S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Streaming Speech Generation

Kavli Affiliate: Xiang Zhang

| First 5 Authors: Yu Pan, Yuguang Yang, Yanni Hu, Jianhao Ye, Xiang Zhang

| Summary:

Multilingual speech-to-speech translation (S2ST) aims to directly convert
spoken utterances from multiple source languages into fluent and intelligible
speech in a target language. Despite recent progress, several critical
challenges persist: 1) achieving high-quality S2ST remains a significant
obstacle; 2) most existing S2ST methods rely heavily on large-scale parallel
speech corpora, which are difficult and resource-intensive to obtain. To tackle
these challenges, we introduce S2ST-Omni, a novel, efficient, and scalable
framework tailored for multilingual speech-to-speech translation. Specifically,
we decompose S2ST into speech-to-text translation (S2TT) and text-to-speech
synthesis (TTS). To enable high-quality S2TT while mitigating reliance on
large-scale parallel speech corpora, we leverage powerful pretrained models:
Whisper for robust audio understanding and Qwen 3.0 for advanced text
comprehension. A lightweight speech adapter is introduced to bridge the
modality gap between speech and text representations, facilitating effective
utilization of pretrained multimodal knowledge. To ensure both translation
accuracy and real-time responsiveness, we adopt a streaming speech generation
model in the TTS stage, which generates the target speech in an autoregressive
manner. Extensive experiments conducted on the CVSS benchmark demonstrate that
S2ST-Omni consistently surpasses several state-of-the-art S2ST baselines in
translation quality, highlighting its effectiveness and superiority.

| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=3