S2ST-Omni: An Efficient Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Progressive Fine-tuning

Kavli Affiliate: Xiang Zhang

| First 5 Authors: Yu Pan, Yu Pan, , ,

| Summary:

Despite recent advances in multilingual speech-to-speech translation (S2ST),
several critical challenges persist: 1) achieving high-quality translation
remains a major hurdle, and 2) most existing methods heavily rely on
large-scale parallel speech corpora, which are costly and difficult to obtain.
To address these issues, we propose textitS2ST-Omni, an efficient and
scalable framework for multilingual S2ST. Specifically, we decompose the S2ST
task into speech-to-text translation (S2TT) and text-to-speech synthesis (TTS).
For S2TT, we propose an effective speech language model that integrates the
pretrained Whisper encoder for robust audio understanding and Qwen 3.0 for
advanced text comprehension. A lightweight speech adapter is employed to bridge
the modality gap between speech and text representations. To further facilitate
the multimodal knowledge learning, a two-stage fine-tuning strategy is
introduced. In the TTS stage, we adopt a streaming autoregressive generation
approach to produce natural and fluent target speech. Experiments on the CVSS
benchmark show that S2ST-Omni consistently outperforms existing
state-of-the-art S2ST systems in translation quality, highlighting its
effectiveness and superiority.

| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=3