Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition

Kavli Affiliate: Zeeshan Ahmed

| First 5 Authors: Niko Moritz, Ruiming Xie, Yashesh Gaur, Ke Li, Simone Merello

| Summary:

We propose the joint speech translation and recognition (JSTAR) model that
leverages the fast-slow cascaded encoder architecture for simultaneous
end-to-end automatic speech recognition (ASR) and speech translation (ST). The
model is transducer-based and uses a multi-objective training strategy that
optimizes both ASR and ST objectives simultaneously. This allows JSTAR to
produce high-quality streaming ASR and ST results. We apply JSTAR in a
bilingual conversational speech setting with smart-glasses, where the model is
also trained to distinguish speech from different directions corresponding to
the wearer and a conversational partner. Different model pre-training
strategies are studied to further improve results, including training of a
transducer-based streaming machine translation (MT) model for the first time
and applying it for parameter initialization of JSTAR. We demonstrate superior
performances of JSTAR compared to a strong cascaded ST model in both BLEU
scores and latency.

| Search Query: ArXiv Query: search_query=au:”Zeeshan Ahmed”&id_list=&start=0&max_results=3