Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT

Kavli Affiliate: Zeeshan Ahmed

| First 5 Authors: Zeeshan Ahmed, Zeeshan Ahmed, , ,

| Summary:

This paper tackles several challenges that arise when integrating Automatic
Speech Recognition (ASR) and Machine Translation (MT) for real-time, on-device
streaming speech translation. Although state-of-the-art ASR systems based on
Recurrent Neural Network Transducers (RNN-T) can perform real-time
transcription, achieving streaming translation in real-time remains a
significant challenge. To address this issue, we propose a simultaneous
translation approach that effectively balances translation quality and latency.
We also investigate efficient integration of ASR and MT, leveraging linguistic
cues generated by the ASR system to manage context and utilizing efficient
beam-search pruning techniques such as time-out and forced finalization to
maintain system’s real-time factor. We apply our approach to an on-device
bilingual conversational speech translation and demonstrate that our techniques
outperform baselines in terms of latency and quality. Notably, our technique
narrows the quality gap with non-streaming translation systems, paving the way
for more accurate and efficient real-time speech translation.

| Search Query: ArXiv Query: search_query=au:”Zeeshan Ahmed”&id_list=&start=0&max_results=3