Audiobook-CC: Controllable Long-context Speech Generation for Multicast Audiobook

Kavli Affiliate: Xiang Zhang

| First 5 Authors: Min Liu, Min Liu, , ,

| Summary:

Existing text-to-speech systems predominantly focus on single-sentence
synthesis and lack adequate contextual modeling as well as fine-grained
performance control capabilities for generating coherent multicast audiobooks.
To address these limitations, we propose a context-aware and emotion
controllable speech synthesis framework specifically engineered for multicast
audiobooks with three key innovations: a context mechanism for contextual
consistency, a disentanglement paradigm to decouple style control from speech
prompts for semantic consistency, and self-distillation to boost emotional
expressiveness and instruction controllability. Experimental results show
superior performance across the generation of narration, dialogue, and the
whole chapter, significantly outperforming existing baselines. Ablation studies
are conducted to validate the effectiveness of our proposed methods. Demo
samples can be found in https://everest-ai.github.io/.

| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=3