Multi-Scale Accent Modeling and Disentangling for Multi-Speaker Multi-Accent Text-to-Speech Synthesis

Kavli Affiliate: Yi Zhou

| First 5 Authors: Xuehao Zhou, Mingyang Zhang, Yi Zhou, Zhizheng Wu, Haizhou Li

| Summary:

Generating speech across different accents while preserving speaker identity
is crucial for various real-world applications. However, accurately and
independently modeling both speaker and accent characteristics in
text-to-speech (TTS) systems is challenging due to the complex variations of
accents and the inherent entanglement between speaker and accent identities. In
this paper, we propose a novel approach for multi-speaker multi-accent TTS
synthesis that aims to synthesize speech for multiple speakers, each with
various accents. Our approach employs a multi-scale accent modeling strategy to
address accent variations on different levels. Specifically, we introduce both
global (utterance level) and local (phoneme level) accent modeling to capture
overall accent characteristics within an utterance and fine-grained accent
variations across phonemes, respectively. To enable independent control of
speakers and accents, we use the speaker embedding to represent speaker
identity and achieve speaker-independent accent control through speaker
disentanglement within the multi-scale accent modeling. Additionally, we
present a local accent prediction model that enables our system to generate
accented speech directly from phoneme inputs. We conduct extensive experiments
on an English accented speech corpus. Experimental results demonstrate that our
proposed system outperforms baseline systems in terms of speech quality and
accent rendering for generating multi-speaker multi-accent speech. Ablation
studies further validate the effectiveness of different components in our
proposed system.

| Search Query: ArXiv Query: search_query=au:”Yi Zhou”&id_list=&start=0&max_results=3