MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis

Kavli Affiliate: Feng Wang

| First 5 Authors: Wenhao Guan, Yishuang Li, Tao Li, Hukai Huang, Feng Wang

| Summary:

The style transfer task in Text-to-Speech refers to the process of
transferring style information into text content to generate corresponding
speech with a specific style. However, most existing style transfer approaches
are either based on fixed emotional labels or reference speech clips, which
cannot achieve flexible style transfer. Recently, some methods have adopted
text descriptions to guide style transfer. In this paper, we propose a more
flexible multi-modal and style controllable TTS framework named MM-TTS. It can
utilize any modality as the prompt in unified multi-modal prompt space,
including reference speech, emotional facial images, and text descriptions, to
control the style of the generated speech in a system. The challenges of
modeling such a multi-modal style controllable TTS mainly lie in two
aspects:1)aligning the multi-modal information into a unified style space to
enable the input of arbitrary modality as the style prompt in a single system,
and 2)efficiently transferring the unified style representation into the given
text content, thereby empowering the ability to generate prompt style-related
voice. To address these problems, we propose an aligned multi-modal prompt
encoder that embeds different modalities into a unified style space, supporting
style transfer for different modalities. Additionally, we present a new
adaptive style transfer method named Style Adaptive Convolutions to achieve a
better style representation. Furthermore, we design a Rectified Flow based
Refiner to solve the problem of over-smoothing Mel-spectrogram and generate
audio of higher fidelity. Since there is no public dataset for multi-modal TTS,
we construct a dataset named MEAD-TTS, which is related to the field of
expressive talking head. Our experiments on the MEAD-TTS dataset and
out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results
based on multi-modal prompts.

| Search Query: ArXiv Query: search_query=au:”Feng Wang”&id_list=&start=0&max_results=3