Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling

Kavli Affiliate: Xiang Zhang

| First 5 Authors: Yuguang Yang, Yu Pan, Jixun Yao, Xiang Zhang, Jianhao Ye

| Summary:

Zero-shot voice conversion (VC) aims to transform the source speaker timbre
into an arbitrary unseen one without altering the original speech content.While
recent advancements in zero-shot VC methods have shown remarkable progress,
there still remains considerable potential for improvement in terms of
improving speaker similarity and speech naturalness.In this paper, we propose
Takin-VC, a novel zero-shot VC framework based on jointly hybrid content and
memory-augmented context-aware timbre modeling to tackle this challenge.
Specifically, an effective hybrid content encoder, guided by neural codec
training, that leverages quantized features from pre-trained WavLM and
HybridFormer is first presented to extract the linguistic content of the source
speech. Subsequently, we introduce an advanced cross-attention-based
context-aware timbre modeling approach that learns the fine-grained,
semantically associated target timbre features. To further enhance both speaker
similarity and real-time performance, we utilize a conditional flow matching
model to reconstruct the Mel-spectrogram of the source speech. Additionally, we
advocate an efficient memory-augmented module designed to generate high-quality
conditional target inputs for the flow matching process, thereby improving the
overall performance of the proposed system. Experimental results demonstrate
that the proposed Takin-VC method surpasses state-of-the-art zero-shot VC
systems, delivering superior performance in terms of both speech naturalness
and speaker similarity.

| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=3