Takin-VC: Expressive Zero-Shot Voice Conversion via Adaptive Hybrid Content Encoding and Enhanced Timbre Modeling

Kavli Affiliate: Xiang Zhang

| First 5 Authors: Yuguang Yang, Yu Pan, Jixun Yao, Xiang Zhang, Jianhao Ye

| Summary:

Expressive zero-shot voice conversion (VC) is a critical and challenging task
that aims to transform the source timbre into an arbitrary unseen speaker while
preserving the original content and expressive qualities. Despite recent
progress in zero-shot VC, there remains considerable potential for improvements
in speaker similarity and speech naturalness. Moreover, existing zero-shot VC
systems struggle to fully reproduce paralinguistic information in highly
expressive speech, such as breathing, crying, and emotional nuances, limiting
their practical applicability. To address these issues, we propose Takin-VC, a
novel expressive zero-shot VC framework via adaptive hybrid content encoding
and memory-augmented context-aware timbre modeling. Specifically, we introduce
an innovative hybrid content encoder that incorporates an adaptive fusion
module, capable of effectively integrating quantized features of the
pre-trained WavLM and HybridFormer in an implicit manner, so as to extract
precise linguistic features while enriching paralinguistic elements. For timbre
modeling, we propose advanced memory-augmented and context-aware modules to
generate high-quality target timbre features and fused representations that
seamlessly align source content with target timbre. To enhance real-time
performance, we advocate a conditional flow matching model to reconstruct the
Mel-spectrogram of the source speech. Experimental results show that our
Takin-VC consistently surpasses state-of-the-art VC systems, achieving notable
improvements in terms of speech naturalness, speech expressiveness, and speaker
similarity, while offering enhanced inference speed.

| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=3