Pureformer-VC: Non-parallel Voice Conversion with Pure Stylized Transformer Blocks and Triplet Discriminative Training

Kavli Affiliate: Jia Liu

| First 5 Authors: Wenhan Yao, Fen Xiao, Xiarun Chen, Jia Liu, YongQiang He

| Summary:

As a foundational technology for intelligent human-computer interaction,
voice conversion (VC) seeks to transform speech from any source timbre into any
target timbre. Traditional voice conversion methods based on Generative
Adversarial Networks (GANs) encounter significant challenges in precisely
encoding diverse speech elements and effectively synthesising these elements
into natural-sounding converted speech. To overcome these limitations, we
introduce Pureformer-VC, an encoder-decoder framework that utilizes Conformer
blocks to build a disentangled encoder and employs Zipformer blocks to create a
style transfer decoder. We adopt a variational decoupled training approach to
isolate speech components using a Variational Autoencoder (VAE), complemented
by triplet discriminative training to enhance the speaker’s discriminative
capabilities. Furthermore, we incorporate the Attention Style Transfer
Mechanism (ASTM) with Zipformer’s shared weights to improve the style transfer
performance in the decoder. We conducted experiments on two multi-speaker
datasets. The experimental results demonstrate that the proposed model achieves
comparable subjective evaluation scores while significantly enhancing objective
metrics compared to existing approaches in many-to-many and many-to-one VC
scenarios.

| Search Query: ArXiv Query: search_query=au:”Jia Liu”&id_list=&start=0&max_results=3