Pureformer-VC: Non-parallel One-Shot Voice Conversion with Pure Transformer Blocks and Triplet Discriminative Training

Kavli Affiliate: Jia Liu

| First 5 Authors: Wenhan Yao, Zedong Xing, Xiarun Chen, Jia Liu, Yongqiang He

| Summary:

One-shot voice conversion(VC) aims to change the timbre of any source speech
to match that of the target speaker with only one speech sample. Existing style
transfer-based VC methods relied on speech representation disentanglement and
suffered from accurately and independently encoding each speech component and
recomposing back to converted speech effectively. To tackle this, we proposed
Pureformer-VC, which utilizes Conformer blocks to build a disentangled encoder,
and Zipformer blocks to build a style transfer decoder as the generator. In the
decoder, we used effective styleformer blocks to integrate speaker
characteristics effectively into the generated speech. The models used the
generative VAE loss for encoding components and triplet loss for unsupervised
discriminative training. We applied the styleformer method to Zipformer’s
shared weights for style transfer. The experimental results show that the
proposed model achieves comparable subjective scores and exhibits improvements
in objective metrics compared to existing methods in a one-shot voice
conversion scenario.

| Search Query: ArXiv Query: search_query=au:”Jia Liu”&id_list=&start=0&max_results=3