Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features

Kavli Affiliate: Zhuo Li

| First 5 Authors: Yuxiang Zhang, Zhuo Li, Jingze Lu, Wenchao Wang, Pengyuan Zhang

| Summary:

Current synthetic speech detection (SSD) methods perform well on certain
datasets but still face issues of robustness and interpretability. A possible
reason is that these methods do not analyze the deficiencies of synthetic
speech. In this paper, the flaws of the speaker features inherent in the
text-to-speech (TTS) process are analyzed. Differences in the temporal
consistency of intra-utterance speaker features arise due to the lack of
fine-grained control over speaker features in TTS. Since the speaker
representations in TTS are based on speaker embeddings extracted by encoders,
the distribution of inter-utterance speaker features differs between synthetic
and bonafide speech. Based on these analyzes, an SSD method based on temporal
consistency and distribution of speaker features is proposed. On one hand,
modeling the temporal consistency of intra-utterance speaker features can aid
speech anti-spoofing. On the other hand, distribution differences in
inter-utterance speaker features can be utilized for SSD. The proposed method
offers low computational complexity and performs well in both cross-dataset and
silence trimming scenarios.

| Search Query: ArXiv Query: search_query=au:”Zhuo Li”&id_list=&start=0&max_results=3