Kavli Affiliate: Zhuo Li
| First 5 Authors: Yuxiang Zhang, Zhuo Li, Jingze Lu, Hua Hua, Wenchao Wang
| Summary:
The current speech anti-spoofing countermeasures (CMs) show excellent
performance on specific datasets. However, removing the silence of test speech
through Voice Activity Detection (VAD) can severely degrade performance. In
this paper, the impact of silence on speech anti-spoofing is analyzed. First,
the reasons for the impact are explored, including the proportion of silence
duration and the content of silence. The proportion of silence duration in
spoof speech generated by text-to-speech (TTS) algorithms is lower than that in
bonafide speech. And the content of silence generated by different waveform
generators varies compared to bonafide speech. Then the impact of silence on
model prediction is explored. Even after retraining, the spoof speech generated
by neural network based end-to-end TTS algorithms suffers a significant rise in
error rates when the silence is removed. To demonstrate the reasons for the
impact of silence on CMs, the attention distribution of a CM is visualized
through class activation mapping (CAM). Furthermore, the implementation and
analysis of the experiments masking silence or non-silence demonstrates the
significance of the proportion of silence duration for detecting TTS and the
importance of silence content for detecting voice conversion (VC). Based on the
experimental results, improving the robustness of CMs against unknown spoofing
attacks by masking silence is also proposed. Finally, the attacks on
anti-spoofing CMs through concatenating silence, and the mitigation of VAD and
silence attack through low-pass filtering are introduced.
| Search Query: ArXiv Query: search_query=au:”Zhuo Li”&id_list=&start=0&max_results=3