Kavli Affiliate: Zhuo Li
| First 5 Authors: Zhuo Li, , , ,
| Summary:
Capturing long-range dependency and modeling long temporal contexts is proven
to benefit speaker verification tasks. In this paper, we propose the
combination of the Hierarchical-Split block(HS-block) and the Depthwise
Separable Self-Attention(DSSA) module to capture richer multi-range context
speaker features from a local and global perspective respectively.
Specifically, the HS-block splits the feature map and filters into several
groups and stacks them in one block, which enlarges the receptive fields(RFs)
locally. The DSSA module improves the multi-head self-attention mechanism by
the depthwise-separable strategy and explicit sparse attention strategy to
model the pairwise relations globally and captures effective long-range
dependencies in each channel. Experiments are conducted on the Voxceleb and
SITW. Our best system achieves 1.27% EER on the Voxceleb1 test set and 1.56% on
SITW by applying the combination of HS-block and DSSA module.
| Search Query: ArXiv Query: search_query=au:”Zhuo Li”&id_list=&start=0&max_results=10