Ruyun Li1, Peng Ouyang2, Dandan Song2 and Shaojun Wei1, 1Tsinghua University, China and 2TsingMicro Co. Ltd., China
Recently, speaker embedding extracted by deep neural networks (DNN) has performed well in speaker verification (SV). However, it is sensitive to different scenarios, and it is too computationally intensive to be deployed on portable devices. In this paper, we first combine rhythm and MFCC features to improve the robustness of speaker verification. The rhythm feature can reflect the distribution of phonemes and help reduce the average error rate (EER) in speaker verification, especially in intra-speaker verification. In addition, we propose a multitask knowledge distillation architecture that transfers the embedding-level and label-level knowledge of a well-trained large teacher to a highly compact student network. The results show that rhythm features and multi-task knowledge distillation significantly improve the performance of the student network. In the ultra-short duration scenario, using only 14.9% of the parameters in the teacher network, the student network can even achieve a relative EER reduction of 32%.
Multi-task learning, Knowledge distillation, Rhythm variation, Angular softmax, Speaker verification.