Learning Speaker Embedding with Momentum Contrast
Speaker verification can be formulated as a representation learning task, where speaker-discriminative embeddings are extracted from utterances of variable lengths. Momentum Contrast (MoCo) is a recently proposed unsupervised representation learning framework, and has shown its effectiveness for learning good feature representation for downstream vision tasks. In this work, we apply MoCo to learn speaker embedding from speech segments. We explore MoCo for both unsupervised learning and pretraining settings. In the unsupervised scenario, embedding is learned by MoCo from audio data without using any speaker specific information. On a large scale dataset with 2,500 speakers, MoCo can achieve EER 4.275% trained unsupervisedly, and the EER can decrease further to 3.58% if extra unlabelled data are used. In the pretraining scenario, encoder trained by MoCo is used to initialize the downstream supervised training. With finetuning on the MoCo trained model, the equal error rate (EER) reduces 13.7% relative (1.44% to 1.242%) compared to a carefully tuned baseline training from scratch. Comparative study confirms the effectiveness of MoCo learning good speaker embedding.
READ FULL TEXT