Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification

06/19/2019
by   Youngmoon Jung, et al.
0

In this paper, we propose a new pooling method called spatial pyramid encoding (SPE) to generate speaker embeddings for text-independent speaker verification. We first partition the output feature maps from a deep residual network (ResNet) into increasingly fine sub-regions and extract speaker embeddings from each sub-region through a learnable dictionary encoding layer. These embeddings are concatenated to obtain the final speaker representation. The SPE layer not only generates a fixed-dimensional speaker embedding for a variable-length speech segment, but also aggregates the information of feature distribution from multi-level temporal bins. Furthermore, we apply deep length normalization by augmenting the loss function with ring loss. By applying ring loss, the network gradually learns to normalize the speaker embeddings using model weights themselves while preserving convexity, leading to more robust speaker embeddings. Experiments on the VoxCeleb1 dataset show that the proposed system using the SPE layer and ring loss-based deep length normalization outperforms both i-vector and d-vector baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/14/2018

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

In this paper, we explore the encoding/pooling layer and loss function i...
research
02/28/2018

Ring loss: Convex Feature Normalization for Face Recognition

We motivate and present Ring loss, a simple and elegant feature normaliz...
research
06/08/2018

Analysis of Length Normalization in End-to-End Speaker Verification System

The classical i-vectors and the latest end-to-end deep speaker embedding...
research
02/21/2019

Deep Speaker Embedding Learning with Multi-Level Pooling for Text-Independent Speaker Verification

This paper aims to improve the widely used deep speaker embedding x-vect...
research
11/19/2019

Partial AUC optimization based deep speaker embeddings with class-center learning for text-independent speaker verification

Deep embedding based text-independent speaker verification has demonstra...
research
12/21/2020

Multi-stream Convolutional Neural Network with Frequency Selection for Robust Speaker Verification

Speaker verification aims to verify whether an input speech corresponds ...
research
04/02/2018

A Novel Learnable Dictionary Encoding Layer for End-to-End Language Identification

A novel learnable dictionary encoding layer is proposed in this paper fo...

Please sign up or login with your details

Forgot password? Click here to reset