Acoustic Neighbor Embeddings

by   Woojay Jeon, et al.

This paper proposes a novel acoustic word embedding called Acoustic Neighbor Embeddings where speech or text of arbitrary length are mapped to a vector space of fixed, reduced dimensions by adapting stochastic neighbor embedding (SNE) to sequential inputs. The Euclidean distance between coordinates in the embedding space reflects the phonetic confusability between their corresponding sequences. Two encoder neural networks are trained: an acoustic encoder that accepts speech signals in the form of frame-wise subword posterior probabilities obtained from an acoustic model and a text encoder that accepts text in the form of subword transcriptions. Compared to a known method based on a triplet loss, the proposed method is shown to have more effective gradients for neural network training. Experimentally, it also gives more accurate results when the two encoder networks are used in tandem in a word (name) recognition task, and when the text encoder network is used standalone in an approximate phonetic match task. In particular, in a name recognition task depending solely on the Euclidean distance between embedding vectors, the proposed embeddings can achieve recognition accuracy that closely approaches that of conventional finite state transducer(FST)-based decoding. For test data with 1K vocabularies, the accuracy difference is 0.6 18-dimensional embeddings, and for test data with a 1M vocabulary, the difference is 0.4


page 1

page 2

page 3

page 4


Improving Word Recognition using Multiple Hypotheses and Deep Embeddings

We propose a novel scheme for improving the word recognition accuracy us...

Additional Shared Decoder on Siamese Multi-view Encoders for Learning Acoustic Word Embeddings

Acoustic word embeddings — fixed-dimensional vector representations of a...

Multi-view Recurrent Neural Acoustic Word Embeddings

Recent work has begun exploring neural acoustic word embeddings---fixed-...

Acoustically Grounded Word Embeddings for Improved Acoustics-to-Word Speech Recognition

Direct acoustics-to-word (A2W) systems for end-to-end automatic speech r...

Deep convolutional acoustic word embeddings using word-pair side information

Recent studies have been revisiting whole words as the basic modelling u...

Contextual Joint Factor Acoustic Embeddings

Embedding acoustic information into fixed length representations is of i...

Contextualized Generative Retrieval

The text retrieval task is mainly performed in two ways: the bi-encoder ...

Please sign up or login with your details

Forgot password? Click here to reset