Towards Visually Grounded Sub-Word Speech Unit Discovery

by   David Harwath, et al.

In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw speech waveforms with semantically related natural image scenes. We show how diphone boundaries can be superficially extracted from the activation patterns of intermediate layers of the model, suggesting that the model may be leveraging these events for the purpose of word recognition. We present a series of experiments investigating the information encoded by these events.


page 3

page 4


Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech

In this paper, we study how word-like units are represented and activate...

Learning to Recognise Words using Visually Grounded Speech

We investigated word recognition in a Visually Grounded Speech model. Th...

A Neural Network Model of Lexical Competition during Infant Spoken Word Recognition

Visual world studies show that upon hearing a word in a target-absent vi...

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

In this paper, we present a method for learning discrete linguistic unit...

Interpreting intermediate convolutional layers of CNNs trained on raw speech

This paper presents a technique to interpret and visualize intermediate ...

Investigating the Utility of Surprisal from Large Language Models for Speech Synthesis Prosody

This paper investigates the use of word surprisal, a measure of the pred...

WEST: Word Encoded Sequence Transducers

Most of the parameters in large vocabulary models are used in embedding ...

Please sign up or login with your details

Forgot password? Click here to reset