Synchronising audio and ultrasound by learning cross-modal embeddings

07/01/2019
by   Aciel Eshky, et al.
2

Audiovisual synchronisation is the task of determining the time offset between speech audio and a video recording of the articulators. In child speech therapy, audio and ultrasound videos of the tongue are captured using instruments which rely on hardware to synchronise the two modalities at recording time. Hardware synchronisation can fail in practice, and no mechanism exists to synchronise the signals post hoc. To address this problem, we employ a two-stream neural network which exploits the correlation between the two modalities to find the offset. We train our model on recordings from 69 speakers, and show that it correctly synchronises 82.9 unseen therapy sessions and unseen speakers, thus considerably reducing the number of utterances to be manually synchronised. An analysis of model performance on the test utterances shows that directed phone articulations are more difficult to automatically synchronise compared to utterances containing natural variation in speech such as words, sentences, or conversations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/31/2021

Automatic audiovisual synchronisation for ultrasound tongue imaging

Ultrasound tongue imaging is used to visualise the intra-oral articulato...
research
11/19/2020

TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

We present the Tongue and Lips corpus (TaL), a multi-speaker corpus of a...
research
09/19/2023

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement

Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech ...
research
06/08/2021

Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

Articulatory-to-acoustic mapping seeks to reconstruct speech from a reco...
research
07/01/2019

Speaker-independent classification of phonetic segments from raw ultrasound in child speech

Ultrasound tongue imaging (UTI) provides a convenient way to visualize t...
research
04/23/2017

Learning weakly supervised multimodal phoneme embeddings

Recent works have explored deep architectures for learning multimodal sp...
research
02/04/2023

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Lipreading refers to understanding and further translating the speech of...

Please sign up or login with your details

Forgot password? Click here to reset