Voice conversion aims to transform source speech into a different target...
Acoustic word embeddings (AWEs) are fixed-dimensional vector representat...
Can we develop a model that can synthesize realistic speech directly fro...
We propose a visually grounded speech model that learns new words and th...
We consider hate speech detection through keyword spotting on radio
broa...
Any-to-any voice conversion aims to transform source speech into a targe...
We propose a visually grounded speech model that acquires new words and ...
We consider the problem of few-shot spoken word classification in a sett...
Diffusion models have shown exceptional scaling properties in the image
...
Imagine being able to show a system a visual depiction of a keyword and
...
We propose AudioStyleGAN (ASGAN), a new generative adversarial network (...
Visually grounded speech (VGS) models are trained on images paired with
...
Latent Dirichlet allocation (LDA) is widely used for unsupervised topic
...
Recent work on unsupervised speech segmentation has used self-supervised...
Keyword localisation is the task of finding where in a speech utterance ...
While multi-agent reinforcement learning has been used as an effective m...
Voice conversion (VC) has been proposed to improve speech recognition sy...
The goal of voice conversion is to transform source speech into a target...
Contrastive predictive coding (CPC) aims to learn representations of spe...
Breakthrough advances in reinforcement learning (RL) research have led t...
Acoustic word embedding models map variable duration speech segments to ...
Visually grounded speech models learn from images paired with spoken
cap...
Voice conversion is the task of converting a spoken utterance from a sou...
Acoustic word embeddings (AWEs) are fixed-dimensional representations of...
Non-native speakers show difficulties with spoken word processing. Many
...
We investigate segmenting and clustering speech into low-bitrate phone-l...
Developments in weakly supervised and self-supervised models could enabl...
Many speech processing tasks involve measuring the acoustic similarity
b...
We propose direct multimodal few-shot models that learn a shared embeddi...
We propose a new unsupervised model for mapping a variable-duration spee...
Research in NLP lacks geographic diversity, and the question of how NLP ...
We consider the task of multimodal one-shot speech-image matching. An ag...
In the first year of life, infants' speech perception becomes attuned to...
Acoustic word embeddings are fixed-dimensional representations of
variab...
In this paper, we explore vector quantization for acoustic unit discover...
Recent studies have introduced methods for learning acoustic word embedd...
In zero-resource settings where transcribed speech audio is unavailable,...
Africa has over 2000 languages. Despite this, African languages account ...
Acoustic word embeddings are fixed-dimensional representations of
variab...
Standard video codecs rely on optical flow to guide inter-frame predicti...
Recent deep learning models outperform standard lossy image compression
...
Recent work in signal propagation theory has shown that dropout limits t...
Recent work has established the equivalence between deep neural networks...
Given a large amount of unannotated speech in a language with few resour...
Recent work has shown that speech paired with images can be used to lear...
For our submission to the ZeroSpeech 2019 challenge, we apply discrete
l...
A number of recent studies have started to investigate how speech system...
We compare features for dynamic time warping based keyword spotting in a...
Imagine a robot is shown new concepts visually together with spoken tags...
Unsupervised subword modeling aims to learn low-level representations of...