Speech representation learning: Learning bidirectional encoders with single-view, multi-view, and multi-task methods

by   Qingming Tang, et al.

This thesis focuses on representation learning for sequence data over time or space, aiming to improve downstream sequence prediction tasks by using the learned representations. Supervised learning has been the most dominant approach for training deep neural networks for learning good sequential representations. However, one limiting factor to scale supervised learning is the lack of enough annotated data. Motivated by this challenge, it is natural to explore representation learning methods that can utilize large amounts of unlabeled and weakly labeled data, as well as an additional data modality. I describe my broad study of representation learning for speech data. Unlike most other works that focus on a single learning setting, this thesis studies multiple settings: supervised learning with auxiliary losses, unsupervised learning, semi-supervised learning, and multi-view learning. Besides different learning problems, I also explore multiple approaches for representation learning. Though I focus on speech data, the methods described in this thesis can also be applied to other domains. Overall, the field of representation learning is developing rapidly. State-of-the-art results on speech related tasks are typically based on Transformers pre-trained with large-scale self-supervised learning, which aims to learn generic representations that can benefit multiple downstream tasks. Since 2020, large-scale pre-training has been the de facto choice to achieve good performance. This delayed thesis does not attempt to summarize and compare with the latest results on speech representation learning; instead, it presents a unique study on speech representation learning before the Transformer era, that covers multiple learning settings. Some of the findings in this thesis can still be useful today.


page 18

page 31


Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

The speech representations learned from large-scale unlabeled data have ...

Learning useful representations for shifting tasks and distributions

Does the dominant approach to learn representations (as a side effect of...

GraphMoco:a Graph Momentum Contrast Model that Using Multimodel Structure Information for Large-scale Binary Function Representation Learning

The ability to compute similarity scores of binary code at the function ...

Simultaneous or Sequential Training? How Speech Representations Cooperate in a Multi-Task Self-Supervised Learning System

Speech representation learning with self-supervised algorithms has resul...

Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

We present Mockingjay as a new speech representation learning approach, ...

Provable benefits of representation learning

There is general consensus that learning representations is useful for a...

Combining Representation Learning with Logic for Language Processing

The current state-of-the-art in many natural language processing and aut...

Please sign up or login with your details

Forgot password? Click here to reset