STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

03/20/2022
by   Qingkai Fang, et al.
0

How to learn a better speech representation for end-to-end speech-to-text translation (ST) with limited labeled data? Existing techniques often attempt to transfer powerful machine translation (MT) capabilities to ST, but neglect the representation discrepancy across modalities. In this paper, we propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Specifically, we mix up the representation sequences of different modalities, and take both unimodal speech sequences and multimodal mixed sequences as input to the translation model in parallel, and regularize their output predictions with a self-learning framework. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy, and achieves significant improvements over a strong baseline on eight translation directions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/23/2023

Improving speech translation by fusing speech and text

In speech translation, leveraging multimodal data to improve model perfo...
research
05/24/2023

CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation

End-to-end speech translation (ST) is the task of translating speech sig...
research
05/15/2023

Understanding and Bridging the Modality Gap for Speech Translation

How to achieve better end-to-end speech translation (ST) by leveraging (...
research
12/19/2022

WACO: Word-Aligned Contrastive Learning for Speech Translation

End-to-end Speech Translation (E2E ST) aims to translate source speech i...
research
05/22/2023

Duplex Diffusion Models Improve Speech-to-Speech Translation

Speech-to-speech translation is a typical sequence-to-sequence learning ...
research
04/21/2021

End-to-end Speech Translation via Cross-modal Progressive Training

End-to-end speech translation models have become a new trend in the rese...
research
02/10/2021

Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

Recently text and speech representation learning has successfully improv...

Please sign up or login with your details

Forgot password? Click here to reset