Joint speech-language training is challenging due to the large demand fo...
Artificial General Intelligence (AGI) requires comprehensive understandi...
The convergence of text, visual, and audio data is a key step towards
hu...
Code-switching speech refers to a means of expression by mixing two or m...
In real application scenarios, it is often challenging to obtain a large...
In this report, we describe our submitted system for track 2 of the VoxC...
Self-supervised learning (SSL) methods have proven to be very successful...
Human intelligence is multimodal; we integrate visual, linguistic, and
a...
This paper studies a novel pre-training technique with unpaired speech d...
Recently, pioneer work finds that speech pre-trained models can solve
fu...
Self-supervised learning (SSL) achieves great success in speech recognit...
The advances in attention-based encoder-decoder (AED) networks have brou...
Multilingual end-to-end(E2E) models have shown a great potential in the
...
The speech representations learned from large-scale unlabeled data have ...
Self-supervised learning (SSL) is a long-standing goal for speech proces...
End-to-end (E2E) spoken language understanding (SLU) can infer semantics...
In this paper, we propose a unified pre-training approach called UniSpee...
Supervised systems require human labels for training. But, are humans
th...
Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN...
Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN...