Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model

by   Zhifu Gao, et al.

Recently, online end-to-end ASR has gained increasing attention. However, the performance of online systems still lags far behind that of offline systems, with a large gap in quality of recognition. For specific scenarios, we can trade-off between performance and latency, and can train multiple systems with different delays to match the performance and latency requirements of various application scenarios. In this work, in contrast to trading-off between performance and latency, we envisage a single system that can match the needs of different scenarios. We propose a novel architecture, termed Universal ASR that can unify streaming and non-streaming ASR models into one system. The embedded streaming ASR model can configure different delays according to requirements to obtain real-time recognition results, while the non-streaming model is able to refresh the final recognition result for better performance. We have evaluated our approach on the public AISHELL-2 benchmark and an industrial-level 20,000-hour Mandarin speech recognition task. The experimental results show that the Universal ASR provides an efficient mechanism to integrate streaming and non-streaming models that can recognize speech quickly and accurately. On the AISHELL-2 task, Universal ASR comfortably outperforms other state-of-the-art systems.


page 1

page 2

page 3

page 4


Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition

Recently, streaming end-to-end automatic speech recognition (E2E-ASR) ha...

Universal ASR: Unify and Improve Streaming ASR with Full-context Modeling

Streaming automatic speech recognition (ASR) aims to emit each hypothesi...

Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

In this paper, we present a novel two-pass approach to unify streaming a...

Learning a Dual-Mode Speech Recognition Model via Self-Pruning

There is growing interest in unifying the streaming and full-context aut...

Sequentially Sampled Chunk Conformer for Streaming End-to-End ASR

This paper presents an in-depth study on a Sequentially Sampled Chunk Co...

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

In real-world applications, users often require both translations and tr...

Streaming parallel transducer beam search with fast-slow cascaded encoders

Streaming ASR with strict latency constraints is required in many speech...

Please sign up or login with your details

Forgot password? Click here to reset