Sparks of Large Audio Models: A Survey and Outlook

08/24/2023
by   Siddique Latif, et al.
0

This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources–from human voices to musical instruments and environmental sounds–poses challenges distinct from those found in traditional Natural Language Processing scenarios. Nevertheless, Large Audio Models, epitomized by transformer-based architectures, have shown marked efficacy in this sphere. By leveraging massive amount of data, these models have demonstrated prowess in a variety of audio tasks, spanning from Automatic Speech Recognition and Text-To-Speech to Music Generation, among others. Notably, recently these Foundational Audio Models, like SeamlessM4T, have started showing abilities to act as universal translators, supporting multiple speech tasks for up to 100 languages without any reliance on separate task-specific systems. This paper presents an in-depth analysis of state-of-the-art methodologies regarding Foundational Large Audio Models, their performance benchmarks, and their applicability to real-world scenarios. We also highlight current limitations and provide insights into potential future research directions in the realm of Large Audio Models with the intent to spark further discussion, thereby fostering innovation in the next generation of audio-processing systems. Furthermore, to cope with the rapid development in this area, we will consistently update the relevant repository with relevant recent articles and their open-source implementations at https://github.com/EmulationAI/awesome-large-audio-models.

READ FULL TEXT

page 3

page 6

page 15

page 17

research
10/15/2021

Advances and Challenges in Deep Lip Reading

Driven by deep learning techniques and large-scale datasets, recent year...
research
09/14/2023

EnCodecMAE: Leveraging neural codecs for universal audio representation learning

The goal of universal audio representation learning is to obtain foundat...
research
07/29/2023

UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models

Multimodal large models have been recognized for their advantages in var...
research
03/02/2022

Audio Self-supervised Learning: A Survey

Inspired by the humans' cognitive ability to generalise knowledge and sk...
research
02/26/2023

From Audio to Symbolic Encoding

Automatic music transcription (AMT) aims to convert raw audio to symboli...
research
08/27/2023

Examining User-Friendly and Open-Sourced Large GPT Models: A Survey on Language, Multimodal, and Scientific GPT Models

Generative pre-trained transformer (GPT) models have revolutionized the ...
research
11/21/2021

Capitalization and Punctuation Restoration: a Survey

Ensuring proper punctuation and letter casing is a key pre-processing st...

Please sign up or login with your details

Forgot password? Click here to reset