Towards Learning Universal Audio Representations

11/23/2021
by   Luyu Wang, et al.
0

The ability to learn universal audio representations that can solve diverse speech, music, and environment tasks can spur many applications that require general sound content understanding. In this work, we introduce a holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains and provide a thorough empirical study of recent sound representation learning systems on that benchmark. We discover that previous sound event classification or speech models do not generalize outside of their domains. We observe that more robust audio representations can be learned with the SimCLR objective; however, the model's transferability depends heavily on the model architecture. We find the Slowfast architecture is good at learning rich representations required by different domains, but its performance is affected by the normalization scheme. Based on these findings, we propose a novel normalizer-free Slowfast NFNet and achieve state-of-the-art performance across all domains.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/06/2022

HEAR 2021: Holistic Evaluation of Audio Representations

What audio embedding approach generalizes best to a wide range of downst...
research
09/14/2023

EnCodecMAE: Leveraging neural codecs for universal audio representation learning

The goal of universal audio representation learning is to obtain foundat...
research
03/20/2022

A Study on Robustness to Perturbations for Representations of Environmental Sound

Audio applications involving environmental sound analysis increasingly u...
research
09/19/2023

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Audio-visual representation learning aims to develop systems with human-...
research
07/29/2023

UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models

Multimodal large models have been recognized for their advantages in var...
research
10/27/2020

Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

Self-supervised audio representation learning offers an attractive alter...
research
12/11/2020

Analysis of Feature Representations for Anomalous Sound Detection

In this work, we thoroughly evaluate the efficacy of pretrained neural n...

Please sign up or login with your details

Forgot password? Click here to reset