BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations

by   Daisuke Niizumi, et al.

Pre-trained models are essential as feature extractors in modern machine learning systems in various domains. In this study, we hypothesize that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound. For recognizing sounds regardless of perturbations such as varying pitch or timbre, features should be robust to these perturbations. For serving the diverse needs of tasks such as recognition of emotions or music genres, representations should provide multiple aspects of these robust features, such as local and global features and their statistics. To implement our principle, we propose a self-supervised learning method: Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced "viola"). BYOL-A pre-trains representations of the input sound themselves invariant to audio data augmentations by minimizing the difference between a pair of augmented input variants, which makes the learned representations robust to the perturbations of sounds. In the BYOL-A encoder, the global pooling calculates representations to form multi-aspect information by combining statistics of frequency- and channel-wise, local, and global features. As a result, the learned representations should provide multi-aspect robust features of the input and serve various needs of diverse tasks. We evaluated general audio task performance among previous state-of-the-art methods, and BYOL-A showed competitive results in all tasks with the best average result of 72.4 VoxCeleb1 and 63.8 experiments and validated the contributions of BYOL-A components. Our code is available online.


page 2

page 3

page 6

page 7

page 8

page 10

page 11

page 12


Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

Recent general-purpose audio representations show state-of-the-art perfo...

Masked Autoencoders with Multi-Window Attention Are Better Audio Learners

Several recent works have adapted Masked Autoencoders (MAEs) for learnin...

An empirical study of weakly supervised audio tagging embeddings for general audio representations

We study the usability of pre-trained weakly supervised audio tagging (A...

BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping

Methods for extracting audio and speech features have been studied since...

DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning

Inspired by the recent progress in self-supervised learning for computer...

A Study on Robustness to Perturbations for Representations of Environmental Sound

Audio applications involving environmental sound analysis increasingly u...

Bootstrap Confidence Regions for Learned Feature Embeddings

Algorithmic feature learners provide high-dimensional vector representat...

Please sign up or login with your details

Forgot password? Click here to reset