CLAP: Learning Audio Concepts From Natural Language Supervision

06/09/2022
by   Benjamin Elizalde, et al.
0

Mainstream Audio Analytics models are trained to learn under the paradigm of one class label to many recordings focusing on one task. Learning under such restricted supervision limits the flexibility of models because they require labeled audio for training and can only predict the predefined categories. Instead, we propose to learn audio concepts from natural language supervision. We call our approach Contrastive Language-Audio Pretraining (CLAP), which learns to connect language and audio by using two encoders and a contrastive learning to bring audio and text descriptions into a joint multimodal space. We trained CLAP with 128k audio and text pairs and evaluated it on 16 downstream tasks across 8 domains, such as Sound Event Classification, Music tasks, and Speech-related tasks. Although CLAP was trained with significantly less pairs than similar computer vision models, it establishes SoTA for Zero-Shot performance. Additionally, we evaluated CLAP in a supervised learning setup and achieve SoTA in 5 tasks. Hence, CLAP's Zero-Shot capability removes the need of training with class labels, enables flexible class prediction at inference time, and generalizes to multiple downstream tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/11/2023

Natural Language Supervision for General-Purpose Audio Representations

Audio-Language models jointly learn multimodal text and audio representa...
research
09/15/2023

Audio-free Prompt Tuning for Language-Audio Models

Contrastive Language-Audio Pretraining (CLAP) is pre-trained to associat...
research
11/12/2022

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Contrastive learning has shown remarkable success in the field of multim...
research
08/09/2023

Transferable Models for Bioacoustics with Human Language Supervision

Passive acoustic monitoring offers a scalable, non-invasive method for t...
research
12/17/2021

Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation

Traditional computer vision models are trained to predict a fixed set of...
research
05/03/2023

Unsupervised Improvement of Audio-Text Cross-Modal Representations

Recent advances in using language models to obtain cross-modal audio-tex...
research
05/31/2023

Learning Music Sequence Representation from Text Supervision

Music representation learning is notoriously difficult for its complex h...

Please sign up or login with your details

Forgot password? Click here to reset