Improved Zero-Shot Audio Tagging Classification with Patchout Spectrogram Transformers

08/24/2022
by   Paul Primus, et al.
2

Standard machine learning models for tagging and classifying acoustic signals cannot handle classes that were not seen during training. Zero-Shot (ZS) learning overcomes this restriction by predicting classes based on adaptable class descriptions. This study sets out to investigate the effectiveness of self-attention-based audio embedding architectures for ZS learning. To this end, we compare the very recent patchout spectrogram transformer with two classic convolutional architectures. We evaluate these three architectures on three tasks and on three different benchmark datasets: general-purpose tagging on AudioSet, environmental sound classification on ESC-50, and instrument tagging on OpenMIC. Our results show that the self-attention-based embedding methods outperform both compared convolutional architectures in all of these settings. By designing training and test data accordingly, we observe that prediction performance suffers significantly when the `semantic distance' between training and new test classes is large, an effect that will deserve more detailed investigations.

READ FULL TEXT
research
11/11/2019

Visualizing and Understanding Self-attention based Music Tagging

Recently, we proposed a self-attention based music tagging model. Differ...
research
07/05/2019

Zero-shot Learning for Audio-based Music Classification and Tagging

Audio-based music classification and tagging is typically based on categ...
research
06/10/2022

Zero-Shot Audio Classification using Image Embeddings

Supervised learning methods can solve the given problem in the presence ...
research
09/15/2023

Exploring Meta Information for Audio-based Zero-shot Bird Classification

Advances in passive acoustic monitoring and machine learning have led to...
research
07/30/2021

Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning

Zero-Shot Learning (ZSL) aims to recognise unseen object classes, which ...
research
01/26/2023

SemSup-XC: Semantic Supervision for Zero and Few-shot Extreme Classification

Extreme classification (XC) involves predicting over large numbers of cl...
research
03/25/2022

AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification

After its sweeping success in vision and language tasks, pure attention-...

Please sign up or login with your details

Forgot password? Click here to reset