Reducing Model Complexity for DNN Based Large-Scale Audio Classification

11/01/2017
by   Yuzhong Wu, et al.
0

Audio classification is the task of identifying the sound categories that are associated with a given audio signal. This paper presents an investigation on large-scale audio classification based on the recently released AudioSet database. AudioSet comprises 2 millions of audio samples from YouTube, which are human-annotated with 527 sound category labels. Audio classification experiments with the balanced training set and the evaluation set of AudioSet are carried out by applying different types of neural network models. The classification performance and the model complexity of these models are compared and analyzed. While the CNN models show better performance than MLP and RNN, its model complexity is relatively high and undesirable for practical use. We propose two different strategies that aim at constructing low-dimensional embedding feature extractors and hence reducing the number of model parameters. It is shown that the simplified CNN model has only 1/22 model parameters of the original model, with only a slight degradation of performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/29/2016

CNN Architectures for Large-Scale Audio Classification

Convolutional Neural Networks (CNNs) have proven very effective in image...
research
02/02/2022

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Audio classification is an important task of mapping audio samples into ...
research
10/26/2022

AVES: Animal Vocalization Encoder based on Self-Supervision

The lack of annotated training data in bioacoustics hinders the use of l...
research
06/12/2018

Sample Dropout for Audio Scene Classification Using Multi-Scale Dense Connected Convolutional Neural Network

Acoustic scene classification is an intricate problem for a machine. As ...
research
11/15/2019

Cross-modal supervised learning for better acoustic representations

Obtaining large-scale human-labeled datasets to train acoustic represent...
research
11/06/2017

Unsupervised Learning of Semantic Audio Representations

Even in the absence of any explicit semantic annotation, vast collection...
research
11/12/2018

NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification

This paper introduces a fast and efficient network architecture, NeXtVLA...

Please sign up or login with your details

Forgot password? Click here to reset