Automated Audio Captioning via Fusion of Low- and High- Dimensional Features

10/10/2022
by   Jianyuan Sun, et al.
0

Automated audio captioning (AAC) aims to describe the content of an audio clip using simple sentences. Existing AAC methods are developed based on an encoder-decoder architecture that success is attributed to the use of a pre-trained CNN10 called PANNs as the encoder to learn rich audio representations. AAC is a highly challenging task due to its high-dimensional talent space involves audio of various scenarios. Existing methods only use the high-dimensional representation of the PANNs as the input of the decoder. However, the low-dimension representation may retain as much audio information as the high-dimensional representation may be neglected. In addition, although the high-dimensional approach may predict the audio captions by learning from existing audio captions, which lacks robustness and efficiency. To deal with these challenges, a fusion model which integrates low- and high-dimensional features AAC framework is proposed. In this paper, a new encoder-decoder framework is proposed called the Low- and High-Dimensional Feature Fusion (LHDFF) model for AAC. Moreover, in LHDFF, a new PANNs encoder is proposed called Residual PANNs (RPANNs) by fusing the low-dimensional feature from the intermediate convolution layer output and the high-dimensional feature from the final layer output of PANNs. To fully explore the information of the low- and high-dimensional fusion feature and high-dimensional feature respectively, we proposed dual transformer decoder structures to generate the captions in parallel. Especially, a probabilistic fusion approach is proposed that can ensure the overall performance of the system is improved by concentrating on the respective advantages of the two transformer decoders. Experimental results show that LHDFF achieves the best performance on the Clotho and AudioCaps datasets compared with other existing models

READ FULL TEXT

page 1

page 2

research
05/30/2023

Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning

Automated audio captioning (AAC) which generates textual descriptions of...
research
01/10/2022

Local Information Assisted Attention-free Decoder for Audio Captioning

Automated audio captioning (AAC) aims to describe audio data with captio...
research
07/21/2021

CL4AC: A Contrastive Loss for Audio Captioning

Automated Audio captioning (AAC) is a cross-modal translation task that ...
research
03/02/2020

Efficient Latent Representations using Multiple Tasks for Autonomous Driving

Driving in the dynamic, multi-agent, and complex urban environment is a ...
research
06/25/2021

Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets

We address voice activity detection in acoustic environments of transien...
research
11/02/2022

MAST: Multiscale Audio Spectrogram Transformers

We present Multiscale Audio Spectrogram Transformer (MAST) for audio cla...

Please sign up or login with your details

Forgot password? Click here to reset