A Squeeze-and-Excitation and Transformer based Cross-task System for Environmental Sound Recognition
Environmental sound recognition (ESR) is an emerging research topic in audio pattern recognition. Many tasks are presented to resort to computational systems for ESR in real-life applications. However, current systems are usually designed for individual tasks, and are not robust and applicable to other tasks. Cross-task systems, which promote unified knowledge modeling across various tasks, have not been thoroughly investigated. In this paper, we propose a cross-task system for three different tasks of ESR: acoustic scene classification, urban sound tagging, and anomalous sound detection. An architecture named SE-Trans is presented that uses attention mechanism-based Squeeze-and-Excitation and Transformer encoder modules to learn channel-wise relationship and temporal dependencies of the acoustic features. FMix is employed as the data augmentation method that improves the performance of ESR. Evaluations for the three tasks are conducted on the recent databases of DCASE challenges. The experimental results show that the proposed cross-task system achieves state-of-the-art performance on all tasks. Further analysis demonstrates that the proposed cross-task system can effectively utilize acoustic knowledge across different ESR tasks.
READ FULL TEXT