A Hybrid System of Sound Event Detection Transformer and Frame-wise Model for DCASE 2022 Task 4

by   Yiming Li, et al.

In this paper, we describe in detail our system for DCASE 2022 Task4. The system combines two considerably different models: an end-to-end Sound Event Detection Transformer (SEDT) and a frame-wise model, Metric Learning and Focal Loss CNN (MLFL-CNN). The former is an event-wise model which learns event-level representations and predicts sound event categories and boundaries directly, while the latter is based on the widely adopted frame-classification scheme, under which each frame is classified into event categories and event boundaries are obtained by post-processing such as thresholding and smoothing. For SEDT, self-supervised pre-training using unlabeled data is applied, and semi-supervised learning is adopted by using an online teacher, which is updated from the student model using the Exponential Moving Average (EMA) strategy and generates reliable pseudo labels for weakly-labeled and unlabeled data. For the frame-wise model, the ICT-TOSHIBA system of DCASE 2021 Task 4 is used. Experimental results show that the hybrid system considerably outperforms either individual model and achieves psds1 of 0.420 and psds2 of 0.783 on the validation set without external data. The code is available at https://github.com/965694547/Hybrid-system-of-frame-wise-model-and-SEDT.


page 1

page 2

page 3

page 4


Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks

In recent years, self-supervised learning (SSL) has emerged as a popular...

Sound Event Detection Transformer: An Event-based End-to-End Model for Sound Event Detection

Sound event detection (SED) has gained increasing attention with its wid...

Semi-supervised Sound Event Detection with Local and Global Consistency Regularization

Learning meaningful frame-wise features on a partially labeled dataset i...

SP-SEDT: Self-supervised Pre-training for Sound Event Detection Transformer

Recently, an event-based end-to-end model (SEDT) has been proposed for s...

DiffSED: Sound Event Detection with Denoising Diffusion

Sound Event Detection (SED) aims to predict the temporal boundaries of a...

Guided Learning Convolution System for DCASE 2019 Task 4

In this paper, we describe in detail the system we submitted to DCASE201...

Please sign up or login with your details

Forgot password? Click here to reset