Augmentation and knowledge distillation (KD) are well-established techni...
Previously, Target Speaker Extraction (TSE) has yielded outstanding
perf...
Visual information can serve as an effective cue for target speaker
extr...
The currently most prominent algorithm to train keyword spotting (KWS) m...
Transformers have emerged as a prominent model framework for audio taggi...
Keyword spotting (KWS) is a core human-machine-interaction front-end tas...
We study the usability of pre-trained weakly supervised audio tagging (A...
Within the audio research community and the industry, keyword spotting (...
Large-scale audio tagging datasets inevitably contain imperfect labels, ...
Voice activity detection is an essential pre-processing component for
sp...
Automated Audio Captioning is a cross-modal task, generating natural lan...
Automated audio captioning (AAC) aims at generating summarizing descript...
Sound event detection (SED) is the task of tagging the absence or presen...
Albeit recent progress in speaker verification generates powerful models...
How to visually localize multiple sound sources in unconstrained videos ...
Traditional supervised voice activity detection (VAD) methods work well ...
Traditional voice activity detection (VAD) methods work well in clean an...
Depression detection research has increased over the last few decades as...
Captioning has attracted much attention in image and video understanding...
Task 4 of the Dcase2018 challenge demonstrated that substantially more
r...
Recent advances in automatic depression detection mostly derive from mod...
Increasing amount of research has shed light on machine perception of au...