Data leakage in cross-modal retrieval training: A case study

02/23/2023
by   Benno Weck, et al.
0

The recent progress in text-based audio retrieval was largely propelled by the release of suitable datasets. Since the manual creation of such datasets is a laborious task, obtaining data from online resources can be a cheap solution to create large-scale datasets. We study the recently proposed SoundDesc benchmark dataset, which was automatically sourced from the BBC Sound Effects web page. In our analysis, we find that SoundDesc contains several duplicates that cause leakage of training data to the evaluation data. This data leakage ultimately leads to overly optimistic retrieval performance estimates in previous benchmarks. We propose new training, validation, and testing splits for the dataset that we make available online. To avoid weak contamination of the test data, we pool audio files that share similar recording setups. In our experiments, we find that the new splits serve as a more challenging benchmark.

READ FULL TEXT
research
12/17/2021

Audio Retrieval with Natural Language Queries: A Benchmark Study

The objectives of this work are cross-modal text-audio and audio-text re...
research
05/05/2021

Audio Retrieval with Natural Language Queries

We consider the task of retrieving audio using free-form natural languag...
research
12/18/2017

Objects that Sound

In this paper our objectives are, first, networks that can embed audio a...
research
06/29/2022

How Train-Test Leakage Affects Zero-shot Retrieval

Neural retrieval models are often trained on (subsets of) the millions o...
research
01/07/2018

Cross-modal Embeddings for Video and Audio Retrieval

The increasing amount of online videos brings several opportunities for ...
research
11/21/2019

Voice-Face Cross-modal Matching and Retrieval: A Benchmark

Cross-modal associations between voice and face from a person can be lea...
research
03/06/2023

Data Portraits: Recording Foundation Model Training Data

Foundation models are trained on increasingly immense and opaque dataset...

Please sign up or login with your details

Forgot password? Click here to reset