DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions

by   Vijay Viswanathan, et al.

Modern machine learning relies on datasets to develop and validate research ideas. Given the growth of publicly available data, finding the right dataset to use is increasingly difficult. Any research question imposes explicit and implicit constraints on how well a given dataset will enable researchers to answer this question, such as dataset size, modality, and domain. We operationalize the task of recommending datasets given a short natural language description of a research idea, to help people find relevant datasets for their needs. Dataset recommendation poses unique challenges as an information retrieval problem; datasets are hard to directly index for search and there are no corpora readily available for this task. To facilitate this task, we build the DataFinder Dataset which consists of a larger automatically-constructed training set (17.5K queries) and a smaller expert-annotated evaluation set (392 queries). Using this data, we compare various information retrieval algorithms on our test set and present a superior bi-encoder retriever for text-based dataset recommendation. This system, trained on the DataFinder Dataset, finds more relevant search results than existing third-party dataset search engines. To encourage progress on dataset recommendation, we release our dataset and models to the public.


page 1

page 2

page 3

page 4


WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset

Over the past years, deep learning methods allowed for new state-of-the-...

Dataset search: a survey

Generating value from data requires the ability to find, access and make...

QUEST: A Retrieval Dataset of Entity-Seeking Queries with Implicit Set Operations

Formulating selective information needs results in queries that implicit...

AgAsk: An Agent to Help Answer Farmer's Questions From Scientific Documents

Decisions in agriculture are increasingly data-driven; however, valuable...

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Semantic code search is the task of retrieving relevant code given a nat...

PatentMatch: A Dataset for Matching Patent Claims Prior Art

Patent examiners need to solve a complex information retrieval task when...

SE-PEF: a Resource for Personalized Expert Finding

The problem of personalization in Information Retrieval has been under s...

Please sign up or login with your details

Forgot password? Click here to reset