Goal Driven Discovery of Distributional Differences via Language Descriptions

by   Ruiqi Zhong, et al.

Mining large corpora can generate useful discoveries but is time-consuming for humans. We formulate a new task, D5, that automatically discovers differences between two large corpora in a goal-driven way. The task input is a problem comprising a research goal "comparing the side effects of drug A and drug B" and a corpus pair (two large collections of patients' self-reported reactions after taking each drug). The output is a language description (discovery) of how these corpora differ (patients taking drug A "mention feelings of paranoia" more often). We build a D5 system, and to quantitatively measure its performance, we 1) contribute a meta-dataset, OpenD5, aggregating 675 open-ended problems ranging across business, social sciences, humanities, machine learning, and health, and 2) propose a set of unified evaluation metrics: validity, relevance, novelty, and significance. With the dataset and the unified metrics, we confirm that language models can use the goals to propose more relevant, novel, and significant candidate discoveries. Finally, our system produces discoveries previously unknown to the authors on a wide range of applications in OpenD5, including temporal and demographic differences in discussion topics, political stances and stereotypes in speech, insights in commercial reviews, and error patterns in NLP models.


Experimental Models of Drug Metabolism and Distribution in Drug Design and Development

Drug discovery and development involve the utilization of in vitro and i...

PHEE: A Dataset for Pharmacovigilance Event Extraction from Text

The primary goal of drug safety researchers and regulators is to promptl...

SWEAT: Scoring Polarization of Topics across Different Corpora

Understanding differences of viewpoints across corpora is a fundamental ...

A Unified View of Relational Deep Learning for Polypharmacy Side Effect, Combination Synergy, and Drug-Drug Interaction Prediction

In recent years, numerous machine learning models which attempt to solve...

Using Open-Ended Stressor Responses to Predict Depressive Symptoms across Demographics

Stressors are related to depression, but this relationship is complex. W...

Large Language Models for Automated Open-domain Scientific Hypotheses Discovery

Hypothetical induction is recognized as the main reasoning type when sci...

Large Language Models Can Be Used to Scale the Ideologies of Politicians in a Zero-Shot Learning Setting

The aggregation of knowledge embedded in large language models (LLMs) ho...

Please sign up or login with your details

Forgot password? Click here to reset