Large Language Models for Automated Open-domain Scientific Hypotheses Discovery

09/06/2023
by   Zonglin Yang, et al.
Nanyang Technological University
Singapore University of Technology and Design
The University of Texas at Dallas
0

Hypothetical induction is recognized as the main reasoning type when scientists make observations about the world and try to propose hypotheses to explain those observations. Past research on hypothetical induction has a limited setting that (1) the observation annotations of the dataset are not raw web corpus but are manually selected sentences (resulting in a close-domain setting); and (2) the ground truth hypotheses annotations are mostly commonsense knowledge, making the task less challenging. In this work, we propose the first NLP dataset for social science academic hypotheses discovery, consisting of 50 recent papers published in top social science journals. Raw web corpora that are necessary for developing hypotheses in the published papers are also collected in the dataset, with the final goal of creating a system that automatically generates valid, novel, and helpful (to human researchers) hypotheses, given only a pile of raw web corpora. The new dataset can tackle the previous problems because it requires to (1) use raw web corpora as observations; and (2) propose hypotheses even new to humanity. A multi-module framework is developed for the task, as well as three different feedback mechanisms that empirically show performance gain over the base framework. Finally, our framework exhibits high performance in terms of both GPT-4 based evaluation and social science expert evaluation.

READ FULL TEXT

page 1

page 2

page 3

page 4

02/07/2020

A tale of two databases: The use of Web of Science and Scopus in academic papers

Web of Science and Scopus are two world-leading and competing citation d...
01/17/2022

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

The need for raw large raw corpora has dramatically increased in recent ...
09/07/2023

Can Large Language Models Discern Evidence for Scientific Hypotheses? Case Studies in the Social Sciences

Hypothesis formulation and testing are central to empirical research. A ...
02/11/2018

Validation and Topic-driven Ranking for Biomedical Hypothesis Generation Systems

Literature underpins research, providing the foundation for new ideas. B...
06/16/2020

Causal Knowledge Extraction from Scholarly Papers in Social Sciences

The scale and scope of scholarly articles today are overwhelming human r...
02/28/2023

Goal Driven Discovery of Distributional Differences via Language Descriptions

Mining large corpora can generate useful discoveries but is time-consumi...

Please sign up or login with your details

Forgot password? Click here to reset