Training Question Answering Models From Synthetic Data

02/22/2020
by   Raul Puri, et al.
16

Question and answer generation is a data augmentation method that aims to improve question answering (QA) models given the limited amount of human labeled data. However, a considerable gap remains between synthetic and human-generated question-answer pairs. This work aims to narrow this gap by taking advantage of large language models and explores several factors such as model size, quality of pretrained models, scale of data synthesized, and algorithmic choices. On the SQuAD1.1 question answering task, we achieve higher accuracy using solely synthetic questions and answers than when using the SQuAD1.1 training set questions alone. Removing access to real Wikipedia data, we synthesize questions and answers from a synthetic corpus generated by an 8.3 billion parameter GPT-2 model. With no access to human supervision and only access to other models, we are able to train state of the art question answering networks on entirely model-generated data that achieve 88.4 Exact Match (EM) and 93.9 F1 score on the SQuAD1.1 dev set. We further apply our methodology to SQuAD2.0 and show a 2.8 absolute gain on EM score compared to prior work using synthetic data.

READ FULL TEXT
research
10/04/2020

When in Doubt, Ask: Generating Answerable and Unanswerable Questions, Unsupervised

Question Answering (QA) is key for making possible a robust communicatio...
research
04/08/2021

PQA: Perceptual Question Answering

Perceptual organization remains one of the very few established theories...
research
01/06/2021

EfficientQA : a RoBERTa Based Phrase-Indexed Question-Answering System

State-of-the-art extractive question answering models achieve superhuman...
research
10/19/2020

Understanding Unnatural Questions Improves Reasoning over Text

Complex question answering (CQA) over raw text is a challenging task. A ...
research
05/24/2023

Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering

We train a language model (LM) to robustly answer multistep questions by...
research
10/30/2022

Transfer Learning with Synthetic Corpora for Spatial Role Labeling and Reasoning

Recent research shows synthetic data as a source of supervision helps pr...
research
10/17/2022

Adversarial and Safely Scaled Question Generation

Question generation has recently gained a lot of research interest, espe...

Please sign up or login with your details

Forgot password? Click here to reset