An Empirical Survey of Data Augmentation for Limited Data Learning in NLP

by   Jiaao Chen, et al.

NLP has achieved great progress in the past decade through the use of neural models and large labeled datasets. The dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks where significant time, money, or expertise is required to label massive amounts of textual data. Recently, data augmentation methods have been explored as a means of improving data efficiency in NLP. To date, there has been no systematic empirical overview of data augmentation for NLP in the limited labeled data setting, making it difficult to understand which methods work in which settings. In this paper, we provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting, summarizing the landscape of methods (including token-level augmentations, sentence-level augmentations, adversarial augmentations, and hidden-space augmentations) and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks. Based on the results, we draw several conclusions to help practitioners choose appropriate augmentations in different settings and discuss the current challenges and future directions for limited data learning in NLP.


page 1

page 2

page 3

page 4


Data Augmentation for Neural NLP

Data scarcity is a problem that occurs in languages and tasks where we d...

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

Current developments in natural language processing offer challenges and...

Improving Keyphrase Extraction with Data Augmentation and Information Filtering

Keyphrase extraction is one of the essential tasks for document understa...

Rethink the Effectiveness of Text Data Augmentation: An Empirical Analysis

In recent years, language models (LMs) have made remarkable progress in ...

Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study

This paper presents an empirical study to build relation extraction syst...

When to Use What: An In-Depth Comparative Empirical Analysis of OpenIE Systems for Downstream Applications

Open Information Extraction (OpenIE) has been used in the pipelines of v...

BaIT: Barometer for Information Trustworthiness

This paper presents a new approach to the FNC-1 fake news classification...

Please sign up or login with your details

Forgot password? Click here to reset