DAGAM: Data Augmentation with Generation And Modification

by   Byeong-Cheol Jo, et al.

Text classification is a representative downstream task of natural language processing, and has exhibited excellent performance since the advent of pre-trained language models based on Transformer architecture. However, in pre-trained language models, under-fitting often occurs due to the size of the model being very large compared to the amount of available training data. Along with significant importance of data collection in modern machine learning paradigm, studies have been actively conducted for natural language data augmentation. In light of this, we introduce three data augmentation schemes that help reduce underfitting problems of large-scale language models. Primarily we use a generation model for data augmentation, which is defined as Data Augmentation with Generation (DAG). Next, we augment data using text modification techniques such as corruption and word order change (Data Augmentation with Modification, DAM). Finally, we propose Data Augmentation with Generation And Modification (DAGAM), which combines DAG and DAM techniques for a boosted performance. We conduct data augmentation for six benchmark datasets of text classification task, and verify the usefulness of DAG, DAM, and DAGAM through BERT-based fine-tuning and evaluation, deriving better results compared to the performance with original datasets.


page 1

page 2

page 3

page 4


Unsupervised Paraphrase Generation using Pre-trained Language Models

Large scale Pre-trained Language Models have proven to be very powerful ...

Adapting BERT for Word Sense Disambiguation with Gloss Selection Objective and Example Sentences

Domain adaptation or transfer learning using pre-trained language models...

Transformers as Neural Augmentors: Class Conditional Sentence Generation via Variational Bayes

Data augmentation methods for Natural Language Processing tasks are expl...

UU-Tax at SemEval-2022 Task 3: Improving the generalizability of language models for taxonomy classification through data augmentation

This paper presents our strategy to address the SemEval-2022 Task 3 PreT...

Multi-Scales Data Augmentation Approach In Natural Language Inference For Artifacts Mitigation And Pre-Trained Model Optimization

Machine learning models can reach high performance on benchmark natural ...

M-Evolve: Structural-Mapping-Based Data Augmentation for Graph Classification

Graph classification, which aims to identify the category labels of grap...

Please sign up or login with your details

Forgot password? Click here to reset