Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization

by   Chen Liang, et al.

The Lottery Ticket Hypothesis suggests that an over-parametrized network consists of "lottery tickets", and training a certain collection of them (i.e., a subnetwork) can match the performance of the full model. In this paper, we study such a collection of tickets, which is referred to as "winning tickets", in extremely over-parametrized models, e.g., pre-trained language models. We observe that at certain compression ratios, generalization performance of the winning tickets can not only match, but also exceed that of the full model. In particular, we observe a phase transition phenomenon: As the compression ratio increases, generalization performance of the winning tickets first improves then deteriorates after a certain threshold. We refer to the tickets on the threshold as "super tickets". We further show that the phase transition is task and model dependent – as model size becomes larger and training data set becomes smaller, the transition becomes more pronounced. Our experiments on the GLUE benchmark show that the super tickets improve single task fine-tuning by 0.9 points on BERT-base and 1.0 points on BERT-large, in terms of task-average score. We also demonstrate that adaptively sharing the super tickets across tasks benefits multi-task learning.


page 1

page 2

page 3

page 4


An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models

Recent work has shown that pre-trained language models such as BERT impr...

BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

Multi-task learning allows the sharing of useful information between mul...

To Adapt or to Fine-tune: A Case Study on Abstractive Summarization

Recent advances in the field of abstractive summarization leverage pre-t...

E-LANG: Energy-Based Joint Inferencing of Super and Swift Language Models

Building huge and highly capable language models has been a trend in the...

A Simple Explanation for the Phase Transition in Large Language Models with List Decoding

Various recent experimental results show that large language models (LLM...

Intriguing Properties of Compression on Multilingual Models

Multilingual models are often particularly dependent on scaling to gener...

Team Enigma at ArgMining-EMNLP 2021: Leveraging Pre-trained Language Models for Key Point Matching

We present the system description for our submission towards the Key Poi...

Please sign up or login with your details

Forgot password? Click here to reset