Exploiting Redundancy in Pre-trained Language Models for Efficient Transfer Learning
Large pre-trained contextual word representations have transformed the field of natural language processing, obtaining impressive results on a wide range of tasks. However, as models increase in size, computational limitations make them impractical for researchers and practitioners alike. We hypothesize that contextual representations have both intrinsic and task-specific redundancies. We propose a novel feature selection method, which takes advantage of these redundancies to reduce the size of the pre-trained features. In a comprehensive evaluation on two pre-trained models, BERT and XLNet, using a diverse suite of sequence labeling and sequence classification tasks, our method reduces the feature set down to 1–7 of the performance.
READ FULL TEXT