Structured Pruning of Large Language Models
Large language models have recently achieved state of the art performance across a wide variety of natural language tasks. Meanwhile, the size of these models and their latency have significantly increased, which makes their usage costly, and raises an interesting question: do language models need to be large? We study this question through the lens of model compression. We present a novel, structured pruning approach based on low rank factorization and augmented Lagrangian L0 norm regularization. Our structured approach achieves significant inference speedups while matching or outperforming our unstructured pruning baseline at various sparsity levels. We apply our method to state of the art models on the enwiki8 dataset and obtain a 1.19 perplexity score with just 5M parameters, vastly outperforming a model of the same size trained from scratch. We also demonstrate that our method can be applied to language model fine-tuning by pruning the BERT model on several downstream classification benchmarks.
READ FULL TEXT