MUX-PLMs: Pre-training Language Models with Data Multiplexing

02/24/2023
by   Vishvak Murahari, et al.
0

Data multiplexing is a recently proposed method for improving a model's inference efficiency by processing multiple instances simultaneously using an ordered representation mixture. Prior work on data multiplexing only used task-specific Transformers without any pre-training, which limited their accuracy and generality. In this paper, we develop pre-trained multiplexed language models (MUX-PLMs) that can be widely finetuned on any downstream task. Our approach includes a three-stage training procedure and novel multiplexing and demultiplexing modules for improving throughput and downstream task accuracy. We demonstrate our method on BERT and ELECTRA pre-training objectives, with our MUX-BERT and MUX-ELECTRA models achieving 2x/5x inference speedup with a 2-4 % drop in absolute performance on GLUE and 1-2 % drop on token-level tasks.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset