Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

05/24/2022
by   James Lee-Thorp, et al.
0

We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability of linear, mixing transformations to design the Sparse Mixer encoder model. The Sparse Mixer slightly outperforms (<1 and SuperGLUE, but more importantly trains 65 faster. We also present a faster variant, prosaically named Fast Sparse Mixer, that marginally underperforms (<0.2 nearly twice as fast: 89 the design of these two models by carefully ablating through various mixing mechanisms, MoE configurations and model hyperparameters. The Sparse Mixer overcomes many of the latency and stability concerns of MoE models and offers the prospect of serving sparse student models, without resorting to distilling them to dense variants.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/29/2021

Well-mixing vertices and almost expanders

We study regular graphs in which the random walks starting from a positi...
research
01/14/2022

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

As the training of giant dense models hits the boundary on the availabil...
research
12/09/2022

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Training large, deep neural networks to convergence can be prohibitively...
research
08/23/2023

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Large language models (LLMs) based on transformers have made significant...
research
01/30/2023

Quantifying Context Mixing in Transformers

Self-attention weights and their transformed variants have been the main...
research
05/20/2023

Lifting the Curse of Capacity Gap in Distilling Language Models

Pretrained language models (LMs) have shown compelling performance on va...

Please sign up or login with your details

Forgot password? Click here to reset