FastMoE: A Fast Mixture-of-Expert Training System

03/24/2021
by   Jiaao He, et al.
0

Mixture-of-Expert (MoE) presents a strong potential in enlarging the size of language model to trillions of parameters. However, training trillion-scale MoE requires algorithm and system co-design for a well-tuned high performance distributed training system. Unfortunately, the only existing platform that meets the requirements strongly depends on Google's hardware (TPU) and software (Mesh Tensorflow) stack, and is not open and available to the public, especially GPU and PyTorch communities. In this paper, we present FastMoE, a distributed MoE training system based on PyTorch with common accelerators. The system provides a hierarchical interface for both flexible model design and easy adaption to different applications, such as Transformer-XL and Megatron-LM. Different from direct implementation of MoE models using PyTorch, the training speed is highly optimized in FastMoE by sophisticated high-performance acceleration skills. The system supports placing different experts on multiple GPUs across multiple nodes, enabling enlarging the number of experts linearly against the number of GPUs. The source of FastMoE is available at https://github.com/laekov/fastmoe under Apache-2 license.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/28/2022

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

As giant dense models advance quality but require large-scale expensive ...
research
12/02/2021

MegBA: A High-Performance and Distributed Library for Large-Scale Bundle Adjustment

Large-scale Bundle Adjustment (BA) is the key for many 3D vision applica...
research
04/30/2019

AdaNet: A Scalable and Flexible Framework for Automatically Learning Ensembles

AdaNet is a lightweight TensorFlow-based (Abadi et al., 2015) framework ...
research
08/23/2023

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Large language models (LLMs) based on transformers have made significant...
research
09/25/2020

HetSeq: Distributed GPU Training on Heterogeneous Infrastructure

Modern deep learning systems like PyTorch and Tensorflow are able to tra...
research
02/01/2019

TF-Replicator: Distributed Machine Learning for Researchers

We describe TF-Replicator, a framework for distributed machine learning ...
research
10/26/2022

M^3ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design

Multi-task learning (MTL) encapsulates multiple learned tasks in a singl...

Please sign up or login with your details

Forgot password? Click here to reset