ModuleFormer: Learning Modular Large Language Models From Uncurated Data

by   Yikang Shen, et al.

Large Language Models (LLMs) have achieved remarkable results. But existing models are expensive to train and deploy, and it is also difficult to expand their knowledge beyond pre-training data without forgetting previous knowledge. This paper proposes a new neural network architecture, ModuleFormer, that leverages modularity to improve the efficiency and flexibility of large language models. ModuleFormer is based on the Sparse Mixture of Experts (SMoE). Unlike the previous SMoE-based modular language model [Gururangan et al., 2021], which requires domain-labeled data to learn domain-specific experts, ModuleFormer can induce modularity from uncurated data with its new load balancing and load concentration losses. ModuleFormer is a modular architecture that includes two different types of modules, new stick-breaking attention heads, and feedforward experts. Different modules are sparsely activated conditions on the input token during training and inference. In our experiment, we found that the modular architecture enables three important abilities for large pre-trained language models: 1) Efficiency, since ModuleFormer only activates a subset of its modules for each input token, thus it could achieve the same performance as dense LLMs with more than two times throughput; 2) Extendability, ModuleFormer is more immune to catastrophic forgetting than dense LLMs and can be easily extended with new modules to learn new knowledge that is not included in the training data; 3) Specialisation, finetuning ModuleFormer could specialize a subset of modules to the finetuning task, and the task-unrelated modules could be easily pruned for a lightweight deployment.


page 1

page 2

page 3

page 4


Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models

Memory is one of the most essential cognitive functions serving as a rep...

MUX-PLMs: Pre-training Language Models with Data Multiplexing

Data multiplexing is a recently proposed method for improving a model's ...

PEFT-Ref: A Modular Reference Architecture and Typology for Parameter-Efficient Finetuning Techniques

Recent parameter-efficient finetuning (PEFT) techniques aim to improve o...

Pre-trained language models as knowledge bases for Automotive Complaint Analysis

Recently it has been shown that large pre-trained language models like B...

Continual Pre-training of Language Models

Language models (LMs) have been instrumental for the rapid advance of na...

Information Flow Control in Machine Learning through Modular Model Architecture

In today's machine learning (ML) models, any part of the training data c...

BASE Layers: Simplifying Training of Large, Sparse Models

We introduce a new balanced assignment of experts (BASE) layer for large...

Please sign up or login with your details

Forgot password? Click here to reset