Pretraining Without Attention

12/20/2022
by   Junxiong Wang, et al.
0

Transformers have been essential to pretraining success in NLP. Other architectures have been used, but require attention layers to match benchmark accuracy. This work explores pretraining without attention. We test recently developed routing layers based on state-space models (SSM) and model architectures based on multiplicative gating. Used together these modeling choices have a large impact on pretraining accuracy. Empirically the proposed Bidirectional Gated SSM (BiGS) replicates BERT pretraining results without attention and can be extended to long-form pretraining of 4096 tokens without approximation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/15/2021

How to Train BERT with an Academic Budget

While large language models à la BERT are used ubiquitously in NLP, pret...
research
05/24/2023

Dynamic Masking Rate Schedules for MLM Pretraining

Most works on transformers trained with the Masked Language Modeling (ML...
research
07/26/2019

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Language model pretraining has led to significant performance gains but ...
research
04/21/2021

Improving BERT Pretraining with Syntactic Supervision

Bidirectional masked Transformers have become the core theme in the curr...
research
07/07/2022

Revisiting Pretraining Objectives for Tabular Deep Learning

Recent deep learning models for tabular data currently compete with the ...
research
11/17/2022

Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers

Large-scale transformer models have become the de-facto architectures fo...
research
09/23/2019

Multi-stage Pretraining for Abstractive Summarization

Neural models for abstractive summarization tend to achieve the best per...

Please sign up or login with your details

Forgot password? Click here to reset