SDBERT: SparseDistilBERT, a faster and smaller BERT model

07/28/2022
by   Devaraju Vinoda, et al.
0

In this work we introduce a new transformer architecture called SparseDistilBERT (SDBERT), which is a combination of sparse attention and knowledge distillantion (KD). We implemented sparse attention mechanism to reduce quadratic dependency on input length to linear. In addition to reducing computational complexity of the model, we used knowledge distillation (KD). We were able to reduce the size of BERT model by 60 performance and it only took 40

READ FULL TEXT
research
08/26/2023

Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning

The use of large transformer-based models such as BERT, GPT, and T5 has ...
research
11/09/2019

Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models

In this paper, we explore the knowledge distillation approach under the ...
research
07/28/2020

Big Bird: Transformers for Longer Sequences

Transformers-based models, such as BERT, have been one of the most succe...
research
10/31/2022

QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

Limited computational budgets often prevent transformers from being used...
research
02/22/2021

Using Prior Knowledge to Guide BERT's Attention in Semantic Textual Matching Tasks

We study the problem of incorporating prior knowledge into a deep Transf...
research
11/11/2022

FAN-Trans: Online Knowledge Distillation for Facial Action Unit Detection

Due to its importance in facial behaviour analysis, facial action unit (...
research
03/29/2023

Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning

Language model probing is often used to test specific capabilities of th...

Please sign up or login with your details

Forgot password? Click here to reset