AI Chat AI Image Generator AI Video Text to Speech

SDBERT: SparseDistilBERT, a faster and smaller BERT model

07/28/2022

∙

by Devaraju Vinoda, et al.

∙

∙

In this work we introduce a new transformer architecture called SparseDistilBERT (SDBERT), which is a combination of sparse attention and knowledge distillantion (KD). We implemented sparse attention mechanism to reduce quadratic dependency on input length to linear. In addition to reducing computational complexity of the model, we used knowledge distillation (KD). We were able to reduce the size of BERT model by 60 performance and it only took 40

Devaraju Vinoda
1 publication
Pawan Kumar Yadav
1 publication

research

∙ 08/26/2023

Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning

The use of large transformer-based models such as BERT, GPT, and T5 has ...

0 Apoorv Dankar, et al. ∙

research

∙ 11/09/2019

Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models

In this paper, we explore the knowledge distillation approach under the ...

0 Linqing Liu, et al. ∙

research

∙ 07/28/2020

Big Bird: Transformers for Longer Sequences

Transformers-based models, such as BERT, have been one of the most succe...

73 Manzil Zaheer, et al. ∙

research

∙ 10/31/2022

QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

Limited computational budgets often prevent transformers from being used...

3 Shira Guskin, et al. ∙

research

∙ 02/22/2021

Using Prior Knowledge to Guide BERT's Attention in Semantic Textual Matching Tasks

We study the problem of incorporating prior knowledge into a deep Transf...

0 Tingyu Xia, et al. ∙

research

∙ 11/11/2022

FAN-Trans: Online Knowledge Distillation for Facial Action Unit Detection

Due to its importance in facial behaviour analysis, facial action unit (...

0 Jing Yang, et al. ∙

research

∙ 03/29/2023

Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning

Language model probing is often used to test specific capabilities of th...

0 Namrata Shivagunde, et al. ∙