Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding

by   Oren Barkan, et al.

Attention based models have become the new state-of-the-art in natural language understanding tasks such as question-answering and sentence similarity. Recent models, such as BERT and XLNet, score a pair of sentences (A and B) using multiple cross-attention operations - a process in which each word in sentence A attends to all words in sentence B and vice versa. As a result, computing the similarity between a query sentence and a set of candidate sentences, requires the propagation of all query-candidate sentence-pairs throughout a stack of cross-attention layers. This exhaustive process becomes computationally prohibitive when the number of candidate sentences is large. In contrast, sentence embedding techniques learn a sentence-to-vector mapping and compute the similarity between the sentence vectors via simple elementary operations such as dot product or cosine similarity. In this paper, we introduce a sentence embedding method that is based on knowledge distillation from cross-attentive models, focusing on sentence-pair tasks. The outline of the proposed method is as follows: Given a cross-attentive teacher model (e.g. a fine-tuned BERT), we train a sentence embedding based student model to reconstruct the sentence-pair scores obtained by the teacher model. We empirically demonstrate the effectiveness of our distillation method on five GLUE sentence-pair tasks. Our method significantly outperforms several ELMO variants and other sentence embedding methods, while accelerating computation of the query-candidate sentence-pairs similarities by several orders of magnitude, with an average relative degradation of 4.6


page 1

page 2

page 3

page 4


Using BERT Encoding and Sentence-Level Language Model for Sentence Ordering

Discovering the logical sequence of events is one of the cornerstones in...

Once is Enough: A Light-Weight Cross-Attention for Fast Sentence Pair Modeling

Transformer-based models have achieved great success on sentence pair mo...

ASBERT: Siamese and Triplet network embedding for open question answering

Answer selection (AS) is an essential subtask in the field of natural la...

"The Boating Store Had Its Best Sail Ever": Pronunciation-attentive Contextualized Pun Recognition

Humor plays an important role in human languages and it is essential to ...

Multilevel Sentence Embeddings for Personality Prediction

Representing text into a multidimensional space can be done with sentenc...

DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching and Pair Modeling

Pre-trained models like BERT (Devlin et al., 2018) have dominated NLP / ...

A New Sentence Ordering Method Using BERT Pretrained Model

Building systems with capability of natural language understanding (NLU)...

Please sign up or login with your details

Forgot password? Click here to reset