Narrowing the Gap between Supervised and Unsupervised Sentence Representation Learning with Large Language Model

by   Mingxin Li, et al.

Sentence Representation Learning (SRL) is a fundamental task in Natural Language Processing (NLP), with Contrastive learning of Sentence Embeddings (CSE) as the mainstream technique due to its superior performance. An intriguing phenomenon in CSE is the significant performance gap between supervised and unsupervised methods, even when their sentence encoder and loss function are the same. Previous works attribute this performance gap to differences in two representation properties (alignment and uniformity). However, alignment and uniformity only measure the results, which means they cannot answer "What happens during the training process that leads to the performance gap?" and "How can the performance gap be narrowed?". In this paper, we conduct empirical experiments to answer these "What" and "How" questions. We first answer the "What" question by thoroughly comparing the behavior of supervised and unsupervised CSE during their respective training processes. From the comparison, We observe a significant difference in fitting difficulty. Thus, we introduce a metric, called Fitting Difficulty Increment (FDI), to measure the fitting difficulty gap between the evaluation dataset and the held-out training dataset, and use the metric to answer the "What" question. Then, based on the insights gained from the "What" question, we tackle the "How" question by increasing the fitting difficulty of the training dataset. We achieve this by leveraging the In-Context Learning (ICL) capability of the Large Language Model (LLM) to generate data that simulates complex patterns. By utilizing the hierarchical patterns in the LLM-generated data, we effectively narrow the gap between supervised and unsupervised CSE.


BAS: An Answer Selection Method Using BERT Language Model

In recent years, Question Answering systems have become more popular and...

Motif Mining and Unsupervised Representation Learning for BirdCLEF 2022

We build a classification model for the BirdCLEF 2022 challenge using un...

Beyond Words: A Comprehensive Survey of Sentence Representations

Sentence representations have become a critical component in natural lan...

Virtual Augmentation Supported Contrastive Learning of Sentence Representations

Despite profound successes, contrastive representation learning relies o...

Sentence Similarity Based on Contexts

Existing methods to measure sentence similarity are faced with two chall...

CoT-BERT: Enhancing Unsupervised Sentence Representation through Chain-of-Thought

Unsupervised sentence representation learning aims to transform input se...

Please sign up or login with your details

Forgot password? Click here to reset