Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches

by   Tim Schopf, et al.

Text classification of unseen classes is a challenging Natural Language Processing task and is mainly attempted using two different types of approaches. Similarity-based approaches attempt to classify instances based on similarities between text document representations and class description representations. Zero-shot text classification approaches aim to generalize knowledge gained from a training task by assigning appropriate labels of unknown classes to text documents. Although existing studies have already investigated individual approaches to these categories, the experiments in literature do not provide a consistent comparison. This paper addresses this gap by conducting a systematic evaluation of different similarity-based and zero-shot approaches for text classification of unseen classes. Different state-of-the-art approaches are benchmarked on four text classification datasets, including a new dataset from the medical domain. Additionally, novel SimCSE and SBERT-based baselines are proposed, as other baselines used in existing work yield weak classification results and are easily outperformed. Finally, the novel similarity-based Lbl2TransformerVec approach is presented, which outperforms previous state-of-the-art approaches in unsupervised text classification. Our experiments show that similarity-based approaches significantly outperform zero-shot approaches in most cases. Additionally, using SimCSE or SBERT embeddings instead of simpler text representations increases similarity-based classification results even further.


page 1

page 2

page 3

page 4


Integrating Semantic Knowledge to Tackle Zero-shot Text Classification

Insufficient or even unavailable training data of emerging classes is a ...

Siamese Networks for Large-Scale Author Identification

Authorship attribution is the process of identifying the author of a tex...

Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach

Zero-shot text classification (0Shot-TC) is a challenging NLU problem to...

Taken by Surprise: Contrast effect for Similarity Scores

Accurately evaluating the similarity of object vector embeddings is of c...

Uncertainty and Surprisal Jointly Deliver the Punchline: Exploiting Incongruity-Based Features for Humor Recognition

Humor recognition has been widely studied as a text classification probl...

Generation-driven Contrastive Self-training for Zero-shot Text Classification with Instruction-tuned GPT

Moreover, GPT-based zero-shot classification models tend to make indepen...

Benchmark of DNN Model Search at Deployment Time

Deep learning has become the most popular direction in machine learning ...

Please sign up or login with your details

Forgot password? Click here to reset