Multi-label Dataless Text Classification with Topic Modeling

11/05/2017
by   Daochen Zha, et al.
0

Manually labeling documents is tedious and expensive, but it is essential for training a traditional text classifier. In recent years, a few dataless text classification techniques have been proposed to address this problem. However, existing works mainly center on single-label classification problems, that is, each document is restricted to belonging to a single category. In this paper, we propose a novel Seed-guided Multi-label Topic Model, named SMTM. With a few seed words relevant to each category, SMTM conducts multi-label classification for a collection of documents without any labeled document. In SMTM, each category is associated with a single category-topic which covers the meaning of the category. To accommodate with multi-labeled documents, we explicitly model the category sparsity in SMTM by using spike and slab prior and weak smoothing prior. That is, without using any threshold tuning, SMTM automatically selects the relevant categories for each document. To incorporate the supervision of the seed words, we propose a seed-guided biased GPU (i.e., generalized Polya urn) sampling procedure to guide the topic inference of SMTM. Experiments on two public datasets show that SMTM achieves better classification accuracy than state-of-the-art alternatives and even outperforms supervised solutions in some scenarios.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/20/2021

Weakly Supervised Prototype Topic Model with Discriminative Seed Words: Modifying the Category Prior by Self-exploring Supervised Signals

Dataless text classification, i.e., a new paradigm of weakly supervised ...
research
10/27/2022

BERT-Flow-VAE: A Weakly-supervised Model for Multi-Label Text Classification

Multi-label Text Classification (MLTC) is the task of categorizing docum...
research
10/14/2020

Text Classification Using Label Names Only: A Language Model Self-Training Approach

Current text classification methods typically require a good number of h...
research
11/05/2022

Hierarchical Multi-Label Classification of Scientific Documents

Automatic topic classification has been studied extensively to assist ma...
research
01/23/2012

A probabilistic methodology for multilabel classification

Multilabel classification is a relatively recent subfield of machine lea...
research
10/24/2020

X-Class: Text Classification with Extremely Weak Supervision

In this paper, we explore to conduct text classification with extremely ...
research
05/24/2023

Debiasing Made State-of-the-art: Revisiting the Simple Seed-based Weak Supervision for Text Classification

Recent advances in weakly supervised text classification mostly focus on...

Please sign up or login with your details

Forgot password? Click here to reset