Unsupervised Learning under Latent Label Shift

by   Manley Roberts, et al.

What sorts of structure might enable a learner to discover classes from unlabeled data? Traditional approaches rely on feature-space similarity and heroic assumptions on the data. In this paper, we introduce unsupervised learning under Latent Label Shift (LLS), where we have access to unlabeled data from multiple domains such that the label marginals p_d(y) can shift across domains but the class conditionals p(𝐱|y) do not. This work instantiates a new principle for identifying classes: elements that shift together group together. For finite input spaces, we establish an isomorphism between LLS and topic modeling: inputs correspond to words, domains to documents, and labels to topics. Addressing continuous data, we prove that when each label's support contains a separable region, analogous to an anchor word, oracle access to p(d|𝐱) suffices to identify p_d(y) and p_d(y|𝐱) up to permutation. Thus motivated, we introduce a practical algorithm that leverages domain-discriminative models as follows: (i) push examples through domain discriminator p(d|𝐱); (ii) discretize the data by clustering examples in p(d|𝐱) space; (iii) perform non-negative matrix factorization on the discrete data; (iv) combine the recovered p(y|d) with the discriminator outputs p(d|𝐱) to compute p_d(y|x) ∀ d. With semi-synthetic experiments, we show that our algorithm can leverage domain information to improve state of the art unsupervised classification methods. We reveal a failure mode of standard unsupervised classification methods when feature-space similarity does not indicate true groupings, and show empirically that our method better handles this case. Our results establish a deep connection between distribution shift and topic modeling, opening promising lines for future work.


page 1

page 2

page 3

page 4


A Prototype-Oriented Clustering for Domain Shift with Source Privacy

Unsupervised clustering under domain shift (UCDS) studies how to transfe...

A New Anchor Word Selection Method for the Separable Topic Discovery

Separable Non-negative Matrix Factorization (SNMF) is an important metho...

Heterogeneous Domain Adaptation with Positive and Unlabeled Data

Heterogeneous unsupervised domain adaptation (HUDA) is the most challeng...

Adapting to Latent Subgroup Shifts via Concepts and Proxies

We address the problem of unsupervised domain adaptation when the source...

TimeMatch: Unsupervised Cross-Region Adaptation by Temporal Shift Estimation

The recent developments of deep learning models that capture the complex...

Joint Label Prediction based Semi-Supervised Adaptive Concept Factorization for Robust Data Representation

Constrained Concept Factorization (CCF) yields the enhanced representati...

Class Overwhelms: Mutual Conditional Blended-Target Domain Adaptation

Current methods of blended targets domain adaptation (BTDA) usually infe...

Please sign up or login with your details

Forgot password? Click here to reset