Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation

03/03/2021
by   Mayee F. Chen, et al.
12

Labeling data for modern machine learning is expensive and time-consuming. Latent variable models can be used to infer labels from weaker, easier-to-acquire sources operating on unlabeled data. Such models can also be trained using labeled data, presenting a key question: should a user invest in few labeled or many unlabeled points? We answer this via a framework centered on model misspecification in method-of-moments latent variable estimation. Our core result is a bias-variance decomposition of the generalization error, which shows that the unlabeled-only approach incurs additional bias under misspecification. We then introduce a correction that provably removes this bias in certain cases. We apply our decomposition framework to three scenarios – well-specified, misspecified, and corrected models – to 1) choose between labeled and unlabeled data and 2) learn from their combination. We observe theoretically and with synthetic experiments that for well-specified models, labeled points are worth a constant factor more than unlabeled points. With misspecification, however, their relative value is higher due to the additional bias but can be reduced with correction. We also apply our approach to study real-world weak supervision techniques for dataset construction.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/30/2019

End-to-end Learning, with or without Labels

We present an approach for end-to-end learning that allows one to jointl...
research
12/12/2012

Continuation Methods for Mixing Heterogenous Sources

A number of modern learning tasks involve estimation from heterogeneous ...
research
10/05/2018

Training Complex Models with Multi-Task Weak Supervision

As machine learning models continue to increase in complexity, collectin...
research
10/19/2020

Can I Trust My Fairness Metric? Assessing Fairness with Unlabeled Data and Bayesian Inference

We investigate the problem of reliably assessing group fairness when lab...
research
06/20/2018

StructVAE: Tree-structured Latent Variable Models for Semi-supervised Semantic Parsing

Semantic parsing is the task of transducing natural language (NL) uttera...
research
11/30/2017

Hybrid VAE: Improving Deep Generative Models using Partial Observations

Deep neural network models trained on large labeled datasets are the sta...
research
01/12/2013

Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

Expectation Propagation (EP) provides a framework for approximate infere...

Please sign up or login with your details

Forgot password? Click here to reset