Distributionally Robust Data Join

02/11/2022
by   Pranjal Awasthi, et al.
0

Suppose we are given two datasets: a labeled dataset and unlabeled dataset which also has additional auxiliary features not present in the first dataset. What is the most principled way to use these datasets together to construct a predictor? The answer should depend upon whether these datasets are generated by the same or different distributions over their mutual feature sets, and how similar the test distribution will be to either of those distributions. In many applications, the two datasets will likely follow different distributions, but both may be close to the test distribution. We introduce the problem of building a predictor which minimizes the maximum loss over all probability distributions over the original features, auxiliary features, and binary labels, whose Wasserstein distance is r_1 away from the empirical distribution over the labeled dataset and r_2 away from that of the unlabeled dataset. This can be thought of as a generalization of distributionally robust optimization (DRO), which allows for two data sources, one of which is unlabeled and may contain auxiliary features.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/21/2017

Domain Generalization by Marginal Transfer Learning

Domain generalization is the problem of assigning class labels to an unl...
research
01/11/2022

Leveraging Unlabeled Data to Predict Out-of-Distribution Performance

Real-world machine learning deployments are characterized by mismatches ...
research
03/15/2020

Beyond without Forgetting: Multi-Task Learning for Classification with Disjoint Datasets

Multi-task Learning (MTL) for classification with disjoint datasets aims...
research
11/03/2020

Robust hypothesis testing and distribution estimation in Hellinger distance

We propose a simple robust hypothesis test that has the same sample comp...
research
11/05/2016

Class-prior Estimation for Learning from Positive and Unlabeled Data

We consider the problem of estimating the class prior in an unlabeled da...
research
02/07/2022

Diversify and Disambiguate: Learning From Underspecified Data

Many datasets are underspecified, which means there are several equally ...
research
08/04/2020

Out-of-Distribution Generalization with Maximal Invariant Predictor

Out-of-Distribution (OOD) generalization problem is a problem of seeking...

Please sign up or login with your details

Forgot password? Click here to reset