Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data

10/24/2022
by   Nabeel Seedat, et al.
6

High model performance, on average, can hide that models may systematically underperform on subgroups of the data. We consider the tabular setting, which surfaces the unique issue of outcome heterogeneity - this is prevalent in areas such as healthcare, where patients with similar features can have different outcomes, thus making reliable predictions challenging. To tackle this, we propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes. We do this by analyzing the behavior of individual examples during training, based on their predictive confidence and, importantly, the aleatoric (data) uncertainty. Capturing the aleatoric uncertainty permits a principled characterization and then subsequent stratification of data examples into three distinct subgroups (Easy, Ambiguous, Hard). We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets. We show that Data-IQ's characterization of examples is most robust to variation across similarly performant (yet different) models, compared to baselines. Since Data-IQ can be used with any ML model (including neural networks, gradient boosting etc.), this property ensures consistency of data characterization, while allowing flexible model selection. Taking this a step further, we demonstrate that the subgroups enable us to construct new approaches to both feature acquisition and dataset selection. Furthermore, we highlight how the subgroups can inform reliable model usage, noting the significant impact of the Ambiguous subgroup on model generalization.

READ FULL TEXT

page 37

page 40

research
10/28/2020

Evaluating Robustness of Predictive Uncertainty Estimation: Are Dirichlet-based Models Reliable?

Robustness to adversarial perturbations and accurate uncertainty estimat...
research
12/22/2020

Mixture Model Framework for Traumatic Brain Injury Prognosis Using Heterogeneous Clinical and Outcome Data

Prognoses of Traumatic Brain Injury (TBI) outcomes are neither easily no...
research
06/26/2018

Boulevard: Regularized Stochastic Gradient Boosted Trees and Their Limiting Distribution

This paper examines a novel gradient boosting framework for regression. ...
research
04/01/2021

Model Selection's Disparate Impact in Real-World Deep Learning Applications

Algorithmic fairness has emphasized the role of biased data in automated...
research
08/13/2021

Datasets for Studying Generalization from Easy to Hard Examples

We describe new datasets for studying generalization from easy to hard e...
research
12/03/2021

Equity in Stochastic Healthcare Facility Location

We consider issues of equity in stochastic facility location models for ...
research
04/24/2023

Functional Causal Inference with Time-to-Event Data

Functional data is a powerful tool for capturing and analyzing complex p...

Please sign up or login with your details

Forgot password? Click here to reset