Rescaling and other forms of unsupervised preprocessing introduce bias into cross-validation

01/25/2019
by   Amit Moscovich, et al.
0

Cross-validation of predictive models is the de-facto standard for model selection and evaluation. In proper use, it provides an unbiased estimate of a model's predictive performance. However, data sets often undergo a preliminary data-dependent transformation, such as feature rescaling or dimensionality reduction, prior to cross-validation. It is widely believed that such a preprocessing stage, if done in an unsupervised manner that does not consider the class labels or response values, has no effect on the validity of cross-validation. In this paper, we show that this belief is not true. Preliminary preprocessing can introduce either a positive or negative bias into the estimates of model performance. Thus, it may lead to sub-optimal choices of model parameters and invalid inference. In light of this, the scientific community should re-examine the use of preliminary preprocessing prior to cross-validation across the various application domains. By default, all data transformations, including unsupervised preprocessing stages, should be learned only from the training samples, and then merely applied to the validation and testing samples.

READ FULL TEXT
research
07/01/2023

Bootstrapping the Cross-Validation Estimate

Cross-validation is a widely used technique for evaluating the performan...
research
11/27/2021

Fast and Informative Model Selection using Learning Curve Cross-Validation

Common cross-validation (CV) methods like k-fold cross-validation or Mon...
research
09/07/2023

Efficient estimation and correction of selection-induced bias with order statistics

Model selection aims to identify a sufficiently well performing model th...
research
01/18/2023

Data thinning for convolution-closed distributions

We propose data thinning, a new approach for splitting an observation in...
research
03/29/2015

Cross-validation of matching correlation analysis by resampling matching weights

The strength of association between a pair of data vectors is represente...
research
07/08/2019

Surrogate modeling of indoor down-link human exposure based on sparse polynomial chaos expansion

Human exposure induced by wireless communication systems increasingly dr...

Please sign up or login with your details

Forgot password? Click here to reset