Credibility of high R^2 in regression problems: a permutation approach
The question of whether Y can be predicted based on X often arises and while a well adjusted model may perform well on observed data, the risk of overfitting always exists, leading to poor generalization error on unseen data. This paper proposes a rigorous permutation test to assess the credibility of high R^2 values in regression models, which can also be applied to any measure of goodness of fit, without the need for sample splitting, by generating new pairings of (X_i, Y_j) and providing an overall interpretation of the model's accuracy. It introduces a new formulation of the null hypothesis and justification for the test, which distinguishes it from previous literature. The theoretical findings are applied to both simulated data and sensor data of tennis serves in an experimental context. The simulation study underscores how the available information affects the test, showing that the less informative the predictors, the lower the probability of rejecting the null hypothesis, and emphasizing that detecting weaker dependence between variables requires a sufficient sample size.
READ FULL TEXT