Connecting population-level AUC and latent scale-invariant R^2 via Semiparametric Gaussian Copula and rank correlations
Area Under the Curve (AUC) is arguably the most popular measure of classification accuracy. We use a semiparametric framework to introduce a latent scale-invariant R^2, a novel measure of variation explained for an observed binary outcome and an observed continuous predictor, and then directly link the latent R^2 to AUC. This enables a mutually consistent simultaneous use of AUC as a measure of classification accuracy and the latent R^2 as a scale-invariant measure of explained variation. Specifically, we employ Semiparametric Gaussian Copula (SGC) to model a joint dependence between observed binary outcome and observed continuous predictor via the correlation of latent standard normal random variables. Under SGC, we show how, both population-level AUC and latent scale-invariant R^2, defined as a squared latent correlation, can be estimated using any of the four rank statistics calculated on binary-continuous pairs: Wilcoxon rank-sum, Kendall's Tau, Spearman's Rho, and Quadrant rank correlations. We then focus on three implications and applications: i) we explicitly show that under SGC, the population-level AUC and the population-level latent R^2 are related via a monotone function that depends on the population-level prevalence rate, ii) we propose Quadrant rank correlation as a robust semiparametric version of AUC; iii) we demonstrate how, under complex-survey designs, Wilcoxon rank sum statistics and Spearman and Quadrant rank correlations provide asymptotically consistent estimators of the population-level AUC using only single-participant survey weights. We illustrate these applications using binary outcome of five-year mortality and continuous predictors including Albumin, Systolic Blood Pressure, and accelerometry-derived measures of total volume of physical activity collected in 2003-2006 National Health and Nutrition Examination Survey (NHANES) cohorts.
READ FULL TEXT