Concentration Based Inference in High Dimensional Generalized Regression Models (I: Statistical Guarantees)
We develop simple and non-asymptotically justified methods for hypothesis testing about the coefficients (θ^*∈R^p) in the high dimensional generalized regression models where p can exceed the sample size. Given a function h: R^pR^m, we consider H_0: h(θ^*) = 0_m against H_1: h(θ^*)≠0_m, where m can be any integer in [1, p] and h can be nonlinear in θ^*. Our test statistics is based on the sample "quasi score" vector evaluated at an estimate θ̂_α that satisfies h(θ̂_α)=0_m, where α is the prespecified Type I error. By exploiting the concentration phenomenon in Lipschitz functions, the key component reflecting the dimension complexity in our non-asymptotic thresholds uses a Monte-Carlo approximation to mimic the expectation that is concentrated around and automatically captures the dependencies between the coordinates. We provide probabilistic guarantees in terms of the Type I and Type II errors for the quasi score test. Confidence regions are also constructed for the population quasi-score vector evaluated at θ^*. The first set of our results are specific to the standard Gaussian linear regression models; the second set allow for reasonably flexible forms of non-Gaussian responses, heteroscedastic noise, and nonlinearity in the regression coefficients, while only requiring the correct specification of E(Y_i | X_i)s. The novelty of our methods is that their validity does not rely on good behavior of θ̂_α - θ^*_2 (or even n^-1/2 X(θ̂_α - θ^*)_2 in the linear regression case) nonasymptotically or asymptotically.
READ FULL TEXT