Optimal Semi-supervised Estimation and Inference for High-dimensional Linear Regression

by   Siyi Deng, et al.

There are many scenarios such as the electronic health records where the outcome is much more difficult to collect than the covariates. In this paper, we consider the linear regression problem with such a data structure under the high dimensionality. Our goal is to investigate when and how the unlabeled data can be exploited to improve the estimation and inference of the regression parameters in linear models, especially in light of the fact that such linear models may be misspecified in data analysis. In particular, we address the following two important questions. (1) Can we use the labeled data as well as the unlabeled data to construct a semi-supervised estimator such that its convergence rate is faster than the supervised estimators? (2) Can we construct confidence intervals or hypothesis tests that are guaranteed to be more efficient or powerful than the supervised estimators? To address the first question, we establish the minimax lower bound for parameter estimation in the semi-supervised setting. We show that the upper bound from the supervised estimators that only use the labeled data cannot attain this lower bound. We close this gap by proposing a new semi-supervised estimator which attains the lower bound. To address the second question, based on our proposed semi-supervised estimator, we propose two additional estimators for semi-supervised inference, the efficient estimator and the safe estimator. The former is fully efficient if the unknown conditional mean function is estimated consistently, but may not be more efficient than the supervised approach otherwise. The latter usually does not aim to provide fully efficient inference, but is guaranteed to be no worse than the supervised approach, no matter whether the linear model is correctly specified or the conditional mean function is consistently estimated.


Efficient and Adaptive Linear Regression in Semi-Supervised Settings

We consider the linear regression problem under semi-supervised settings...

Semi-Supervised Quantile Estimation: Robust and Efficient Inference in High Dimensional Settings

We consider quantile estimation in a semi-supervised setting, characteri...

Distributed Semi-Supervised Sparse Statistical Inference

This paper is devoted to studying the semi-supervised sparse statistical...

Semi-Supervised Off Policy Reinforcement Learning

Reinforcement learning (RL) has shown great success in estimating sequen...

Collaboratively Learning Linear Models with Structured Missing Data

We study the problem of collaboratively learning least squares estimates...

Semi-Supervised Empirical Risk Minimization: When can unlabeled data improve prediction

We present a general methodology for using unlabeled data to design semi...

CAD: Debiasing the Lasso with inaccurate covariate model

We consider the problem of estimating a low-dimensional parameter in hig...

Please sign up or login with your details

Forgot password? Click here to reset