Linear Discriminant Analysis with High-dimensional Mixed Variables

by   Binyan Jiang, et al.

Datasets containing both categorical and continuous variables are frequently encountered in many areas, and with the rapid development of modern measurement technologies, the dimensions of these variables can be very high. Despite the recent progress made in modelling high-dimensional data for continuous variables, there is a scarcity of methods that can deal with a mixed set of variables. To fill this gap, this paper develops a novel approach for classifying high-dimensional observations with mixed variables. Our framework builds on a location model, in which the distributions of the continuous variables conditional on categorical ones are assumed Gaussian. We overcome the challenge of having to split data into exponentially many cells, or combinations of the categorical variables, by kernel smoothing, and provide new perspectives for its bandwidth choice to ensure an analogue of Bochner's Lemma, which is different to the usual bias-variance tradeoff. We show that the two sets of parameters in our model can be separately estimated and provide penalized likelihood for their estimation. Results on the estimation accuracy and the misclassification rates are established, and the competitive performance of the proposed classifier is illustrated by extensive simulation and real data studies.


page 1

page 2

page 3

page 4


Think Global and Act Local: Bayesian Optimisation over High-Dimensional Categorical and Mixed Search Spaces

High-dimensional black-box optimisation remains an important yet notorio...

Sufficient reductions in regression with mixed predictors

Most data sets comprise of measurements on continuous and categorical va...

Sequential Linear Discriminant Analysis in High Dimensions Using Individual Discriminant Functions

High dimensional classification has been highlighted for last two decade...

A Bayesian Framework for Generation of Fully Synthetic Mixed Datasets

Much of the micro data used for epidemiological studies contain sensitiv...

Detecting Outliers in High-dimensional Data with Mixed Variable Types using Conditional Gaussian Regression Models

Outlier detection has gained increasing interest in recent years, due to...

Jackknife Empirical Likelihood Approach for K-sample Tests

The categorical Gini correlation is an alternative measure of dependence...

Rank-based approach for estimating correlations in mixed ordinal data

High-dimensional mixed data as a combination of both continuous and ordi...

Please sign up or login with your details

Forgot password? Click here to reset