A Minimum Description Length Approach to Multitask Feature Selection

05/30/2009
by   Brian Tomasik, et al.
0

Many regression problems involve not one but several response variables (y's). Often the responses are suspected to share a common underlying structure, in which case it may be advantageous to share information across them; this is known as multitask learning. As a special case, we can use multiple responses to better identify shared predictive features -- a project we might call multitask feature selection. This thesis is organized as follows. Section 1 introduces feature selection for regression, focusing on ell_0 regularization methods and their interpretation within a Minimum Description Length (MDL) framework. Section 2 proposes a novel extension of MDL feature selection to the multitask setting. The approach, called the "Multiple Inclusion Criterion" (MIC), is designed to borrow information across regression tasks by more easily selecting features that are associated with multiple responses. We show in experiments on synthetic and real biological data sets that MIC can reduce prediction error in settings where features are at least partially shared across responses. Section 3 surveys hypothesis testing by regression with a single response, focusing on the parallel between the standard Bonferroni correction and an MDL approach. Mirroring the ideas in Section 2, Section 4 proposes a novel MIC approach to hypothesis testing with multiple responses and shows that on synthetic data with significant sharing of features across responses, MIC sometimes outperforms standard FDR-controlling methods in terms of finding true positives for a given level of false positives. Section 5 concludes.

READ FULL TEXT

page 21

page 25

research
09/05/2021

Scalable Feature Selection for (Multitask) Gradient Boosted Trees

Gradient Boosted Decision Trees (GBDTs) are widely used for building ran...
research
02/05/2019

Robust Regression via Online Feature Selection under Adversarial Data Corruption

The presence of data corruption in user-generated streaming data, such a...
research
11/22/2017

Sparse Variable Selection on High Dimensional Heterogeneous Data with Tree Structured Responses

We consider the problem of sparse variable selection on high dimension h...
research
06/17/2015

Feature Selection for Ridge Regression with Provable Guarantees

We introduce single-set spectral sparsification as a deterministic sampl...
research
10/27/2019

Annual Interruption Rate as a KPI, its measurement and comparison

This article is divided into two chapters. The first chapter describes t...
research
08/14/2020

Feature Selection Methods for Cost-Constrained Classification in Random Forests

Cost-sensitive feature selection describes a feature selection problem, ...
research
08/21/2019

Minimum Description Length Revisited

This is an up-to-date introduction to and overview of the Minimum Descri...

Please sign up or login with your details

Forgot password? Click here to reset