Filter Methods for Feature Selection in Supervised Machine Learning Applications – Review and Benchmark

11/23/2021
by   Konstantin Hopf, et al.
0

The amount of data for machine learning (ML) applications is constantly growing. Not only the number of observations, especially the number of measured variables (features) increases with ongoing digitization. Selecting the most appropriate features for predictive modeling is an important lever for the success of ML applications in business and research. Feature selection methods (FSM) that are independent of a certain ML algorithm - so-called filter methods - have been numerously suggested, but little guidance for researchers and quantitative modelers exists to choose appropriate approaches for typical ML problems. This review synthesizes the substantial literature on feature selection benchmarking and evaluates the performance of 58 methods in the widely used R environment. For concrete guidance, we consider four typical dataset scenarios that are challenging for ML models (noisy, redundant, imbalanced data and cases with more features than observations). Drawing on the experience of earlier benchmarks, which have considered much fewer FSMs, we compare the performance of the methods according to four criteria (predictive performance, number of relevant features selected, stability of the feature sets and runtime). We found methods relying on the random forest approach, the double input symmetrical relevance filter (DISR) and the joint impurity filter (JIM) were well-performing candidate methods for the given dataset scenarios.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/20/2020

Pulsars Detection by Machine Learning with Very Few Features

It is an active topic to investigate the schemes based on machine learni...
research
02/24/2020

FSinR: an exhaustive package for feature selection

Feature Selection (FS) is a key task in Machine Learning. It consists in...
research
05/05/2020

Feature Selection Methods for Uplift Modeling

Uplift modeling is a predictive modeling technique that estimates the us...
research
01/08/2023

Analogical Relevance Index

Focusing on the most significant features of a dataset is useful both in...
research
05/31/2023

Distance Rank Score: Unsupervised filter method for feature selection on imbalanced dataset

This paper presents a new filter method for unsupervised feature selecti...
research
06/15/2021

Employing an Adjusted Stability Measure for Multi-Criteria Model Fitting on Data Sets with Similar Features

Fitting models with high predictive accuracy that include all relevant b...
research
04/11/2023

Selecting Robust Features for Machine Learning Applications using Multidata Causal Discovery

Robust feature selection is vital for creating reliable and interpretabl...

Please sign up or login with your details

Forgot password? Click here to reset