A Debiased MDI Feature Importance Measure for Random Forests

by   Xiao Li, et al.

Tree ensembles such as Random Forests have achieved impressive empirical success across a wide variety of applications. To understand how these models make predictions, people routinely turn to feature importance measures calculated from tree ensembles. It has long been known that Mean Decrease Impurity (MDI), one of the most widely used measures of feature importance, incorrectly assigns high importance to noisy features, leading to systematic bias in feature selection. In this paper, we address the feature selection bias of MDI from both theoretical and methodological perspectives. Based on the original definition of MDI by Breiman et al. for a single tree, we derive a tight non-asymptotic bound on the expected bias of MDI importance of noisy features, showing that deep trees have higher (expected) feature selection bias than shallow ones. However, it is not clear how to reduce the bias of MDI using its existing analytical expression. We derive a new analytical expression for MDI, and based on this new expression, we are able to propose a debiased MDI feature importance measure using out-of-bag samples, called MDI-oob. For both the simulated data and a genomic ChIP dataset, MDI-oob achieves state-of-the-art performance in feature selection from Random Forests for both deep and shallow trees.


page 1

page 2

page 3

page 4


ControlBurn: Feature Selection by Sparse Forests

Tree ensembles distribute feature importance evenly amongst groups of co...

Interpreting Deep Forest through Feature Contribution and MDI Feature Importance

Deep forest is a non-differentiable deep model which has achieved impres...

Analyzing the tree-layer structure of Deep Forests

Random forests on the one hand, and neural networks on the other hand, h...

Nonparametric Feature Selection by Random Forests and Deep Neural Networks

Random forests are a widely used machine learning algorithm, but their c...

From unbiased MDI Feature Importance to Explainable AI for Trees

We attempt to give a unifying view of the various recent attempts to (i)...

Causality-Based Feature Importance Quantifying Methods:PN-FI, PS-FI and PNS-FI

In current ML field models are getting larger and more complex, data we ...

Optimal Sparse Recovery with Decision Stumps

Decision trees are widely used for their low computational cost, good pr...

Please sign up or login with your details

Forgot password? Click here to reset