Classification Trees for Imbalanced and Sparse Data: Surface-to-Volume Regularization
Classification algorithms face difficulties when one or more classes have limited training data. We are particularly interested in classification trees, due to their interpretability and flexibility. When data are limited in one or more of the classes, the estimated decision boundaries are often irregularly shaped due to the limited sample size, leading to poor generalization error. We propose a novel approach that penalizes the Surface-to-Volume Ratio (SVR) of the decision set, obtaining a new class of SVR-Tree algorithms. We develop a simple and computationally efficient implementation while proving estimation and feature selection consistency for SVR-Tree. SVR-Tree is compared with multiple algorithms that are designed to deal with imbalance through real data applications.
READ FULL TEXT