How complex is the microarray dataset? A novel data complexity metric for biological high-dimensional microarray data

08/12/2023
by   Zhendong Sha, et al.
0

Data complexity analysis quantifies the hardness of constructing a predictive model on a given dataset. However, the effectiveness of existing data complexity measures can be challenged by the existence of irrelevant features and feature interactions in biological micro-array data. We propose a novel data complexity measure, depth, that leverages an evolutionary inspired feature selection algorithm to quantify the complexity of micro-array data. By examining feature subsets of varying sizes, the approach offers a novel perspective on data complexity analysis. Unlike traditional metrics, depth is robust to irrelevant features and effectively captures complexity stemming from feature interactions. On synthetic micro-array data, depth outperforms existing methods in robustness to irrelevant features and identifying complexity from feature interactions. Applied to case-control genotype and gene-expression micro-array datasets, the results reveal that a single feature of gene-expression data can account for over 90 multi-feature model, confirming the adequacy of the commonly used differentially expressed gene (DEG) feature selection method for the gene expression data. Our study also demonstrates that constructing predictive models for genotype data is harder than gene expression data. The results in this paper provide evidence for the use of interpretable machine learning algorithms on microarray data.

READ FULL TEXT
research
01/24/2019

A Stable Combinatorial Particle Swarm Optimization for Scalable Feature Selection in Gene Expression Data

Evolutionary computation (EC) algorithms, such as discrete and multi-obj...
research
02/20/2020

APTER: Aggregated Prognosis Through Exponential Reweighting

This paper considers the task of learning how to make a prognosis of a p...
research
06/26/2020

A Fokker-Planck approach to the study of robustness in gene expression

We study several Fokker-Planck equations arising from a stochastic chemi...
research
04/23/2020

Constructing Complexity-efficient Features in XCS with Tree-based Rule Conditions

A major goal of machine learning is to create techniques that abstract a...
research
10/16/2018

Refining interaction search through signed iterative Random Forests

Advances in supervised learning have enabled accurate prediction in biol...

Please sign up or login with your details

Forgot password? Click here to reset