Active feature selection discovers minimal gene-sets for classifying cell-types and disease states in single-cell mRNA-seq data
Sequencing costs currently prohibit the application of single cell mRNA-seq for many biological and clinical tasks of interest. Here, we introduce an active learning framework that constructs compressed gene sets that enable high accuracy classification of cell-types and physiological states while analyzing a minimal number of gene transcripts. Our active feature selection procedure constructs gene sets through an iterative cell-type classification task where misclassified cells are examined at each round to identify maximally informative genes through an `active' support vector machine (SVM) classifier. Our active SVM procedure automatically identifies gene sets that enables >90% cell-type classification accuracy in the Tabula Muris mouse tissue survey as well as a ∼ 40 gene set that enables classification of multiple myeloma patient samples with >95% accuracy. Broadly, the discovery of compact but highly informative gene sets might enable drastic reductions in sequencing requirements for applications of single-cell mRNA-seq.
READ FULL TEXT