From differential abundance to mtGWAS: accurate and scalable methodology for metabolomics data with non-ignorable missing observations and latent factors

by   Shangshu Zhao, et al.

Metabolomics is the high-throughput study of small molecule metabolites. Besides offering novel biological insights, these data contain unique statistical challenges, the most glaring of which is the many non-ignorable missing metabolite observations. To address this issue, nearly all analysis pipelines first impute missing observations, and subsequently perform analyses with methods designed for complete data. While clearly erroneous, these pipelines provide key practical advantages not present in existing statistically rigorous methods, including using both observed and missing data to increase power, fast computation to support phenome- and genome-wide analyses, and streamlined estimates for factor models. To bridge this gap between statistical fidelity and practical utility, we developed MS-NIMBLE, a statistically rigorous and powerful suite of methods that offers all the practical benefits of imputation pipelines to perform phenome-wide differential abundance analyses, metabolite genome-wide association studies (mtGWAS), and factor analysis with non-ignorable missing data. Critically, we tailor MS-NIMBLE to perform differential abundance and mtGWAS in the presence of latent factors, which reduces biases and improves power. In addition to proving its statistical and computational efficiency, we demonstrate its superior performance using three real metabolomic datasets.


page 1

page 2

page 3

page 4


Estimation and inference in metabolomics with non-random missing data and latent factors

High throughput metabolomics data are fraught with both non-ignorable mi...

Missing Value Knockoffs

One limitation of the most statistical/machine learning-based variable s...

Factor analysis in high dimensional biological data with dependent observations

Factor analysis is a critical component of high dimensional biological d...

Linear mixed models to handle missing at random data in trial-based economic evaluations

Trial-based cost-effectiveness analyses (CEAs) are an important source o...

Data Integrity Error Localization in Networked Systems with Missing Data

Most recent network failure diagnosis systems focused on data center net...

Differential analysis in Transcriptomic: The strength of randomly picking 'reference' genes

Transcriptomic analysis are characterized by being not directly quantita...

Please sign up or login with your details

Forgot password? Click here to reset