Statistical modeling of cross-sample variation and learning of latent structures in microbiome sequencing data

NIH RePORTER · NIH · R01 · $346,292 · view on reporter.nih.gov ↗

Abstract

PROJECT ABSTRACT The bacterial communities (microbiota) residing on the human body have been linked to a variety of acute and chronic diseases and conditions, such as obesity, inflammatory bowel disorders, Type 2 diabetes, depression, and urinary tract infections (UTIs), as well as to the host’s response to a variety of treatments and health interventions for these diseases and conditions. As the critical role played by the microbiota has become increasingly recognized, microbiome sequencing data sets are now routinely generated under ever more sophisticated experimental designs and survey strategies. While such data share many common features and challenges of modern big data, such as high-dimensionality and sparsity, they also possess characteristics peculiar to the microbiota, including (i) the explicit and latent contextual relationships among the bacterial species, such as their evolutionary and functional relationships; and (ii) the substantial heterogeneity across samples and complex structure in the sample-to-sample variation. Effective analysis of modern microbiome studies calls for new statistical methodology that incorporates these important characteristics in the data generative mechanism. This project’s objective is to develop a suite of statistical models, methods, algorithms, and software that meet this urgent need. An initial aim is to develop a multi-scale probabilistic framework for modeling microbiome compositions that effectively characterizes the high dimensionality, sparsity, and substantial cross-sample variation in microbiome sequencing data, and incorporates a variety of common experimental designs, such as covariates, batch effects, and multiple time points, while striking a balance in flexibility, analytical parsimony, and computational tractability. An additional focus is to develop latent variable models for microbiome compositional data for the purpose of identifying latent structures such as sample clusters and species subcommunities. A final aim is to produce user-friendly, open-source software that implements all of the proposed methods for the analysis of microbiome sequencing data. All of the models and methods developed are informed by two on- going collaborative projects of PI Ma and his team. One is on the identification of microbial communities associated with UTIs in aging women, and the other on the study of longitudinal changes in the microbiome of cancer patients undergoing hematopoietic stem cell transplantation. These studies will serve as testbeds for all development. The models, methods, and software developed will not only result in better prediction of the health outcomes in these and other microbiome studies but also help decipher the roles of microbiome in various diseases and biomedical processes, with the ultimate goal of personalized interventions on the microbiome compositions of patients to lead to improved health.

Key facts

NIH application ID
10468838
Project number
5R01GM135440-03
Recipient
DUKE UNIVERSITY
Principal Investigator
Li Ma
Activity code
R01
Funding institute
NIH
Fiscal year
2022
Award amount
$346,292
Award type
5
Project period
2020-09-15 → 2025-08-31