Large-scale annotation-free disease correlation analysis of the iHMP

NIH RePORTER · NIH · R03 · $304,918 · view on reporter.nih.gov ↗

Abstract

Project Summary We will work with the iHMP data resource to apply novel tools and data analysis methodologies to the challenge of disease association between large microbiome data sets, Inflammatory Bowel Disease, and the onset of diabetes. We will start with an annotation-free approach using k-mers to preprocess IBD and diabetes cohorts. We then will apply a novel scaling technology implemented in the sourmash software to reduce the data set size by a factor of 2000, rendering it tractable to machine learning approaches. We next will use random forests to determine a subset of predictive k-mers, and will measure their accuracy on validation data sets not used in the initial training. Finally, we will annotate the predictive k-mers using all available genome databases as well as a novel method to infer the metagenomic presence of accessory genomes of known genomes. Our outcomes will include a catalog of microbial genomes that correlate with IBD subtype and the onset of diabetes, as well as automated workflows to apply similar approaches to other data sets.

Key facts

NIH application ID
10112077
Project number
1R03OD030596-01
Recipient
UNIVERSITY OF CALIFORNIA AT DAVIS
Principal Investigator
C. Titus BROWN
Activity code
R03
Funding institute
NIH
Fiscal year
2020
Award amount
$304,918
Award type
1
Project period
2020-09-15 → 2022-04-30