# Large-scale annotation-free disease correlation analysis of the iHMP

> **NIH NIH R03** · UNIVERSITY OF CALIFORNIA AT DAVIS · 2020 · $304,918

## Abstract

Project Summary
We will work with the iHMP data resource to apply novel tools and data analysis methodologies
to the challenge of disease association between large microbiome data sets, Inflammatory
Bowel Disease, and the onset of diabetes. We will start with an annotation-free approach using
k-mers to preprocess IBD and diabetes cohorts. We then will apply a novel scaling technology
implemented in the sourmash software to reduce the data set size by a factor of 2000, rendering
it tractable to machine learning approaches. We next will use random forests to determine a
subset of predictive k-mers, and will measure their accuracy on validation data sets not used in
the initial training. Finally, we will annotate the predictive k-mers using all available genome
databases as well as a novel method to infer the metagenomic presence of accessory genomes
of known genomes. Our outcomes will include a catalog of microbial genomes that correlate
with IBD subtype and the onset of diabetes, as well as automated workflows to apply similar
approaches to other data sets.

## Key facts

- **NIH application ID:** 10112077
- **Project number:** 1R03OD030596-01
- **Recipient organization:** UNIVERSITY OF CALIFORNIA AT DAVIS
- **Principal Investigator:** C. Titus BROWN
- **Activity code:** R03 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $304,918
- **Award type:** 1
- **Project period:** 2020-09-15 → 2022-04-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10112077

## Citation

> US National Institutes of Health, RePORTER application 10112077, Large-scale annotation-free disease correlation analysis of the iHMP (1R03OD030596-01). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10112077. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
