Human Microbiome Compendium: large-scale curation and processing of human microbiome datasets

NIH RePORTER · NIH · R01 · $368,475 · view on reporter.nih.gov ↗

Abstract

ABSTRACT Mounting evidence shows the microbial communities living in (and on) the human body play a key role in the etiology of disease. A major obstacle in the field is the dearth of reliable methods for extracting meaningful signals from small, noisy, intercorrelated, and highly variable microbiome datasets. Enhancing the ability of researchers to generate robust characterizations of the complex relationship between microbiota and their hosts will support novel, more reliable diagnosis of disease and bring the field one step closer to finding the causal links underlying microbiome-based therapeutics. Until now, however, researchers have not had the huge volume of data required to draw these conclusions. Although microbiome data from hundreds of thousands of samples is available in the NCBI Sequence Read Archive (SRA), these datasets have not been leveraged at a large scale. To bridge this gap, we will build an automated pipeline to process and aggregate more than 750,000 samples of amplicon and shotgun metagenomics sequencing data from all publicly available human microbiome samples. We will build a platform, which we call "The Human Microbiome Compendium," for compiling collections of relevant samples that can be used by researchers to find ecological dynamics that have until now been hidden in the noise. The compendium will allow users to see relative abundances of microbial taxa in every sample, which will also be linked to NCBI metadata and annotations generated by a new tool that imputes a uniform set of descriptors for sample type, body site, and host traits. We will also use the compendium to train machine learning models for dimensionality reduction, which will improve the power of independent microbiome studies by incorporating insights from the compendium's collection of hundreds of thousands of samples. These data and tools will be distributed across multiple channels, including a web application where users will be able to upload data to be processed in real time by the dimensionality reduction tools. The proposed studies will generate the first comprehensive aggregation of the microbiome datasets available via the SRA, which will be used to provide characterizations of the human microbiome in unprecedented detail. The resulting compendium will encourage the use of publicly available data and inform new microbiome analysis tools that will help extract important associations in studies where it's impractical to acquire the sample sizes required by conventional techniques. Results from this study will be a starting point to identification of microbiome biomarkers for disease and the development of novel therapeutic approaches.

Key facts

NIH application ID: 10538341
Project number: 1R01LM013863-01A1
Recipient: UNIVERSITY OF CHICAGO
Principal Investigator: Ran Blekhman
Activity code: R01
Funding institute: NIH
Fiscal year: 2022
Award amount: $368,475
Award type: 1
Project period: 2022-09-15 → 2026-07-31