# Biology-aware machine learning methods for characterizing microbiome genotype and phenotype

> **NIH NIH R35** · UNIVERSITY OF CALIFORNIA, SAN DIEGO · 2023 · $344,669

## Abstract

PROJECT SUMMARY
1 The Mirarab laboratory designs leading computational methods for answering biological and biomedical ques-
2 tions, focusing on scalability and accuracy. These methods span several areas (e.g., microbiome proﬁling,
3 multiple sequence alignment, and phylogenomics), and a common thread among them is evolutionary mod-
4 eling. The lab has developed scalable and accurate methods for reconstructing evolutionary histories (i.e.,
 5 phylogenies) and using these histories in downstream biomedical applications. Reconstructing phylogenies is a
 6 fundamental goal and a precursor to many biological analyses. Methods developed by this lab (e.g., ASTRAL)
7 are at the forefronts of modern genome-wide phylogenetics. Moreover, biomedical research increasingly uses
8 evolutionary histories in diverse areas like microbiome analyses, immunology, epidemiology, and comparative
9 genomics. While the lab has previously focused more on inferring species histories, it has recently started
10 to shift its focus to developing methods for microbiome analyses. The inference and the use of evolutionary
11 histories in analyzing environmental microbiome samples present a unique set of challenges.
12 In the next ﬁve years, the Mirarab lab will focus on designing, testing, and applying improved methods for
13 statistical analyses of microbiome data. These methods will target two questions. (i) Proﬁling: What organisms
14 constitute a given sample? (ii) Association: How are samples different in their organismal composition, and
15 how do these differences connect to measurable characteristics of their environment? While both questions
16 have been subject to considerable research, many computational challenges remain, providing an opportunity
17 for better methods to make a signiﬁcant impact. Instead of focusing solely on new algorithms, the lab will
18 also work on building better reference datasets and combining data from multiple sources. Thus, the project
19 aims to harness the unprecedented computational power, large available datasets, and recent advances in
20 machine learning to improve state-of-the-art dramatically. The project will not use off-the-shelf machine learning
21 methods in a black-box fashion. Instead, it develops methods that incorporate biological knowledge (e.g., of the
22 evolutionary relationships) into machine learning methods in a principled biologically-motivated fashion.
23 The lab will pursue several ambitious goals for both proﬁling and association questions. The project will
24 (i) create methods to infer a continuously-updated reference alignment and tree encompassing all sequenced
25 prokaryotic genomes (half a million currently) to be used for proﬁling, (ii) build methods for ultra-sensitive sam-
26 ple proﬁling, (iii) use deep learning to connect data obtained using amplicon sequencing and metagenomics,
27 (iv) build discordance-aware phylogenetic measures of sample differentiation, and (v) develop machine learning
28 method...

## Key facts

- **NIH application ID:** 10696960
- **Project number:** 5R35GM142725-03
- **Recipient organization:** UNIVERSITY OF CALIFORNIA, SAN DIEGO
- **Principal Investigator:** Siavash Mir arabbaygi
- **Activity code:** R35 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2023
- **Award amount:** $344,669
- **Award type:** 5
- **Project period:** 2021-09-15 → 2026-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10696960

## Citation

> US National Institutes of Health, RePORTER application 10696960, Biology-aware machine learning methods for characterizing microbiome genotype and phenotype (5R35GM142725-03). Retrieved via AI Analytics 2026-05-21 from https://api.ai-analytics.org/grant/nih/10696960. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*