# Phylogenetic and computational methods for accurate and efficient analyses of large-scale metagenomics datasets

> **NIH NIH K99** · UNIVERSITY OF CALIFORNIA BERKELEY · 2023 · $108,000

## Abstract

Project Summary/Abstract
The overall goal of this project is to use approaches from statistics and computer science to solve signiﬁcant chal-
lenges in the analysis of metabarcode and metagenomics data. Metagenomics, the study of combined genomes
of organisms present in a single community, is an emerging highly interdisciplinary ﬁeld that combines genomics,
bioinformatics, systems biology, among other areas. Metagenomics has many applications to public health es-
pecially in the areas of pathogen detection, human microbiome analysis, and biodiversity monitoring. The larger
objective of this proposal is to leverage the use of the open source software, tronko, a fast approximate likelihood
phylogenetic placement method that I developed for taxonomic classiﬁcation, which is the ﬁrst phylogenetic place-
ment method that truly enables the use of large-scale reference databases and next generation sequencing data
desired as queries. Tronko will be used to solve fundamental problems in analyses of metabarcode and metage-
nomic data in addition to developing an application to analyses of severe acute respiratory syndrome coronavirus
2 (SARS-CoV-2) sequences that will greatly enhance the utility of environmental monitoring of SARS-CoV-2. The
speciﬁc aims of this proposal are to (1) solve an important theoretical problem by applying a rigorous species
delineation to assignment, (2) to apply tronko to solve an important practical problem of estimating the compo-
sition of SARS-CoV-2 lineages in wastewater surveillance samples, and (3) to develop a rapid custom reference
database builder for analyzing metabarcode and metagenomics data. For Aim 1, different phylogenetic groups
have different variability in different parts of the tree, therefore, I plan to use Bayesian methods to estimate effec-
tive population sizes locally to establish appropriate cut-off thresholds for species assignments in different parts
of the phylogeny. Current methods use arbitrary thresholds for delineation of taxonomic groups and this method
would provide an elegant solution to a long-standing limitation in species classiﬁcation. For Aim 2, SARS-CoV-2
monitoring of wastewater is an effective strategy for early detection of outbreaks. I plan to build a pipeline, and
subsequently a web portal for researchers, that uses tronko to ﬁrst detect the virus within a wastewater sample
then subsequently uses an expectation-maximization algorithm to estimate the proportions of viral strains. This
aim would greatly aid public health researchers in assessing and managing the pandemic since no established
methods are currently available for this type of analysis. For Aim 3, current custom reference database builders
require weeks if not months of consecutive computational time in addition to access to a large amount of data
storage. I propose to build a method which can be completed within a day. The method will perform in silico
ampliﬁcation of primers and subsequently use the ampliﬁed fragments ...

## Key facts

- **NIH application ID:** 10542443
- **Project number:** 5K99GM144747-02
- **Recipient organization:** UNIVERSITY OF CALIFORNIA BERKELEY
- **Principal Investigator:** Lenore Pipes
- **Activity code:** K99 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2023
- **Award amount:** $108,000
- **Award type:** 5
- **Project period:** 2022-01-01 → 2024-12-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10542443

## Citation

> US National Institutes of Health, RePORTER application 10542443, Phylogenetic and computational methods for accurate and efficient analyses of large-scale metagenomics datasets (5K99GM144747-02). Retrieved via AI Analytics 2026-05-25 from https://api.ai-analytics.org/grant/nih/10542443. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
