# Scalable Coalescent Inference for Large Data Sets

> **NIH NIH R01** · STANFORD UNIVERSITY · 2020 · $304,832

## Abstract

Mathematical and statistical modeling of gene genealogies-trees that reflect ancestral relationships among sampled
molecular sequences-is central to many biological fields, including population genetics, phylodynamics of infectious
disease, paleogenomics, phylogenetics, and cancer genomics. Kingman's n-coalescent is a stochastic process of gene
genealogies whose parameters depend on an evolutionary model. Inference of model parameters then contributes to an
understanding of the phenomena that have given rise to the sequences. Though many sophisticated methods have been
developed to date, major statistical and computational challenges remain because the state space of genealogies grows
superexponentially with the number of samples. We are no longer data-limited but instead, we lack computational and
statistical methods for analysis of large scale emerging genomic data sets. The long-term goal of the researchers is to
develop statistically consistent and computationally efficient coalescent methods for exact inference of evolutionary
parameters from next-generation sequencing datasets. The objective of this research is to apply the notion of
lumpability of Kingman's n-coalescent to address the state-space explosion problem of coalescent methods. The basic
idea is to model a coarser resolution of the underlying genealogy and reduce the cardinality of the hidden state space.
These coarser coalescent models include Tajima's coalescent and the pure-death process coalescent. The specific aims
include (1) prove theorems for coalescent models and provide theoretical and practical tools for addressing
computational challenges when modeling different resolutions or "lumpings" of Kingman's coalescent; (2) develop
scalable methods for inference of evolutionary parameters using different coalescent models; (3) theoretically and
empirically validate the inference methods, applying them in simulations and in molecular sequences from infectious
diseases such as Zika, as well as ancient DNA samples from bison in North America and ancient and modern human
samples; (4) implement the novel methods in open source software, ensuring fast dissemination of the methodology
among researchers. The research is innovative in many distinct ways. First, Tajima's coalescent has not yet been
exploited for inference despite the potential based on the smaller state space. Second, the methods developed here will
allow inference from data sets that have not been exploited before because of computational limitations. Third, we
will not only provide a suite of tools ready for application but we will also provide statistical results supporting our
implementations. Our proposed research on scalable modeling of genealogical trees will be significant in a number eJf
fields, including the theory of evolutionary trees, statistical inference in population genetics and phylogenetics, and
the analysis of molecular sequences from infectious disease and ancient DNA.

## Key facts

- **NIH application ID:** 9964853
- **Project number:** 5R01GM131404-03
- **Recipient organization:** STANFORD UNIVERSITY
- **Principal Investigator:** Julia Palacios
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $304,832
- **Award type:** 5
- **Project period:** 2018-09-05 → 2022-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9964853

## Citation

> US National Institutes of Health, RePORTER application 9964853, Scalable Coalescent Inference for Large Data Sets (5R01GM131404-03). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/9964853. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*