# Scalable Computational Methods for Genealogical Inference: from species level to single cells

> **NIH NIH R01** · UNIVERSITY OF CALIFORNIA BERKELEY · 2024 · $592,707

## Abstract

PROJECT SUMMARY
Massive amounts of genomic data are currently being generated, providing unprecedented opportunities for
biomedical researchers to characterize various biological components and processes. In order to utilize these
data to make new biological discoveries and improve human health, accurate models and scalable computational
tools need to be developed to facilitate analysis and interpretation. The central objective of this project is to
address this challenge by developing more realistic probabilistic models, scalable algorithms, and user-friendly
software tools to enable the biomedical research community to better harness large genomic data. Many prob-
lems in genomics rely on computational methods for inferring genealogical information from large sequence data
and interpreting the reconstructed trees. In this application, we propose to make significant strides towards im-
proving this line of research by developing a suite of robust and scalable algorithms for probabilistic models of
molecular evolution and genealogical inference across multiple timescales. We will achieve our goal by carrying
out the following specific aims: 1) A fundamental problem in statistical analysis of molecular evolution is estimat-
ing model parameters, for which maximum likelihood estimation (MLE) is typically employed. Unfortunately, MLE
is a computationally expensive task, in some cases prohibitively so. In Aim 1, we will tackle this problem by
combining a novel MLE framework and modern optimization techniques to develop a broadly applicable
computational method that achieves several orders of magnitude speedup in MLE for general models of
molecular evolution. The ability to estimate model parameters at unprecedented speed will transform the way
that phylogenetic analysis is performed and enable the community to consider more complex, realistic models
than previously possible. We will apply our tools to improve phylogenetic inference for two clinically important
superfamilies of membrane proteins in humans, namely G protein-coupled receptors and Solute carrier trans-
porters. 2) Because of meiotic recombination, the genetic variability within humans cannot be represented by a
single tree. Instead, there are millions of different trees across the genome, where each position in the genome
will tend to have its own tree that differs only minimally from the trees in nearby sites. The collection of all these
trees, and the set of recombination points creating new trees, is represented by the Ancestral Recombination
Graph (ARG), which has a number of applications in human genetics. Despite substantial recent progress on
reconstructing ARGs, however, current methods are either too slow to scale up to large data sets, or they do not
sample ARGs accurately from the correct posterior distribution. In Aim 2, we will develop a new computational
method to improve ARG sampling. We will test the method extensively on simulated data, develop a number of
applications, ...

## Key facts

- **NIH application ID:** 10981028
- **Project number:** 1R01HG013117-01A1
- **Recipient organization:** UNIVERSITY OF CALIFORNIA BERKELEY
- **Principal Investigator:** Ian H Holmes
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $592,707
- **Award type:** 1
- **Project period:** 2024-09-10 → 2028-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10981028

## Citation

> US National Institutes of Health, RePORTER application 10981028, Scalable Computational Methods for Genealogical Inference: from species level to single cells (1R01HG013117-01A1). Retrieved via AI Analytics 2026-05-25 from https://api.ai-analytics.org/grant/nih/10981028. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*