# Scaling up computational genomics with tree sequences

> **NIH NIH R56** · UNIVERSITY OF OREGON · 2021 · $556,834

## Abstract

Project Summary/Abstract
Increasing sample size is a tremendously important factor in building our understanding of the genetics of
human disease. As we discover that more and more diseases have a complex web of genetic causation, we
need larger and larger genetic datasets to disentangle them, and to ultimately produce successful therapies.
Driven in part by this need, the community is now assembling vast collections of human genome sequences,
and millions of samples will soon be commonplace. Nonhuman datasets, with applications in epidemiology,
ecology, and evolution, will not be far behind. There is a profound problem, however: our computational
methods for storing, processing, simulating, and analyzing genomic data are lagging far behind our ability to
collect such data. The algorithms and data structures underlying today's computational methods were designed
for thousands of samples, not millions, and we are in danger of being overwhelmed by the impending tsunami
of data. Without a fundamental change in how we store and process genomic data, we will either not fully tap
the potential of the data we collect, or the computational costs will be astronomical – or both.
 Our proposal addresses this critical need by focusing on a new data structure: the succinct tree sequence.
This data structure (the “tree sequence”, for brevity) encodes genetic variation data using the population ge-
netics processes that produced the data itself – by representing variation among contemporary samples via
mutations on the branches of the underlying genealogical trees. This yields extraordinary levels of data com-
pression, with ﬁle sizes hundreds of times smaller than current community standards. Since the tree sequence
was introduced in 2016 it has led to performance increases of 2–4 orders of magnitude in the diverse applica-
tions of genome simulation, calculation of statistics, and ancestry inference. Such sudden leaps in computa-
tional performance are vanishingly rare, and only possible through deep algorithmic advances.
 Our research plan builds on the extraordinary successes of tree sequence methods so far, scaling up three
crucial layers of computational genomics: analysis, simulation, and inference. First, we will continue our
development of highly efﬁcient tree-sequence-based methods for fundamental operations in statistical and
population genetics. Second, we will scale up genome simulations by integrating tree sequence methods into
complex forward-time simulations, utilizing modern, multicore processors. Third, we will combine efﬁcient
genome simulations with cutting-edge deep-learning methods to improve existing inference methods, both
of tree sequences from genomic data, and of population parameters from novel tree-sequence encodings of
genotype data. Together, we aim to revolutionize the way we work with population genetic variation data, and
how we use it to understand human health and evolutionary processes.
 Our experienced, interdisciplinary te...

## Key facts

- **NIH application ID:** 10471496
- **Project number:** 1R56HG011395-01A1
- **Recipient organization:** UNIVERSITY OF OREGON
- **Principal Investigator:** PETER Lochhead RALPH
- **Activity code:** R56 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $556,834
- **Award type:** 1
- **Project period:** 2021-09-24 → 2023-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10471496

## Citation

> US National Institutes of Health, RePORTER application 10471496, Scaling up computational genomics with tree sequences (1R56HG011395-01A1). Retrieved via AI Analytics 2026-05-26 from https://api.ai-analytics.org/grant/nih/10471496. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
