Scaling up computational genomics with tree sequences

NIH RePORTER · NIH · R56 · $556,834 · view on reporter.nih.gov ↗

Abstract

Project Summary/Abstract Increasing sample size is a tremendously important factor in building our understanding of the genetics of human disease. As we discover that more and more diseases have a complex web of genetic causation, we need larger and larger genetic datasets to disentangle them, and to ultimately produce successful therapies. Driven in part by this need, the community is now assembling vast collections of human genome sequences, and millions of samples will soon be commonplace. Nonhuman datasets, with applications in epidemiology, ecology, and evolution, will not be far behind. There is a profound problem, however: our computational methods for storing, processing, simulating, and analyzing genomic data are lagging far behind our ability to collect such data. The algorithms and data structures underlying today's computational methods were designed for thousands of samples, not millions, and we are in danger of being overwhelmed by the impending tsunami of data. Without a fundamental change in how we store and process genomic data, we will either not fully tap the potential of the data we collect, or the computational costs will be astronomical – or both. Our proposal addresses this critical need by focusing on a new data structure: the succinct tree sequence. This data structure (the “tree sequence”, for brevity) encodes genetic variation data using the population ge- netics processes that produced the data itself – by representing variation among contemporary samples via mutations on the branches of the underlying genealogical trees. This yields extraordinary levels of data com- pression, with ﬁle sizes hundreds of times smaller than current community standards. Since the tree sequence was introduced in 2016 it has led to performance increases of 2–4 orders of magnitude in the diverse applica- tions of genome simulation, calculation of statistics, and ancestry inference. Such sudden leaps in computa- tional performance are vanishingly rare, and only possible through deep algorithmic advances. Our research plan builds on the extraordinary successes of tree sequence methods so far, scaling up three crucial layers of computational genomics: analysis, simulation, and inference. First, we will continue our development of highly efﬁcient tree-sequence-based methods for fundamental operations in statistical and population genetics. Second, we will scale up genome simulations by integrating tree sequence methods into complex forward-time simulations, utilizing modern, multicore processors. Third, we will combine efﬁcient genome simulations with cutting-edge deep-learning methods to improve existing inference methods, both of tree sequences from genomic data, and of population parameters from novel tree-sequence encodings of genotype data. Together, we aim to revolutionize the way we work with population genetic variation data, and how we use it to understand human health and evolutionary processes. Our experienced, interdisciplinary te...

Key facts

NIH application ID: 10471496
Project number: 1R56HG011395-01A1
Recipient: UNIVERSITY OF OREGON
Principal Investigator: PETER Lochhead RALPH
Activity code: R56
Funding institute: NIH
Fiscal year: 2021
Award amount: $556,834
Award type: 1
Project period: 2021-09-24 → 2023-08-31