Personal and panel references for improved alignment

NIH RePORTER · NIH · R01 · $372,820 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY Next-generation sequencing is ubiquitous in the study of biology and disease. The ﬁrst step when analyz- ing a sequencing dataset is read alignment: the process of determining where each snippet of sequencing data (“read”) came from with respect to a reference genome. Currently, genomics research is hampered by the use of a single, arbitrary reference. This fails to account for the vast genetic diversity that exists among humans and model organisms. Further, it can result in “reference bias,” in turn leading to false or misleading scientiﬁc results. We propose a three-aim project that addresses the reference bias problem on multiple fronts. In Aim 1, we will develop new methods and a new software tool called biastools for summarizing and visualizing reference bias. In Aim 2, we will develop new software and methods that address reference bias by enabling alignment to multiple representative reference genomes. In one subproject, we will use genotype imputation to infer a personalized genome with the help of a large panel of reference haplotypes. In a second subproject, we will use small collections of representative genomes connected in a “ﬂow graph,” so that reads are ultimately analyzed with respect to the most appropriate reference. The methods described in both subprojects will be implemented as part of a new software tool called pals. Also as part of this aim, we will release a software library and tool called jector for transforming alignments from one reference coordinate system to another. Finally, for Aim 3, we apply a novel text-indexing method called r-index to enable alignment of reads to large panels of reference haplotypes. We will release the software as a software library and tool called pandex. Successful completion of the project will provide the community with new methods and references that leverage the genetic information we are gleaning from large-scale genotyping studies and from new long-read assemblies. All software will be made available under an open source license.

Key facts

NIH application ID: 10893002
Project number: 5R01HG011392-05
Recipient: JOHNS HOPKINS UNIVERSITY
Principal Investigator: Benjamin Thomas Langmead
Activity code: R01
Funding institute: NIH
Fiscal year: 2024
Award amount: $372,820
Award type: 5
Project period: 2020-09-01 → 2025-06-30