# Leveraging long-range haplotypes in sequencing data to advance large scale genetic studies

> **NIH NIH R01** · UNIVERSITY OF MICHIGAN AT ANN ARBOR · 2021 · $359,028

## Abstract

The Human Genome Project and subsequent projects such as 1000 Genomes, Genome Sequencing Program
(GSP), and Trans-Omics Precision Medicine (TOPMed) are providing powerful resources for studying the
genetic basis of human diseases. Combining these resources and technologies with the development of new
statistical and computational methods have in the last decade led to identification of thousands of loci
associated with disease-related phenotypes, primarily through array-based genome-wide association studies
(GWAS), empowered by genotype imputation from sequence-based haplotype panel. However, serious
problems remain when analyzing these data: (1) As short read sequencing data only provides unphased
genotype data, methods for statistical phasing are used to allow advanced analyses and to generate reference
haplotypes for genotype imputation. However, current methods to phase sequence data result in several
thousand switch errors per genome. These phasing errors in turn limit the accuracy of genotype imputation and
hamper our ability to study haplotype-aware disease models such as compound heterozygotes. (2) Due to the
abundance of rare variants, it is necessary to identify high-interest variants to obtain powerful test statistics.
Within exons, the genetic code provides some of the necessary information, but for most the genome we have
very little information that allows us to prioritize variants. (3) While samples sequenced from diverse and
admixed populations are becoming more common, few methods are designed to make use of the unique
properties of such data. For example, the distribution of local ancestry in admixed samples generate unique
haplotype structure that can be informative about the underlying phasing. Here we propose a set of novel
methods that will address these challenges: recognizing that in very large datasets most sequences will have a
recent common ancestor with at least one other sequence and that these closely related sequences will share
long segments (>1 cM) identical by descent (IBD). These IBD segments provides information about the
phasing of the underlying variants similar to large sibships. Moreover, the length of the IBD segment provides
information about the age of variants located on the IBD segment. As young variants are more likely to be
under selection, IBD length can be used to prioritize functional noncoding variants. We also aim to leverage the
long-distance correlation of genotypes in admixed samples to identify phasing errors in admixed samples. As
phasing errors also change the local ancestry of a sample in individuals of heterozygous ancestry, identifying
these breaks allows identifying and correcting phasing errors. We will develop statistical models that leverage
these conceptual ideas and implement these methods in algorithms efficient enough to be applied to sample
sizes >100,000. We will use our algorithms to annotate and re-phase existing large sequencing datasets and
thus improve commonly used imput...

## Key facts

- **NIH application ID:** 10251017
- **Project number:** 5R01HG011031-02
- **Recipient organization:** UNIVERSITY OF MICHIGAN AT ANN ARBOR
- **Principal Investigator:** Sebastian Zoellner
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $359,028
- **Award type:** 5
- **Project period:** 2020-09-01 → 2024-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10251017

## Citation

> US National Institutes of Health, RePORTER application 10251017, Leveraging long-range haplotypes in sequencing data to advance large scale genetic studies (5R01HG011031-02). Retrieved via AI Analytics 2026-05-25 from https://api.ai-analytics.org/grant/nih/10251017. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*