# Statistical Methods for the Analysis of Long-Read Sequencing Data

> **NIH NIH F31** · UNIVERSITY OF MICHIGAN AT ANN ARBOR · 2021 · $37,816

## Abstract

Project Summary:
Genome wide association studies (GWAS) have revealed thousands of loci associated with hundreds of
complex human traits and diseases, but the underlying biological mechanisms of most of these associations
remain unclear. The majority of associated variants are in noncoding regions and presumed to influence the
trait through regulation of gene expression. Regulatory variants are often close to their target genes, but they
may also be located up to hundreds of kilobases away in distal regulatory elements. Long-range haplotype
phasing is important to study the effects of distal regulatory variants on genes and their subsequent influence
on traits. Haplotype information is most commonly obtained through family data or the statistical phasing of
genotypes from arrays or short-read whole genome sequencing. However, new long-read sequencing
technologies determine phase directly by sequencing DNA fragments of 10 kilobases or longer. Statistical
phasing methods can be applied to variants called from long reads to infer even longer haplotypes, but existing
methods for phasing variants from long reads do not take advantage of information available in large external
reference panels, which can improve phasing for modestly sized samples. Long-read sequencing also
improves detection of structural variants including copy number variants (CNVs), which have been implicated
in numerous diseases. However, the functional consequences of CNVs have been understudied compared to
single nucleotide variants (SNVs) due to their absence from SNV genotyping arrays and the challenges of
calling CNVs from short-read sequence data. Long-read sequencing therefore enables a more comprehensive
study of the effects of CNVs on gene expression and individual-level traits. The goal of this project is to
develop statistical methods for the analysis of long-read sequence data. In Specific Aim 1, we will extend
existing methods for the statistical phasing of variants from genotype arrays or short reads to obtain long-range
phasing of variants from long reads. In Specific Aim 2, we will develop a framework that integrates phased
genetic data with molecular profiles including gene expression and chromatin accessibility to study the
regulatory effects of CNVs. We expect this research project to lead to improved methods for the analysis of
long-read sequence data that can be used by the wider genetics community.

## Key facts

- **NIH application ID:** 10124991
- **Project number:** 5F31HG011186-02
- **Recipient organization:** UNIVERSITY OF MICHIGAN AT ANN ARBOR
- **Principal Investigator:** Sarah Cheyenne Hanks
- **Activity code:** F31 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $37,816
- **Award type:** 5
- **Project period:** 2020-05-01 → 2022-04-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10124991

## Citation

> US National Institutes of Health, RePORTER application 10124991, Statistical Methods for the Analysis of Long-Read Sequencing Data (5F31HG011186-02). Retrieved via AI Analytics 2026-05-25 from https://api.ai-analytics.org/grant/nih/10124991. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*