Statistical Methods for the Analysis of Long-Read Sequencing Data

NIH RePORTER · NIH · F31 · $37,300 · view on reporter.nih.gov ↗

Abstract

Project Summary: Genome wide association studies (GWAS) have revealed thousands of loci associated with hundreds of complex human traits and diseases, but the underlying biological mechanisms of most of these associations remain unclear. The majority of associated variants are in noncoding regions and presumed to influence the trait through regulation of gene expression. Regulatory variants are often close to their target genes, but they may also be located up to hundreds of kilobases away in distal regulatory elements. Long-range haplotype phasing is important to study the effects of distal regulatory variants on genes and their subsequent influence on traits. Haplotype information is most commonly obtained through family data or the statistical phasing of genotypes from arrays or short-read whole genome sequencing. However, new long-read sequencing technologies determine phase directly by sequencing DNA fragments of 10 kilobases or longer. Statistical phasing methods can be applied to variants called from long reads to infer even longer haplotypes, but existing methods for phasing variants from long reads do not take advantage of information available in large external reference panels, which can improve phasing for modestly sized samples. Long-read sequencing also improves detection of structural variants including copy number variants (CNVs), which have been implicated in numerous diseases. However, the functional consequences of CNVs have been understudied compared to single nucleotide variants (SNVs) due to their absence from SNV genotyping arrays and the challenges of calling CNVs from short-read sequence data. Long-read sequencing therefore enables a more comprehensive study of the effects of CNVs on gene expression and individual-level traits. The goal of this project is to develop statistical methods for the analysis of long-read sequence data. In Specific Aim 1, we will extend existing methods for the statistical phasing of variants from genotype arrays or short reads to obtain long-range phasing of variants from long reads. In Specific Aim 2, we will develop a framework that integrates phased genetic data with molecular profiles including gene expression and chromatin accessibility to study the regulatory effects of CNVs. We expect this research project to lead to improved methods for the analysis of long-read sequence data that can be used by the wider genetics community.

Key facts

NIH application ID: 9990008
Project number: 1F31HG011186-01
Recipient: UNIVERSITY OF MICHIGAN AT ANN ARBOR
Principal Investigator: Sarah Cheyenne Hanks
Activity code: F31
Funding institute: NIH
Fiscal year: 2020
Award amount: $37,300
Award type: 1
Project period: 2020-05-01 → 2022-04-30