SHAPEIT+Salmon: haplotype phasing and RNA-seq quantification for allele-specific eQTL mapping

NIH RePORTER · NIH · R21 · $172,209 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY / ABSTRACT Allele-specific expression quantitative trait locus (eQTL) mapping has become increasingly popular, since it en- hances the traditional eQTL mapping by providing significantly more detailed gene regulatory mechanisms un- derlying the genetic architecture of diseases. Allele-specific eQTL mapping identifies cis-acting and trans-acting eQTLs that each pinpoint to cis-regulatory elements and trans-acting factors, by leveraging the fact that unlike trans-acting eQTLs, cis-acting eQTLs affect the expression of transcripts from the same haplotype as the variant itself, causing allelic imbalance in expression. However, allele-specific eQTL mapping requires a reliable long- range phasing of genome sequences and an accurate allele-specific expression quantification from RNA-seq data consistent with the genome phasing. Most existing works have treated allele-specific expression quantification and phasing as independent tasks, even though each can enhance the accuracy of the other. In this proposed research, we will modify and pair up the two widely-used tools, SHAPEIT for genome phasing and Salmon for RNA-seq quantification, to obtain an accurate phasing and allele-specific expression quantification consistent with each other for allele-specific eQTL mapping. The combined tool will inherit or enhance the accuracy and efficiency of the two original methods. If phased sequences are known from experimental or trio data, we will replace the EM algorithm of Salmon with an accelerated EM to address the extreme multi-mapped read problem with computational efficiency. If phased sequences are not available as in unrelated individuals, we will modify SHAPEIT to jointly phase the variants and allele-specific read abundances, embedding allele-specific expression quantification within SHAPEIT and using Salmon for obtaining transcript quantification and allele-specific read abundances. As a testbed, we will use genotype and RNA-seq data from a 50 generation intercross, cross be- tween two inbred mouse strains. Because these data are derived from two fully sequenced inbred founders, the correct phase is known. Though we use mice as a testbed, our approach is applicable to data from any diseases, tissues, and organisms, including GTEx data.

Key facts

NIH application ID
10153860
Project number
5R21HG011116-02
Recipient
CARNEGIE-MELLON UNIVERSITY
Principal Investigator
Seyoung Kim
Activity code
R21
Funding institute
NIH
Fiscal year
2021
Award amount
$172,209
Award type
5
Project period
2020-05-01 → 2023-04-30