Incorporating Analysis of Gene Paralog Variation Into Existing Genomics Datasets

NIH RePORTER · NIH · R03 · $312,000 · view on reporter.nih.gov ↗

Abstract

Project Summary/Abstract Gene duplication is a major mechanism for the evolution of novel gene functions. Copy-number and sequence variation within multigene families are associated with many phenotypes, human diseases, and evolutionary adaptations. Yet systematic incorporation of gene paralog variation into studies of genomic diversity is lacking. Most existing tools are not well suited to delineating differences among gene family members or require prohibitively large computational resources. We recently developed an approach, QuicK-mer2, which efficiently estimates gene copy-number in a paralog specific manner. Application of our approach to data from the 1000 Genomes Project revealed rare gene-paralog variants that have not been previously reported. Here, we propose application of QuicK-mer2 to create paralog specific copy-number estimates from existing NIH Common Fund genomics data sets. In specific Aim 1, we will analyze genome sequencing data from the Genotype-Tissue Expression (GTEx) consortium to define the effect of gene paralog variation on gene expression levels. Although we will assess the entire genome, we will focus our analyses on variation among the largest family of transcription factors, KRAB-ZFPs (Kruppel-related AB box zinc finger proteins), to identify trans-acting expression QTL. In specific Aim 2, we will analyze variation among duplicated genes in the Gabriella Miller Kids First Data Resource with a focus on structural birth defects, a phenotype to which copy-number variation is known to be a key contributor. Many recurrent copy-number variants arise in regions which are flanked by large segments of duplicated sequence with a high identity. Many of these regions of segmental duplication also contain members of duplicated gene families that have important biological functions. Here, we will focus on discovering previously missed gene copy number variation within the duplicated sequences themselves. Together, completion of these aims will give a fuller picture of the extent of genomic variation and the impact of differences among gene paralogs on gene regulation and disease.

Key facts

NIH application ID
10104902
Project number
1R03OD030605-01
Recipient
UNIVERSITY OF MICHIGAN AT ANN ARBOR
Principal Investigator
Jeffrey M Kidd
Activity code
R03
Funding institute
NIH
Fiscal year
2020
Award amount
$312,000
Award type
1
Project period
2020-09-18 → 2023-08-31