Genome Informatics For Biobank-scale Data

NIH RePORTER · NIH · R56 · $616,069 · view on reporter.nih.gov ↗

Abstract

Genetic data of biobank scales offer a wealth of information that is not obvious in traditional smaller cohorts. We will develop and evaluate efficient and accurate algorithms and tools for the analysis of such data to reveal such information. In particular, we will develop methods for three main tasks: haplotype phasing refinement, genotype imputation, and relatedness inference. Although mature methods are available for these tasks in traditional smaller data sets, there is still a lack of scalable efficient and accurate methods and tools for handling genomic big data of that scale. Our main observation is that biobank-scale genetic data offer dense connections between individual data points. Unlike traditional methods based on the Li and Stephens hidden Markov models (HMMs), we models each individual using the individual-specific cohort, i.e., all the other individuals that are connected to the individual. We leverage the efficient positional Burrows-Wheeler transformation (PBWT), a foundational data structure for modeling haplotype matching. We were the first to develop a PBWT-based method for identifying IBD segments in biobank-scale cohorts, RaPID. We are also enriched the traditional PBWT data structure and algorithms to efficient haplotype search and allowing dynamic updates. In this application, we leverage our algorithm development expertise and develop an IBD-based algorithm for refining haplotype phasing of very large panels. We will also develop IBD-based algorithms for improving efficiency and cost-effectiveness of genotype imputation using a very large reference panel. In addition, we will develop RaPID-Affin algorithms for efficient and accurate inference of genetic relatedness. Finally, we will benchmark the methods and develop free software for the community. This project will empower modern genetic research by developing efficient informatics tools for very large genotyped cohorts.

Key facts

NIH application ID
10471476
Project number
1R56HG011509-01A1
Recipient
UNIVERSITY OF TEXAS HLTH SCI CTR HOUSTON
Principal Investigator
Shaojie Zhang
Activity code
R56
Funding institute
NIH
Fiscal year
2021
Award amount
$616,069
Award type
1
Project period
2021-09-24 → 2023-08-31