# Genome Informatics For Biobank-scale Data

> **NIH NIH R56** · UNIVERSITY OF TEXAS HLTH SCI CTR HOUSTON · 2021 · $616,069

## Abstract

Genetic data of biobank scales offer a wealth of information that is not obvious in traditional
smaller cohorts. We will develop and evaluate efficient and accurate algorithms and tools for the
analysis of such data to reveal such information. In particular, we will develop methods for three
main tasks: haplotype phasing refinement, genotype imputation, and relatedness inference.
Although mature methods are available for these tasks in traditional smaller data sets, there is
still a lack of scalable efficient and accurate methods and tools for handling genomic big data of
that scale. Our main observation is that biobank-scale genetic data offer dense connections
between individual data points. Unlike traditional methods based on the Li and Stephens hidden
Markov models (HMMs), we models each individual using the individual-specific cohort, i.e., all
the other individuals that are connected to the individual. We leverage the efficient positional
Burrows-Wheeler transformation (PBWT), a foundational data structure for modeling haplotype
matching. We were the first to develop a PBWT-based method for identifying IBD segments in
biobank-scale cohorts, RaPID. We are also enriched the traditional PBWT data structure and
algorithms to efficient haplotype search and allowing dynamic updates. In this application, we
leverage our algorithm development expertise and develop an IBD-based algorithm for refining
haplotype phasing of very large panels. We will also develop IBD-based algorithms for
improving efficiency and cost-effectiveness of genotype imputation using a very large reference
panel. In addition, we will develop RaPID-Affin algorithms for efficient and accurate inference of
genetic relatedness. Finally, we will benchmark the methods and develop free software for the
community. This project will empower modern genetic research by developing efficient
informatics tools for very large genotyped cohorts.

## Key facts

- **NIH application ID:** 10471476
- **Project number:** 1R56HG011509-01A1
- **Recipient organization:** UNIVERSITY OF TEXAS HLTH SCI CTR HOUSTON
- **Principal Investigator:** Shaojie Zhang
- **Activity code:** R56 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $616,069
- **Award type:** 1
- **Project period:** 2021-09-24 → 2023-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10471476

## Citation

> US National Institutes of Health, RePORTER application 10471476, Genome Informatics For Biobank-scale Data (1R56HG011509-01A1). Retrieved via AI Analytics 2026-05-25 from https://api.ai-analytics.org/grant/nih/10471476. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
