# Direct whole-genome haplotype-resolved assembly using sequence graphs

> **NIH NIH K99** · DANA-FARBER CANCER INST · 2020 · $48,022

## Abstract

Abstract
The lack of complete, high-quality sequencing of human genomes is a major bottleneck for accurate and
complete analyses in population and medical genetics. Advances in a variety of sequencing technologies have
created enormous opportunities to yield full assemblies of every chromosome and its homologue (called as
haplotypes). The reconstruction of haplotype sequences from sequencing data is known as diploid assembly or
haplotype-aware de novo assembly. Standard de novo assemblers are limited in their ability to combine mixed
data types, and also collapse haplotype sequences, resulting in expensive, discontinuous, and inaccurate
assemblies. Our interim goal is a finished human genome that would not only reveal the last remaining regions
of the genome, but also benefit downstream analyses by providing an unbiased reference for comparison and
mapping, as well as the complete phased sequencing of several human and non-human genomes for specific
research projects.
This project will develop a novel computational toolkit WHdenovo, that can optimally combine various
sequencing data types to generate phased assemblies of single individuals and pedigrees. In aim 1 (K99
phase), I will provide computationally efficient tools that are easy-to-use, open-source and are production level
for generating diploid assemblies of pedigrees at minimal cost. In aim 2 (R00 phase), I will develop novel
computational tools for generating pedigree-independent diploid assemblies of single individuals over whole
genomes including centromeres. In aim 3 (R00 phase), the tools developed during aims 1 and 2 will be applied
to generating diploid assemblies of diverse human and non-human genomes, and of clinically relevant regions
such as the histocompatibility complex (MHC) and killer cell immunoglobulin-like receptor (KIR) region. My goal
is to design tools that will be useful to large consortiums such as Genome in a Bottle, High Quality Human
Reference Genomes, and the Personal Genome Project.
My extensive background in computational biology puts me in a unique position to accomplish this proposal,
which requires a seamless integration between data science and genomics. Career and Training: I received
my PhD in Computer Science at Max Planck Institute for Informatics, and started postdoctoral research in the
lab of Professor George Church at Harvard Medical School. During the K99 phase, I will continue to be
mentored by Professor Church. Under the supervision of co-mentor Heng Li, I will advance my expertise in
making computational tools efficient in practice, and how to tune them for upcoming novel high throughput
sequencing (HTS) datasets. This proposed plan would prepare me to be an independent bioinformatics
research scientist.
​
​
​
​​
​

## Key facts

- **NIH application ID:** 10015321
- **Project number:** 5K99HG010906-02
- **Recipient organization:** DANA-FARBER CANCER INST
- **Principal Investigator:** Shilpa Garg
- **Activity code:** K99 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $48,022
- **Award type:** 5
- **Project period:** 2019-09-10 → 2021-01-03

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10015321

## Citation

> US National Institutes of Health, RePORTER application 10015321, Direct whole-genome haplotype-resolved assembly using sequence graphs (5K99HG010906-02). Retrieved via AI Analytics 2026-05-26 from https://api.ai-analytics.org/grant/nih/10015321. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
