# A deep reinforcement learning framework for haplotype assembly

> **NIH NIH R21** · BROAD INSTITUTE, INC. · 2024 · $457,920

## Abstract

Haplotype assembly is the problem of reconstructing the combination of alleles on the maternally and
paternally inherited chromosome copies and is key to our understanding of human population genetics and
disease. Numerous statistical and molecular approaches have been developed to date to enable haplotype
reconstruction. In this work, we focus on read-based phasing of individual genomes, which involves the
assembly of the two haplotypes from whole-genome-sequencing read alignments and variant genotypes.
Fragments that span more than one heterozygous variant provide molecular linkage evidence for alleles
occurring on the same haplotype and can hence be leveraged for haplotype assembly; however, sequencing
errors make this problem challenging. Existing techniques often employ an NP-hard combinatorial optimization
formulation for this problem and rely on hand-engineered heuristics to find a solution. Here we propose a novel
framework based on deep reinforcement learning, which integrates the representational power of deep
learning with reinforcement learning, to automatically learn effective algorithms that can accurately partition
read fragments into two haplotype sets given inputs from different sequencing platforms. Importantly, this
approach does not require labeled training data, which allows us to use all the publicly-available datasets
collected in large-scale sequencing repositories, such as the 1000 Genomes Project, as training data for our
models. Given the complex combinatorial structure of genomic data, an important aspect of this work is the
design and compilation of a representative training dataset to ensure model generalizability. Our initial
preliminary results show that our approach can achieve state of the art phasing block lengths and lower error
rates on short read inputs.

## Key facts

- **NIH application ID:** 10871190
- **Project number:** 1R21HG013567-01
- **Recipient organization:** BROAD INSTITUTE, INC.
- **Principal Investigator:** Victoria Popic
- **Activity code:** R21 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $457,920
- **Award type:** 1
- **Project period:** 2024-07-15 → 2026-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10871190

## Citation

> US National Institutes of Health, RePORTER application 10871190, A deep reinforcement learning framework for haplotype assembly (1R21HG013567-01). Retrieved via AI Analytics 2026-05-25 from https://api.ai-analytics.org/grant/nih/10871190. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*