# Development of a joint machine learning/de novo assembly system for resolving viral quasispecies

> **NIH NIH R43** · GATACA, LLC · 2020 · $267,225

## Abstract

PROJECT SUMMARY
Viral hepatitis from hepatitis B (HBV) establishes chronic infections in >250M people worldwide; chronicity is
on the rise, and approximately one-third of the world’s population (2 billion) has serologic evidence of
exposure. HBV coinfection with HCV and HIV is a hidden consequence of the substance use disorder
epidemic. Viral populations have extremely high sequence diversity and rapidly evolve, which explains the
vaccine failure rates and viral resistance to existing therapies and makes discovering lasting therapies
extremely challenging. Next Generation Sequencing (NGS) is the method of choice to assess the intra-host
virus population, termed a “quasispecies”. While a large set of short DNA sequencing reads are acquired that
represent the virions in the quasispecies, computational technologies are limited in their analysis capabilities,
resulting in particularly low resolution of complex HBV genomic structures. Another challenge is assembling
NGS reads representing short fragment of the host genome into full strains (haplotypes) without knowledge of
their true occurrence in the samples. To meet these challenges, GATACA is developing pathogen-specific
bioinformatics software, GAT-ML (GATACA Assembly Tool – machine learning [ML]) to support treatment
discovery and improve infection control. Its specifically designed algorithm utilizes novel ML methodologies
adapted and modified for assisting genome assembly that will allow GAT-ML to reconstruct complete viral
haplotypes and populations by learning the ‘language’ of the sequences. Tailored initially for HBV samples,
GAT and its new ML system will be integrated for feasibility testing in this Phase I with the following Specific
Aims:
1. Specific Aim 1. Build a joint learning system. Train and test natural language processing (NLP) methods on
HBV genetic variation.
2. Specific Aim 2. Implement and test the machine learning methods in GAT (GAT-ML).
We anticipate a working tool for characterizing HBV haplotypes, validated with multi-sourced datasets, and
extensive testing and benchmarking of offline and integrated methods.

## Key facts

- **NIH application ID:** 10011686
- **Project number:** 1R43AI152894-01
- **Recipient organization:** GATACA, LLC
- **Principal Investigator:** Johanna C Craig
- **Activity code:** R43 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $267,225
- **Award type:** 1
- **Project period:** 2020-04-01 → 2022-03-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10011686

## Citation

> US National Institutes of Health, RePORTER application 10011686, Development of a joint machine learning/de novo assembly system for resolving viral quasispecies (1R43AI152894-01). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/10011686. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
