Development of a joint machine learning/de novo assembly system for resolving viral quasispecies

NIH RePORTER · NIH · R43 · $267,225 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY Viral hepatitis from hepatitis B (HBV) establishes chronic infections in >250M people worldwide; chronicity is on the rise, and approximately one-third of the world’s population (2 billion) has serologic evidence of exposure. HBV coinfection with HCV and HIV is a hidden consequence of the substance use disorder epidemic. Viral populations have extremely high sequence diversity and rapidly evolve, which explains the vaccine failure rates and viral resistance to existing therapies and makes discovering lasting therapies extremely challenging. Next Generation Sequencing (NGS) is the method of choice to assess the intra-host virus population, termed a “quasispecies”. While a large set of short DNA sequencing reads are acquired that represent the virions in the quasispecies, computational technologies are limited in their analysis capabilities, resulting in particularly low resolution of complex HBV genomic structures. Another challenge is assembling NGS reads representing short fragment of the host genome into full strains (haplotypes) without knowledge of their true occurrence in the samples. To meet these challenges, GATACA is developing pathogen-specific bioinformatics software, GAT-ML (GATACA Assembly Tool – machine learning [ML]) to support treatment discovery and improve infection control. Its specifically designed algorithm utilizes novel ML methodologies adapted and modified for assisting genome assembly that will allow GAT-ML to reconstruct complete viral haplotypes and populations by learning the ‘language’ of the sequences. Tailored initially for HBV samples, GAT and its new ML system will be integrated for feasibility testing in this Phase I with the following Specific Aims: 1. Specific Aim 1. Build a joint learning system. Train and test natural language processing (NLP) methods on HBV genetic variation. 2. Specific Aim 2. Implement and test the machine learning methods in GAT (GAT-ML). We anticipate a working tool for characterizing HBV haplotypes, validated with multi-sourced datasets, and extensive testing and benchmarking of offline and integrated methods.

Key facts

NIH application ID
10011686
Project number
1R43AI152894-01
Recipient
GATACA, LLC
Principal Investigator
Johanna C Craig
Activity code
R43
Funding institute
NIH
Fiscal year
2020
Award amount
$267,225
Award type
1
Project period
2020-04-01 → 2022-03-31