Robust and cost-effective computational methods for haplotype-resolved genome assemblies

NIH RePORTER · NIH · K99 · $42,888 · view on reporter.nih.gov ↗

Abstract

Abstract Background: De nova haplotype-resolved genome assembly not only plays a critical role in the studies of novel species, but also is the most comprehensive solution to discover structural variants and understand repeat-rich regions of the human genome. Moreover, haplotype-resolved assemblies are the fundamental infrastructures for various pangenome references. Recent advances in accurate long-read sequencing technologies open the opportunity to faithfully build high-quality haplotyperesolved assemblies, but most assembly algorithms could not take full advantage of the emerging accurate long-read data. To this end, I have developed a graph-based haplotype-resolved genome assembly algorithm, called hifiasm, which combines accurate long reads with the additional data providing long-range phasing information. Hifiasm has been widely used by multiple large-scale sequencing projects, such as the Human Pangenome Reference Consortium (HPRC), the Genome in a Bottle (GIAB), the Vertebrate Genomes Project (VGP), and the Darwin Tree of Life project. Based on hifiasm, this proposal focuses on developing a set of new haplotype-resolved assembly algorithms to further improve the assembly quality for complex regions and genomes, as well as substantially reduce the assembly cost. Research: My first aim is to develop a hybrid algorithm to produce high-quality haplotype-resolved assemblies for diploid genomes, especially focusing on resolving highly repetitive regions like centromeres. The proposed algorithm will combine the advantages of length and accuracy from different types of long-read data to automatically reconstruct the last unexplored repeat-rich regions of the genome. In the second aim, I will develop a haplotype-aware scaffolding algorithm to achieve chromosome-level haplotype-resolved assemblies for diploid genomes. In the third aim, I will propose different strategies to reduce the sequencing cost and the computational cost of the haplotype-resolved assembly, making it feasible for populationscale studies. I will also develop assembly algorithms to resolve complex genomes including not only two haplotypes. Upon completion, the proposed studies will offer efficient assembly tools for large-scale sequencing projects, and will pave the way to personal genome assembly for genomic research and clinical applications. Career development and training: My long-term career goal is to lead an independent research group focusing on developing novel computational methods for haplotype-resolved assemblies and the relevant applications. In addition to further enhancing my training in computational method development with my mentor Dr. Heng Li, I will obtain systematic training in biomedical research from the advisory committee (Dr. Erich D. Jarvis and Dr. Scott V. Edwards for human and non-human genomes, Dr. Evan E. Eichler and Dr. Karen H. Miga for repeats and structural variations, as well as Dr. Matthew Meyerson for complex genomes including not only two h...

Key facts

NIH application ID: 10784766
Project number: 5K99HG012798-02
Recipient: DANA-FARBER CANCER INST
Principal Investigator: Haoyu Cheng
Activity code: K99
Funding institute: NIH
Fiscal year: 2024
Award amount: $42,888
Award type: 5
Project period: 2023-02-13 → 2024-07-01