K-mer indexing for pan-genome reference annotation

NIH RePORTER · NIH · U01 · $300,000 · view on reporter.nih.gov ↗

Abstract

ABSTRACT The human genome reference sequence is one of the foundations of genome sciences, especially in the context of next-generation sequencing (NGS) analysis. The reference has enabled discoveries in biomedical research and been particularly instrumental in human disease gene identification. However, the human genome reference is limited by its static and linear nature. Specifically, the current reference lacks the featural and contextual flexibility to represent the breadth of human variation. Important elements of individual genomes are either missed or incorrectly represented. As a solution that will bridge the next generation of reference assemblies with population genome sequencing studies, we have developed a K-mer-based indexing approach. This method is more efficient computationally, provides accurate representation in the context of populations and facilitates the analysis of diverse human genomes. Our goal is to use this strategy in developing a robust computational architecture that will encode and annotate large collections of genomes in the context of a pan-genome reference. First, we plan to develop a scalable, efficient K-mer representation of a large collection of haplotype/phased reference genomes, by 1) generating an index of all K-mers in human reference genome GRCh38 in a manner that can efficiently store variant information as metadata, and then 2) incrementally updating the K-mer index to include all novel K-mers derived from ongoing population sequencing efforts, while 3) developing schemes for directly analyzing compressed genomic data. Second, we plan to apply K-mer representation to genomic analysis by 1) providing the entirety of known human genetic variation in an aggregated index that is computationally efficient and easy to understand, 2) developing functions for our pan-genomic index that supports ultra-rapid queries, such as of clinically important variants, and 3) linking conventional coordinate information to the K-mer metadata in the pan-genome index to allow annotating genetic variation to a particular genome reference. Third, we will create an online web portal for the pan-genome, using cloud computing, to maximize the utility of our approach, to promote community engagement and to enabling contribution from the research community. We expect that completion of these aims will provide: a scalable computational architecture which incorporates the continuous addition of variant information without loss of resolution or accuracy;; rapid query speeds that will remain nearly constant as the database grows;; a universally accessible portal using cloud computing. This work will help solve the issues of multiple assemblies. It will improve researchers’ ability to understand the relationship of variants and disease, while also providing great savings over the long-term ...

Key facts

NIH application ID: 10328233
Project number: 5U01HG010963-03
Recipient: STANFORD UNIVERSITY
Principal Investigator: Hanlee P Ji
Activity code: U01
Funding institute: NIH
Fiscal year: 2022
Award amount: $300,000
Award type: 5
Project period: 2020-02-01 → 2024-01-31