Developing robust and scalable genomics tools and databases to analyze immune receptor repertoires across diverse populations

NIH RePORTER · NIH · R01 · $823,771 · view on reporter.nih.gov ↗

Abstract

Abstract The recent advances in high-throughput sequencing technologies enable cost-effective characterization of the immune system and provide novel opportunities to study adaptive immune receptor repertoire (AIRR) at the population scale. In particular, AIRR analysis provides essential insight into the complexity of the immune system across a large variety of human diseases, including infectious diseases, cancer, autoimmune conditions, and neurodegenerative diseases. A commonly used assay-based approach (i.e. AIRR-Seq) provides a detailed view of the adaptive immune system by leveraging the deep sequencing of amplified DNA or RNA from the variable region of the T and B cell receptors (TCR and BCR) loci. However, the limited number of samples probed by the AIRR-Seq approach restricts the ability to detect novel population-specific V(D)J gene alleles across ethnically diverse and admixed populations. Non-targeted next-generation sequencing (NGS) (e.g. WGS) promises to fill the existing data gap by providing hundreds of thousands of NGS datasets across various ancestry groups. However, reliable and scalable bioinformatics algorithms have yet to be developed to utilize non-targeted NGS technologies to assemble novel population-specific alleles that would support effect-size heterogeneity across ancestries. There's a lack of comprehensive population-specific allelic immunogenomics reference databases. This void exacerbates existing health disparities, as discoveries in medical immunogenomics continue to be a privilege and benefit for populations of European ancestry. The current state-of-the-art databases were built on the genetic architecture based on individuals of European ancestry and thus fail to capture allelic variation across diverse populations. Ongoing initiatives by the Adaptive Immune Receptor Repertoire Community (AIRR-C) to improve the representation of diverse populations in reference databases (e.g. OGRDB and VDJbase) ignore individuals of non-European ancestry and only incorporate an extremely small number of individuals of European descent. We propose to utilize a data science approach for studying the variation of the human adaptive immune system at a truly global scale, improving studies of immunological health and diseases, and reducing health disparities. In this study, we will develop robust and scalable bioinformatics tools and databases able to leverage the largest datasets covering individuals of various ancestries composed of over half a million NGS samples spanning the AIRR-Seq, RNA-Seq, and WGS technologies. We will perform rigorous benchmarking of the developed bioinformatics methods based on both simulated and real data to demonstrate the feasibility of using NGS-based approaches to assemble novel V(D)J alleles. The availability of large and ethnically diverse sets of samples will allow us to discover novel population-specific V(D)J alleles, which will enrich existing immunogenomics databases with population-specific i...

Key facts

NIH application ID: 10656981
Project number: 1R01AI173172-01A1
Recipient: UNIVERSITY OF SOUTHERN CALIFORNIA
Principal Investigator: SERGHEI MANGUL
Activity code: R01
Funding institute: NIH
Fiscal year: 2023
Award amount: $823,771
Award type: 1
Project period: 2023-02-10 → 2028-01-31