Large-scale integrated data analysis of lymphocyte receptor repertoires with workflows

NIH RePORTER · NIH · U01 · $499,570 · view on reporter.nih.gov ↗

Abstract

Project Summary / Abstract Over the last decade, high-throughput B cell and T cell receptor repertoire sequencing has become a fundamental method for investigating adaptive immune responses. The Immcantation framework, consisting of open-source Python and R packages, provides a comprehensive analytical ecosystem for this Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) data analysis, covering critical steps like pre-processing, clonal relationship identification, lineage reconstruction, and somatic hypermutation analysis. This framework has gained widespread usage in infectious and immune-mediated disease research, with over 100,000 downloads in 2022. However, as sequencing technologies advance and datasets grow larger, there is a need for scalable computational workflows combining the individual analysis steps. To meet this demand and support the expanding user and developer community, we developed nf-core/airrflow (AIRRflow), a Nextflow workflow that integrates the individual Immcantation tools into a high-throughput analysis pipeline. The workflow offers parallelization, scalability, and compatibility with diverse computing infrastructures, including High-Performance Computing (HPC) clusters and commercial clouds. It is part of the nf-core project, a community-driven effort collecting Nextflow pipelines with an emphasis on robustness and reproducibility. This proposal aims to enhance AIRRflow usability, findability, accessibility, interoperability and scalability to cater to a broader audience in the infectious and immune-mediated disease (IID) research community. The proposed aims include adding new functionality to handle the integration of data from large numbers of subjects and facilitate interpretability, including embedding methods that translate receptor sequences to length-independent numerical vectors suitable for machine learning, determination of convergent responses across infectious and immune-mediated diseases, and annotation of receptor specificity leveraging public databases like IEDB. To enhance accessibility of data from public databases and ensure compliance with FAIR software principles, we will include automated data download from the Sequence Read Archive (SRA) and ImmPort, expand the supported data types to RNAseq and single-cell RNAseq, implement scalability tests, and make the workflow metadata accessible through suitable portals like the NIAID Data Ecosystem Discovery Portal. We will actively work towards expanding the user base by offering hands-on trainings, tutorials targeting relevant use cases for the IID research community, and community engagement events, gathering feedback through multiple channels including surveys, GitHub issue tracking and slack. These improvements will make AIRRflow an even more valuable resource for researchers in the IID community.

Key facts

NIH application ID: 10948588
Project number: 1U01AI184647-01
Recipient: YALE UNIVERSITY
Principal Investigator: Gisela Gabernet
Activity code: U01
Funding institute: NIH
Fiscal year: 2024
Award amount: $499,570
Award type: 1
Project period: 2024-08-09 → 2025-05-31