# Large-scale integrated data analysis of lymphocyte receptor repertoires with workflows

> **NIH NIH U01** · YALE UNIVERSITY · 2024 · $499,570

## Abstract

Project Summary / Abstract
Over the last decade, high-throughput B cell and T cell receptor repertoire sequencing has become a
fundamental method for investigating adaptive immune responses. The Immcantation framework, consisting of
open-source Python and R packages, provides a comprehensive analytical ecosystem for this Adaptive Immune
Receptor Repertoire sequencing (AIRR-seq) data analysis, covering critical steps like pre-processing, clonal
relationship identification, lineage reconstruction, and somatic hypermutation analysis. This framework has
gained widespread usage in infectious and immune-mediated disease research, with over 100,000 downloads
in 2022. However, as sequencing technologies advance and datasets grow larger, there is a need for scalable
computational workflows combining the individual analysis steps. To meet this demand and support the
expanding user and developer community, we developed nf-core/airrflow (AIRRflow), a Nextflow workflow that
integrates the individual Immcantation tools into a high-throughput analysis pipeline. The workflow offers
parallelization, scalability, and compatibility with diverse computing infrastructures, including High-Performance
Computing (HPC) clusters and commercial clouds. It is part of the nf-core project, a community-driven effort
collecting Nextflow pipelines with an emphasis on robustness and reproducibility. This proposal aims to enhance
AIRRflow usability, findability, accessibility, interoperability and scalability to cater to a broader audience in the
infectious and immune-mediated disease (IID) research community. The proposed aims include adding new
functionality to handle the integration of data from large numbers of subjects and facilitate interpretability,
including embedding methods that translate receptor sequences to length-independent numerical vectors
suitable for machine learning, determination of convergent responses across infectious and immune-mediated
diseases, and annotation of receptor specificity leveraging public databases like IEDB. To enhance accessibility
of data from public databases and ensure compliance with FAIR software principles, we will include automated
data download from the Sequence Read Archive (SRA) and ImmPort, expand the supported data types to
RNAseq and single-cell RNAseq, implement scalability tests, and make the workflow metadata accessible
through suitable portals like the NIAID Data Ecosystem Discovery Portal. We will actively work towards
expanding the user base by offering hands-on trainings, tutorials targeting relevant use cases for the IID research
community, and community engagement events, gathering feedback through multiple channels including
surveys, GitHub issue tracking and slack. These improvements will make AIRRflow an even more valuable
resource for researchers in the IID community.

## Key facts

- **NIH application ID:** 10948588
- **Project number:** 1U01AI184647-01
- **Recipient organization:** YALE UNIVERSITY
- **Principal Investigator:** Gisela Gabernet
- **Activity code:** U01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $499,570
- **Award type:** 1
- **Project period:** 2024-08-09 → 2025-05-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10948588

## Citation

> US National Institutes of Health, RePORTER application 10948588, Large-scale integrated data analysis of lymphocyte receptor repertoires with workflows (1U01AI184647-01). Retrieved via AI Analytics 2026-06-12 from https://api.ai-analytics.org/grant/nih/10948588. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
