# The AnVIL Data Ecosystem

> **NIH NIH U24** · BROAD INSTITUTE, INC. · 2020 · $4,500,000

## Abstract

The AnVIL Data Ecosystem
Project Summary / Abstract
In this proposal, we bring together a unified team with a strong track record of developing secure and scalable
software systems to support flagship scientific efforts, such as the All of Us Research Program, the Genomic
Data Commons (GDC), and the Human Cell Atlas (HCA). Our group will leverage these experiences, and the
software developed for them, to create an ecosystem of applications that will both serve the needs of the
AnVIL and interoperate with other NIH data resources. We will accomplish this through the following Aims:
 ● Aim 1 (Software Engineering): Leverage existing software capabilities to create tools for storing,
 sharing, and analyzing AnVIL datasets at unlimited scale. During the past five years, our groups
 have created a suite of modular and open source software capabilities that address key needs in
 genomic data science. We will leverage these existing capabilities and extend them in novel directions
 to address AnVIL-specific scientific goals relating to human genetics and functional genomics.
 ● Aim 2 (Data Engineering): Curate data and metadata resources so that they are easily
 accessible. The AnVIL will not only be a suite of software services, but also a vast repository of
 genotypic and phenotypic information. For this resource to be usable by the community, it must be
 organized, curated, and made accessible. We will accomplish this by processing genomic datasets
 using a consistent set of best-practices pipelines, and mapping phenotypes to a common data model.
 ● Aim 3 (Operations): Stand up and support a data environment for the AnVIL community, and
 integrate it with other NIH resources as part of a federated NIH-wide genomic data commons.
 The modular components of Aim 1 are critical building blocks, but they alone are not enough to meet
 the needs of the AnVIL; they must also be stood up as services and integrated into a coherent entity,
 which we call a “data environment.” We propose to create an AnVIL data environment that will enable
 researchers to access datasets in a secure, compliant, and facile manner.
The guiding principle of these efforts is that progress in genomic science will happen most rapidly if there is a
diversity of solutions created by a plurality of groups. Towards that end, our approach to engineering the
software components of Aim 1, curating the datasets of Aim 2, and operating the software services of Aim 3 is
to catalyze an ecosystem of activity around the AnVIL. Our proposal focuses not only on creating and
operating software services ourselves, but also on incorporating third-party solutions. We propose to
accomplish this by architecting the AnVIL data environment according to the following principles: (i) modularity,
(ii) openness, (iii) community engagement, (iv) standardization, and (v) interoperability.

## Key facts

- **NIH application ID:** 9990833
- **Project number:** 5U24HG010262-03
- **Recipient organization:** BROAD INSTITUTE, INC.
- **Principal Investigator:** Robert J Carroll
- **Activity code:** U24 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $4,500,000
- **Award type:** 5
- **Project period:** 2018-09-19 → 2023-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9990833

## Citation

> US National Institutes of Health, RePORTER application 9990833, The AnVIL Data Ecosystem (5U24HG010262-03). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/9990833. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*