Implementing the Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL)

NIH RePORTER · NIH · U24 · $645,000 · view on reporter.nih.gov ↗

Abstract

Project Summary NIH-sponsored biomedical research is increasingly moving to cloud-based data storage and analysis systems. The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) powers the next generation of computational genomics research across the NHGRI using cloud-scale data and compute resources. The platform provides multiple entry points for data access and analysis, including data search with Gen3, workflows with Terra and Dockstore, notebook environments including Jupyter and RStudio, Bioconductor packages for analysis leveraging AnVIL APIs and services, and Galaxy instances for interactive analysis. By providing a unified environment for data management and compute, AnVIL eliminates the need for data movement, allows for controlled access to sensitive data, and provides elastic computing resources that can be acquired by researchers as needed. The NIH Cloud Platform Interoperability (NCPI) effort aims to address interoperability issues across NIH cloud systems, including AnVIL, by implementing key technologies and standards. We will work with NCPI working groups to define use cases and lead outreach as well as implement several major technologies within the AnVIL. First, we will enhance support for the NIH Researcher Auth Service (RAS) to enable researchers to establish their identity and access data they are authorized to use across Terra and Galaxy. Second, we will enhance support for the Global Alliance for Genomics and Health (GA4GH) Data Repository Service (DRS) so that data consumers can access data objects in a single, standard way. Third, we will enhance support in AnVIL for the Fast Healthcare Interoperability Resources (FHIR) standard. This will facilitate access to eMERGE and related projects by users in AnVIL and other NCPI platforms. Next, we will develop new resources and guides for budgeting for cloud computing. For this, we will identify the most commonly used tools and workflows run within Galaxy, and model the cost of these tools by varying data sets (e.g., sequencing coverage or number of genomes) and computational resources (e.g., number of CPUs, peak RAM). Statistical analysis of results will be published and serve as a tool to decrease cost as a barrier to cloud research and cloud interoperabily. Finally, interoperability of workflow generation is hampered by the fact that not all cloud platforms support the same sets of workflow languages. To address this, we will develop a Kubernetes- based computational engine to link workflows from multiple workflow languages. This work will initially focus on usage of Snakemake workflows, followed by extending development to support the workflow languages WDL, CWL and Galaxy workflows using their respective execution engines. This will simplify the transition from institutional HPC to the cloud and make it possible for researchers to seamlessly execute workflows across NCPI platforms.

Key facts

NIH application ID
10405959
Project number
3U24HG010263-04S1
Recipient
JOHNS HOPKINS UNIVERSITY
Principal Investigator
Jeremy Goecks
Activity code
U24
Funding institute
NIH
Fiscal year
2021
Award amount
$645,000
Award type
3
Project period
2018-09-21 → 2023-06-30