The Integration of Trans-omics for Precision Medicine (TOPMED) and Other Heart, Lung, Blood and Sleep (HLBS) Data Sets with the Data Commons

NIH RePORTER · NIH · OT3 · $8,750,000 · view on reporter.nih.gov ↗

Abstract

The life sciences are in the midst of a data revolution. Cheap and accurate genome sequencing is a reality, high-resolution imaging is becoming routine, and clinical data is increasingly stored in machine-readable formats. These breakthroughs have brought us to the threshold of a new era in biomedicine, one where the data sciences hold the potential to propel our understanding and treatment of human disease. Achieving this potential, however, will require creating software platforms that can support storing, sharing, and analyzing data at unlimited scale. In this application, we propose to address this unmet need by bringing together three groups — the University of Chicago, the Broad Institute, and the University of California at Santa Cruz — each with a strong track record of developing production-grade software platforms to support flagship scientific efforts, including the All of Us Cohort Program, the Genome Data Commons (GDC) and its affiliated NCI Cloud Pilots program, and the Human Cell Atlas Data Coordination Platform (HCA DCP). Our goal is to align and integrate our individual efforts at building data platforms, in order to build a cohesive environment that can serve the needs of the NIH Data Commons and beyond. Because these platforms were each developed to fulfill differing use cases, there is currently far more complementarity than overlap between them. For example, Dr. Grossman has extensive expertise in running a hybrid cloud at scale to support the needs of the GDC; this provides cost benefits around data transport and egress that would be invaluable to the NIH Data Commons. Similarly, Dr. Philippakis has developed a cloud-based model of collaborative workspaces (FireCloud) and software for management of secondary data use restrictions (DUOS), and Dr. Paten has long been a leader in developing and implementing standardized APIs as part of the GA4GH. It is this complementarity that motivates us to integrate our efforts. In the sections below, we present our plans for creating the Commons Alliance Platform. In addition to having a unified technical vision for what is needed, we are also aligned around a core set of guiding principles: (1) Open-source - All the software we develop, from user interfaces down to cloud metal, is open-source. This includes not only the software that would be funded via this awarding mechanism, but all software developed and deployed by our team. (2) Modular and interoperable - A design principle of all complex software undertakings is “separation of concerns,” i.e. the notion that there should be a clean division between architectural components, each encapsulated by well-defined interfaces. We are committed to building modular and interoperable software and, in doing so, encouraging the creation of an ecosystem around them. (3) Standards-driven - Our team is committed to creating and utilizing standardized APIs and data formats. We have been leaders in GA4GH since its founding, chairing various working ...

Key facts

NIH application ID
10267909
Project number
3OT3HL142481-01S3
Recipient
UNIVERSITY OF CALIFORNIA SANTA CRUZ
Principal Investigator
ROBERT L. GROSSMAN
Activity code
OT3
Funding institute
NIH
Fiscal year
2020
Award amount
$8,750,000
Award type
3
Project period
2017-09-30 → 2022-03-30