# A More Perfect Union:  Leveraging Clinically Deployed Models and Cancer Epidemiology Cohort Data to Improve AI/ML Readiness of NIH-Supported Population Sciences Resources

> **NIH NIH U01** · BECKMAN RESEARCH INSTITUTE/CITY OF HOPE · 2022 · $348,657

## Abstract

PROJECT SUMMARY/ABSTRACT
NIH has invested hundreds of millions of dollars in large-scale prospective observational cohorts. These
studies' diverse and valuable data have been used to generate important discoveries about how lifestyle and
environment affect health and disease. These high-dimensional and multi-modal real-world data can enable
broad research, including new AI/ML applications. Unfortunately, the standard methods cohorts use to store,
manage, analyze, and share their data are not ideal for contemporary AI/ML use. This creates a “readiness
gap” that hinders new AI/ML research. This project proposes an innovative yet feasible approach to close that
gap by improving AI/ML readiness at multiple levels. Our multidisciplinary team includes AI/ML experts at City
of Hope (COH); experienced population scientists from the California Teachers Study (CTS) cohort team; and
cloud computing specialists from the San Diego Supercomputer Center's (SDSC) Sherlock Cloud. The CTS
includes 133,477 female participants who have been followed continuously since 1995. Through surveys and
linkages, the CTS has collected comprehensive exposure and lifestyle data and has identified over 28,000
cancers; over 34,000 deaths; and over 800,000 individual hospitalizations. Based on an AI/ML readiness
framework, we will update the CTS's data & computing architecture; reconfigure data exploration and
aggregation tools and documentation; and use CTS data to text, evaluate, and expand existing, clinically
deployed AI/ML models. First, we will expand the current private CTS data analytics cloud to include a new
scalable computing environment specifically for AI/ML. We will deploy Amazon Web Services (AWS) resources
for AI/ML within our secure CTS enclave and provision GPU-enabled instances running a full suite of scientific
computing and AI/ML packages in Python and Jupyter Notebooks. Second, we will generate embeddings in
the CTS data to reduce the data complexity that is a barrier to AI/ML applications. Embeddings are low-
dimensional latent representations that compress data from multiple modalities into vectors that represent a
compact embedding, or abstracted summary, of a participant's data. Use of unsupervised learning and an
autocoder deep neural network will cluster CTS data into phenotype-based subgroups that can be used for
essential AI/ML functions, such as cohort discovery, close-neighbor identification, and imputation. Third, we will
augment clinically deployed risk models at COH (e.g., for readmissions) with CTS data to directly evaluate the
potential for real-world cohort data to improve model performance and the portability of clinical models into
cohort populations. Each of these three initiatives will be documented in interactive tutorial notebooks that will
be FAIR for the research community. This project includes a balanced combination of people, process, and
technology: a new multidisciplinary team of experts from relevant fields; new general-purpose e...

## Key facts

- **NIH application ID:** 10594304
- **Project number:** 3U01CA199277-08S2
- **Recipient organization:** BECKMAN RESEARCH INSTITUTE/CITY OF HOPE
- **Principal Investigator:** James V Lacey
- **Activity code:** U01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $348,657
- **Award type:** 3
- **Project period:** 2015-09-01 → 2025-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10594304

## Citation

> US National Institutes of Health, RePORTER application 10594304, A More Perfect Union:  Leveraging Clinically Deployed Models and Cancer Epidemiology Cohort Data to Improve AI/ML Readiness of NIH-Supported Population Sciences Resources (3U01CA199277-08S2). Retrieved via AI Analytics 2026-05-26 from https://api.ai-analytics.org/grant/nih/10594304. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
