# ENRICHing NIH Imaging Datasets to Prepare them for Machine Learning

> **NIH NIH R01** · UNIVERSITY OF CALIFORNIA, SAN FRANCISCO · 2023 · $350,868

## Abstract

PROJECT SUMMARY
Objective: The goal of the parent proposal is to develop and optimize deep learning (DL) to improve detection
of congenital heart disease (CHD) from fetal ultrasound imaging. This work includes evaluation of an imaging
collection spanning two decades, tens of thousands of patients, and several clinical centers across a range of
healthcare settings. Background: Through this work, we have found that performance of DL models is
critically linked to the quality of the datasets used to train and test them. However, the AI/ML field lacks a
complete understanding of how to measure “quality.” To date, image datasets are either described subjectively
or measured crudely by size, i.e. the number of images they contain. However, “more is better” fails to account
for the key importance of diversity in the quality of image datasets. In parent Aim 1, we sought to develop
better metrics for dataset quality and content, founded in information theory and leveraging diversity. This work
has already proven quite useful for our parent use case, but it is also extremely important for all imaging
datasets in order to save on data storage/transfer costs, harmonize data intelligently, save on laborious image
labeling, screen for artifacts both anticipated and un-anticipated, and ensure diversity at several levels.
Preliminary Studies: Our multi-disciplinary team in imaging, DL, and information theory has successfully
developed a framework to analyze image datasets, called ENRICH. ENRICH consists of two main steps. First,
a similarity metric is calculated for all pairs of images in a given dataset, forming a matrix of pairwise-similarity
values. Second, an instance-selection algorithm operates on the matrix to describe its diversity and/or curate
the most informative images. ENRICH is customizable in that different choices for pairwise image similarity
metric and for curation algorithm can be used for different tasks. An initial implementation of ENRICH aimed at
reducing redundancy allowed us to get the same DL model performance in a CHD classification task from only
a fraction of the original training data. It also identified data structure and imaging artifacts without a priori
labeling, among other achievements (see Research Strategy). Goals of Supplement: The next logical step is
to apply ENRICH to more biomedical datasets, both to further validate its utility and to provide quantitative
descriptors of quality on datasets important for the research community. Aims: (1) We will run ENRICH on
several NIH imaging datasets, including (2) validating labels and adding annotations to targeted subsets of
these datasets. (3) We will document and publish these methods for the research community to use, including
connecting with the original NIH repository for each dataset. Environment and Impact: This work proposed is
supported in an outstanding environment at the crossroads of data science, imaging, and information theory
and will provide valuable tools and in...

## Key facts

- **NIH application ID:** 10842910
- **Project number:** 3R01HL150394-04S1
- **Recipient organization:** UNIVERSITY OF CALIFORNIA, SAN FRANCISCO
- **Principal Investigator:** Rima Arnaout
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2023
- **Award amount:** $350,868
- **Award type:** 3
- **Project period:** 2020-04-01 → 2025-03-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10842910

## Citation

> US National Institutes of Health, RePORTER application 10842910, ENRICHing NIH Imaging Datasets to Prepare them for Machine Learning (3R01HL150394-04S1). Retrieved via AI Analytics 2026-05-25 from https://api.ai-analytics.org/grant/nih/10842910. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
