ENRICHing NIH Imaging Datasets to Prepare them for Machine Learning

NIH RePORTER · NIH · R01 · $350,868 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY Objective: The goal of the parent proposal is to develop and optimize deep learning (DL) to improve detection of congenital heart disease (CHD) from fetal ultrasound imaging. This work includes evaluation of an imaging collection spanning two decades, tens of thousands of patients, and several clinical centers across a range of healthcare settings. Background: Through this work, we have found that performance of DL models is critically linked to the quality of the datasets used to train and test them. However, the AI/ML field lacks a complete understanding of how to measure “quality.” To date, image datasets are either described subjectively or measured crudely by size, i.e. the number of images they contain. However, “more is better” fails to account for the key importance of diversity in the quality of image datasets. In parent Aim 1, we sought to develop better metrics for dataset quality and content, founded in information theory and leveraging diversity. This work has already proven quite useful for our parent use case, but it is also extremely important for all imaging datasets in order to save on data storage/transfer costs, harmonize data intelligently, save on laborious image labeling, screen for artifacts both anticipated and un-anticipated, and ensure diversity at several levels. Preliminary Studies: Our multi-disciplinary team in imaging, DL, and information theory has successfully developed a framework to analyze image datasets, called ENRICH. ENRICH consists of two main steps. First, a similarity metric is calculated for all pairs of images in a given dataset, forming a matrix of pairwise-similarity values. Second, an instance-selection algorithm operates on the matrix to describe its diversity and/or curate the most informative images. ENRICH is customizable in that different choices for pairwise image similarity metric and for curation algorithm can be used for different tasks. An initial implementation of ENRICH aimed at reducing redundancy allowed us to get the same DL model performance in a CHD classification task from only a fraction of the original training data. It also identified data structure and imaging artifacts without a priori labeling, among other achievements (see Research Strategy). Goals of Supplement: The next logical step is to apply ENRICH to more biomedical datasets, both to further validate its utility and to provide quantitative descriptors of quality on datasets important for the research community. Aims: (1) We will run ENRICH on several NIH imaging datasets, including (2) validating labels and adding annotations to targeted subsets of these datasets. (3) We will document and publish these methods for the research community to use, including connecting with the original NIH repository for each dataset. Environment and Impact: This work proposed is supported in an outstanding environment at the crossroads of data science, imaging, and information theory and will provide valuable tools and in...

Key facts

NIH application ID: 10842910
Project number: 3R01HL150394-04S1
Recipient: UNIVERSITY OF CALIFORNIA, SAN FRANCISCO
Principal Investigator: Rima Arnaout
Activity code: R01
Funding institute: NIH
Fiscal year: 2023
Award amount: $350,868
Award type: 3
Project period: 2020-04-01 → 2025-03-31