Guiding humans to create better labeled datasets for machine learning in biomedical research

NIH RePORTER · NIH · R01 · $383,889 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY / ABSTRACT Machine learning (ML) has seen tremendous advances in the past decade, fueled by growth in computing and the availability of large labeled datasets. While the impact of these advances on clinical and biomedical research are potentially significant, these applications face unique challenges due to the difficulty in acquiring labels from biomedical experts. Furthermore, ML algorithms often fail to generalize across institutions or datasets due to measurement biases (e.g. MR scanners) or intrinsic demographic or biological differences between cohorts / datasets which limits their impact in biomedical science. This proposal will develop new methodology and open-source software that biomedical data scientists can use with their applications to 1. Improve data labeling by identifying the best samples for labeling that provide the most benefit for training ML algorithms; 2. Improve generalization of ML models across institutes; and 3. Perform this work on scalable cloud platforms. We will first explore how to improve upon methods known as active learning that interactively construct labeled datasets by having an algorithm select samples that address its weaknesses and present these samples to an expert for labeling. We will then investigate how these samples can be selected to improve the performance of ML algorithms across multiple institutions by learning robust patterns that are not specific to any one site. Finally, we will develop an extendable software framework that developers can integrate into their own applications to take advantage of these methods, and that can operate on cloud platforms to support scalable analysis of large datasets. This work will be developed through a combination of simulation studies using a unique repository of over 280,000 human markups of digital pathology images at multiple institutions, and also user studies of the developed software frameworks focused on applications in perinatal pathology and the human placenta. The software tools will impact a broad variety of biomedical applications beyond pathology where data labeling and multi-institutional studies remain challenging.

Key facts

NIH application ID
10867456
Project number
5R01LM013523-04
Recipient
NORTHWESTERN UNIVERSITY
Principal Investigator
Lee Cooper
Activity code
R01
Funding institute
NIH
Fiscal year
2024
Award amount
$383,889
Award type
5
Project period
2021-09-01 → 2026-05-31