# Guiding humans to create better labeled datasets for machine learning in biomedical research

> **NIH NIH R01** · NORTHWESTERN UNIVERSITY · 2022 · $403,132

## Abstract

PROJECT SUMMARY / ABSTRACT
Machine learning (ML) has seen tremendous advances in the past decade, fueled by growth in computing and
the availability of large labeled datasets. While the impact of these advances on clinical and biomedical
research are potentially significant, these applications face unique challenges due to the difficulty in acquiring
labels from biomedical experts. Furthermore, ML algorithms often fail to generalize across institutions or
datasets due to measurement biases (e.g. MR scanners) or intrinsic demographic or biological differences
between cohorts / datasets which limits their impact in biomedical science. This proposal will develop new
methodology and open-source software that biomedical data scientists can use with their applications to 1.
Improve data labeling by identifying the best samples for labeling that provide the most benefit for training ML
algorithms; 2. Improve generalization of ML models across institutes; and 3. Perform this work on scalable
cloud platforms. We will first explore how to improve upon methods known as active learning that interactively
construct labeled datasets by having an algorithm select samples that address its weaknesses and present
these samples to an expert for labeling. We will then investigate how these samples can be selected to
improve the performance of ML algorithms across multiple institutions by learning robust patterns that are not
specific to any one site. Finally, we will develop an extendable software framework that developers can
integrate into their own applications to take advantage of these methods, and that can operate on cloud
platforms to support scalable analysis of large datasets. This work will be developed through a combination of
simulation studies using a unique repository of over 280,000 human markups of digital pathology images at
multiple institutions, and also user studies of the developed software frameworks focused on applications in
perinatal pathology and the human placenta. The software tools will impact a broad variety of biomedical
applications beyond pathology where data labeling and multi-institutional studies remain challenging.

## Key facts

- **NIH application ID:** 10466914
- **Project number:** 5R01LM013523-02
- **Recipient organization:** NORTHWESTERN UNIVERSITY
- **Principal Investigator:** Lee Cooper
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $403,132
- **Award type:** 5
- **Project period:** 2021-09-01 → 2025-05-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10466914

## Citation

> US National Institutes of Health, RePORTER application 10466914, Guiding humans to create better labeled datasets for machine learning in biomedical research (5R01LM013523-02). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10466914. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*