# Opening the Black Box of Machine Learning Models

> **NIH NIH R35** · UNIVERSITY OF WASHINGTON · 2021 · $388,750

## Abstract

Project Summary
Biomedical data is vastly increasing in quantity, scope, and generality, expanding opportunities to discover
novel biological processes and clinically translatable outcomes. Machine learning (ML), a key technology in
modern biology that addresses these changing dynamics, aims to infer meaningful interactions among variables
by learning their statistical relationships from data consisting of measurements on variables across samples.
Accurate inference of such interactions from big biological data can lead to novel biological discoveries,
therapeutic targets, and predictive models for patient outcomes. However, a greatly increased hypothesis space,
complex dependencies among variables, and complex “black-box” ML models pose complex, open challenges.
To meet these challenges, we have been developing innovative, rigorous, and principled ML techniques to infer
reliable, accurate, and interpretable statistical relationships in various kinds of biological network inference problems,
pushing the boundaries of both ML and biology.
Fundamental limitations of current ML techniques leave many future opportunities to translate inferred
statistical relationships into biological knowledge, as exemplified in a standard biomarker discovery problem –
an extremely important problem for precision medicine. Biomarker discovery using high-throughput molecular
data (e.g., gene expression data) has significantly advanced our knowledge of molecular biology and genetics.
The current approach attempts to find a set of features (e.g., gene expression levels) that best predict a phenotype
and use the selected features, or molecular markers, to determine the molecular basis for the phenotype.
However, the low success rates of replication in independent data and of reaching clinical practice indicate three
challenges posed by current ML approach. First, high-dimensionality, hidden variables, and feature correlations
create a discrepancy between predictability (i.e., statistical associations) and true biological interactions; we need
new feature selection criteria to make the model better explain rather than simply predict phenotypes. Second,
complex models (e.g., deep learning or ensemble models) can more accurately describe intricate relationships
between genes and phenotypes than simpler, linear models, but they lack interpretability. Third, analyzing
observational data without conducting interventional experiments does not prove causal relations.
To address these problems, we propose an integrated machine learning methodology for learning interpretable models
from data that will: 1) select interpretable features likely to provide meaningful phenotype explanations, 2) make
interpretable predictions by estimating the importance of each feature to a prediction, and 3) iteratively validate
and refine predictions through interventional experiments. For each challenge, we will develop a generalizable
ML framework that focuses on different aspects of model interpre...

## Key facts

- **NIH application ID:** 10224845
- **Project number:** 5R35GM128638-04
- **Recipient organization:** UNIVERSITY OF WASHINGTON
- **Principal Investigator:** Su-In Lee
- **Activity code:** R35 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $388,750
- **Award type:** 5
- **Project period:** 2018-07-01 → 2023-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10224845

## Citation

> US National Institutes of Health, RePORTER application 10224845, Opening the Black Box of Machine Learning Models (5R35GM128638-04). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10224845. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
