Opening the Black Box of Machine Learning Models

NIH RePORTER · NIH · R35 · $388,750 · view on reporter.nih.gov ↗

Abstract

Project Summary Biomedical data is vastly increasing in quantity, scope, and generality, expanding opportunities to discover novel biological processes and clinically translatable outcomes. Machine learning (ML), a key technology in modern biology that addresses these changing dynamics, aims to infer meaningful interactions among variables by learning their statistical relationships from data consisting of measurements on variables across samples. Accurate inference of such interactions from big biological data can lead to novel biological discoveries, therapeutic targets, and predictive models for patient outcomes. However, a greatly increased hypothesis space, complex dependencies among variables, and complex “black-box” ML models pose complex, open challenges. To meet these challenges, we have been developing innovative, rigorous, and principled ML techniques to infer reliable, accurate, and interpretable statistical relationships in various kinds of biological network inference problems, pushing the boundaries of both ML and biology. Fundamental limitations of current ML techniques leave many future opportunities to translate inferred statistical relationships into biological knowledge, as exemplified in a standard biomarker discovery problem – an extremely important problem for precision medicine. Biomarker discovery using high-throughput molecular data (e.g., gene expression data) has significantly advanced our knowledge of molecular biology and genetics. The current approach attempts to find a set of features (e.g., gene expression levels) that best predict a phenotype and use the selected features, or molecular markers, to determine the molecular basis for the phenotype. However, the low success rates of replication in independent data and of reaching clinical practice indicate three challenges posed by current ML approach. First, high-dimensionality, hidden variables, and feature correlations create a discrepancy between predictability (i.e., statistical associations) and true biological interactions; we need new feature selection criteria to make the model better explain rather than simply predict phenotypes. Second, complex models (e.g., deep learning or ensemble models) can more accurately describe intricate relationships between genes and phenotypes than simpler, linear models, but they lack interpretability. Third, analyzing observational data without conducting interventional experiments does not prove causal relations. To address these problems, we propose an integrated machine learning methodology for learning interpretable models from data that will: 1) select interpretable features likely to provide meaningful phenotype explanations, 2) make interpretable predictions by estimating the importance of each feature to a prediction, and 3) iteratively validate and refine predictions through interventional experiments. For each challenge, we will develop a generalizable ML framework that focuses on different aspects of model interpre...

Key facts

NIH application ID: 10437684
Project number: 5R35GM128638-05
Recipient: UNIVERSITY OF WASHINGTON
Principal Investigator: Su-In Lee
Activity code: R35
Funding institute: NIH
Fiscal year: 2022
Award amount: $388,750
Award type: 5
Project period: 2018-07-01 → 2023-06-30