Reliable post hoc interpretations of deep learning in genomics

NIH RePORTER · NIH · R01 · $129,318 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY Understanding how the coordination of transcription factors bind to non-coding DNA provides mechanistic insights into transcriptional regulation. Recent developments in deep neural networks (DNNs) have revolutionized our ability to study regulatory genomics. While they have demonstrated improved predictions compared to previous methods based on traditional computational genomics, their low interpretability has earned them a reputation as a black box. To address this gap, post hoc model interpretability methods have emerged to interrogate important features that the network has learned. Of these, attribution maps have demonstrated promise, providing importance scores for each nucleotide in a given sequence; these have a natural interpretation as single-nucleotide variant effects. In principle, attribution maps should contain information to identify motifs that are important for cell-type specific regulatory functions and annotate their positions at base- resolution. However, attribution maps are often noisy in practice; in addition to motifs, they contain spurious importance scores for arbitrary nucleotides for reasons that are not well established. Despite their promise, interpreting a DNN through attribution maps remains challenging. Here we propose three complementary aims that serve to maximize the biological insights that we can achieve from attribution maps for genomic DNNs. In Aim 1, we will develop a model selection framework to identify the optimal DNN from a set of candidate DNNs that yields high generalization performance and interpretable attribution maps. In Aim 2, we will develop robust training strategies based on regularization and data augmentations tailored for genomics, with the broader aim of ensuring that DNNs yield high-quality attribution maps and high generalization. In Aim 3, we will develop and employ interpretable computational methods to directly analyze attribution maps to facilitate discovery of functional motifs and annotate their positions. Each aim will be implemented as open-source software in TensorFlow and PyTorch. As the number of deep learning applications in genomics is rising quickly, the biomedical community will greatly benefit from these user-friendly computational tools by enabling the deployment of robust training and interpretability analysis for any DNN trained on functional genomics assays. This, in turn, will drive new discoveries in cis-regulatory biology across the many biological systems that deep learning has already been applied to and the new applications that will continue to emerge in the future.

Key facts

NIH application ID: 11100471
Project number: 3R01GM149921-02S1
Recipient: COLD SPRING HARBOR LABORATORY
Principal Investigator: Peter K Koo
Activity code: R01
Funding institute: NIH
Fiscal year: 2024
Award amount: $129,318
Award type: 3
Project period: 2023-08-01 → 2027-04-30