Interpretable Computational Models of Functional Genomics Data

NIH RePORTER · NIH · R01 · $417,299 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY Understanding how the coordination of cis-regulatory elements (CREs) influences biological processes, such as transcription and alternative splicing, is a major goal in computational genomics. This remains a challenge because CRE activity at any given locus may depend on a host of other factors, including sequence context and/or the presence of other CREs nearby. Recent developments in deep convolutional neural networks (CNNs) have revolutionized our ability to predict regulatory functions from DNA sequence. Unlike previous computational methods based on position-weight matrices, which capture an additive model of CREs, CNNs can, in principle, also learn higher-order dependencies within the CRE, with other CREs, and with the broader sequence context. However, CNNs are essentially black box models, with parameters that don’t have clear biological meaning. Hence it remains a challenge to translate the improved predictions of a CNN to new biological insights. Here we propose to develop three different computational methods that can comprehensively characterize higher-order interactions within CREs and across different CREs from functional genomics data, specifically ChIP-seq and CLIP-seq data publicly available through ENCODE. Each method serves as its own separate Aim and will be developed in parallel. In Aim 1, we will develop a new post hoc model interpretability method based on employing interpretable quantitative models originally developed to understand complex genetic interactions in laboratory- based comprehensive mutagenesis (e.g. multiplex assays of variant effects) to characterize CRE dependencies learned by a CNN, using synthetic sequences to target specific biological hypotheses. In Aim 2, we will develop new CNN architectures where the learned parameters will express higher-order interactions that have direct biological interpretations. In Aim 3, we will combine a Bayesian nonparametric framework for modeling CREs with CNN-based CRE annotations and GPU acceleration to develop new methods for understanding how CREs are specified in the genome. Successful completion of these Aims will provide a leap forward in our understanding of higher-order CRE dependencies that are exploited but have not yet been fully revealed by CNNs. This work will provide the community with: (1) a new suite of open-source computational tools that address the problem of modeling CREs and their dependencies in functional genomics data; and (2) a comprehensive genome-wide catalogue of CRE syntax for transcription factors and RNA-binding proteins that will be hosted on a user-friendly webserver.

Key facts

NIH application ID: 10453055
Project number: 1R01HG012131-01A1
Recipient: COLD SPRING HARBOR LABORATORY
Principal Investigator: Peter K Koo
Activity code: R01
Funding institute: NIH
Fiscal year: 2022
Award amount: $417,299
Award type: 1
Project period: 2022-09-07 → 2027-06-30