# Interpreting function of non-coding sequences with synthetic biology and machine learning

> **NIH NIH F31** · WASHINGTON UNIVERSITY · 2021 · $31,970

## Abstract

PROJECT SUMMARY/ABSTRACT
Most disease-associated variants lie in non-coding regions of the genome and exert their influence through
effects on gene expression. However, we lack a predictive framework to interpret such non-coding variants,
limiting how genomic data is used in precision medicine. We may be able to interpret non-coding variants with
new machine learning algorithms, but so far the practical applications of machine learning in functional
genomics have been limited because of two major challenges. First, the size and diversity of training data sets
in functional genomics are orders of magnitude smaller than in applications where machine learning has been
successful, such as image recognition and product recommendation. A second challenge is that if training data
are not collected in an appropriate in vitro cellular model, then the resulting machine learning models may not
generalize to relevant in vivo cell types. To improve the application of machine learning to non-coding variants,
I propose to address both the limited size of training data sets and the efficacy of cell culture models.
A core principle of machine learning is that model performance improves with more data. In Aim 1, I propose to
increase the size and diversity of training data by performing iterative cycles of machine learning and
experimental validation with Massively Parallel Reporter Assays (MPRAs). The key aspect of my approach is to
algorithmically design each successive MPRA library to contain sequences that are most likely to improve the
next round of modeling. I recently trained my first model on data that I collected from MPRA experiments of
cis-regulatory sequences that function in mammalian photoreceptors. To avoid any issues with cell lines, I
performed these experiments in ex vivo developing retinas, which retain the appropriate tissue architecture.
However, unlike photoreceptors, most cell types are not experimentally tractable in their native physiological
context. Thus, it will be important to determine how well in vitro cell lines recapitulate in vivo cis-regulation. In
Aim 2, I propose to determine whether a tractable cell culture model can recapitulate results from ex vivo
retinas. I will use existing MPRA data from ex vivo retinas as a standard to compare against data collected in
cell lines engineered to express combinations of photoreceptor transcription factors. I aim to address whether
engineering tractable cell lines to express tissue-specific transcription factors might be a general approach for
collecting data to train machine learning models that generalize to in vivo systems. Successful completion of
these aims will produce a general approach to increase the size and diversity of functional genomic training
data, and may result in a general method for producing experimentally tractable systems for machine learning
applications, ultimately helping us better apply genomic data to precision medicine.

## Key facts

- **NIH application ID:** 10177882
- **Project number:** 5F31HG011431-02
- **Recipient organization:** WASHINGTON UNIVERSITY
- **Principal Investigator:** Ryan Zachary Friedman
- **Activity code:** F31 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $31,970
- **Award type:** 5
- **Project period:** 2020-07-01 → 2023-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10177882

## Citation

> US National Institutes of Health, RePORTER application 10177882, Interpreting function of non-coding sequences with synthetic biology and machine learning (5F31HG011431-02). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10177882. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
