# Computational methods to interpret genomic variation and integrate functional genomics data in genetic analysis of human diseases

> **NIH NIH R35** · COLUMBIA UNIVERSITY HEALTH SCIENCES · 2024 · $406,912

## Abstract

Abstract
The overall research direction in my lab is to develop new computational methods to enable new discovery in genetic
studies of human diseases. We design methods using machine learning models based on biological intuitions to extract
knowledge from large genome sequencing and functional genomics data sets.
Recent large-scale genome and exome sequencing studies of human diseases have successfully identified novel risk
genes and improved diagnostic yields in clinical genetic testing, especially in rare diseases, developmental disorders, and
cancer. However, many significant genetic questions remain unsolved and cannot be solved by the accumulation of genetic
data alone. Most of the risk genes of human diseases are still unknown. In particular, the role of rare variants has been
under-studied. One major bottleneck is the lack of highly accurate and automated tools to interpret genetic variation. Rare
missense variants account for most of protein-coding variants with potential functional impact; however, most of them
do not contribute to diseases. The inability to accurately predict their functional impact is a critical hurdle to identify risk
genes in genetic research studies and to disambiguate variants of uncertain significance in clinical practice. We see a
unique opportunity to dramatically improve computational methods in next five years, due to the following confluent
factors: accumulating large population genome sequence data, modern deep learning methods to model genomic and
protein sequence and structure, human functional genomics data across cell types and developmental stages, and scalable
methods to profile molecular effect of genetic variants. We will focus on three areas. The first is computational prediction
of functional impact of missense variants. We use deep neural networks to learn effective representation of protein
sequence and structure in prediction models and use probabilistic graphical models to jointly estimate effects at molecular
and population levels. The second is computational integration of functional genomics and genetics data. We will fuse
machine learning with statistical genetics to develop methods that model disease genetic data together with single cell
expression and regulatory profiles of normal individuals. The methods will improve both statistical power of new risk gene
discovery and generate biological insights of disease etiology. Third, we will continue to develop new bioinformatics tools
to improve detecting and automated confirmation of copy number variants and mosaic mutations from large-scale
genomics data.
Finally, our collaboration with experts in medical genetics will provide positive feedback loops to improve the methods
and generate new biological insights and clinical utility. Our research will produce new methods to analyze genomics data,
and ultimately these methods will enable new discoveries in disease genetic studies and improve the yield of clinical
genetic diagnostics.

## Key facts

- **NIH application ID:** 10842456
- **Project number:** 5R35GM149527-02
- **Recipient organization:** COLUMBIA UNIVERSITY HEALTH SCIENCES
- **Principal Investigator:** Yufeng Shen
- **Activity code:** R35 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $406,912
- **Award type:** 5
- **Project period:** 2023-06-01 → 2028-05-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10842456

## Citation

> US National Institutes of Health, RePORTER application 10842456, Computational methods to interpret genomic variation and integrate functional genomics data in genetic analysis of human diseases (5R35GM149527-02). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10842456. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
