Computational methods to interpret genomic variation and integrate functional genomics data in genetic analysis of human diseases

NIH RePORTER · NIH · R35 · $406,912 · view on reporter.nih.gov ↗

Abstract

Abstract The overall research direction in my lab is to develop new computational methods to enable new discovery in genetic studies of human diseases. We design methods using machine learning models based on biological intuitions to extract knowledge from large genome sequencing and functional genomics data sets. Recent large-scale genome and exome sequencing studies of human diseases have successfully identified novel risk genes and improved diagnostic yields in clinical genetic testing, especially in rare diseases, developmental disorders, and cancer. However, many significant genetic questions remain unsolved and cannot be solved by the accumulation of genetic data alone. Most of the risk genes of human diseases are still unknown. In particular, the role of rare variants has been under-studied. One major bottleneck is the lack of highly accurate and automated tools to interpret genetic variation. Rare missense variants account for most of protein-coding variants with potential functional impact; however, most of them do not contribute to diseases. The inability to accurately predict their functional impact is a critical hurdle to identify risk genes in genetic research studies and to disambiguate variants of uncertain significance in clinical practice. We see a unique opportunity to dramatically improve computational methods in next five years, due to the following confluent factors: accumulating large population genome sequence data, modern deep learning methods to model genomic and protein sequence and structure, human functional genomics data across cell types and developmental stages, and scalable methods to profile molecular effect of genetic variants. We will focus on three areas. The first is computational prediction of functional impact of missense variants. We use deep neural networks to learn effective representation of protein sequence and structure in prediction models and use probabilistic graphical models to jointly estimate effects at molecular and population levels. The second is computational integration of functional genomics and genetics data. We will fuse machine learning with statistical genetics to develop methods that model disease genetic data together with single cell expression and regulatory profiles of normal individuals. The methods will improve both statistical power of new risk gene discovery and generate biological insights of disease etiology. Third, we will continue to develop new bioinformatics tools to improve detecting and automated confirmation of copy number variants and mosaic mutations from large-scale genomics data. Finally, our collaboration with experts in medical genetics will provide positive feedback loops to improve the methods and generate new biological insights and clinical utility. Our research will produce new methods to analyze genomics data, and ultimately these methods will enable new discoveries in disease genetic studies and improve the yield of clinical genetic diagnostics.

Key facts

NIH application ID: 10842456
Project number: 5R35GM149527-02
Recipient: COLUMBIA UNIVERSITY HEALTH SCIENCES
Principal Investigator: Yufeng Shen
Activity code: R35
Funding institute: NIH
Fiscal year: 2024
Award amount: $406,912
Award type: 5
Project period: 2023-06-01 → 2028-05-31