Methods for Integrative Genomic Data Analysis

NIH RePORTER · NIH · R01 · $452,629 · view on reporter.nih.gov ↗

Abstract

Abstract The broad, long-term objective of this project concerns the development of novel statistical methods, theory and computational tools for statistical modeling of large-scale multiple high-dimensional genomic data motivated by im- portant biological questions and experiments. New high-throughput technologies and next generation sequencing are generating various types of very high-dimensional genetics, genomic, epigenomics, metabolomics data in order to obtain an integrative understanding of various complex phenotypes. Integrative analysis of genomic data from differ- ent populations and tissues can potentially increase the power of detecting disease associated genetic variants and genes, and provide the possibility of making causal inference in genomic studies, eventually leading to understanding of the disease causal pathways and genomics-based risk prediction. The speciﬁc aims of the current project are to develop new statistical models and methods for polygenic risk score (PRS) prediction and for integrative analysis of eQTL and genome wide genetic association (GWAS) data for identiﬁcation of possible causal genes and pathways of complex diseases. In order to effectively utilize data across different ethnicity groups and different tissues, this project will develop several novel transfer learning methods in order to achieve better estimate of polygenic risk scores and to increase the power of detecting trait associated variants in minority populations. The project will also develop method of meta-learning to predict ethnicity- and tissue-speciﬁc gene expressions in order to increase the power of transcriptome-wide association analysis (TWAS). Finally, statistical methods for genome-wide co-localization analysis that can effectively integrate GTEx data with GWAS association summary statistics will be developed in order to identify possible causal disease genes and pathways. These methods hinge on novel integration of methods for multiple re- lated high-dimensional regressions, high-dimensional Gaussian sequence models and subspace estimation. The new methods can be applied to different types of genomic data and will ideally help facilitate the identiﬁcation of genes as well as the biological pathways underlying various complex human diseases and genomics-based disease risk predic- tion. The work proposed here will contribute statistical methodology and theory for transfer learning and meta-learning in high-dimensional genomic data to study complex phenotypes and to offer insights into each of the biological areas represented by the various data sets, including Alzheimer's disease, cardiometabolic syndrome, and chronic kidney disease. All algorithms, software tools and the resulting polygenic risk score models and tissue-speciﬁc gene expres- sion prediction models together with detailed documentation will be made available on the GitHub.

Key facts

NIH application ID: 10924010
Project number: 5R01GM129781-06
Recipient: UNIVERSITY OF PENNSYLVANIA
Principal Investigator: Hongzhe Lee
Activity code: R01
Funding institute: NIH
Fiscal year: 2024
Award amount: $452,629
Award type: 5
Project period: 2018-09-01 → 2027-08-31