# Methods for Integrative Genomic Data Analysis

> **NIH NIH R01** · UNIVERSITY OF PENNSYLVANIA · 2024 · $452,629

## Abstract

Abstract
 The broad, long-term objective of this project concerns the development of novel statistical methods, theory and
computational tools for statistical modeling of large-scale multiple high-dimensional genomic data motivated by im-
portant biological questions and experiments. New high-throughput technologies and next generation sequencing are
generating various types of very high-dimensional genetics, genomic, epigenomics, metabolomics data in order to
obtain an integrative understanding of various complex phenotypes. Integrative analysis of genomic data from differ-
ent populations and tissues can potentially increase the power of detecting disease associated genetic variants and
genes, and provide the possibility of making causal inference in genomic studies, eventually leading to understanding
of the disease causal pathways and genomics-based risk prediction. The speciﬁc aims of the current project are to
develop new statistical models and methods for polygenic risk score (PRS) prediction and for integrative analysis of
eQTL and genome wide genetic association (GWAS) data for identiﬁcation of possible causal genes and pathways
of complex diseases. In order to effectively utilize data across different ethnicity groups and different tissues, this
project will develop several novel transfer learning methods in order to achieve better estimate of polygenic risk scores
and to increase the power of detecting trait associated variants in minority populations. The project will also develop
method of meta-learning to predict ethnicity- and tissue-speciﬁc gene expressions in order to increase the power of
transcriptome-wide association analysis (TWAS). Finally, statistical methods for genome-wide co-localization analysis
that can effectively integrate GTEx data with GWAS association summary statistics will be developed in order to identify
possible causal disease genes and pathways. These methods hinge on novel integration of methods for multiple re-
lated high-dimensional regressions, high-dimensional Gaussian sequence models and subspace estimation. The new
methods can be applied to different types of genomic data and will ideally help facilitate the identiﬁcation of genes as
well as the biological pathways underlying various complex human diseases and genomics-based disease risk predic-
tion. The work proposed here will contribute statistical methodology and theory for transfer learning and meta-learning
in high-dimensional genomic data to study complex phenotypes and to offer insights into each of the biological areas
represented by the various data sets, including Alzheimer's disease, cardiometabolic syndrome, and chronic kidney
disease. All algorithms, software tools and the resulting polygenic risk score models and tissue-speciﬁc gene expres-
sion prediction models together with detailed documentation will be made available on the GitHub.

## Key facts

- **NIH application ID:** 10924010
- **Project number:** 5R01GM129781-06
- **Recipient organization:** UNIVERSITY OF PENNSYLVANIA
- **Principal Investigator:** Hongzhe Lee
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $452,629
- **Award type:** 5
- **Project period:** 2018-09-01 → 2027-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10924010

## Citation

> US National Institutes of Health, RePORTER application 10924010, Methods for Integrative Genomic Data Analysis (5R01GM129781-06). Retrieved via AI Analytics 2026-05-27 from https://api.ai-analytics.org/grant/nih/10924010. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
