Statistical Methods for Data Integration and Applications to Genome-wide Association Studies

NIH RePORTER · NIH · R01 · $421,361 · view on reporter.nih.gov ↗

Abstract

Abstract Large-scale epidemiologic studies, including biobanks and genome-wide association studies (GWAS), are now rapidly leading to the identification of novel risk factors for complex diseases. There is increasing opportunity to develop comprehensive models for disease risk incorporating genetic markers, other biomarkers, life-style factors and sociodemographic indicators. There are, however, major challenges as information on all of the potential risk factors are often not available in a single adequately large study. Instead, information may be available from different studies, each of which may include some subsets of the desired variables. Further, because of privacy concerns with individual-level data, only summary-level information, i.e., estimates of model parameters, may be available from some studies. We propose to develop a series of novel statistical methods that will allow data integration across disparate datasets to tackle modern problems faced in genetic and epidemiologic studies. In Aim 1, we will develop a framework for building generalized linear models using detail covariate data from a main study, while incorporating summary-statistics information from an external study. We will develop a series of applications of this framework to GWAS where we will use covariate data from biobanks and perform combined analysis with external summary- statistics data for powerful exploration of gene-environment interactions and mediations. In Aim 2, we will extend the proposed framework of Aim 1 for developing models with high-dimensional covariates with regularized parameter estimates. We will develop novel computational algorithms for practical implementation of the method for large-scale data analysis and develop new theory for inference on model parameters. We will further develop application of the proposed method for fine-mapping and polygenic risk score analysis conditional on covariates. In Aim 3, we will develop applications of the data integration framework to account for different accuracy/depth of disease outcome data across different studies. We will illustrate applications of different methods across the aims using datasets on cancers (breast, melanoma and lung), cardiometabolic traits (type-2 diabetes and coronary artery disease) and a psychiatric disorder (major depression disorder). We will distribute develop and freely distribute user friendly software.

Key facts

NIH application ID: 10980746
Project number: 1R01HG013137-01A1
Recipient: JOHNS HOPKINS UNIVERSITY
Principal Investigator: Nilanjan Chatterjee
Activity code: R01
Funding institute: NIH
Fiscal year: 2024
Award amount: $421,361
Award type: 1
Project period: 2024-09-23 → 2029-01-31