Causal and integrative deep learning for Alzheimer's disease genetics

NIH RePORTER · NIH · U01 · $733,352 · view on reporter.nih.gov ↗

Abstract

Summary In response to PAR-19-269, “Cognitive Systems Analysis of Alzheimer's Disease Genetic and Phenotypic Data”, we propose developing and applying more powerful and robust machine learning methods for causal and integrative analysis, especially deep learning approaches for instrumental variable analysis, to identify causal risk/protective factors for Alzheimer's disease (AD) in the post-GWAS era by leveraging published large-scale GWAS, whole-genome sequencing (WGS) and other omic and neuroimaging data. Our main motivation is to ex- tend an emerging and increasingly influential approach of integrating GWAS with gene expression data, called transcriptome-wide association studies (TWAS), aiming to improve over the current practice of GWAS by not only increasing statistical power, but also identifying (putative) causal genes, thus gaining insights into the genetic basis of common diseases and complex traits. The statistical principle underlying TWAS is the (two-sample) two-stage least squares (2SLS) for linear models in the framework of instrumental variable (IV) analysis for causal inference. In practice, however, TWAS may fail to identify true causal genes while giving false positives due to the violation of its modeling assumptions, e.g., due to non-linear effects of IVs or gene expression, or due to invalid IVs (in the presence of horizontal pleiotropy of SNPs). First, we propose developing linear models and neural network models incorporating a large number of functional annotations on the genome (e.g. various types of functional genomic and epigenetic data from the ENCODE and Roadmap Epigenomics projects) as prior knowledge to improve im- puting/predicting gene expression (or other molecular or imaging endophenotypes or complex traits/diseases) via SNPs, corresponding to the first stage of 2SLS. Second, we propose neural networks as more flexible non-linear models for the second stage of 2SLS in the presence of invalid IVs, which may be the SNPs having direct (or horizontal pleiotropic) effects on the outcome as expected from the wide-spread pleiotropy. Then we combine the approaches in the above two stages to form a more flexible and robust neural network approach as an extension of 2SLS for causal inference. Third, we consider inferring causal directions between two traits, e.g. a gene's expres- sion and AD, allowing non-linear relationships between SNPs and traits and between the two traits. This is critical in reducing false positives, e.g. due to reverse causation, but has been largely under-studied. Fourth, we apply the new (and existing) methods to transcriptomic, proteomic, neuroimaging and AD GWAS/WGS data to identify (pu- tative) causal genes, proteins and brain regions of interest (ROIs) for AD, while building the corresponding genetic prediction models for endophenotypes and AD risk. Finally, we will develop and disseminate publicly available software implementing the proposed analysis methods, e.g. as Python programs or R package...

Key facts

NIH application ID
10267373
Project number
1U01AG073079-01
Recipient
UNIVERSITY OF MINNESOTA
Principal Investigator
Wei Pan
Activity code
U01
Funding institute
NIH
Fiscal year
2021
Award amount
$733,352
Award type
1
Project period
2021-09-15 → 2026-08-31