# Statistical Methods for Analysis of Massive Genetic and Genomic Data in Cancer Research

> **NIH NIH R35** · HARVARD UNIVERSITY D/B/A HARVARD SCHOOL OF PUBLIC HEALTH · 2024 · $877,834

## Abstract

Project Summary
With massive data from genome, exposome and phenome rapidly available in population and clinical studies,
data science has emerged to be critically important and provides unprecedented opportunities for new
discoveries in cancer. This competing renewal application of an NCI Outstanding Investigator Award (R35)
aims at developing and applying scalable, interpretable and transferable statistical and machine learning (ML)
methods for integrative analysis of massive germline whole genome sequencing (WGS) and somatic whole
exome sequencing (WES) data, epidemiological and clinical data, in large-scale multi-ethnic biobanks,
population and clinical studies of cancer, with experimental cell specific multi-omic functional data, such as
single cell RNA/ATAC-seq data. Our ultimate goal is to use advanced data science methods and different
types of population, clinical, and experimental data to accelerate progress in advancing from cancer gene
mapping to mechanisms to cancer prevention and medicine, discover new effective trans-ethnic precision
cancer prevention and treatment strategies, and reduce health disparities in cancer genetic research. This
application aims to meet the pressing quantitative needs for the analysis of massive data in cancer research.
Specifically, (A) for genetic cancer epidemiology, we will develop scalable, interpretable and transferable
statistical and ML methods for (1) rare variant analysis by integrating population-based WGS and experimental
single cell functional data; (2) advancing from associated variants with unknown causality and biology to causal
variants, genes and pathways using causal mediation analysis and Mendelian Randomization by integrating
genetic, cell-specific omic, biomarkers and phenotype data; (3) estimating transferable trans-ethnic polygenetic
risk scores (PRSs) and heritability using common and rare variants by integrating WGS data with experimental
in-silicon cell-specific functional annotations and non-genetic data, for actionable prevention strategies; (3)
federated and transferable trans-ethnic single phenotype and phenome-wide genetic analysis in large WGS
studies and biobanks. (B) For cancer genetic medicine, we will develop scalable and interpretable statistical
and machine learning methods for (1) joint analysis of germline WGS and tumor somatic WES data to identify
genetic variants that predispose to cancer subtypes; (2) integrative analysis of tumor somatic WES data and
clinicopathological characteristics to identify patient profiles for improved efficacy of immunotherapies; (3)
analysis of the effects of clonal hematopoiesis, mitochondrial dysfunctions, leukocyte telomere length called
from germline WGS data on tumor somatic events, cancer prognosis and responses to immunotherapies. We
will apply the proposed methods in lung cancer and breast cancer genetic epidemiological and clinical studies
and biobanks. We will develop open access cluster and cloud-based software of these met...

## Key facts

- **NIH application ID:** 10896421
- **Project number:** 5R35CA197449-10
- **Recipient organization:** HARVARD UNIVERSITY D/B/A HARVARD SCHOOL OF PUBLIC HEALTH
- **Principal Investigator:** XIHONG LIN
- **Activity code:** R35 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $877,834
- **Award type:** 5
- **Project period:** 2015-08-05 → 2029-07-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10896421

## Citation

> US National Institutes of Health, RePORTER application 10896421, Statistical Methods for Analysis of Massive Genetic and Genomic Data in Cancer Research (5R35CA197449-10). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10896421. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*