Recovering reproducible and local signal in genomic data

NIH RePORTER · NIH · P20 · $167,063 · view on reporter.nih.gov ↗

Abstract

Challenge. One of the most important challenge in biological science today is to elucidate the extent to which complex experiments, which measure hundreds of thousands of variables, can be analyzed to generate consistent and global signal when repeated, to identify local signal related to tissues, cancer types or population structure. Importantly, we must include the intrinsic diversity of variation across different studies and control for technical confounders as part of this task. Most measurements from high-dimensional biological experiments display variation arising both from biological sources, such as genes belonging to a different tissue or different positions in the brains. While some components reappear across multiple tissues, global biological signal is more likely than spurious signal to be reproducibly present in multiple tissues. Our challenge is to systematically and reliably identify the global biological factors, and estimate the signal specific to each study. Aims. In order to meet this challenge, we propose a novel concept that combines ideas from meta-analysis and statistical modeling dimension reduction. We posit that one can develop high-dimensional data reduction techniques that at the same time function as multi-study tools to extract consistent signal and local specific signal.This proposal develops statistical methods for identifying shared and study-specific signal across multiple cancer studies. In this work, it is crucial to understand the shared signal - here, gene co-expression shared across different cancer types - and the signal specific to each study. This proposal will pilot this concept by building novel classes of multi-logistic regression and factor analysis methods. The key is to decompose data from each study into latent dimensions, some of which are global while some are not and only specific to a local signal. This will simultaneously achieve two goals: learning reproducible biological features shared among studies, and identifying the variation specific of each study. Specific aims include methodology design, software development and applications. Impact. The concepts, approaches, and software tools generated by this research will have a direct impact on the ability of the biomedical community to reproducibly identify stable signals across multiple high-throughput biology studies and to capture local signals. Our tools will also enable a more reliable identification of artifacts and thus facilitate more efficient experimental designs and guide technological development. We also hope to impact data sciences beyond genomics. Our study will be the first opportunity to evaluate the novel concept of sharing latent factors as well as estimating local latent structures. The proposed work could subsequently provide the inspiration, as well and the practical foundation, for expanding this concept to a variety of another dimension reduction and machine learning techniques.

Key facts

NIH application ID
10904898
Project number
5P20GM109035-09
Recipient
BROWN UNIVERSITY
Principal Investigator
Roberta De Vito
Activity code
P20
Funding institute
NIH
Fiscal year
2024
Award amount
$167,063
Award type
5
Project period
2016-06-01 → 2026-07-31