# New Statistical Methods for Medical Signals and Images

> **NIH NIH R01** · STANFORD UNIVERSITY · 2023 · $496,901

## Abstract

The analysis of large datasets from computational biology and medicine represents an important chal-
lenge for Statisticians. These biomedical data typically have a large number of correlated features with rel-
atively weak signals for predicting phenotypes of interest. Examples include DNA sequences and GWAS,
mass-spectra, RNAseq and protein arrays. The broad goal of this ongoing three-investigator grant is to de-
velop and study statistical techniques that enhance the analysis and interpretation of these data. The team
combines experience in statistical modeling, algorithmic development, and theoretical analysis. Through
four Specific Aims, the new projects focus on development and validation of state-of-the art statistical
methods to use structure to learn from high-dimensional data to advance human population health.
1. Cluster-aware supervised learning. In “omics” settings, there are a large number of features that
 often exhibit sizable correlations. This aim proposes the Cluster-Aware Lasso, a statistical method
 which fits a lasso regression model that adaptively selects clusters of features using a hierarchical
 clustering-based approach, enforcing a notion of a tree-respecting solution. It will be validated on
 gene expression and mass spec data, and extensions to other supervised learning settings studied.
2. SNP Selection from GWAS summary statistics with FDR control. Genome-wide association studies
 often report findings for phenotypes in terms of summary statistics for individual SNPs. This aim
 develops a statistical method to identify causal SNPs while controlling the False Discovery Rate. It
 uses an estimate of the SNP correlation matrix based on linkage-disequilibrium data, an approximate
 multivariate lasso fit and model-X knockoff techniques, and will be validated on UK Biobank data.
3. Inference for high-dimensional genetic covariance matrices. Statistical estimation of large genetic
 covariance matrices is needed to learn whether genetic variation at phenome-wide scale is concen-
 trated in relatively few trait combinations, with implications for evolution and pleiotropy. This aim will
 explore biases in Restricted Maximum Likelihood and study alternative parametric and nonparametric
 methods of estimation both by asymptotic approximation and simulation.
4. Mixture lasso for multiple instance learning. It is often known whether a person is sick, but not which
 of their immune cells are responding to a particular illness, nor which parts of biopsied tissue are
 diseased. There is a label only for each patient, but data instances on a more granular level. The aim
 is to predict the labels of each data instance. This project proposes a supervised learning method
 based on mixtures and the lasso, with validation on viral sequence and mass spectrometry data.
 Working together, the investigators and their students will implement the new statistical tools into publi-
cally available software, following a pattern established in ear...

## Key facts

- **NIH application ID:** 10734451
- **Project number:** 2R01GM134483-28
- **Recipient organization:** STANFORD UNIVERSITY
- **Principal Investigator:** Iain M Johnstone
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2023
- **Award amount:** $496,901
- **Award type:** 2
- **Project period:** 1996-09-10 → 2027-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10734451

## Citation

> US National Institutes of Health, RePORTER application 10734451, New Statistical Methods for Medical Signals and Images (2R01GM134483-28). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/10734451. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
