The analysis of large datasets from computational biology and medicine represents an important chal- lenge for Statisticians. These biomedical data typically have a large number of correlated features with rel- atively weak signals for predicting phenotypes of interest. Examples include DNA sequences and GWAS, mass-spectra, RNAseq and protein arrays. The broad goal of this ongoing three-investigator grant is to de- velop and study statistical techniques that enhance the analysis and interpretation of these data. The team combines experience in statistical modeling, algorithmic development, and theoretical analysis. Through four Specific Aims, the new projects focus on development and validation of state-of-the art statistical methods to use structure to learn from high-dimensional data to advance human population health. 1. Cluster-aware supervised learning. In “omics” settings, there are a large number of features that often exhibit sizable correlations. This aim proposes the Cluster-Aware Lasso, a statistical method which fits a lasso regression model that adaptively selects clusters of features using a hierarchical clustering-based approach, enforcing a notion of a tree-respecting solution. It will be validated on gene expression and mass spec data, and extensions to other supervised learning settings studied. 2. SNP Selection from GWAS summary statistics with FDR control. Genome-wide association studies often report findings for phenotypes in terms of summary statistics for individual SNPs. This aim develops a statistical method to identify causal SNPs while controlling the False Discovery Rate. It uses an estimate of the SNP correlation matrix based on linkage-disequilibrium data, an approximate multivariate lasso fit and model-X knockoff techniques, and will be validated on UK Biobank data. 3. Inference for high-dimensional genetic covariance matrices. Statistical estimation of large genetic covariance matrices is needed to learn whether genetic variation at phenome-wide scale is concen- trated in relatively few trait combinations, with implications for evolution and pleiotropy. This aim will explore biases in Restricted Maximum Likelihood and study alternative parametric and nonparametric methods of estimation both by asymptotic approximation and simulation. 4. Mixture lasso for multiple instance learning. It is often known whether a person is sick, but not which of their immune cells are responding to a particular illness, nor which parts of biopsied tissue are diseased. There is a label only for each patient, but data instances on a more granular level. The aim is to predict the labels of each data instance. This project proposes a supervised learning method based on mixtures and the lasso, with validation on viral sequence and mass spectrometry data. Working together, the investigators and their students will implement the new statistical tools into publi- cally available software, following a pattern established in ear...