PROJECT SUMMARY Cancer develops when pathways controlling cell survival, cell fate or genome maintenance are disrupted by the somatic alteration of key driver genes. Understanding the mechanism and impact of pathway dis- ruption is therefore essential for an accurate characterization of cancer biology and identification of ther- apeutic targets. A common approach for studying pathway dysregulation in cancer involves the analysis of tumor gene expression data using gene set testing or pathway analysis techniques. Gene set testing is an effective and widely applied hypothesis aggregation method that uses prior knowledge regarding gene function to test a smaller number of more biologically meaningful hypotheses and thereby improve interpretation, replication and power relative to a gene-level analysis. Although the gene set analysis of large cancer gene expression data sets has successfully identified pathways commonly impacted in human cancer, existing pathway analysis methods have two important limitations when applied to can- cer gene expression data. First, most existing gene set collections model the pattern of gene activity found in normal tissues, which can differ significantly from the pattern found within tumors. Using these gene sets to analyze cancer gene expression data can result in misleading results with the potential for a significantly inflated type II error rate. Second, standard gene set testing methods leverage only the gene expression data for the analyzed samples. Although there are some cancer-specific pathway analysis methods that consider multiple omics modalities, e.g., expression and mutations, information regarding the expression of genes in the associated normal tissue is not utilized by existing techniques. Ignoring normal tissue gene expression can result in a cancer-focused analysis that simply recapitulates the phenotype of the associated normal tissue rather than capturing cancer-specific activity. To address these challenges, we will develop novel and innovative bioinformatics algorithms that 1) optimize exist- ing gene set collections to reflect the pattern of gene activity found in dysplastic tissue, and 2) leverage information regarding normal tissue gene activity during gene set analysis.