Project 3: Statistical Methods for Genome Characterization Abstract Understanding the role that genes play in life is a key issue in biomedical sciences, yet the overwhelming majority of sequences in public databases remain uncharacterized. Functional annotation is important for a variety of downstream analyses of genetic data. Yet experimental characterization of function remains costly and slow, making computational prediction an important endeavor. This project therefore proposes three Aims focused on functional genomics. In our first Specific Aim, we propose to develop a probabilistic evolutionary model built upon phylogenetic trees and experimental Gene Ontology functional annotations that allows automated prediction of function for unannotated genes. We will develop a probabilistic hierarchical modeling framework that that will allow joint inference, and borrowing of strength, across a family of related trees. We expect this to significantly improve overall accuracy. Our approach will provide a scalable computational method that will enable gene annotation to be kept up to date regardless of the flow of new experimental data. Our second Aim focuses on the development of improved statistical methods for pathway analysis. Such methods aim to detect over-representation of members of a super-structure, such as a genetic pathway, in a list of objects of interest from an experimental or statistical analysis. However, pathway definitions are not consistent between resources, with the overlap between two definitions of the same pathway on differing resources being as low as 30%. In this Aim we will develop methods that focus on the network structure itself, which is much more robust. Our third Aim focuses on analysis of epigenetic conservation. The epigenome dictates cell phenotype and it is increasingly possible to infer which genes are silenced or expressed by measuring the epigenome of a cell. Cancers are characterized by multiple genes that show both hypermethylation and hypomethylation relative to normal tissues. We will develop advanced statistical methods to assess how conservation of DNA methylation varies along the genome, and validated using measures of ‘essentiality’ taken from the Cancer Dependency Map and drug sensitivity data taken from the Genomics of Drug Sensitivity in Cancer (GDSC) Project.