Abstract This project concerns how to extract clinically actionable information for diagnostic purposes from mutational patterns observed from tumor sequencing panels that are increasingly being used in routine medical care of cancer patients. In recent years there has been intense scrutiny of the mutational landscape, using publicly available databases such as The Cancer Genome Atlas and other important sources of information on somatic mutations. However, the bulk of the attention has focused on major cancer genes, and especially the hotspot mutations in these genes at which mutations occur frequently. However, the vast majority of somatic mutations occur at “rare” genetic loci. Of the 1,788,153 distinct mutations that were observed in the 10,295 TCGA tumors over 92% were singletons, i.e. mutations observed in only one tumor. Moreover, when new tumors are sequenced, on average 60% of mutations observed are mutations that were not observed in TCGA. To date investigators have mostly ignored this “hidden iceberg” of potential information. Our proposal is motivated by the belief that at least a portion of these rare mutations contain important information that could be harnessed for clinical purposes. In preliminary work we have adapted statistical methods that were developed for use in analogous investigations in other scientific fields, such as species identification in ecology and language processing, and have been able to demonstrate that the probabilities of observing rare variants in known cancer genes differs markedly by gene, that these probabilities can be estimated accurately, and that for some genes the probabilities exhibit strong lineage dependency. Motivated by these findings, we propose to broaden the scope of these methods to investigate lineage dependency throughout the genome and to use the information to develop accurate tools for classifying tumors by tissue site of origin. In Aim 1, we will integrate data from various bioinformatic resources to characterize genes as well as mutations in non-coding parts of the genome on the basis of their local GC content, DNA replication timing, transcriptional activity, chromatin accessibility, and histone modification marks in the corresponding tissues-of-origin with a view to mapping lineage-dependent variation in rare and previously unobserved variants. In Aim 2, we will use this information to construct a classification tool based on a penalized hierarchical mixed-effects statistical model that permits direct use of these “meta-features” for imputing the discriminatory effects of rare and previously unseen variants. We will examine the predictive accuracy of the model using empirical validation datasets and study its computational feasibility in the context of different data settings, e.g. panel sequencing versus whole-exome and whole-genome. The ultimate goal is to create a tool for the classification of the anatomic site of origin of cancers of unknown primary and of cancers detected th...