Project Summary The size of genetic data sets is growing exponentially. At the current rate of growth, the largest reference panels of phased, sequenced individuals will have millions of individuals within 5-7 years. This research will address the computational challenges of performing genotype phasing and imputation in large cohorts and with large reference panels. Large cohorts from outbred populations typically contain a mixture of nominally unrelated and closely related individuals. Current phasing methods for these large data sets do not model parent-offspring or other close relationships. We will develop a new phasing method that greatly increases phase accuracy in closely-related individuals and that scales to large sample sizes. Increasing reference panel size also increases genotype phase and imputation accuracy. However, computational cost also increases with reference panel size. We will develop a new reference file format that substantially reduces the computational cost of imputation and phasing with large reference panels. We will provide a format specification, software, and software libraries so that other researchers and software developers can readily use the new reference file format. We will develop a new computational method for finding shared haplotype segments between a reference panel and a target haplotype. This new method will significantly reduce the cost of phasing and imputation using large reference panels. Finally, we will extend the fastest, most accurate method for genotype phasing and imputation (Beagle 5.0) to analyse chromosome X data. This extension will improve genetic studies of this important chromosome.