PROJECT SUMMARY Due to mandates from funding agencies and publishers, high-throughput, molecular data from Down syndrome individuals and controls (mostly humans and mice) are available in public repositories. Researchers can use such data to corroborate their own findings and pose new research questions. Doing so would help to leverage prior investments and complement efforts by the INCLUDE Data Coordinating Center (DCC) to generate data for new cohorts. Our proposal focuses specifically on mRNA expression and DNA methylation data. These data types shed light on how genes are regulated, how molecular aberrations lead to medical conditions, and how medical outcomes can be predicted, potentially leading to improved diagnostics, treatments, and insights into human health and disease. However, many data-generation platforms are used for these data types, and researchers use a wide range of techniques for normalizing the data, checking data quality (if they check at all), and mapping to gene annotations. To reuse the data most effectively, the data must be reprocessed from its original form; normalized and quality checked consistently; and mapped to current annotations. Agencies who manage public repositories lack resources and expertise to perform these steps. In our first aim, we will address this problem using a data-curation approach. We have identified 148 datasets specific to Down Syndrome that we believe should be prioritized for reuse. Using our expertise in molecular-data processing and bioinformatics, we will re-normalize, quality-check, summarize, and annotate the data using an approach that maximizes consistency for all of the datasets. Additionally, we will map the metadata to biomedical-ontology terms in collaboration with the INCLUDE DCC. We expect that these efforts will reduce barriers for researchers in the Down syndrome community to reuse the data and accelerate progress in the field. Our second aim focuses on interoperability. For many research questions, a single dataset is insufficient. Sample sizes may be small and/or a single dataset may not represent the range of phenotypes or other factors necessary to answer a given question. Therefore, it is often crucial to integrate datasets from multiple sources. However, systematic differences between datasets are inevitable due to differences in populations, laboratory conditions, and environmental factors. Failing to adjust for these differences will likely lead to biased conclusions. We will evaluate the feasibility of using generative neural networks, a type of algorithm that is highly configurable and is behind many of the most influential artificial-intelligence advances of the past decade. We will apply these algorithms in the context of studying medical conditions that co-occur with DS, such as autoimmune conditions, dementia-related disease, congenital heart defects, and leukemias. Our algorithms will search for systematic patterns that differ between datasets and generate a modified vers...