SUMMARY 6.1 million children in the US currently suffer from asthma, making it the most common chronic disease experienced during childhood. Significant racial and ethnic disparities exist with African American (AA) children being 8 times more likely to die of asthma relative to non-Hispanic white children. Genetic, environmental, and psychosocial factors are believed to jointly cause the disease by affecting biological pathways related to asthma pathophysiology. Within our parent R01 award (5R01MD015409) – abbreviated as the “Stress, Epigenome and Asthma” (SEA) study, we hypothesize that exposure to psychosocial stress in childhood may act at a mechanistic (biological) level impacting the function of our genome by epigenetic modifications. To test our hypothesis, we are collecting large amounts of data in a prospective social epigenomics study of asthmatic AA children/families including high-resolution epigenetic profiles, comprehensive social determinants of health (SDOH), and chronic stress information. While we propose within the parent award to make the ‘omics’ dataset ready for downstream AI/ML approaches we recognize the need to also prepare our SDOH and chronic stress data for similar applications which is however outside of the scope of the parent award. Specifically, we argue the SEA study data will greatly benefit from use of AI/ML techniques such as ensemble models that are capable of naively capturing differential outcomes across combinations of features. However, given that exposure to chronic stressors is tied to a child’s social environment, to develop reliable models will require significant efforts to prepare and contextualize the collected data. We hypothesize this can be accomplished through the linking of collected social and clinical data with disparate population level datasets. Our supplement will address two aims: 1) We will develop novel quantitative measures to define the representativeness of study participant data. By utilizing publicly available population-level data (e.g., Census data) we will develop a framework to compare the sociodemographic profile of study participations against an expected distribution of individuals in a geographic reference area. And, by doing so, identify subgroups that may misaligned to the community on which results are expected to generalize. By further linking this alignment to data quality measures (e.g., missingness), we can create a standardized tool to convey the dataset’s intrinsic biases on population subsets to aid in designing analyses and interpreting AI/ML model results; and 2) We will extend traditional AI/ML imputation preprocessing methods to account for socioeconomic factors. Understanding that chronic stress is deeply interconnected with children’s social environment and that sampling is not balanced by geographic region, current imputation estimates for data in subgroups with a high degree of missingness, would be primarily driven by relationships found in cohorts with m...