PROJECT SUMMARY / ABSTRACT The overarching goal of this study is to use new large multi-modal data resources and machine-learning-based data mining algorithm to better understand risk factors and improve diagnosis for people with Amyotrophic lateral sclerosis (ALS). Amyotrophic lateral sclerosis (ALS) is a rare, fatal neurodegenerative disorder, with 90% sporadic cases do not have genetic causes and their contributing risk factors are largely unknown. Most of what is known about ALS risk factors comes from epidemiological studies using registry data, which historically forms the main standardized big data source to help describe the natural history, epidemiology, and burden of disease; however, the strength of evidence resulting from these studies varies greatly. One potential major limitation to registry data are the fields collected are based upon known potential risk factors, which have restricted its usability for exploring novel associations and causalities. Moreover, ALS is a rare disease with low prevalence, thus making it infeasible to study its etiology using traditional observational study design due to statistical power constraints. The digitization of healthcare records and the capacity to link to other relevant data sources now enables a more representative, enriched and statistically powerful study population; and ideal for leveraging machine-learning-driven, hypothesis-generating models to identify new risk factors and patterns identify new risk factors important for understanding, diagnosing, or treating people with ALS. Building on established well-integrated real world big data source and established ensemble embedded feature selection framework, an established multi-marker (biomarker, clinical marker, geo-marker, socio-marker) discovery algorithm will be developed to discover novel, generalizable risk factors (Aim 1); new symptomatic patterns for early diagnosis (Aim 2), and effective clinical care pathways for ALS (Aim 3). To best translate findings into clinical insights, a multi-disciplinary and multi-stakeholder team has been assembled, including not only investigators with diverse expertise in statistics, machine learning, clinical research informatics, neurology, computer science, epidemiology, but also an engaging patient advisory board with diverse social background. The proposed work will be one of the first pilot studies applying AI/ML-based, hypothesis-generating algorithms on statistically powerful real-world data to bridge the knowledge gap on ALS risk factors. The work will not only provide CDC agency of toxic substance and disease registry (ATSDR) with empirical evidence to better prioritize future decisions on expanding the ALS registry risk factor survey but serve to inform better designed proposals for future etiological studies and targeted trials for ALS. This study will also provide an exemplar framework which can be generalizable to advance research of other rare and complex disease domains by leveraging r...