# Using Machine Learning with Real-World Data to Identify Autism Risk in Children

> **NIH NIH R21** · UNIVERSITY OF SOUTHERN CALIFORNIA · 2022 · $266,964

## Abstract

PROJECT SUMMARY/ABSTRACT
Early and accurate identification of autism spectrum disorder (ASD) is important because ASD interventions
can support positive long-term developmental outcomes, but there is a delay of >2 years between the age
children can reliably be diagnosed and the average age of diagnosis; and 1 in 4 U.S. children aged 8 with ASD
have not been diagnosed. Girls and Latino children are disproportionately impacted by the problem of delayed
diagnosis and under-identification of ASD, in part because clinicians are less likely to recognize ASD risk
factors in them and refer them for an ASD evaluation. Therefore, predicting ASD risk at a population level is
needed to enhance early and accurate detection, particularly in these underserved populations. Researchers
are beginning to harness clinical informatics methods to identify ASD from real-world data in electronic health
records (EHRs), using both structured (e.g., diagnosis codes) and unstructured data (e.g., physician notes).
However, existing algorithms suffer from multiple major flaws, including non-representativeness of training
samples, outdated diagnosis codes and natural language processing (NLP) methods, and a lack of ‘verified’
ASD diagnosis in their gold standard datasets. This proposed research addresses these gaps by developing a
contemporary ASD risk model that uses state-of-the-art machine learning and NLP methods. Using EHR data
from Children’s Hospital Los Angeles (including a gold standard dataset with ‘verified’ ASD diagnoses from the
Boone Fetter Clinic) and the OneFlorida Data Trust (a Florida state-wide EHR database), we will (1) develop a
computable phenotype for ASD using both structured and unstructured EHR data (including parent-reported
ASD discriminators and features associated with ASD that are often found in free text in children’s records),
and (2) develop a machine-learning risk prediction model for ASD. This will lay the foundation for a clinical
decision support tool, to be integrated into EHRs to notify a clinician when a child warrants ASD evaluation.
This has potential to improve ASD identification in all children, but it may particularly benefit girls and Latino
children, reducing sex and ethnic disparities. Further, it will be easily expandable into a ‘next steps’ study to the
overall PCORnet, which provides healthcare to over 24 million children. By using EHRs, this proposal holds
promise for future cost-effective health systems interventions that can help to correct a sociodemographic
‘imbalance’ in ASD research by reaching girls and Latino children at risk for ASD.

## Key facts

- **NIH application ID:** 10430153
- **Project number:** 1R21MH129682-01
- **Recipient organization:** UNIVERSITY OF SOUTHERN CALIFORNIA
- **Principal Investigator:** Amber M. Angell
- **Activity code:** R21 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $266,964
- **Award type:** 1
- **Project period:** 2022-03-14 → 2024-02-29

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10430153

## Citation

> US National Institutes of Health, RePORTER application 10430153, Using Machine Learning with Real-World Data to Identify Autism Risk in Children (1R21MH129682-01). Retrieved via AI Analytics 2026-05-25 from https://api.ai-analytics.org/grant/nih/10430153. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
