Collaborative Research: The State of the State: Archival, Unstructured Data and Machine Learning

NSF Award Search · 01002526DB NSF RESEARCH & RELATED ACTIVIT · $225,431 · view on nsf.gov ↗

Abstract

This project uses machine learning to create a database of State of the State (SOTS) addresses from 1800 to 2016 and state-level agendas. The data collection involves collecting and cleaning the full set of speeches from governors over time. SOTS data are stored at publicly available data repositories and a website developed by the PIs. Methodologically, the project advances the study of unstructured data and the use of artificial intelligence and machine learning. The data support knowledge and scholarship related to public decision and provide a web resource for educators and journalists. This project extends the SOTS dataset that covers state-of-the-state addresses from 1800 to 2016. The PIs collect, process, and analyze SOTS speeches from years prior to 1960, using techniques developed to overcome poor quality documents implemented through software created by one of the PIs. The software applies machine learning to isolate, enhance, and extract text from hard-to-read documents, correcting document layout problems with a novel statistical approach before it runs optical character recognition (OCR). This results in a significantly higher level of accuracy than other current approaches. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Key facts

NSF award ID
2529209
Awardee
Washington University (MO)
SAM.gov UEI
L6NFUM28LQM5
PI
Daniel Butler
Primary program
01002526DB NSF RESEARCH & RELATED ACTIVIT
All programs
Artificial Intelligence (AI), Machine Learning Theory, UNDERGRADUATE EDUCATION, GRADUATE INVOLVEMENT
Estimated total
$225,431
Funds obligated
$225,431
Transaction type
Standard Grant
Period
09/01/2025 → 08/31/2027