Corpus phonetics and speech technology infrastructure

NSF Award Search · 01002526DB NSF RESEARCH & RELATED ACTIVIT · $101,674 · view on nsf.gov ↗

Abstract

Speech technology, including artificial intelligence (AI) trained on speech data, performs poorly in cases where little or no recorded audio data exists to train the required AI models. Building better speech technology in these cases requires creating collections of speech materials and their transcriptions. However, transcription is immensely time-consuming without the assistance of existing AI technologies. This project builds a high-quality speech data set to enable phonetics and phonology research for several low-data languages, and to model an approach to ease the “transcription bottleneck” assisted by techniques in AI and natural language processing (NLP). The project jointly engages the expert perspectives of users of target languages, linguists, and computer scientists, and establishes an infrastructure for collaborative, computationally mediated language work. Other benefits to society include bridging laboratory-style research and real-world applications and providing innovative educational opportunities for trainees. This project builds a 60-hour corpus of naturalistic and read speech data recorded in the field, suitable for both AI/NLP applications and research in acoustic phonetics and phonology. Unsupervised or weakly supervised machine learning techniques are used to semi-automatically transcribe and annotate a portion of the speech corpus. This transcription and annotation process uses a novel human-in-the-loop approach making direct use of expert speaker

Key facts

NSF award ID: 2438916
Awardee: SUNY at Buffalo (NY)
SAM.gov UEI: LMCJKRFW5R81
PI: Matthew Faytak
Primary program: 01002526DB NSF RESEARCH & RELATED ACTIVIT
All programs: Artificial Intelligence (AI), LINGUISTICS, DLI-Dyn Language Infrastructure, GRADUATE INVOLVEMENT, SCIENCE, MATH, ENG & TECH EDUCATION
Estimated total: $101,674
Funds obligated: $101,674
Transaction type: Standard Grant
Period: 09/01/2025 → 08/31/2027