# Harnessing multimodal data to enhance machine learning of children’s vocalizations

> **NIH NIH R01** · UNIVERSITY OF MIAMI CORAL GABLES · 2021 · $199,999

## Abstract

Project Summary
This Administrative Supplement proposes implementation of a multimodal data pipeline to support
machine learning of child language production in complex naturalistic environments. The Supplement
builds on the parent R01 (DC018542) that gathers objective, longitudinal data to capture the vocal
interactions of children with hearing loss (HL). Even with cochlear implantation, HL is a life-altering
condition with high social costs. Inclusion of children with HL and typically hearing (TH) peers in
preschool classrooms is a national standard, but it is not clear how early vocal interaction contributes to
the language development of children with HL and their TH peers. The parent R01 employs
computational models of child location and orientation to indicate when children are in social contact
with their peers and teachers. An additional strategy for pursuing the broad goals of the R01—
identifying interactive contexts in which children produce phonemically complex vocalizations and
interactive speech—is machine learning. Machine learning algorithms can determine the contextual,
individual, and interactive factors that predict children’s vocalizations and vocal interactions. However,
the parent R01 does not propose machine learning, nor are data disseminated in a format designed to
facilitate machine learning. To facilitate machine learning in the classroom, a rigorous diarization
process is required to determine speaker identity, which is operationalized as the likelihood that each
vocalization was spoken by a given child or teacher. We will integrate audio processing of each target
child and teacher’s first-person audio recording with processing of their interactive partners’ recordings.
The influence of partner recordings will be determined by their physical distance and orientation relative
to the target. This will yield a weighted speaker identification score for each vocalization. For 25% of the
sample, the algorithmic score will be compared to speaker identification provided by trained coders to
quantify intersystem reliability. Processed datasets will include 7,160 hours of multimodal recordings of
child and teacher movement in classrooms synchronized with continuously recorded, child- and
teacher-specific (first-person) audio recordings. De-identified output data will characterize vocalizations
with respect to algorithmically computed speaker identification probabilities, coder-identified speaker
identity (25% of sample), phonemic complexity and audio characteristics (e.g., fundamental frequency),
as well as the position and relative orientation of all individuals in the classroom, and child
demographics (including characterizations of HL). Over the course of the supplement, output data,
Python processing code, and metadata descriptions of the processing pipeline will be disseminated in
dedicated distribution portals including Github, Kaggle, and the UCI repository. Recordings will be
released to certified investigators via NIH-f...

## Key facts

- **NIH application ID:** 10411575
- **Project number:** 3R01DC018542-01A1S1
- **Recipient organization:** UNIVERSITY OF MIAMI CORAL GABLES
- **Principal Investigator:** DANIEL S MESSINGER
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $199,999
- **Award type:** 3
- **Project period:** 2021-02-01 → 2026-01-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10411575

## Citation

> US National Institutes of Health, RePORTER application 10411575, Harnessing multimodal data to enhance machine learning of children’s vocalizations (3R01DC018542-01A1S1). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10411575. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
