# Improving AI/ML-readiness of FaceBase Research Datasets

> **NIH NIH U01** · UNIVERSITY OF SOUTHERN CALIFORNIA · 2021 · $337,557

## Abstract

PROJECT SUMMARY
The goal of the FaceBase III Hub was created by the National Institute for Dental and
Craniofacial Research (NIDCR) to create a data repository to serve the entire community of
dental and craniofacial researchers by sharing diverse data related to craniofacial development
and dysmorphia, as well as other research communities that can leverage the diverse data that
is in the FaceBase repository. One particularly unique and important element of FaceBase III is
that it has over 22,000 facial images from over 11,000 human subjects, many of which are
labeled with syndromes based on clinical and genomic diagnoses.
Facial images are a critical resource for studying the correlation between genotype and
phenotype and have received intense interest within the Artificial Intelligence (AI) and Machine
Learning (ML) research field with notable advances in automated phenotyping. While FaceBase
embraces the FAIR (Findable, Accessible, Interoperable, and Reusable) principles, there are
unique concerns specific to AI/ML research including: presence of noise, uncertainty of labels,
and bias within datasets. It is imperative that we remedy any limitations in the utility of
FaceBase’s facial imaging data for AI/ML research.
In this project, we propose to unlock the tremendous potential of FaceBase facial scans by
identifying gaps in how data is characterized, formated, and preprocessed from the perspective
of its use in AI/ML research and algorithm development. To accomplish this, we propose to
initiate a pilot application that applies existing deep learning algorithms developed by
investigators in this proposal to existing FaseBase data (Aim 1). The goal of the pilot is to
identify how curation, organization and preparation of FaceBase data might be improved so as
to streamline their use in ML/AI based investigations.
Based on what we learn from the pilot, we will modify the current FaceBase self curation
processes specifically around Facial Scans (Aim 2). This will require us to streamline our
process associated with curation of human subject data, so that we have the necessary rich
descriptive elements while maintaining required restrictions on data handling.
Ultimately, the goal is to position the FaceBase Hub so that the existing facial scan resources
become more broadly useful to AI/ML researchers. More significantly, we expect to see an
increased availability with facial scan data and other associated data types, such as genotyping
and neurofunctional data. By making the proposed improvements to our data ingest procedures,
we anticipate that this proposal will allow FaceBase to scale to significantly larger data set sizes,
and consequently, cementing and expanding its position as a unique resource to the broader
NIH community of ML and AI researchers.

## Key facts

- **NIH application ID:** 10412668
- **Project number:** 3U01DE028729-03S2
- **Recipient organization:** UNIVERSITY OF SOUTHERN CALIFORNIA
- **Principal Investigator:** Yang Chai
- **Activity code:** U01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $337,557
- **Award type:** 3
- **Project period:** 2019-08-01 → 2022-07-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10412668

## Citation

> US National Institutes of Health, RePORTER application 10412668, Improving AI/ML-readiness of FaceBase Research Datasets (3U01DE028729-03S2). Retrieved via AI Analytics 2026-05-28 from https://api.ai-analytics.org/grant/nih/10412668. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
