From Semi-supervised Learning to Prediction-powered Inference

NSF Award Search · 01002526DB NSF RESEARCH & RELATED ACTIVIT · $250,000 · view on nsf.gov ↗

Abstract

In today’s world, data are cheap, plentiful, and everywhere . . . except when they are not. The vast majority of books have been digitized, a substantial portion of the internet has been scraped, and low-cost biomedical technologies are available at a doctor’s (or patient’s) fingertips. But, often the data needed for a particular real-world problem is much harder to access, due to cost or other constraints. This project will answer the question: how can the outputs of machine learning or artificial intelligence algorithms be used to augment limited datasets in order to draw meaningful statistical conclusions? Consider a setting where the target of inference is a functional of the joint distribution of X and Y, and n independent and identically distributed observations of (X,Y) are available. This research considers the following questions: under what circumstances, by how much, and how can additional observations for which we only have access to X (and not Y) improve inference? The investigative team will consider this question first from a theoretical perspective, by establishing new semi-parametric efficiency results for semi-supervised learning (Project 1); then from a methodological perspective, by developing new and improved estimators for prediction-powered inference (PPI, Project 2); and finally from an applied perspective, by proposing PPI estimators of true positive rate, false positive rate, and area under the curve (Project 3). This award reflects NSF's st

Key facts

NSF award ID: 2514344
Awardee: University of Washington (WA)
SAM.gov UEI: HD1WMN6945W6
PI: Daniela Witten
Primary program: 01002526DB NSF RESEARCH & RELATED ACTIVIT
All programs: Machine Learning Theory, Artificial Intelligence (AI)
Estimated total: $250,000
Funds obligated: $250,000
Transaction type: Standard Grant
Period: 09/15/2025 → 08/31/2028