In today’s world, data are cheap, plentiful, and everywhere . . . except when they are not. The vast majority of books have been digitized, a substantial portion of the internet has been scraped, and low-cost biomedical technologies are available at a doctor’s (or patient’s) fingertips. But, often the data needed for a particular real-world problem is much harder to access, due to cost or other constraints. This project will answer the question: how can the outputs of machine learning or artificial intelligence algorithms be used to augment limited datasets in order to draw meaningful statistical conclusions? Consider a setting where the target of inference is a functional of the joint distribution of X and Y, and n independent and identically distributed observations of (X,Y) are available. This research considers the following questions: under what circumstances, by how much, and how can additional observations for which we only have access to X (and not Y) improve inference? The investigative team will consider this question first from a theoretical perspective, by establishing new semi-parametric efficiency results for semi-supervised learning (Project 1); then from a methodological perspective, by developing new and improved estimators for prediction-powered inference (PPI, Project 2); and finally from an applied perspective, by proposing PPI estimators of true positive rate, false positive rate, and area under the curve (Project 3). This award reflects NSF's st