Best Practices for Using Data Generated by AI or Machine Learning

NSF Award Search · 01002526DB NSF RESEARCH & RELATED ACTIVIT · $307,273 · view on nsf.gov ↗

Abstract

Empirical analysts now routinely generate new data by deploying artificial intelligence (AI) or machine learning (ML) algorithms on large, unstructured data sets. Examples include quantifying sentiment or uncertainty in news text using large language models or natural language processing methods; measuring product characteristics from review text and product images on online platforms; or imputing missing variables from demographic information. In standard practice, AI- or ML-generated data are treated as if they were regular numerical data for the purposes of data analysis. However, this standard approach can introduce bias in parameter estimates and lead to invalid conclusions. This project develops new econometric methods and statistical theory to inform best practice for empirical researchers using these data. This research project improves the quality of data analysis performed by businesses, non-profits, and government organizations. The interdisciplinary nature of the research helps to forge connections between academia, policy makers, and industry. The research improves the validity of empirical research using AI- and ML-generated data. The projects contribute novel econometric methods for working with data generated by AI and ML algorithms that correct the bias and inference problems present in current empirical practice. The methods are rigorously justified with new statistical theory for AI- and ML generated data. A key contribution is the development of novel a

Key facts

NSF award ID: 2521471
Awardee: Yale University (CT)
SAM.gov UEI: FL6GV84CKN57
PI: Timothy Christensen
Primary program: 01002526DB NSF RESEARCH & RELATED ACTIVIT
All programs: Artificial Intelligence (AI), GRADUATE INVOLVEMENT
Estimated total: $307,273
Funds obligated: $307,273
Transaction type: Standard Grant
Period: 08/15/2025 → 07/31/2028