Collaborative Research: Statistical Foundations for Scalable and Robust Data Valuation

NSF Award Search · 01002627DB NSF RESEARCH & RELATED ACTIVIT · $150,000 · view on nsf.gov ↗

Abstract

High-quality data are increasingly central to modern machine learning and artificial intelligence, enabling advances in scientific discovery, automated decision-making, and emerging AI technologies. Yet there often lack transparent and reliable mechanisms to appropriately credit and compensate those who contribute data used to train AI systems. This project will develop statistical and machine-learning methods for measuring the value of data in AI model training and data-driven decision systems. The work addresses fundamental challenges in data valuation, including robustness to strategic manipulation, computational scalability for large-scale learning systems, and principled uncertainty quantification in assigning value to data contributions. The outcomes of this project will support transparent, fair, and sustainable AI data ecosystems while improving incentives for sharing high-quality and socially beneficial data. The project will also support graduate and undergraduate training, development of educational materials, public dissemination of results, and open-source software for the broader AI and data science communities. The research will develop statistical foundations for scalable and robust Shapley-value-based data valuation in modern machine learning through three integrated directions. First, it will develop priority-aware valuation rules that incorporate precedence relationships and priority weights, enabling originality, provenance, and individual risk consider

Key facts

NSF award ID
2610423
Awardee
Carnegie Mellon University (PA)
SAM.gov UEI
U3NKNFLNQ613
PI
Weijing Tang
Primary program
01002627DB NSF RESEARCH & RELATED ACTIVIT
All programs
Artificial Intelligence (AI), Machine Learning Theory
Estimated total
$150,000
Funds obligated
$150,000
Transaction type
Standard Grant
Period
07/01/2026 → 06/30/2029