Interactions between high-dimensional probability and data science

NSF Award Search · 01002526DB NSF RESEARCH & RELATED ACTIVIT · $239,966 · view on nsf.gov ↗

Abstract

This project addresses theoretical challenges in high-dimensional probability, with a particular focus on those arising in data science. It aims to develop rigorous mathematical foundations for understanding the authenticity and privacy of synthetic data, tackling questions such as “What is artificial, mathematically?” and “How can we distinguish artificial data from real?” As a related aim, the project will broaden the reach of random matrix theory in data science by developing new geometric approaches to random matrices and random tensors. By establishing a probabilistic framework for detecting synthetic data, the project will develop an adversarial classification model and characterize the regimes where artificial data can be reliably identified. This analysis will draw on connections to high-dimensional Gaussian geometry and convexity. To develop a mathematical framework for private synthetic data, the project will explore metric-based characterizations of the privacy-accuracy tradeoff, grounded in the methodology of high-dimensional probability. Furthermore, this project will advance non-spectral random matrix theory by developing and applying high-dimensional probability methods to study approximation numbers, general operator norms, and norms of the inverse of random matrices. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Key facts

NSF award ID: 2451011
Awardee: University of California-Irvine (CA)
SAM.gov UEI: MJC5FCYQTPE6
PI: Roman Vershynin
Primary program: 01002526DB NSF RESEARCH & RELATED ACTIVIT
All programs: Artificial Intelligence (AI), Machine Learning Theory
Estimated total: $239,966
Funds obligated: $239,966
Transaction type: Standard Grant
Period: 07/15/2025 → 06/30/2027