III: Medium: SMARTCAT: Developing Smart Data Catalogs for Data Science and AI

NSF Award Search · 01002526DB NSF RESEARCH & RELATED ACTIVIT · $1,000,000 · view on nsf.gov ↗

Abstract

The world has become data driven. Organizations, such as companies, domain sciences, and government agencies, increasingly have numerous datasets, scattered in many locations. When starting a data science or AI project, users often must find a specific datasets, then analyze them to extract insights. However, finding the needed datasets among a “sea of datasets” is often very difficult. So organizations increasingly use data catalogs for this purpose. A data catalog stores the names, descriptions, and other characteristics of datasets, as well the relationships among them. Users can then query the catalog to find desired datasets. As such, data catalogs have become a critical enabler for data science and AI projects. Yet the state of the art in catalog development has remained quite limited, leading to underwhelming performance that falls short of the users’ needs. In particular, not enough attention is devoted to the “pain points” of catalog users, and there is very little interaction among the research, vendor, user, and open-source tool communities. This has negatively impacted users, especially in domain sciences, with anecdotal evidence of intensive manual work to construct catalogs. This project seeks to address these limitations by first developing innovative and practical solutions for several pain points of catalog users, thereby accelerating research on these critical topics. Second, the project will combine these solutions to build SmartCat, a catalog software, and

Key facts

NSF award ID: 2504787
Awardee: University of Wisconsin-Madison (WI)
SAM.gov UEI: LCLSJAGTNZQ7
PI: AnHai Doan
Primary program: 01002526DB NSF RESEARCH & RELATED ACTIVIT
All programs: INFO INTEGRATION & INFORMATICS, MEDIUM PROJECT
Estimated total: $1,000,000
Funds obligated: $1,000,000
Transaction type: Standard Grant
Period: 07/01/2025 → 06/30/2029