CAREER: Kishu: Checkpointing Data Science with Nonintrusive State Manager

NSF Award Search · 01002829DB NSF RESEARCH & RELATED ACTIVIT · $600,000 · view on nsf.gov ↗

Abstract

Today's data science systems, ranging from batch jobs to interactive interfaces, are surprisingly fragile. Data scientists typically use dozens of libraries, but a single bug in any can destroy hours or even days of computation, causing significant pain. This issue has been widely discussed in the data science community and academic literature. Yet, no principled mechanisms have been proposed to address the issue which might be puzzling to database researchers because existing databases implement checkpointing to periodically save changes in data for future recovery. Why haven't data science systems adopted checkpointing? What are the unique properties of data science systems that challenge the adoption? This project will answer these questions and bring checkpointing to data science systems with zero modifications to existing libraries and programs. If successful, this project can enable checkpointing, for the first time, in today's data science ecosystems. It will enable recovery from crashes, execution “undos”, suspending cloud resources without losing data, etc. This project first identifies a critical challenge: data science systems lack mechanisms for detecting changes in data, an important premise of checkpointing. Existing databases achieve this with centralized buffer pools. In contrast, data science systems intentionally omit centralized data spaces, allowing individual libraries to manage data using shared memory, GPUs, and remote machines for high performanc

Key facts

NSF award ID
2440498
Awardee
University of Illinois at Urbana-Champaign (IL)
SAM.gov UEI
Y8CWNJRCNN91
PI
Yongjoo Park
Primary program
01002829DB NSF RESEARCH & RELATED ACTIVIT
All programs
CAREER-Faculty Erly Career Dev, INFO INTEGRATION & INFORMATICS
Estimated total
$600,000
Funds obligated
$381,371
Transaction type
Continuing Grant
Period
06/15/2025 → 05/31/2030