CAREER: ProTrain: Enabling Efficient Large Language Model Training via Performance, Energy, and Reliability Co-optimizations

NSF Award Search · 01002627DB NSF RESEARCH & RELATED ACTIVIT · $533,628 · view on nsf.gov ↗

Abstract

The rapid growth of large language models (LLMs) has enabled major advances in artificial intelligence (AI), including systems that assist with writing, coding, education, and decision-making. However, training these models demands enormous computing resources, creating significant challenges across multiple dimensions, including model quality, training time, energy efficiency, and reliability. Although many optimization techniques have been proposed, most focus on only one or a few aspects of training, leaving their overall impact on total training efficiency unclear. This project addresses this gap by developing a systematic understanding of the trade-offs among existing optimization strategies and by delivering a quantitative efficiency model that enables informed, cost-aware decision making for LLM training. In addition, the project advances optimization methods in underexplored areas, particularly energy efficiency and reliability. The anticipated outcomes will promote more sustainable computing practices, strengthen national competitiveness in AI, and support applications that advance economic growth, education, national security, and public services. The project also establishes an integrated education program to support workforce development and expand participation in advanced computing and the AI industry. This project develops a unified framework for analyzing and optimizing LLM training efficiency across performance, energy consumption, reliability, and model quality, addressing the growing gap between the unprecedented resource demands of LLM training and the limitations of existing optimization approaches. The research comprises three primary components. First, it develops a novel efficiency model that integrates performance, energy, reliability, and quality optimizations to enable holistic decision making for large-scale training systems. Second, it designs a mathematically grounded, checkpoint-free fault tolerance mechanism that improves error det

Key facts

NSF award ID: 2540555
Awardee: University of Oregon Eugene (OR)
SAM.gov UEI: Z3FGN9MF92U2
PI: Jieyang Chen
Primary program: 01002627DB NSF RESEARCH & RELATED ACTIVIT
All programs: Artificial Intelligence (AI), CAREER-Faculty Erly Career Dev, Microelectronics and Semiconductors, COMPUTER ARCHITECTURE
Estimated total: $533,628
Funds obligated: $308,232
Transaction type: Continuing Grant
Period: 09/01/2026 → 08/31/2031