CIF:Small: Towards practical gradient coding

NSF Award Search · 01002526DB NSF RESEARCH & RELATED ACTIVIT · $582,000 · view on nsf.gov ↗

Abstract

Machine learning systems have made revolutionary advances in several areas, including (but not limited to) automated speech and image recognition, scientific discovery, human health and national security. These advances have been made possible in large part by the training of high-capacity models that are able to capture and infer complex relationships between exhorbitant amounts of data, such as images, video, and speech. Such training is quite resource-intensive and failure-prone and typically requires the deployment of large groups of computers that operate collaboratively to achieve the overall objectives. For instance, by conservative estimates, the training of current state-of-the-art models for language understanding consume enough energy to power over one thousand average US households for a year. Moreover, a rule-of-thumb within distributed computing states: "failures are the norm, rather than the exception". This project will investigate resource-efficient and fault-tolerant schemes for distributed model training within machine learning. Specifically, the training time depends on the reliability and speed of the computers and the speed of communication between them. This project will examine techniques for simultaneously increasing both the reliability and speed of the process. If successful, this will result in significant energy and monetary savings across the board in scenarios where machine learning is routinely deployed. The ability to work with large-scale com

Key facts

NSF award ID
2523473
Awardee
Iowa State University (IA)
SAM.gov UEI
DQDBM7FGJPC5
PI
Aditya Ramamoorthy
Primary program
01002526DB NSF RESEARCH & RELATED ACTIVIT
All programs
Machine Learning Theory, SMALL PROJECT, NETWORK CODING AND INFO THEORY, EXP PROG TO STIM COMP RES
Estimated total
$582,000
Funds obligated
$582,000
Transaction type
Standard Grant
Period
07/01/2025 → 06/30/2028