The rapid growth of large language models (LLMs) has enabled major advances in artificial intelligence (AI), including systems that assist with writing, coding, education, and decision-making. However, training these models demands enormous computing resources, creating significant challenges across multiple dimensions, including model quality, training time, energy efficiency, and reliability. Although many optimization techniques have been proposed, most focus on only one or a few aspects of training, leaving their overall impact on total training efficiency unclear. This project addresses this gap by developing a systematic understanding of the trade-offs among existing optimization strategies and by delivering a quantitative efficiency model that enables informed, cost-aware decision making for LLM training. In addition, the project advances optimization methods in underexplored areas, particularly energy efficiency and reliability. The anticipated outcomes will promote more sustainable computing practices, strengthen national competitiveness in AI, and support applications that advance economic growth, education, national security, and public services. The project also establishes an integrated education program to support workforce development and expand participation in advanced computing and the AI industry. This project develops a unified framework for analyzing and optimizing LLM training efficiency across performance, energy consumption, reliability, and model quality, addressing the growing gap between the unprecedented resource demands of LLM training and the limitations of existing optimization approaches. The research comprises three primary components. First, it develops a novel efficiency model that integrates performance, energy, reliability, and quality optimizations to enable holistic decision making for large-scale training systems. Second, it designs a mathematically grounded, checkpoint-free fault tolerance mechanism that improves error det