Scientific findings should come with error rates that mean what they say: among findings assigned a 5 percent chance of error, about 5 in 100 should turn out to be wrong. This standard, called calibration, underlies trusted probability claims from weather forecasting to machine learning, but it is not yet a routine part of the statistical tools used in many large-scale scientific studies. The issue arises whenever researchers must triage long lists of possible discoveries, anomalies, or published claims. In metascience, the question is which findings in the literature will replicate; in AI safety, which suspicious model inputs deserve greater scrutiny. Current methods control the average error rate across an entire list of discoveries, but they rarely provide individual findings with calibrated error probabilities. This award supports research on calibrated hypothesis testing, which will develop methods that distinguish strong evidence from borderline evidence with interpretable, rigorous guarantees. The work will support more reproducible science and safer data-driven systems, while training graduate researchers, developing new instructional materials, and releasing open-source software. This project will develop theory and methodology for calibrated, large-scale inference. The framework draws upon probabilistic forecasting but addresses a distinct challenge: unlike forecasting, where labels are eventually observed, in multiple testing the ground truth is never revealed,