Systems based on modern large language models (LLMs) play an increasing role in how users access information and compose text. For instance, a user executing a web search will increasingly rely on LLM-based systems to summarize their search results, rather than viewing individual web pages, and they might use LLM-based systems to “talk to” long documents like financial reports, rather than reading them in their entirety. To support these new paradigms, it is important that an LLM be able to generate responses that are factual, informative and safe. However, satisfying these criteria is not sufficient: a response should also be at the right level of abstraction or detail, in the right format, creative where appropriate, and aligned with other user needs. Current practice has neglected evaluation of these more subtle factors. This project proposes to address these shortcomings by identifying a set of “evaluation concepts” to indicate the kinds of areas where LLMs are failing, like “lack of detail in a list.” The project will then develop technology for automatically evaluating and improving LLM responses according to these concepts. This project aims to improve the evaluation and the functionality of LLMs in two ways. First, the project will discover a concept taxonomy and learn how to evaluate LLM responses according to the concepts in that taxonomy. This process will necessitate advances in reward models, which are themselves LLMs, customized to reliably score responses. S