Prospective analysis to determine model accuracy performance and boundaries in the post-AlphaFold2 environment

NIH RePORTER · NIH · R01 · $297,890 · view on reporter.nih.gov ↗

Abstract

Experimental determination of protein structure often provides atomic accuracy models, but is inherently time- consuming, costly, and not always possible. Computational modeling offers an alternative. Recent application of deep learning methods to structure modeling now allows the generation of highly accurate models of structure for the great majority of single proteins. Progress in modeling accuracy for the modeling of protein complexes is also expected in the near term. These developments have dramatically expanded the utility of models in biology and medicine. However, having even high accuracy models at one’s disposal is not the same as an experimental result. While experimental structures come with established norms and procedures to ascertain accuracy, models are at best annotated with predictions of accuracy. The increased use of models underscores the need for similar knowledge, norms, and procedures for models. The Critical Assessment of Structure Prediction (CASP, funded by the parent grant), is already focused on assessing performance of modeling techniques and the performance of methods for estimating model accuracy, and to this end, extensive additional data are currently being acquired as part of the CASP15 experiment. But a detailed assessment of accuracy and accuracy prediction properties under an adequate range of conditions requires more extensive information than will be available through traditional CASP procedures. The release of close to a million models obtained with the AlphaFold2 method from DeepMind has created an opportunity to perform such an analysis. We will use these data to perform a prospective analysis of model accuracy and accuracy prediction performance. Specifically, we will compare structures released by the PDB with the corresponding previously deposited models. We will assess the agreement between these experimental and computed structures, both overall and at the individual amino acid level, and the agreement between the predicted and actual accuracy. Critically, we will examine these factors as a function of the relevant variables, particularly the quality and type of the experimental information, environmental effects such as crystal packing, function related structural features, rare structural features, protein interface regions, sequence length, and the depth of the available amino acid sequence alignment used to generate the model. We will also build tools to analyze and visualize the relationships and interdependencies between these data, to facilitate and expand our understanding of model accuracy performance. The overall goal of this work is to provide a fine-grained landscape of accuracy obtained with current modeling methods, and confidence limits on accuracy in each situation. Although the present analysis is restricted to AlphaFold2 models, identification of the most important variables should allow us to apply the procedures to the range of methods represented in the smaller data sets obta...

Key facts

NIH application ID: 10672042
Project number: 3R01GM100482-11S1
Recipient: UNIVERSITY OF CALIFORNIA AT DAVIS
Principal Investigator: KRZYSZTOF A FIDELIS
Activity code: R01
Funding institute: NIH
Fiscal year: 2022
Award amount: $297,890
Award type: 3
Project period: 2012-06-15 → 2025-05-31