!"#!"#$% #% Applied common sense The why, what and how of validation Gerard J. Kleywegt Protein Data Bank in Europe (pdbe.org) EMBL-EBI, Cambridge, UK Winter School on Structural Biology, CEITEC, Brno, 13 February 2015 What is validation? Validation according to the dictionary •! Validation = establishing or checking the truth or accuracy of (something) •! Theory •! Hypothesis •! Model •! Assertion, claim, statement •! Integral part of scientific activity! •! “Science is a way of trying not to fool yourself. The first principle is that you must not fool yourself, and you are the easiest person to fool.” (Richard Feynman) Critical thinking •! Essential “24/7” skill for every scientist •! And, in fact, for every non-scientist too •! Important aspect of validation Critical thinking !"#!"#$% !% Critical thinking •! What is wrong here? •! The tacR gene regulates the human nervous system •! The tacQ gene is similar to tacR but is found in E. coli •! ==> The tacQ gene regulates the nervous system in E. coli! And here? “The tetramer has a total surface area of 81,616Å2” (Implies: +/- 0.5Å2 …) Validation = critical assessment •! How good is my model, really? •! At the very least: •! Does it explain all the data that I used? •! Does it explain all the prior knowledge that I had? •! More importantly: •! Does my model explain all the data that I didn’t use? •! Does my model explain all the prior knowledge that I didn’t use? •! Is my model the best possible, most parsimonious explanation for the data? •! Are the testable predictions based on my model correct? •! If any of these questions is answered with “no”, you have a problem! Occam’s razor Popper’s falsifiability principle The why of validation Crystallography is great!! And NMR, 3DEM, SAS etc. too, of course! •! Crystallography can provide important biological insight and understanding!! Crystallography is great!! ! ! ! ! ! ! ! ! •! Crystallography can result in an all-expensespaid trip to Stockholm (albeit in December)!! ! Nightmare before Christmas … but sometimes we get it horribly wrong !"#!"#$% &% Why do errors survive? •! “Why do errors make it into the literature and the PDB?” •! Suggestions from students •! Cold Spring Harbor course, 2005 •! Copenhagen University course, 2006 Who/what do YOU think is to blame? Playing the Blame Game … •! Crystallographer •! ignorance, lack of experience, incompetence, incorrect preconceptions/bias, cheating, laziness, “science by mouse-click”, stress, can’t be bothered to fix minor problems, no validation •! PI •! pressure to publish/graduate fast, career interest, competition/scoops, grant writing, insufficient supervision •! Referees/editors •! lazy, inadequate reviewing routines, no access to raw data, “validation by senior author name”, lack of experience •! Software •! misses or causes errors •! PDB •! doesn’t check •! “Nature” •! limitations of the technique/resolution, errors hard to detect, poor data Why do crystallographers make mistakes? •! Limitations to the data •! Incomplete •! Weak •! Limited resolution •! Space and time averaged •! Phase errors •! The human factor •! Subjectivity and bias involved in map interpretation and refinement (even at atomic resolution!) •! Inexperienced people do the work, use of black boxes, … •! Not everybody is a good chemist •! Even experienced people make mistakes Kleywegt, Acta Cryst. D65, 134 (2009) Crystallographer = Super(wo)man? •! The crystallographer ideally has •! Knowledge of the history of the sample •! Knowledge of the biology of the system •! Knowledge of chemistry •! Knowledge of physics •! Understanding of data collection and processing •! Understanding of the refinement process and software •! Experience in map interpretation (preferably with a range of resolutions, space groups, etc.) •! Read and remembered all the relevant literature •! … (Wayne Hendrickson) The odds are stacked against us •! Crystallographers produce models of structures that will contain errors •! High resolution AND skilled crystallographer ! probably nothing major •! High resolution XOR skilled crystallographer ! possibly nothing major •! NOT (High resolution OR skilled crystallographer) ! pray for nothing major "I know the human being and fish can coexist peacefully" A little experiment •! Hypothesis: “If a card has a vowel on one side, then it has an even number on the other side” •! Validate this hypothesis by turning as few cards as possible •! How many, and which, cards must you turn? Wason selection task !"#!"#$% '% Confirmation bias •! A scientific model is a hypothesis to be shot down •! We should be looking for disconfirming evidence •! But we often don’t! We tend to look for supporting evidence •! Reasonable expectation to find a ligand + Any old density blob in a reasonable ligand-binding site => Model the ligand! •! Even if it isn’t really there… •! Conversely: we don’t expect a ligand, so we model waters “Believing is seeing…” Retracted “ligand complex” published in Nature “A philosopher is a blind man in a dark room looking for a black cat that isn’t there” “A crystallographer is the man who finds it” Paraphrasing HL Mencken Xtallography ! exact science •! Crystallographic models will contain errors •! Crystallographers need to fix errors (if possible) •! Users need to be aware of potentially problematic aspects of the model •! Note: every crystallographer is also a user! •! Validation is important •! Is the model as a whole reliable? •! How about the bits that are of particular interest? •! Active-site residues •! Interface residues •! Ligand, inhibitor, co-factor, … Why don’t people admit to their errors easily? •! To err is human •! But so is denying that you erred •! In some cases, “retraction battles” have raged for years •! Cognitive dissonance - discomfort caused by conflicting views of self •! “I am an intelligent, hard-working scientist who makes good decisions” •! “There is an error in my structure” •! How to resolve this discomfort? Cognitive dissonance – ways of coping •! (1) Self-justification/denial/passing the buck •! “There’s nothing wrong with it” •! “It doesn’t change the conclusions” •! “Everybody makes those kinds of errors” •! “It’s really a matter of interpretation” •! “It’s probably low occupancy/high mobility” •! “There is strain in the active site” •! “It fits other data/my chemical intuition” •! “It was my student’s first structure” •! “Legacy software changed the signs of "Fanom” •! (2) Depression – no need for that! •! (3) Acceptance/reconciliation – the grown-up thing to do •! “I made an error, I’ll fix it and learn from it” •! Still an intelligent, hard-working scientist! •! Doing yourself and science a favour Proceedings of the CCP4 Study Weekend. Accuracy and Reliability of Macromolecular Crystal Structures (1990) (David Eisenberg) !"#!"#$% $% Cognitive dissonance in action •! Single N-C bonds of 1.1 and 1.6Å •! Non-bonded C…C contact of 2.0Å •! PO3 moiety separated by 2.7Å from O THE LIGAND N5G IN THIS ENTRY IS N5-IMINIUM PHOSPHATE. HOWEVER,! THERE IS SOME DISCREPANCY IN THE GEOMETRY. THE GEOMETRY FOR N5G! IS SUGGESTED BY THE REFINEMENT. THE CO-ORDINATES FIT WELL IN THE! ELECTRON DENSITY MAP. THE MAP WAS GENERATED USING A DATASET! COLLECTED AT 2.8 ANGSTROM RESOLUTION. THE DENSITY FOR THE LIGAND! IS UNAMBIGUOUS AND THEREFORE THE GEOMETRIES ARE CORRECT AND ARE! AS THEY WOULD BE IN A BIOLOGICAL MOLECULE, WHERE THE MICRO! ENVIRONMENT HAS A PROFOUND INFLUENCE ON THE GEOMETRIES OF THE! LIGAND.! The experimental “evidence” “Evidence that molecular-orbital theory breaks down in the presence of a protein crystallographer” (K. Henrick) pdbe.org/valrep/3hy4 Errors and validation •! We need to take the drama out of the whole issue of errors and validation •! “When a friend makes a mistake, the friend remains a friend and the mistake remains a mistake” (S. Peres) •! Lao Tzu (more than 2500 years ago): A great nation is like a great man: When he makes a mistake, he realises it Having realised it, he admits it Having admitted it, he corrects it He considers those who point out his faults as his most benevolent teachers. What kinds of errors do crystallographers make? Errors in protein structures •! Brändén & Jones (1990) •! Mistracing an entire molecule or domain •! Register errors •! Local errors in the main chain •! Sidechain errors Kleywegt, Acta Cryst. D56, 249 (2000) !"#!"#$% (% Example of a tracing error 1PHY (1989, 2.4Å, PNAS) 2PHY (1995, 1.4Å) Entire molecule traced incorrectly Example of a tracing error 1FZN (2000, 2.55Å, Nature) 2FRH (2006, 2.6Å) - One helix in register, two helices in place, rest wrong - 1FZN obsolete, but complex with DNA still in PDB (1FZP) What are register errors? •! For a segment of a model, the assigned sequence is out-ofregister with the actual density Example of a register error •! 1CHR (light; 3.0Å, 1994, Acta D) vs. 2CHR (dark) Example of a register error 1ZEN (green carbons), 1996, 2.5Å, Structure 1B57 (gold carbons), 1999, 2.0Å 1B57 (A) ---SKIFDFVKPGVITGDDVQKVFQ .=ALIGN |=ID .. .......... ||||||| 1ZEN (_) SKI-FD-FVKPGVITGD-DVQKVFQ Confirmed by iterative build-omit maps (Tom Terwilliger et al., 2008) Problems with ligands !"#!"#$% )% Reasonable assumptions? •! Typical assumptions •! We know what the ligand is •! The modelled ligand was really there •! We didn’t miss anything important •! The observed conformation is reliable •! At high resolution we get all the answers •! The H-bonding network is known •! We can trust the waters •! We are good chemists •! (The complex structure is relevant for drug design) A case of mistaken identity… 3OEG – bacteriochlorophyll-a 3VDI – PEG fragment and waters Tronrud & Allen, Photosynth. Res. 112, 71 (2012) The ligand is really there? (J. Amer. Chem. Soc., August 2002) Dude, where’s my density? 1FQH (2000, 2.8Å, JACS) We didn’t miss anything? Conundrum!! 2GWX (1999, 2.3Å, Cell) Oh, that ligand! 2BAW (2006, same data!) !"#!"#$% *% Small-molecule anomalies •! 3-Phenylpropylamine in 1TNK, 1994, 1.8Å, Nature Struct. Biol. •! Aromatic carbon in between planar (0˚) and pyramidal (35˚) … 17˚ Oops-a-daisy! •! COA = coenzyme A •! 2.25Å, R 0.25/0.28, Mol. Cell •! Deposited 2003 •! Non-bonded contacts as close as 0.54Å •! Bond lengths up to 6.7Å •! Bond angles as low as 18˚ •! Impropers of 160˚ Validation of PDB ligand structures by CCDC •! 16% of PDB entries deposited in 2006 had ligand geometries that were almost certainly in significant error (in-house analysis using Relibase+/Mogul) •! The good news - for structures before 2000 the figure was 26% Wrong 26% Plausable 34% Not unusual 40% Wrong 16% Plausable 29% Not unusual 55% Pre 2000 2006 (Jana Hennemann & John Liebeschuetz) Liebeschuetz et al., J. Comput. Aid. Mol. Des. 26, 169 (2012) High resolution reveals all? •! Even at very high resolution there are sources of subjectivity and ambiguity •! How to model temperature factors? •! Is a blob of density a water or not? •! How to model alternative conformations? •! How to interpret density of unknown entities? •! How to tell C/N/O apart? The 22nd amino acid @ 1.55Å Sodium chloride Ammonium sulfate (Hao et al., 2002; PDB entries 1L2Q and 1L2R) The what of validation !"#!"#$% +% How do we generate new knowledge? New questions New model or hypothesis Predictions Curiosity Experiment Prior knowledge New data Synthesis and interpretation Errors affect measurements •! Random errors (noise) •! Affect precision •! Usually normally distributed •! Reduce by increasing nr of observations •! Systematic errors (bias) •! Affect accuracy •! Incomplete knowledge or inadequate design •! Reproducible •! Gross errors (bloopers) •! Incorrect assumptions, undetected mistakes or malfunctions •! Sometimes detectable as outliers Errors affect measurements •! How tall is Gerard? •! 200 203 202 203 202 201 203 80 •! Random error •! Systematic error •! Gross error Anisotropic model of Gerard What can go wrong? New questions New model or hypothesis Predictions Curiosity Experiment New dataPrior knowledge Synthesis and interpretation Sod’s Law (a.k.a. Murphy’s Law) Various kinds of validation Prior knowledge New questions New data Synthesis and interpretation New model or hypothesis Predictions Curiosity Experiment Unused knowledge Unused data This model of hypothesis validation is entirely general for experimental sciences How does it apply to protein crystallography? !"#!"#$% #,% The how of validation What is a good model? •! A good model makes SENSE in all respects! Various kinds of crystal structure validation Prior knowledge New questions New data Synthesis and interpretation New model or hypothesis Predictions Curiosity Experiment Unused knowledge Unused data Geometry Stereo-chemistry Close contacts Sequence Chemical structure Biosynthetic pathways … Various kinds of crystal structure validation Prior knowledge New questions New data Synthesis and interpretation New model or hypothesis Predictions Curiosity Experiment Unused knowledge Unused data R-value Real-space fit B-values ksol … Various kinds of crystal structure validation Prior knowledge New questions New data Synthesis and interpretation New model or hypothesis Predictions Curiosity Experiment Unused knowledge Unused data Rfree Binding data Mutant data Conserved residues Heavy-atom sites SAXS envelope … Various kinds of crystal structure validation Prior knowledge New questions New data Synthesis and interpretation New model or hypothesis Predictions Curiosity Experiment Unused knowledge Unused data Ramachandran Rotamers Environments … !"#!"#$% ##% Various kinds of crystal structure validation Prior knowledge New questions New data Synthesis and interpretation New model or hypothesis Predictions Curiosity Experiment Unused knowledge Unused data Falsifiable hypotheses Validation in a nutshell •! Compare your model to the experimental data and to the prior knowledge. It should: •! Reproduce knowledge/information/data used in the construction of the model •! R, RMSD bond lengths, chirality, … •! Predict knowledge/information/data not used in the construction of the model •! Rfree, Ramachandran plot, packing quality, … •! Global and local •! Model alone, data alone, fit of model and data •! … and if your model fails to do this, there had better be a plausible explanation! What is “the PDB” doing about validation? SOMETHING IS WRONG IN THE PDB! What is “the PDB”? SOMETHING IS WRONG IN THE PDB! wwPDB wwpdb.org wwPDB partnership •! Collaborate on “data in” •! Policy issues •! Weekly releases •! Validation standards •! Format specifications •! Chemical Component Dictionary •! Deposition and annotation procedures •! Archive quality and remediation •! Journal interactions •! Community interactions •! Friendly competition on “data out” •! Serving PDB data with added-value •! PDB-based services •! Other services, resources and activities wwpdb.org !"#!"#$% #!% Validation addresses important questions •! Entry-specific validation (quality control) •! Is this model ready for archiving and publication? •! Is this model a faithful, reliable and complete interpretation of the experimental data? •! Are there any obvious errors/problems? •! Are the conclusions drawn in the paper justified by the data? •! Is this model suitable for my application? •! Archive-wide validation (comparative) •! Is this model a better interpretation of the data? •! What is the best model for this molecule/complex to answer my research question? •! Which models should I select/omit when mining the PDB? Validation by wwPDB - advantages •! Applies community-agreed methods uniformly •! Improves the quality and consistency of the PDB archive •! Supports editors and referees •! Helps users assess if an entry is suitable •! Helps users compare related entries •! Enables identification of outliers when mining the PDB •! Stimulates adoption of better protocols by the community The future of validation •! wwPDB X-ray Validation Task Force Archive-wide analysis X-ray VTF: Read et al., Structure 19, 1395 (2011) Percentile scores PDF report for depositor & referees Statistics and plots for the entry, per chain, per residue, and list of unusual features wwPDB X-ray validation pipeline Validation pipeline 1.0 MolProbity EDSXtriage Mogul Deposited data (coordinates & reflections) Percentiles PDF maker Validation XML file Distributions External reference files (e.g., Engh & Huber) Gore et al., Acta Cryst. D68, 478 (2012) !"#!"#$% #&% What does it mean for a crystallographer? •! There are three uses of the validation pipeline •! At deposition time •! Not all checks can be run, e.g. some sequence and ligand checks •! Report for depositor •! At annotation time •! Complete validation report, also suitable for editors/referees •! Independently of deposition •! Anonymous web-based server to use on models not (yet) in the PDB •! Not all checks can be done •! Will be developed once the production pipeline is up and running •! Will not be available as a stand-alone software package Validation reports •! Front cover •! Deposition info •! Software info •! wwpdb.org/validation-reports.html •! wwpdb.org/validation-servers.html pdbe.org/valrep/1cbs Validation reports •! Summary •! Quality vs. all PDB X-ray •! Quality vs. entries at similar resolution •! Overview of residuebased quality for every polymer •! Table of ligands that may need attention Validation reports •! Entry contents •! Inventory Validation reports •! Residue quality •! One plot per polymer •! Coloured by number of types of geometric outliers •! Grey if not modelled •! Red dots: poor density (RSR-Z > 2, as in EDS) Validation reports •! “Table 1” •! Xtriage !"#!"#$% #'% Validation reports •! Model quality •! Bond lengths and angles •! Torsion angles (Ramachandran, rotamers) •! Clashes •! Separately for standard residues, non-standard residues, ligands, carbohydrates •! Generally: information about distribution, outlier stats, percentile scores, list of up to 5 (worst) outliers (full reports contain all outliers) Validation reports •! Geometry validation of ligands and non-standard entities •! Mogul (CCDC) •! wwPDB will get CSD coordinates for new and existing compounds (if they are available, of course) Validation reports •! Model/data fit proteins, DNA, RNA •! RSR and RSR-Z (EDS) •! Ligands etc. •! RSR and LLDF Public X-ray Validation Reports pdbe.org – rcsb.org – pdbj.org Beta site at PDBe http://wwwdev.ebi.ac.uk/pdbe/entry/pdb/1cbs Other methods? Nature 514, 416 (2014) !"#!"#$% #$% Other Methods? •! Model validation using same criteria as X-ray •! MolProbity, Mogul •! Some special model-related issues per technique •! X-ray: alternative conformations •! NMR: ensemble of models; well-defined regions •! 3DEM: clashes of rigid-body fitted models; difference in species of model and sample sequence •! Data quality and model/data-fit assessment will be different for each technique NMR Validation •! NMR VTF recommendations published •! Global quality scores reported for !welldefined residues" only •! As averages over the ensemble •! Medioid model only 3DEM Validation •! Model validation •! Clashes? •! Taxonomy? •! Homology models? •! Non-atomistic models? •! C!-only models? •! Rigid-body vs. flexible fitting vs. de novo modelling? •! Data and map validation •! Per technique and resolution regime •! Tilt-pair analysis; handedness; projections vs. raw data •! Map + model •! Depending on resolution regime and model-building method? EM Validation Reports •! Metrics relevant for EM models •! Define “Table 1” for EM Validation by wwPDB •! By no means the end of the story! •! Room for extension and improvement •! Ligands, nucleic acids, carbohydrates, NCS, spacegroup errors, … •! wwPDB ligand-validation workshop in 2015 •! X-ray •! Re-convene X-ray VTF in 2015 to evaluate and update recommendations •! NMR •! Further development in progress •! EM •! Rudimentary at present, lots more work needed •! All methods: annual re-compute of distributions •! User feedback welcome at validation@mail.wwpdb.org 109 “Other other” methods •! SAS – wwPDB task force (2012, 2014) •! Hybrid methods – wwPDB task force (2014) •! For example: solid-state NMR + EM + SAXS + solution NMR + homology modelling … •! Questions •! What to archive and where? •! What to accept? •! What requirements for deposition? •! How to validate? •! What to do with non-atomistic models? •! What to do with homology models? !"#!"#$% #(% SAS Task Force recommendations •! Need repository for SAXS and SANS data •! Need dictionary (data model) for SAXS and SANS •! Shape/bead and atomistic models should be archived (somewhere, somehow) •! Validation criteria need to be defined •! Archive of non-atomistic models from hybrid data •! What should (not) be in the PDB? Trewhella et al., Structure 21, 875 (2013) SAS archives bioisis.net – sasbdb.org Hybrid Methods •! Task Force met in October 2014 •! Representatives of existing task forces, other methods, integrative modellers, and wwPDB •! Questions about what to archive where, what data and metadata, how to validate wwPDB Hybrid Methods Task Force Nature 514, 416 (2014) EMBL-EBI, Hinxton, 6-7 October, 2014 X-ray NMR 3DEM/ET SAS FRET EPR MS … Modelling Docking Validation Visualisation Archiving … Key outcomes of discussion •! Be as inclusive as possible in collecting data from many different experimental methods •! Accommodate many types of structural representations •! Create a federated system to collect/curate data •! Use a common interface to collect data •! wwPDB should play a leadership role •! Whitepaper to describe vision What have we learned? !"#!"#$% #)% Why do/did things sometimes go horribly wrong in X-ray? •! Blind optimism/naïveté/ignorance •! Belief in (wrong) numbers and in “magic” refinement programs •! Inappropriate (use of) modelling/refinement methods •! Fitting too many parameters •! No/inappropriate quality control/validation •! “Believing is seeing” •! Large influx of non-experts Of course, none of this should be news or surprising… Hendrickson (CCP4 Proc., 1980) - “That which is not restricted will take its liberties” Knight et al. (CCP4 Proc., 1990) - “None of this evidence is dependent on a refined model and instead makes use of known facts about proteins in general and the S subunit of RuBisCO in particular” 1990 Brändén & Jones, Nature 353, 687 (1990) Lessons •! Have we learned anything from 25 years of errors? •! Education is important •! Avoid blind optimism, naïveté, belief in “magic” programs •! Don’t be afraid to ask a colleague’s help or opinion •! Use restraint and restraints when modelling •! Consider the ratio of observations and parameters •! Consider the information content of your data •! Null-hypothesis: everything is normal! •! Trans-peptides, bond lengths/angles, rotamers, NCS, … •! Unless your data shouts at you otherwise, or you have reliable prior knowledge Lessons •! Have we learned anything from 25 years of errors? •! Use (lots of) validation tools throughout, not just when you deposit •! Or worse, rely on wwPDB annotators to tell you what’s dodgy about your model… •! Be your own fiercest critic! •! Avoid confirmation bias - try to shoot down your own models and hypotheses •! How will you deal with cognitive dissonance? What you would like your plots to look like… pdbe.org/valrep/4lfq !"#!"#$% #*% Validation reports for today’s structures •! New-style wwPDB X-ray validation reports are available for most of the structures shown or discussed in this lecture (even superseded ones) from http://www.ebi.ac.uk/~gerard/valrepcshl.html! •! Examples:! •! 1Z2R (part of the" pentaretraction);" 4.2Å! •! 3LNA (imagined" ligand); 2.7Å! Where to go from here? •! Download and read: •! GJ Kleywegt. Validation of protein crystal structures. Acta Crystallographica D56, 249-265 (2000) (and many references therein) •! GJ Kleywegt. On vital aid: the why, what and how of validation. Acta Crystallographica, D65, 134-139 (2009) •! Do this web-based tutorial: •! http://xray.bmc.uu.se/embo2001/modval Acknowledgements •! Alwyn Jones (Uppsala U) •! Randy Read (Cambridge U) •! Andy Davis (AstraZeneca) •! Members of the wwPDB and EMDataBank VTFs •! CCDC •! Colleagues •! Uppsala, PDBe, wwPDB, EMDataBank, EBI, EMBL •! Everybody whom I have ever discussed validation and errors in protein structures with •! Many funding agencies in Sweden, UK, Europe and US as well as Uppsala University and EMBL Questions? If you see this slide, I’ve gone too far