Structural databases  Structural databases  3D data validation  3D protein modelling  Models validation and databases Outline Structural databases & Models of structures 2  Structural databases  Data formats (PDB, mmCIF, PDBML)  wwPDB  Other resources  3D data validation  3D protein modelling  Models validation and databases Outline 34-Str. DBs & 3D Modelling -> Str. DBs Data formats 4-Str. DBs & 3D Modelling -> Str. DBs  different formats are used to represent primary macromolecular 3D structure data  PDB  mmCIF  PDBML  ...  The spatial 3D coordinates for each atom are recorded 4 PDB format  designed in the early 1970s - first entries of PDB database  rigid structure of 80 characters per line, including spaces  still the most widely supported format 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 5 PDB format 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 6 PDB format  atomic coordinates  chemical and biological features  experimental details of the structure determination  structural features  secondary structure assignments  hydrogen bonding  biological assemblies REMARK 350  active sites  ... 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 7 PDB format  advantages  widely used → supported by majority of tools  easy to read and easy to use → suitable for accessing individual entries 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 8 PDB format  disadvantages  inconsistency between individual PDB entries as well as PDB records within one entry (e.g., different residue numbering in SEQRES and ATOM sections) → not suitable for computer extraction of information 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 9  disadvantages  inconsistency between individual PDB entries as well as PDB records within one entry → not suitable for computer extraction of information  absolute limits on the size of certain items of data, e.g.: max. number of atom records limited to 99,999; max. number of chains limited to 26 → large systems such as the ribosomal subunit must be divided into multiple PDB files → not suitable for analysis and comparison of experimental and structure data across the entire database PDB format 4-Str. DBs & 3D Modelling -> Str. DBs -> PDB format 10 mmCIF format  macromolecular Crystallographic Information File (mmCIF)  developed to handle increasingly complicated structure data  each field of information is explicitly assigned by a tag and linked to other fields through a special syntax 4-Str. DBs & 3D Modelling -> Str. DBs -> mmCIF format 11 mmCIF format  advantages  easily parsable by computer software  consistency of data across the database  disadvantages  difficult to read  rarely supported by visualization and computational tools → suitable for analysis and comparison of experimental and structure data across the entire database → not suitable for accessing individual entries 4-Str. DBs & 3D Modelling -> Str. DBs -> mmCIF format 12 PDBML format  Protein Data Bank Markup Language (PDBML)  XML version of PDB format 4-Str. DBs & 3D Modelling -> Str. DBs -> PDBML format 13 Structural databases  Primary • wwPDB: 3D structure of biopolymers • BMRB: Nuclear Magnetic Resonance specific • EMDB: Electron-Microscopy specific • NDB: 3D structure of nucleic acids: http://ndbserver.rutgers.edu/ • CSD: 3D structure of small molecules (commercial) http://www.ccdc.cam.ac.uk/products/csd/  Other sources • PDBsum, SCOP, Protopedia, Structural Biology KnowledgeBase 4-Str. DBs & 3D Modelling -> Str. DBs 14 wwPDB  joint initiative of four organizations  Research Collaboratory for Structural Bioinformatics (RCSB PDB)  Protein Data Bank in Europe (PDBe)  Protein Data Bank Japan (PDBj)  Biological Magnetic Resonance Data Bank (BMRB) 4-Str. DBs & 3D Modelling -> Str. DBs -> wwPDB 15 wwPDB  database growth 4-Str. DBs & 3D Modelling -> Str. DBs -> wwPDB 16 wwPDB  worldwide Protein Data Bank (wwPDB)  http://www.wwpdb.org/  central repository of experimental macromolecular structures  more than 225,000 structures (October 2024), updated every week  mostly protein structures (87 %), structures of protein/nucleic acids or oligosaccharides complexes (11 %) and nucleic acid structures (2 %)  majority of structures from X-ray crystallography (84 % ), NMR (6 %), or EM (10%)  deposition of the structure into wwPDB is a requirement for its publication 4-Str. DBs & 3D Modelling -> Str. DBs -> wwPDB 17 wwPDB – data deposition  All data can be deposited at RCSBPDB, PDBe or PDBj site  Same requirements content and format of the final files:  structures of biopolymers  structures determined by experimental techniques  structures containing required information  Same validation methods → uniformity of the final archive  PDB-ID  assigned to each deposition  unique identifier of each structure  four-character code 4-Str. DBs & 3D Modelling -> Str. DBs -> wwPDB 18 wwPDB – data validation  assessment of the quality of deposited atomic models (structure validation) and how well these models fit experimental data (experimental validation)  validation using accepted community standards  covalent bond distances and angles  stereochemical validation  atom and ligand nomenclature  geometry  NMR data specific checks  ... 4-Str. DBs & 3D Modelling -> Str. DBs -> wwPDB 19 wwPDB – data access  the access to the PDB archive is free and publicly available from the RCSB PDB site, PDBe site or PDBj site  FTP  RCSB PDB, PDBe and PDBj sites distribute the same PDB archive  updated weekly  web sites  each wwPDB site provides its own services and resources → different views and analyses of the structural data  sequence-based and text-based queries 4-Str. DBs & 3D Modelling -> Str. DBs -> wwPDB 20 RCSB PDB  http://pdb.rcsb.org 4-Str. DBs & 3D Modelling -> Str. DBs -> wwPDB 21 Other structure-based resources  PDBsum  http://www.ebi.ac.uk/pdbsum/  provides summaries and pre-computed analyses for structures deposited in the wwPDB 4-Str. DBs & 3D Modelling -> Str. DBs -> Other resources 24 Other structure-based resources  Structural Classification of Proteins (SCOP)  http://scop.mrc-lmb.cam.ac.uk/scop/  provides classifications of proteins with known 3D structure according to their evolutionary and structural relationships 4-Str. DBs & 3D Modelling -> Str. DBs -> Other resources 25 Other structure-based resources  Proteopedia  http://www.proteopedia.org/wiki/index.php/  free, collaborative 3D-encyclopedia of proteins and other molecules 4-Str. DBs & 3D Modelling -> Str. DBs -> Other resources 26 Other structure-based resources  Structural Biology Knowledgebase  http://sbkb.org/  provides up-to-date information about advances in structural biology and structural genomics 4-Str. DBs & 3D Modelling -> Str. DBs -> Other resources 27 Structural quality assurance 284-Str. DBs & 3D Modelling -> 3D data validation  Revision of concepts  Important truths about structures  Errors in deposited structures  systematic errors  random errors  Selecting reliable structure  rules of thumbs  quality checks  programs and databases Outline 294-Str. DBs & 3D Modelling -> 3D data validation Concepts  Resolution  measure of the level of detail present in the diffraction pattern 3 Å 2 Å 1 Å 304-Str. DBs & 3D Modelling -> 3D data validation -> Concepts Concepts  R-factor (R-value)  measure of a model quality - i.e. how well it can reproduce experimental data 314-Str. DBs & 3D Modelling -> 3D data validation -> Concepts Concepts  Thermal factors (B-factors)  measure of how much an atom oscillates or vibrates around the position specified in the model 324-Str. DBs & 3D Modelling -> 3D data validation -> Concepts Important truths about structures  all structures are just models devised to satisfy experimental data → random and systematic errors  individual structures differ in the quality  most structures are reasonably accurate, containing “only” random errors, but some structures are seriously incorrect  structures should be carefully selected and critically assessed before being used for a specific purpose → quality checks of structures 334-Str. DBs & 3D Modelling -> 3D data validation -> Truths Errors in deposited structures  systematic errors  random errors 344-Str. DBs & 3D Modelling -> 3D data validation -> Errors Systematic errors  relate to the accuracy of the model—how well it corresponds to the “true” structure of the molecule in question  often include errors of interpretation  low quality of electron density map → difficult to find the correct tracing of the molecule(s) through it → misstracing and “frame-shift” errors  spectral interpretations (assignment of individual NMR signals to individual atoms)  may lead to completely wrong final structure 354-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Systematic Examples of systematic errors  completely wrong structures  trace of the protein chain following the wrong path through the electron density → completely incorrect fold Incorrect model (1PHY) Corrected model (2PHY) 364-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Systematic Examples of systematic errors  wrong connectivity between secondary structure elements  incorrect order of secondary structure elements → many protein’s residues in the wrong place in the 3D structure Incorrect model (1PTE) Corrected model (3PTE) 374-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Systematic Examples of systematic errors  frame-shift errors  occur where a residue is fitted into the electron density that belongs to the next residue and persists until compensating error is made (two residues are fitted into the density of a single residue)  occur almost exclusively at very low resolution (> 3.0 Å), often in loop regions  fitting of incorrect main chain or side chain conformations into the density  usually the least serious, however still can have effects on biological interpretations 384-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Systematic Random errors  depend on how precisely a given measurement can be made  all measurements contain errors at some degree of precision → uncertainties in atomic positions  less serious than systematic errors  if a structure is essentially correct, the sizes of the random errors determine how precise the structure is 394-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Random Examples of random errors  uncertainties in atomic positions  typically in range of 0.01 - 1.27 Å, median 0.28 Å 404-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Random Examples of random errors  side chain flips  His/Asn/Gln – symmetrical in terms of shape → fit electron density equally well when rotated by 180° difficult to distinguish N and O atoms of the side-chain amide from X-ray data N O O N 414-Str. DBs & 3D Modelling -> 3D data validation -> Errors -> Random Selecting reliable structure  rules of thumb for selecting structures  X-ray structures  NMR structures  quality checks of structures  validation of protein structures  programs for quality checks  quality information on the web 424-Str. DBs & 3D Modelling -> 3D data validation -> Structure Selection Rules of thumb for selecting structures  X-ray structures  reasonably accurate structure: resolution ≤ 2.0 Å and R-factor ≤ 0.2  selection criteria always depend on the type of analysis required (e.g., comparison of folds – 3.0 Å resolution is sufficient vs. analysis of side chain torsional conformers – resolution ≤ 1.2 Å is required)  R-factor can easily be fooled → a better indicator of model reliability is Rfree – calculated in the same way as R-factor but using only a small fraction of the experimental data; Rfree should be ≤ 0.4  local errors indicated by residue B-factors > 50 but quality checks should always be performed to assess possible local problems in a structure 434-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Rules of thumb Rules of thumb for selecting structures  NMR structures  no simple rule of thumb as in the case of X-ray structures  information on structure quality can be found in the original paper or obtained by quality checks  ResProx (http://www.resprox.ca/) – predicts the atomic resolution of NMR protein structures using machine learning  DRESS (http://www.cmbi.ru.nl/dress/ ) and RECOORD (http://www.ebi.ac.uk/pdbe-apps/nmr/recoord/main.html)web servers – provide improved versions of old NMR models (obtained by re-refinement of the original experimental data using more upto-date force fields and refinement protocols) 444-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Rules of thumb Quality checks of structures  checks of structure geometry, stereochemistry and other structural properties  tests of normality  comparison of a given protein or nucleic acid structure against what is already known about these molecules  knowledge comes from high-resolution structures of small molecules and systematic analyses of existing protein and nucleic acid structures  not all outliers from the norm are errors (e.g., an unusual torsion angle of a single residue), however, a structure exhibiting a large number of outliers and oddities is probably problematic 454-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Validation of protein structures  Ramachandran plot  check of stereochemical quality of protein structures  plot of the Ψ versus the Φ main chain torsion angles for every amino acid residue in the protein (except the two terminal residues)  favorable and “disallowed” regions of the plot determined from analyses of existing structures  typical protein structures – residues tightly clustered in the most favored regions, only few or none residues in the “disallowed” regions  poorly defined protein structures– residues more dispersed and many of them lie in the “disallowed” regions of the Ramachandran plot 464-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Validation of protein structures  Ramachandran plot typical protein structure poorly defined protein structure Φ Ψ Φ Ψ 474-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Validation of protein structures  Ramachandran plot 484-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks  side chain torsion angles  preferred conformations of side chain torsion angles obtained by analyses of existing structures  χ1 – torsion angle about N-Cα-Cβ-Aγ  χ2 – torsion angle about Cα-Cβ-Aγ-Aδ, ... Validation of protein structures 494-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Validation of protein structures  bad and unfavorable atom-atom contacts  “simple” count of bad contacts, e.g., two nonbonded atoms with a center-to-center distance < sum of their van der Waals radii  evaluation of the environment of individual atoms or residue fragments with respect to the environments found in the high resolution crystal structures 504-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Validation of protein structures  secondary structure  ~ 50-60% of residues usually in regions of regular secondary structure  poorly defined structures – main chain O and N atoms can lie beyond normal hydrogen bonding distances → some of the α-helices and βstrands not detected by the secondary structure assignment programs typical protein structure poorly defined protein structure 514-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Validation of protein structures  other parameters  counts of unsatisfied hydrogen bond donors  hydrogen bonding energies  knowledge-based potentials assessing how “happy” each residue is in its local environment – many unhappy residues → “sad” overall structure  real space R-factor expressing how well each residue fits its electron density; can also be expressed as a Real-space correlation coefficient 524-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Programs for quality checks  Proteins  PROCHECK  WHAT_CHECK  Verify 3D  MolProbity  ANOLEA 534-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Programs for quality checks  PROCHECK  http://www.ebi.ac.uk/thornton-srv/software/PROCHECK/  variety of plots for protein structures: Ramachandran plot, χ1-χ2 plot for each amino acid type, main chain bond lengths and bond angles, secondary structure plot, ...  parameters that deviate from norm are highlighted  NMR-PROCHECK – version specific for NMR 544-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Programs for quality checks  WHAT_CHECK (subset of WHAT IF package)  http://swift.cmbi.ru.nl/gv/whatcheck/  space group and symmetry  bond lengths and angles  bad contacts  hydrogen bonds  ....  detailed output of discrepancies of the given protein structure from the norms 554-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Programs for quality checks  Verify3D  https://genesilico.pl/toolkit/unimod?method=Verify3D  evaluates residue’s environment in terms of secondary structure, buried surface area, and fraction of side chain covered by polar atoms  MolProbity  http://molprobity.biochem.duke.edu/  detailed all-atom contact analysis within a given protein structure  ANOLEA  http://melolab.org/anolea/index.html  knowledge based evaluation of atom-atom contacts 564-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Quality information on the web  several databases provide pre-computed quality criteria for all wwPDB structures  EDS  PDBsum  PDBREPORT  RCSB PDB 574-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Quality information on the web  Electron Density Server (EDS)  http://eds.bmc.uu.se/eds/, also available via the PDBe site  information about local quality of the structure for all structures from wwPDB with deposited experimental data  plot of real-space R-factor (RSR) – how well each residue fits its electron density  plot of Z-score – large positive spike → residue has considerably worse RSR than the average residue of the same type in structures determined at similar resolution.  Ramachandran plot  ... 584-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Quality information on the web  Electron Density Server (EDS) 594-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Quality information on the web  PDBsum  http://www.ebi.ac.uk/pdbsum/  provides numerous structural analyses of all wwPDB structures, including full PROCHECK output (for all protein-containing entries) 604-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Quality information on the web  PDBsum 614-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Quality information on the web  PDBREPORT  http://swift.cmbi.ru.nl/gv/pdbreport/  provides a pre-computed WHAT_CHECK report for any structure in the wwPDB 624-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Quality information on the web  PDBREPORT 634-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks Quality information on the web  RCSB PDB  http://pdb.rcsb.org/  provides geometrical analyses for each entry, including information about bond lengths, angles and dihedral angles 644-Str. DBs & 3D Modelling -> 3D data validation -> Str. Sel. -> Quality checks References  Gu, J. & Bourne, P. E. (2009). Structural Bioinformatics, 2nd Edition, Wiley-Blackwell, Hoboken, p. 1067.  Xiong, J. (2006). Essential Bioinformatics. Cambridge University Press, New York, p. 352.  Schwede, T. & Peitsch, M. C. (2008). Computational Structural Biology: Methods and Applications, World Scientific Publishing Company, Singapore, p. 700.  Shapiro, B. A. et al. (2007). Bridging the gap in RNA structure prediction. Current opinion in structural biology 17: 157-165. 65Structural databases & Models of structures