Models of structures 3D structure prediction 25-3D Modelling 3D structure prediction q homology modeling q fold recognition q ab initio prediction q “hybrid” approaches q Assesment q databases of protein models 35-3D Modelling Importance of structure q no experimental structure for most of the sequencesNumberofentries[millions] Year 45-3D Modelling Homology modelling q basic principle – structure is more conserved than sequence 55-3D Modelling Homology modeling q basic principle – structure is more conserved than sequence § similar sequences adopt practically identical structures haloalkane dehalogenase LinB (PDB-ID 1iz7) haloalkane dehalogenase DhaA (PDB-ID 1cqw) sequence identity: ~ 50 % 5-3D Modelling Homology modeling q basic principle – structure is more conserved than sequence § distantly related sequences still fold into similar structures haloalkane dehalogenase LinB (PDB-ID 1iz7) chloroperoxidase L (PDB-ID 1a88) sequence identity: ~ 15 % 5-3D Modelling Homology modeling Year Numberoffolds q Per year q Total q number of folds in SCOP database 5-3D Modelling Homology modeling q basic principle – structure is more conserved than sequence § similar sequences adopt practically identical structures § distantly related sequences still fold into similar structures q builds an atomic-resolution model of the target protein based on the experimental 3D structure (template) of a homologous protein q the most accurate 3D prediction approach q if no reliable template is available → fold recognition or ab initio prediction 5-3D Modelling Homology modeling q the quality of the model depends on the sequence identity /similarity between the target and template proteins q For a standard length protein it should be > 25% / > 40% Safe homology modeling zone Twilight zone 5-3D Modelling Homology modelling – steps ...MSLGAKPFGE... target sequence 115-3D Modelling -> Homology modelling Homology modelling – steps ...MSLGAKPFGE... target sequence database search 125-3D Modelling -> Homology modelling Database search q standard sequence-similarity searches § comparison of the target sequence to all sequences with known 3D structures in the wwPDB database § BLAST, FASTA,... q profile-based searches § more sensitive than standard sequence-similarity searches § PSI-BLAST, HHMER, HHblits, ... q fold recognition methods § applied if no template can reliably be identified by the sequence or profile based methods (sequence identity < recommended 25 %) § FUGUE, GenTHREADER, pro-sp3-TASSER.. 135-3D Modelling -> Homology modelling Homology modelling – steps ...MSLGAKPFGE... target sequence database search selection of template 145-3D Modelling -> Homology modelling Selection of template q wrong template = wrong model q more than one possible template may be identified → a combination of different criteria to select the final template: § sequence identity between the template and target protein § coverage between the template and query sequences § the resolution of the template structure, number of errors § a portion of conserved residues in the region of interest (e.g., binding site residues) § ... q multiple templates can be used to create a combined model 155-3D Modelling -> Homology modelling Homology modelling – steps sequence alignment ...MSLGAKPFGE... ...MSLGAKPFGE... ...MGV-AKTYGE... target sequence database search selection of template 165-3D Modelling -> Homology modelling Sequence alignments q reliability of alignment decreases with decreasing similarity of the target and template sequences q quality of alignment is crucial – it determines the quality of the final model q the pairwise target-template alignment provided by the database search methods is almost guaranteed to contain errors → more sophisticated methods needed § multiple sequence alignment § Profile-driven alignments § correction of alignment based on the template structure 175-3D Modelling -> Homology modelling Sequence alignments q multiple sequence alignment § works with more information than pairwise alignment → more reliable § MUSCLE, CLUSTAL Omega, T-Coffee 185-3D Modelling -> Homology modelling Homology modelling – steps sequence alignment ...MSLGAKPFGE... ...MSLGAKPFGE... ...MGV-AKTYGE... target sequence database search building model framework selection of template 195-3D Modelling -> Homology modelling Building model framework q copying the basic shape of the template to the model § if the two aligned residues differ, the backbone coordinates for N, Cα, C and O, and often also Cβ can be copied § conserved residues can be copied completely to provide an initial guess § residues that are not present in the target (because the target can have less residues than the template) are not copied 205-3D Modelling -> Homology modelling Homology modelling – steps sequence alignment ...MSLGAKPFGE... ...MSLGAKPFGE... ...MGV-AKTYGE... target sequence database search loop and sidechain modeling building model framework selection of template 215-3D Modelling -> Homology modelling Loop modelling q inserting missing residues into the continuous backbone q prediction of loop conformation is a difficult task (especially for loops > 5-8 residues long § knowledge based prediction – use of libraries of possible loop conformations known from experimentally determined structures with the same local sequence § ab initio prediction – use of energy functions to find the most optimal conformation, followed by minimization of the structure § hybrid approach – the loop is divided into small fragments that are all separately compared to known structures 224-Str. DBs & 3D Modelling -> 3D modelling -> Homology modelling Side-chain modelling q adding side-chains of amino acids to the model backbone q rotamer libraries – common side-chain conformations (rotamers) extracted from high-resolution X-ray structures → possible rotamers explored and scored based on energy function q backbone-dependent rotamer libraries – the optimal conformation of the side chain depends on the local backbone conformation (5 - 9 neighboring residues) → explored only possible rotamers corresponding to the best backbone matches – greatly reduces conformational search space 235-3D Modelling -> Homology modelling Side-chain modelling q backbone-dependent rotamer library According to the backbone-dependent rotamer library, the backbone favors two different conformations for Tyrosine which appear about equally often in the database . 244-Str. DBs & 3D Modelling -> 3D modelling -> Homology modelling Homology modelling – steps sequence alignment ...MSLGAKPFGE... ...MSLGAKPFGE... ...MGV-AKTYGE... target sequence database search model optimization building model framework selection of template loop and sidechain modeling 255-3D Modelling -> Homology modelling Model optimization q energy minimization – may introduce many errors moving the model away from its correct structure → must be used carefully q molecular dynamics simulation – follows the motions of the protein and mimics the folding process 265-3D Modelling -> Homology modelling Homology modelling – steps sequence alignment ...MSLGAKPFGE... ...MSLGAKPFGE... ...MGV-AKTYGE... target sequence database search model validation model optimization building model framework selection of template loop and sidechain modeling 275-3D Modelling -> Homology modelling Model validation q finished model contain errors (like any other structure) – the number of errors (for a given method) mainly depends on: q the percentage of sequence identity between template and target sequence, e.g., 90 %: the accuracy of the model comparable to X-ray structures; 50 %-90 %: larger local errors; identity < 25 %: often very large errors q the number of errors in the template structure q problems that occur far from the site of interest may be ignored, others should be tackled 285-3D Modelling -> Homology modelling Homology modelling – steps sequence alignment ...MSLGAKPFGE... ...MSLGAKPFGE... ...MGV-AKTYGE... target sequence database search model validation model optimization building model framework iteration selection of template loop and sidechain modeling 295-3D Modelling -> Homology modelling Homology modelling – steps sequence alignment ...MSLGAKPFGE... ...MSLGAKPFGE... ...MGV-AKTYGE... target sequence database search selection of template model validation model optimization building model framework iteration loop and sidechain modeling 305-3D Modelling -> Homology modelling Iteration q portions of the homology modeling process can be iterated to correct identified errors § small errors introduced during the optimization → running a shorter molecular dynamics simulation § error in a loop → choosing another loop conformation in the loop modeling step § large mistakes in the backbone conformation → repeating the whole process with another alignment or even different template § ... 315-3D Modelling -> Homology modelling Homology modeling programs q MODELLER § http://salilab.org/modeller/ § models built by satisfying the spatial restraints of the C α - C α bond lengths and angles, the dihedral angles of the side-chains, and van der Waals interactions § restraints calculated from the template structures § available as a web server at different sites, e.g., part of: ModWeb workflow https://modbase.compbio.ucsf.edu/modweb/, GeneSilico server https://genesilico.pl/toolkit/unimod?method=Modeller or Bioinformatics toolkit http://toolkit.lmb.uni-muenchen.de/modeller 325-3D Modelling -> Homology modelling (Programs) Homology modeling programs q SWISS-MODEL § http://swissmodel.expasy.org/ § fully automated protein structure homology modeling server 335-3D Modelling -> Homology modelling (Programs) Model validation q mostly the same principles as used for the validation of experimental structures q always check both model and template § The model cannot improve the template if this is “bad” in regions q checks of normality § inside/outside distributions of polar and apolar residues § bad contacts § evaluation of atom/residue environment q energy-based checks § side-chain clashes § bond lengths and angles 345-3D Modelling -> Homology modelling (Model validation) Model validation programs q QMEAN § https://swissmodel.expasy.org/qmean/ § composite scoring function for the quality estimation of protein structure models; evaluates torsion angles, solvation and non-bonded interactions and the agreement between predicted and calculated secondary structure and solvent accessibility 355-3D Modelling -> Homology modelling (Model validation) Model validation programs q Verify3D q ANOLEA q PROCHECK q WHATCHECK q PROSA II q … 365-3D Modelling -> Homology modelling (Model validation, Programs) Fold recognition (Threading) q predicts the fold of a protein by fitting its sequence into a structural database and selecting the best fitting fold q provides a rough approximation of the overall topology of the native structure → does not generate fully refined atomic models for the query sequence q can be used when no suitable template structures available for homology modeling q fails if the correct protein fold does not exist in the database q high rates of false positives 375-3D Modelling -> Fold recognition (Threading) MSLGAKPFGE... target sequence q threading 38 Fold recognition (Threading) 5-3D Modelling -> Fold recognition (Threading) MSLGAKPFGE... fold 1 fold 2 fold n target sequence Fold library q threading 39 Fold recognition (Threading) 5-3D Modelling -> Fold recognition (Threading) q pairwise energy-based methods (threading) – protein sequence is searched for in a structural database to find the best matching structural fold using energy-based criteria 1. alignment of the query sequence with each structural fold in the fold library (essentially performed at the sequence profile level) 40 Fold recognition (Threading) 5-3D Modelling -> Fold recognition (Threading) MSLGAKPFGE... model building target sequence fold 1 fold 2 fold n Fold library q threading 41 Fold recognition (Threading) 5-3D Modelling -> Fold recognition (Threading) q pairwise energy-based methods (threading) – protein sequence is searched for in a structural database to find the best matching structural fold using energy-based criteria 1. alignment of the query sequence with each structural fold in the fold library (essentially performed at the sequence profile level) 2. building a crude model for the target sequence (replacing aligned residues in the template structure with the corresponding residues in the query) 42 Fold recognition (Threading) 4-Str. DBs & 3D Modelling -> 3D modelling -> Fold recognition (Threading) MSLGAKPFGE... model building energy calculations target sequence fold 1 fold 2 fold n Fold library q threading 43 Fold recognition (Threading) 5-3D Modelling -> Fold recognition (Threading) q pairwise energy-based methods (threading) – protein sequence is searched for in a structural database to find the best matching structural fold using energy-based criteria 1. alignment of the query sequence with each structural fold in the fold library (essentially performed at the sequence profile level) 2. building a crude model for the target sequence (replacing aligned residues in the template structure with the corresponding residues in the query) 3. calculating energy of the raw model 44 Fold recognition (Threading) 4-Str. DBs & 3D Modelling -> 3D modelling -> Fold recognition (Threading) q pairwise energy-based methods (threading) – protein sequence is searched for in a structural database to find the best matching structural fold using energy-based criteria Glu-Asp (l>10) Glu-Arg (l>10) Distance Cb-Cb Energy(kcal/mol) l is distance in sequence (density normalization required) can be calculated from collections of known structures 45 Fold recognition (Threading) 5-3D Modelling -> Fold recognition (Threading) 5-3D Modelling -> Fold recognition (Threading) MSLGAKPFGE... model building energy calculations scoring and ranking target sequence fold 1 fold 2 fold n Fold library q threading 46 Fold recognition (Threading) 5-3D Modelling -> Fold recognition (Threading) q pairwise energy-based methods (threading) – protein sequence is searched for in a structural database to find the best matching structural fold using energy-based criteria 1. alignment of the query sequence with each structural fold in the fold library (essentially performed at the sequence profile level) 2. building a crude model for the target sequence (replacing aligned residues in the template structure with the corresponding residues in the query) 3. calculating energy of the raw model 4. ranking of the models based on the energetics – the lowest energy fold represents the structurally most compatible fold 47 Fold recognition (Threading) 5-3D Modelling -> Fold recognition (Threading) q profile methods 48 Fold recognition (Profiles) 5-3D Modelling -> Fold recognition (Profiles) Fold recognition programs q PHYRE § http://www.sbg.bio.ic.ac.uk/phyre2/ § profile-based method § the highest scoring alignments are used to construct full 3D models of the query – missing or inserted regions are repaired using a loop library and reconstruction procedure, side-chains are placed using a fast graph-based algorithm 605-3D Modelling -> Fold recognition (Programs) Fold recognition programs q PHYRE 615-3D Modelling -> Fold recognition (Programs) Fold recognition programs q RaptorX § http://raptorx.uchicago.edu/ § provides single-template threading, alignment quality prediction, and multiple-template threading q GenTHREADER § http://bioinf.cs.ucl.ac.uk/psipred/ § uses a hybrid of the profile and pairwise energy methods § multiple sequence alignment and secondary structure predictions derived for the query are used as input for threading § threading results are evaluated using neural networks 625-3D Modelling -> Fold recognition (Programs) Ab initio prediction q attempts to generate a structure by using physicochemical principles only q used when neither homology modeling nor fold recognition can be applied q search for the structure in the global free-energy minimum q so far still limited success in getting correct structures 635-3D Modelling -> Ab initio Ab initio prediction programs q Rosetta § http://www.rosettacommons.org/ § software suite for predicting and designing protein structures, protein folding mechanisms, and protein-protein interactions 645-3D Modelling -> Ab initio Ab initio prediction programs q Rosetta 655-3D Modelling -> Ab initio “Hybrid” 3D structure prediction programs q I-TASSER § http://zhanglab.ccmb.med.umich.edu/I-TASSER/ § combines homology modeling, threading and ab initio predictions § No. 1 server for protein structure prediction in previous CASP experiments q Robetta § http://robetta.bakerlab.org/ § combines homology modeling and ab initio predictions § implements ROSETTA software 665-3D Modelling -> Hybrid AlphaFold1: ML-powered threading § Combines threading with ML § No. 1 server for protein structure prediction in CASP13 (2018) experiment 675-3D Modelling -> Machine Learning (AlphaFold) AlphaFold 2: ML revolution § Two independent traks for sequence and structure, each ML-powered § Attention layer at the structure module doing the “trick) § AF multimer for protein interactions. § No. 1 server for protein structure prediction in CASP14 (2020) experiment 5-3D Modelling -> Machine Learning (AlphaFold) 68 AlphaFold 3: turning up another notch § Simplified representations (sequence-based) § Structural info only optional. § Simplified network § Models all sorts of biomolecules. 5-3D Modelling -> Machine Learning (AlphaFold) 69 ML-powered (reverse) folding AlphaFold ESMFold ProteinMPNN RosettaFold difussion q State of the art homology modelling approaches. q Simple problems can be solved by simpler approaches. q ESMFold: quality length-dependent q Hallucinates new 3Ds from scratch (ML learned how structure looks like) q Solves reverse problem: from 3D predict optimal sequence. 5-3D Modelling -> Machine Learning 70 Assessment of prediction methods q CASP (Critical Assessment of techniques for protein Structure Prediction) § http://predictioncenter.org/ § biannual international contest providing objective evaluation of the performance of individual prediction methods § evaluation based on a large number of blind predictions contestants are given protein sequences whose structures have been solved, but not yet published - results of the predictions are compared with the newly determined structure § competition in several categories 715-3D Modelling -> Assessment 5-3D Modelling -> Assessment on real 3D structures. Assessment of prediction methods q CAMEO (Continuous Automated Model EvaluatiOn) § https://www.cameo3d.org/ § weekly assessment of new structures in the PDB § registered prediction servers are sent weekly requests on not-soeasy new structures in the weekly PDB pre-release. § Multiple scores considered, normalized average (IDDT) reported § Categories: § 3D: Prediction of the 3D coordinates of a protein from sequence § QE: Model quality Estimation: Assessment of quality measures reported by participant servers 72 Databases of protein models q ModBase § http://modbase.compbio.ucsf.edu/modbase-cgi/index.cgi § database of annotated protein models generated by the automated pipeline including the MODELLER program § contains ~38 millions models for ~6.5 millions unique sequences 745-3D Modelling -> Databases of predicted structures Databases of protein models q SWISS-MODEL repository § http://swissmodel.expasy.org/repository/ § database of annotated protein models generated by the automated homology-modeling pipeline SWISS-MODEL. § contains 2.2 millions models for UniProt sequences q PMDB (Protein Model DataBase) § http://srv00.recas.ba.infn.it/PMDB/ § contains manually built 3D protein models § users can download as well as submit models along with related supporting evidence 755-3D Modelling -> Databases of predicted structures Databases of protein models 765-3D Modelling -> Databases of predicted structures References q Gu, J. & Bourne, P. E. (2009). Structural Bioinformatics, 2nd Edition, Wiley-Blackwell, Hoboken, p. 1067. q Xiong, J. (2006). Essential Bioinformatics. Cambridge University Press, New York, p. 352. q Schwede, T. & Peitsch, M. C. (2008). Computational Structural Biology: Methods and Applications, World Scientific Publishing Company, Singapore, p. 700. q Shapiro, B. A. et al. (2007). Bridging the gap in RNA structure prediction. Current opinion in structural biology 17: 157-165. 77Structural databases & Models of structures