Models of structures
3D structure prediction
25-3D Modelling
3D structure prediction
q homology modeling
q fold recognition
q ab initio prediction
q “hybrid” approaches
q Assesment
q databases of protein models
35-3D Modelling
Importance of structure
q no experimental structure for most of the sequencesNumberofentries[millions]
Year
45-3D Modelling
Homology modelling
q basic principle – structure is more conserved than sequence
55-3D Modelling
Homology modeling
q basic principle – structure is more conserved than sequence
§ similar sequences adopt practically identical structures
haloalkane dehalogenase
LinB (PDB-ID 1iz7)
haloalkane dehalogenase
DhaA (PDB-ID 1cqw)
sequence identity: ~ 50 %
5-3D Modelling
Homology modeling
q basic principle – structure is more conserved than sequence
§ distantly related sequences still fold into similar structures
haloalkane dehalogenase
LinB (PDB-ID 1iz7)
chloroperoxidase L
(PDB-ID 1a88)
sequence identity: ~ 15 %
5-3D Modelling
Homology modeling
Year
Numberoffolds
q Per year
q Total
q number of folds in SCOP database
5-3D Modelling
Homology modeling
q basic principle – structure is more conserved than sequence
§ similar sequences adopt practically identical structures
§ distantly related sequences still fold into similar structures
q builds an atomic-resolution model of the target protein
based on the experimental 3D structure (template) of a
homologous protein
q the most accurate 3D prediction approach
q if no reliable template is available → fold recognition or ab
initio prediction
5-3D Modelling
Homology modeling
q the quality of the model depends on the sequence identity
/similarity between the target and template proteins
q For a standard length protein it should be > 25% / > 40%
Safe homology
modeling zone
Twilight zone
5-3D Modelling
Homology modelling – steps
...MSLGAKPFGE...
target
sequence
115-3D Modelling -> Homology modelling
Homology modelling – steps
...MSLGAKPFGE...
target
sequence
database search
125-3D Modelling -> Homology modelling
Database search
q standard sequence-similarity searches
§ comparison of the target sequence to all sequences with known 3D
structures in the wwPDB database
§ BLAST, FASTA,...
q profile-based searches
§ more sensitive than standard sequence-similarity searches
§ PSI-BLAST, HHMER, HHblits, ...
q fold recognition methods
§ applied if no template can reliably be identified by the sequence or
profile based methods (sequence identity < recommended 25 %)
§ FUGUE, GenTHREADER, pro-sp3-TASSER..
135-3D Modelling -> Homology modelling
Homology modelling – steps
...MSLGAKPFGE...
target
sequence
database search selection of
template
145-3D Modelling -> Homology modelling
Selection of template
q wrong template = wrong model
q more than one possible template may be identified → a
combination of different criteria to select the final template:
§ sequence identity between the template and target protein
§ coverage between the template and query sequences
§ the resolution of the template structure, number of errors
§ a portion of conserved residues in the region of interest (e.g.,
binding site residues)
§ ...
q multiple templates can be used to create a combined model
155-3D Modelling -> Homology modelling
Homology modelling – steps
sequence
alignment
...MSLGAKPFGE...
...MSLGAKPFGE...
...MGV-AKTYGE...
target
sequence
database search selection of
template
165-3D Modelling -> Homology modelling
Sequence alignments
q reliability of alignment decreases with decreasing similarity
of the target and template sequences
q quality of alignment is crucial – it determines the quality of
the final model
q the pairwise target-template alignment provided by the
database search methods is almost guaranteed to contain
errors → more sophisticated methods needed
§ multiple sequence alignment
§ Profile-driven alignments
§ correction of alignment based on the template structure
175-3D Modelling -> Homology modelling
Sequence alignments
q multiple sequence alignment
§ works with more information than pairwise alignment → more
reliable
§ MUSCLE, CLUSTAL Omega, T-Coffee
185-3D Modelling -> Homology modelling
Homology modelling – steps
sequence
alignment
...MSLGAKPFGE...
...MSLGAKPFGE...
...MGV-AKTYGE...
target
sequence
database search
building model
framework
selection of
template
195-3D Modelling -> Homology modelling
Building model framework
q copying the basic shape of the template to the model
§ if the two aligned residues differ, the backbone coordinates for N, Cα,
C and O, and often also Cβ can be copied
§ conserved residues can be copied completely to provide an initial
guess
§ residues that are not present in the target (because the target can
have less residues than the template) are not copied
205-3D Modelling -> Homology modelling
Homology modelling – steps
sequence
alignment
...MSLGAKPFGE...
...MSLGAKPFGE...
...MGV-AKTYGE...
target
sequence
database search
loop and sidechain
modeling
building model
framework
selection of
template
215-3D Modelling -> Homology modelling
Loop modelling
q inserting missing residues into the continuous backbone
q prediction of loop conformation is a difficult task (especially
for loops > 5-8 residues long
§ knowledge based prediction – use of libraries of possible loop
conformations known from experimentally determined structures
with the same local sequence
§ ab initio prediction – use of energy functions to find the most
optimal conformation, followed by minimization of the structure
§ hybrid approach – the loop is divided into small fragments that are all
separately compared to known structures
224-Str. DBs & 3D Modelling -> 3D modelling -> Homology modelling
Side-chain modelling
q adding side-chains of amino acids to the model backbone
q rotamer libraries – common side-chain conformations (rotamers)
extracted from high-resolution X-ray structures → possible rotamers
explored and scored based on energy function
q backbone-dependent rotamer libraries – the optimal conformation of
the side chain depends on the local backbone conformation (5 - 9
neighboring residues) → explored only possible rotamers
corresponding to the best backbone matches – greatly reduces
conformational search space
235-3D Modelling -> Homology modelling
Side-chain modelling
q backbone-dependent rotamer library
According to the backbone-dependent rotamer library, the backbone favors two
different conformations for Tyrosine which appear about equally often in the database
.
244-Str. DBs & 3D Modelling -> 3D modelling -> Homology modelling
Homology modelling – steps
sequence
alignment
...MSLGAKPFGE...
...MSLGAKPFGE...
...MGV-AKTYGE...
target
sequence
database search
model
optimization
building model
framework
selection of
template
loop and sidechain
modeling
255-3D Modelling -> Homology modelling
Model optimization
q energy minimization – may introduce many errors moving
the model away from its correct structure → must be used
carefully
q molecular dynamics simulation – follows the motions of
the protein and mimics the folding process
265-3D Modelling -> Homology modelling
Homology modelling – steps
sequence
alignment
...MSLGAKPFGE...
...MSLGAKPFGE...
...MGV-AKTYGE...
target
sequence
database search
model
validation
model
optimization
building model
framework
selection of
template
loop and sidechain
modeling
275-3D Modelling -> Homology modelling
Model validation
q finished model contain errors (like any other structure) – the
number of errors (for a given method) mainly depends on:
q the percentage of sequence identity between template and target
sequence, e.g., 90 %: the accuracy of the model comparable to X-ray
structures; 50 %-90 %: larger local errors; identity < 25 %: often very
large errors
q the number of errors in the template structure
q problems that occur far from the site of interest may be
ignored, others should be tackled
285-3D Modelling -> Homology modelling
Homology modelling – steps
sequence
alignment
...MSLGAKPFGE...
...MSLGAKPFGE...
...MGV-AKTYGE...
target
sequence
database search
model
validation
model
optimization
building model
framework
iteration
selection of
template
loop and sidechain
modeling
295-3D Modelling -> Homology modelling
Homology modelling – steps
sequence
alignment
...MSLGAKPFGE...
...MSLGAKPFGE...
...MGV-AKTYGE...
target
sequence
database search selection of
template
model
validation
model
optimization
building model
framework
iteration
loop and sidechain
modeling
305-3D Modelling -> Homology modelling
Iteration
q portions of the homology modeling process can be iterated to
correct identified errors
§ small errors introduced during the optimization → running a shorter
molecular dynamics simulation
§ error in a loop → choosing another loop conformation in the loop
modeling step
§ large mistakes in the backbone conformation → repeating the whole
process with another alignment or even different template
§ ...
315-3D Modelling -> Homology modelling
Homology modeling programs
q MODELLER
§ http://salilab.org/modeller/
§ models built by satisfying the spatial restraints of the C α - C α bond
lengths and angles, the dihedral angles of the side-chains, and van
der Waals interactions
§ restraints calculated from the template structures
§ available as a web server at different sites, e.g., part of: ModWeb
workflow https://modbase.compbio.ucsf.edu/modweb/, GeneSilico
server https://genesilico.pl/toolkit/unimod?method=Modeller or
Bioinformatics toolkit http://toolkit.lmb.uni-muenchen.de/modeller
325-3D Modelling -> Homology modelling (Programs)
Homology modeling programs
q SWISS-MODEL
§ http://swissmodel.expasy.org/
§ fully automated protein structure homology modeling server
335-3D Modelling -> Homology modelling (Programs)
Model validation
q mostly the same principles as used for the validation of
experimental structures
q always check both model and template
§ The model cannot improve the template if this is “bad” in regions
q checks of normality
§ inside/outside distributions of polar and apolar residues
§ bad contacts
§ evaluation of atom/residue environment
q energy-based checks
§ side-chain clashes
§ bond lengths and angles
345-3D Modelling -> Homology modelling (Model validation)
Model validation programs
q QMEAN
§ https://swissmodel.expasy.org/qmean/
§ composite scoring function for the quality estimation of protein
structure models; evaluates torsion angles, solvation and non-bonded
interactions and the agreement between predicted and calculated
secondary structure and solvent accessibility
355-3D Modelling -> Homology modelling (Model validation)
Model validation programs
q Verify3D
q ANOLEA
q PROCHECK
q WHATCHECK
q PROSA II
q …
365-3D Modelling -> Homology modelling (Model validation, Programs)
Fold recognition (Threading)
q predicts the fold of a protein by fitting its sequence into a
structural database and selecting the best ﬁtting fold
q provides a rough approximation of the overall topology of
the native structure → does not generate fully refined
atomic models for the query sequence
q can be used when no suitable template structures available
for homology modeling
q fails if the correct protein fold does not exist in the database
q high rates of false positives
375-3D Modelling -> Fold recognition (Threading)
MSLGAKPFGE...
target
sequence
q threading
38
Fold recognition (Threading)
5-3D Modelling -> Fold recognition (Threading)
MSLGAKPFGE...
fold 1
fold 2
fold n
target
sequence
Fold library
q threading
39
Fold recognition (Threading)
5-3D Modelling -> Fold recognition (Threading)
q pairwise energy-based methods (threading) – protein
sequence is searched for in a structural database to find the
best matching structural fold using energy-based criteria
1. alignment of the query sequence with each structural fold in the
fold library (essentially performed at the sequence profile level)
40
Fold recognition (Threading)
5-3D Modelling -> Fold recognition (Threading)
MSLGAKPFGE...
model
building
target
sequence
fold 1
fold 2
fold n
Fold library
q threading
41
Fold recognition (Threading)
5-3D Modelling -> Fold recognition (Threading)
q pairwise energy-based methods (threading) – protein
sequence is searched for in a structural database to find the
best matching structural fold using energy-based criteria
1. alignment of the query sequence with each structural fold in the
fold library (essentially performed at the sequence profile level)
2. building a crude model for the target sequence (replacing aligned
residues in the template structure with the corresponding residues
in the query)
42
Fold recognition (Threading)
4-Str. DBs & 3D Modelling -> 3D modelling -> Fold recognition (Threading)
MSLGAKPFGE...
model
building
energy
calculations
target
sequence
fold 1
fold 2
fold n
Fold library
q threading
43
Fold recognition (Threading)
5-3D Modelling -> Fold recognition (Threading)
q pairwise energy-based methods (threading) – protein
sequence is searched for in a structural database to find the
best matching structural fold using energy-based criteria
1. alignment of the query sequence with each structural fold in the
fold library (essentially performed at the sequence profile level)
2. building a crude model for the target sequence (replacing aligned
residues in the template structure with the corresponding residues
in the query)
3. calculating energy of the raw model
44
Fold recognition (Threading)
4-Str. DBs & 3D Modelling -> 3D modelling -> Fold recognition (Threading)
q pairwise energy-based methods (threading) – protein
sequence is searched for in a structural database to find the
best matching structural fold using energy-based criteria
Glu-Asp (l>10)
Glu-Arg (l>10)
Distance Cb-Cb
Energy(kcal/mol)
l is distance in
sequence (density
normalization
required)
can be calculated
from collections of
known structures
45
Fold recognition (Threading)
5-3D Modelling -> Fold recognition (Threading)
5-3D Modelling -> Fold recognition (Threading)
MSLGAKPFGE...
model
building
energy
calculations
scoring
and ranking
target
sequence
fold 1
fold 2
fold n
Fold library
q threading
46
Fold recognition (Threading)
5-3D Modelling -> Fold recognition (Threading)
q pairwise energy-based methods (threading) – protein
sequence is searched for in a structural database to find the
best matching structural fold using energy-based criteria
1. alignment of the query sequence with each structural fold in the
fold library (essentially performed at the sequence profile level)
2. building a crude model for the target sequence (replacing aligned
residues in the template structure with the corresponding residues
in the query)
3. calculating energy of the raw model
4. ranking of the models based on the energetics – the lowest energy
fold represents the structurally most compatible fold
47
Fold recognition (Threading)
5-3D Modelling -> Fold recognition (Threading)
q profile methods
48
Fold recognition (Profiles)
5-3D Modelling -> Fold recognition (Profiles)
Fold recognition programs
q PHYRE
§ http://www.sbg.bio.ic.ac.uk/phyre2/
§ profile-based method
§ the highest scoring alignments are used to construct full 3D
models of the query – missing or inserted regions are repaired using
a loop library and reconstruction procedure, side-chains are placed
using a fast graph-based algorithm
605-3D Modelling -> Fold recognition (Programs)
Fold recognition programs
q PHYRE
615-3D Modelling -> Fold recognition (Programs)
Fold recognition programs
q RaptorX
§ http://raptorx.uchicago.edu/
§ provides single-template threading, alignment quality prediction,
and multiple-template threading
q GenTHREADER
§ http://bioinf.cs.ucl.ac.uk/psipred/
§ uses a hybrid of the profile and pairwise energy methods
§ multiple sequence alignment and secondary structure predictions
derived for the query are used as input for threading
§ threading results are evaluated using neural networks
625-3D Modelling -> Fold recognition (Programs)
Ab initio prediction
q attempts to generate a structure by using physicochemical
principles only
q used when neither homology modeling nor fold recognition
can be applied
q search for the structure in the global free-energy minimum
q so far still limited success in getting correct structures
635-3D Modelling -> Ab initio
Ab initio prediction programs
q Rosetta
§ http://www.rosettacommons.org/
§ software suite for predicting and designing protein structures,
protein folding mechanisms, and protein-protein interactions
645-3D Modelling -> Ab initio
Ab initio prediction programs
q Rosetta
655-3D Modelling -> Ab initio
“Hybrid” 3D structure prediction programs
q I-TASSER
§ http://zhanglab.ccmb.med.umich.edu/I-TASSER/
§ combines homology modeling, threading and ab initio predictions
§ No. 1 server for protein structure prediction in previous CASP
experiments
q Robetta
§ http://robetta.bakerlab.org/
§ combines homology modeling and ab initio predictions
§ implements ROSETTA software
665-3D Modelling -> Hybrid
AlphaFold1: ML-powered threading
§ Combines threading with ML
§ No. 1 server for protein structure
prediction in CASP13 (2018) experiment
675-3D Modelling -> Machine Learning (AlphaFold)
AlphaFold 2: ML revolution
§ Two independent traks for sequence and structure, each ML-powered
§ Attention layer at the structure module doing the “trick)
§ AF multimer for protein interactions.
§ No. 1 server for protein structure prediction in CASP14 (2020) experiment
5-3D Modelling -> Machine Learning (AlphaFold) 68
AlphaFold 3: turning up another notch
§ Simplified representations (sequence-based)
§ Structural info only optional.
§ Simplified network
§ Models all sorts of biomolecules.
5-3D Modelling -> Machine Learning (AlphaFold) 69
ML-powered (reverse) folding
AlphaFold
ESMFold
ProteinMPNN
RosettaFold
difussion
q State of the art homology
modelling approaches.
q Simple problems can be solved by
simpler approaches.
q ESMFold: quality length-dependent
q Hallucinates new 3Ds from scratch
(ML learned how structure looks like)
q Solves reverse problem:
from 3D predict optimal sequence.
5-3D Modelling -> Machine Learning 70
Assessment of prediction methods
q CASP (Critical Assessment of techniques for protein
Structure Prediction)
§ http://predictioncenter.org/
§ biannual international contest providing objective evaluation of the
performance of individual prediction methods
§ evaluation based on a large number of blind predictions contestants
are given protein sequences whose structures have
been solved, but not yet published - results of the predictions are
compared with the newly determined structure
§ competition in several categories
715-3D Modelling -> Assessment
5-3D Modelling -> Assessment on real 3D structures.
Assessment of prediction methods
q CAMEO (Continuous Automated Model EvaluatiOn)
§ https://www.cameo3d.org/
§ weekly assessment of new structures in the PDB
§ registered prediction servers are sent weekly requests on not-soeasy
new structures in the weekly PDB pre-release.
§ Multiple scores considered, normalized average (IDDT) reported
§ Categories:
§ 3D: Prediction of the 3D coordinates of a protein from sequence
§ QE: Model quality Estimation: Assessment of quality measures
reported by participant servers
72
Databases of protein models
q ModBase
§ http://modbase.compbio.ucsf.edu/modbase-cgi/index.cgi
§ database of annotated protein models generated by the automated
pipeline including the MODELLER program
§ contains ~38 millions models for ~6.5 millions unique sequences
745-3D Modelling -> Databases of predicted structures
Databases of protein models
q SWISS-MODEL repository
§ http://swissmodel.expasy.org/repository/
§ database of annotated protein models generated by the automated
homology-modeling pipeline SWISS-MODEL.
§ contains 2.2 millions models for UniProt sequences
q PMDB (Protein Model DataBase)
§ http://srv00.recas.ba.infn.it/PMDB/
§ contains manually built 3D protein models
§ users can download as well as submit models along with related
supporting evidence
755-3D Modelling -> Databases of predicted structures
Databases of protein models
765-3D Modelling -> Databases of predicted structures
References
q Gu, J. & Bourne, P. E. (2009). Structural Bioinformatics, 2nd Edition,
Wiley-Blackwell, Hoboken, p. 1067.
q Xiong, J. (2006). Essential Bioinformatics. Cambridge University Press,
New York, p. 352.
q Schwede, T. & Peitsch, M. C. (2008). Computational Structural Biology:
Methods and Applications, World Scientific Publishing Company,
Singapore, p. 700.
q Shapiro, B. A. et al. (2007). Bridging the gap in RNA structure prediction.
Current opinion in structural biology 17: 157-165.
77Structural databases & Models of structures