AlphaFold2 ML Revolution in Structural Biology Marian Novotný, Karel Berka 9th September 2021 Outline • Protein structure prediction • CASP14 • AlphaFold2 - under the hood • Uses of AF2 • AF2 DB • AF2 in MetaCentrum • Other software (RosettaFold, ML ) • AF2 publically available servers - power of Jupyter notebooks • Limitations and Future challenges So what is a protein? • elements of cells that actually do things • responsible for almost everything • composed of amino acids • produced from RNA by ribosomes • folding leads to a 3D structure • human has around 20 000 different proteins wikipedia Guo et al., 2019 Knowing structure helps to understand the function Hayer-Hartl et al., 2015 wikipedia Solving 3D structures is expensive... https://bair.berkeley.edu/blog/2019/11/04/proteins/ The gap between numbers of experimental structures and sequences is increasing over time z Can we use sequence to predict 3D structure? • C.B. Anfinsen received Nobel prize in Chemistry (1972) for describing the relationship between sequence and structure "The native conformation is determined by the totality of interatomic interactions and hence by the amino acid sequence, in a given environment.” • it shall be possible to give to predict structure from sequence Principles of prediction from sequence https://www.unil.ch/pmf/en/home/menuinst/technologies /homology-modeling.html Structure prediction = simulation of protein folding Levinthal’s paradox - protein of 100 aa has 1070 available conformations -> it would take 1052 years at the speed of 10-11 s to sample one conformation to assume it native shape How to move the prediction field forward? • transparent competition • provide an “environment” for communication and exchange of experience • develop metrics for careful examination of predicted structures • CASP – critical assessment of protein structure prediction • once in two years since1994 • compare with experimentally solved structures CASP 11 How to compare structures? https://predictioncenter.org/casp14/doc/presentations/2020_11_30_CASP14_Introduction_Moult.pdf GDT_TS = Global distance test - total score (max 100%) The conventional GDT_TS total score in CASP is the average result of cutoffs at 1, 2, 4, and 8 Å falling within experimental position 2018: AlphaFold enters... https://predictioncenter.org/casp14/doc/presentations/2020_11_30_CASP14_Introduction_Moult.pdf 2020: Alphafold2 wins https://predictioncenter.org/casp14/doc/presentations/2020_11_30_CASP14_Introduction_Moult.pdf GDT_TS = 96.5 How does good prediction look like? GDT_TS = 44.6 The worst prediction of Alphafold 2 in CASP 14 GDT_TS= 87 Side chain predictions– orf8 covid19 so how it works? AlphaFold2 - under the hood 18 AlphaFold2 Input: sequence extended by MSA + structural templates Evoformer and Structure model (w MD simulation) plDDT - predicted local confidence prediction 19https://www.nature.com/articles/s41586-021-03819-2 MSA multiple sequence alignment using standard tools - jackhmmer, HHBlits • sequence DBs: • UniRef90 • UniClust30 = for sequence self-distilation • metagenomicsDBs - to fully cover classes underepresented in UniRef90 • Big Fantastic database (BFD) = 66M protein families from 2.2G protein sequences • clustered MGnify needed at least 30 sequences per MSA otherwise quality deteriorated> 20https://www.nature.com/articles/s41586-021-03819-2 Training PDB database + PDB70 clusters training db: 40% identity clusters, crop to 258 residues, batches by 128 per Tensor processing unit (TPU) enhance accuracy by noisy student self-distillation predict 350000 structures from UniRef30 using trained network filter to high confidence subset then train again from scratch with mixture of PDB and UniRef30 => effective use of unlabelled sequence data randomly mask or mutate individual residues from MSA using BERT (bidirectional encoder representations from Transformers => to predict masked elements within MSA 21https://www.nature.com/articles/s41586-021-03819-2 EvoFormer - mixing MSA and pairs via updates - graph inference problem in 3D space - edges = residues in proximity - updates per each block (48 blocks) separately (AF1 updated all network at once) - using triangles (instead of just pairs) 22 https://www.nature.com/articles/s41586-021-03819-2 Structure model • prioritize backbone positions+orientations • residue gas - free floating rigid body rotations and translation • updates • IPA (invariant point attention) - neural activations only in rigid 3D • equivariant update using updated activations • later fix backbone geometry • avoid loop closure problem) • sidechain final refinement: • OpenMM with Amber 99sb forcefield 23https://www.nature.com/articles/s41586-021-03819-2 Effect of cross-chain contacts. prediction is worse for heterotropic contacts (large complexes where 3D structure is dictated by other chains in complex) homotropics yields high-accuracy even when chains are intertwinned 24 PDB 6SKO Timings one GPU minute per model with 384 residues => allows proteome-scale studies 1500 residues trimer (SARS-CoV2 S protein) - about a day on ELIXIR CZ Metacentrum pipeline 25 AlphaFoldDB 26 https://www.alphafold.ebi.ac.uk/ 27 28 Complete structures of 20 model organisms 29 30 Alphafold tells you where is it right! 31 How good are the predictions of human proteins? pLDDT - per-residue estimate of its confidence on a scale from 0 - 100 model’s predicted score on the lDDT-Cα metric (local superposition-free score for comparing protein structures and models using distance difference tests). Usages 32 AlphaFold in Google Colab 33https://colab.research.google.com/github/sokrypton/ColabFold/ Github enabled JupyterNotebooks running in Google Colab environment limitation in size Mirdita M, Ovchinnikov S, Steinegger M. ColabFold - Making protein folding accessible to all. bioRxiv, 2021 Alphafold on ELIXIR CZ • Alphafold needs GPU to run -> not many people have it on their PC • Alphafold has been installed on Elixir CZ hardware • Elixir is accessible through Metacentrum • speed is dependent on size of predicted protein but can be in order of tens of minutes Alphafold is just a start... • use Alphafold ideas for development of their own 3D structure predictions - RoseTTAfold • prediction of designed proteins • prediction of RNA structures • prediction of orphan proteins • molecular replacement • interpretation of cryoEM • pLDDT can act as IDP predictor ... as of 9.9.2021 Accurate prediction of protein structures and interactions using a three-track neural network 36https://www.science.org/doi/full/10.1126/science.abj8754 37 Geometric deep learning of RNA structure 38https://www.science.org/doi/10.1126/science.abe5650 39 Single-sequence protein structure prediction using language models from deep learning https://www.youtube.com/watch?v=eobc7cMMpeY&feature=youtu.be AlphaFold and Implications for Intrinsically Disordered Proteins 40 Ruff KM, Pappu RV ,AlphaFold and Implications for Intrinsically Disordered Proteins, J Mol Biol. 2021, https://doi.org/10.1016/j.jmb.2021.167208. MrParse: Finding homologues in the PDB and the EBI AlphaFold database for Molecular Replacement and more 41 Adam J. Simpkin, Jens M. H. Thomas, Ronan M. Keegan, Daniel J. Rigden doi: https://doi.org/10.1101/2021.09.02.458604 Limitations 42 Are structural biologists and bioinformaticians on the job market? • Alphafold can not do multiprotein complexes – interactions • Alphafold can not do point mutations - design of functions • Alphafold can not do conformational changes or dynamics • Alphafold can not do effects of post-translational protein modifications • Alphafold can not do ligand effects • Alphafold is not good with orphan sequences • Alphafold does not tell much about folding process Are the models good enough for drug design? • we do not know yet • average RMSD for Alphafold2 models is 1.3 Å • average RMSD of X-Ray structures is 0.3 - 0.5 Å • best Alphafold prediction has RMSD 0.6 Å • locally AlphaFold2 might be there T1064 Summary • Alphafold2 made a huge leap in prediction accuracy • Role of open science and publicly available data can not be overstated • CASP competition was a driver of the change • Alphafold is publicly available and can be run from many places including ELIXIR CZ • Alphafold has inspired many tools already • Alphafold limits are yet to be fully described Thank you for your attention. Any questions? 47 Extra slides 48 Architectural details. 49 Interpreting the neural network 50https://www.nature.com/articles/s41586-021-03819-2 depth of neural network - it is usually quick, but for challenging targets it can be quite deep