LOSCHMIDT LABORATORIES PROTEIN ENGINEERING 2. INSIUCO IDENTIFICATION OF PROTEINS □ Why to search for new proteins? □ How to acquire new proteins? ■ traditional approach ■ metagenomic approach ■ bioinformatic approach □ Bioinformatic approach ■ Where to find target sequences? ■ How to find target sequences? ■ How to recognize interesting sequences? □ What to keep in mind? 2 Strategies for protein optimization Optimization of protein for applications Protein engineering Natural diversity Why to search for new proteins? NT Why to search for new proteins? □ better understanding of structure-function relationships ■ required for rational design Product (ROH) Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties ■ stability 100 80 60 ar[%] 40 20 0 5 100 200 300 400 time [h] 500 600 Why to search for new proteins? V □ better understanding of structure-function relationships □ novel properties stability temperature profile Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties stability temperature profile activity specificity '.rng 0.15 '« mol 0.10 ZL ■—■ > -t—■ activi 0.05 icific 0.00 CO 8 Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties stability temperature profile activity specificity enantioselectivity ethyl (5)-2-bromopropionate 15 20 time (min) 9 NT Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties stability temperature profile activity specificity enantioselectivity 10 Why to search for new proteins? □ better understanding of structure-function relationsh □ novel properties □ better starting points for protein engineering Why to search for new proteins? V □ better understanding of structure-function relationships □ novel properties □ better starting points for protein engineering -> proteins with desired properties -> practical applications How to acquire new proteins? ■ traditional approach ■ metagenomic approach ■ bioinformatic approach How to acquire new proteins? □ traditional approach enrichment sample isolation DNA library o k screening —' //^*~ ^^^^ 14 How to acquire new proteins? □ traditional approach ■ microorganisms possessing target activity are enriched from the environment and isolated in pure culture ■ proteins or corresponding genes are recovered from organisms by protein purification, DNA library screening, PCR with specific primers, etc. ■ majority of microorganisms (> 99 %) cannot be cultivated using standard techniques -> a large fraction of the microbial diversity in an environment is lost 15 □ metagenomic approach DNA library How to acquire new proteins? □ metagenomic approach ■ isolation and cloning of DNA extracted directly from environmental sample (without culturing the present organisms) ■ genes recovered by DNA library screening or PCR with specific primers,... ■ enables to explore biodiversity of uncultured microorganisms 17 How to acquire new proteins? □ bioinformatic approach sequence database (meta)genomic sequencing projects % NCBI home search site map I-, Erttrei, The Life Sciences Search Engine PubMed All Databases Human Genome GenBank Map Viewer BLAST Search across databases liinta □Em He|p , t-1 PubMed: biomedical literature citations and ra h-d i L i 44 |JJ abstracts uu none | | Books: online books 79 ^] PubMed Central: free, full text journal articles (Z) none OMIM: online Mendelian Inheritance in Man (?) none Site Search: NCBI web and FTP sites □MIA: Online Mendelian Inheritance Nucleotide: sequence database (includes 45 GenBank) 39 Protein: sequence database 4 jjj Genome: whole genome sequences m none ^ UniGene: gene-ori ř sequences ented clusters of tr. [7] none CDD: conserved protein domain datab Structure: three-dimensional macromolecular structures (?] none I^Jt UniSTS: markers and mapping data /n s/7/co "screening" n 1: ABI9321Ě. Report LinB Panthomona5...[gi:l 15291795] >gi1115291795 I9b|£BI93216.1| Lin£ [Xjnfch«vior,ij jp. KH12] HILMliJ-tQarriEIKHmVi IDEtTtDPITJijIIHttTB 3 VLHHHIHPHCÄ&LOELI äCHLI GMH)3D HI]P3 r,PEPttfcraJIlPIim]tf^^ ATJ(FPEQ^EDHFQAFE3 W&EEL^rjJJHOT^q^&L^ It) 12 3D Domains: domains from Entrez Sti Ipii&TPÄIJimmÄPI^ EICAAIAÄTVEP. gene synthesis, DNA request 02: AARD597S. Rupert LinE Ppluiigoiiiona...[gi:379ö3eS3] >gi I ÍT362ĚĚÍ I gb|AÄEÖ53TS 1| LinE [3phirig«vio fast and cheap way to identify novel proteins -> one cannot find what is not in the database (but there is a lot of data - more than one usually needs) ■ genes are recovered by gene synthesis or obtained from sequencing consortia upon request 19 Where to find target sequences? ■ databases of nucleotide sequences ■ databases of protein sequences Databases of nucleotide sequences □ GenBank http://www.ncbi.nlm.nih.gov/genbank/ NCBI provided by NCBI (National Center for Biotechnology Information) □ EMBL-BANK http://www.ebi.ac.uk/embl/ provided by EBI (European Bioinformatics Institute) NUCLEOTIDE SEQUENCE DATABASE □ DDBJ http://www.ddbj.nig.ac.jp/ (§>DDBJ DNA Dutu Biink uf Japan provided by National Institute of Genetics from Japan 21 Databases of nucleotide sequences □ GenBank, EMBL-Bank, DDBJ ■ annotated collections of all publically available nucleotide sequences freely available to the wide community contain data obtained from genomic centers or research institutions everyday synchronization of new or updated data contain about 250,000,000 sequences (Feb 2024) mostly automatic annotations - lower quality, errors 22 Databases of protein sequences □ UniProtKB http://www.uniprot.org/ provided by EBI, Swiss Institute of Bioinformatics and Protein Information Resource □ nr Protein database ■ http://www.ncbi.nlm.nih.gov/protein/ ■ provided by NCBI 23 Databases of protein sequences □ UniProtKB, nr Protein database ■ annotated collections of publically available protein sequences ■ freely available to wide community ■ contain data obtained by conceptual translation of coding sequences from EMBL-Bank/GenBank/DDBJ or provided by research institutions ■ contain more than 250,000,000 sequences (Feb 2024) ■ mostly automatic annotations - lower quality, errors 24 Databases of protein sequences □ UniProtKB rich annotations (e.g., information about function of protein and individual amino acids, experimental data, biological ontologies, classifications,...) clear indication of annotation quality (manual vs. automatic) UniPfOt • BLAST Align Peptide search ID mapping SPARQL Release 2024_05 | Statistics Ä ífr E3 Help Find your protein Advanced I List Proteins UniProt Knowledgebase Reviewed (Swiss-Prot) 572,214 Species Proteomes Protein Clusters UniRef Protein sets for species with sequenced genomes from across the tree of life Clusters of protein sequences at 100%, 90% & 50% identity Non-redundant archive of publicly available protein sequences seen across different databases □ U n i P rot K B/S w i ss- P rot ■ high-quality annotations, i.e., manually annotated entries or expert-reviewed automatic annotations ■ source of reliable information ■ contains "only" ~ 570,000 sequences (Feb 2025) □ UniProtKB/TrEMBL ■ automatic annotations - lower quality, errors ■ contains ~ 250,000,000 sequences (Feb 2025) 26 Databases of protein sequences T Number of sequences Number of characterized proteins 27 Pitfalls of sequence databases □ large number of errors errors in sequences (wrong base, frameshift errors) wrong positions of genes exon-intron boundary errors errors and inaccuracies in annotations 28 How to find target sequences? ■ text-based searches ■ sequence-based searches Data format □ Fasta sequence: ■ Header starting with ">" followed by the sequence description ■ Sequence data are on the new line >Haloalkane dehalogenase LinB MSLGAKPFGEKKFIEIKGRRMAYIDEGTGDPILFQHGNPTSSYLWRNIMPHCAGLGRLIACDLI GMGDSDKLDPSGPERYAYAEHRDYLDALWEALDLGDRVVLVVHDWGSALGFDWARRHRERVQGI AYMEAIAMPIEWADFPEQDRDLFQAFRSQAGEELVLQD 30 Text-based searches □ database retrieval systems ■ enable quick and easy search of many databases at the same time ■ specification of queries using logical operators (AND, OR, NOT,...) ■ Entrez(NCBI), SRS(EBI) □ results dependent on sequence annotations ■ erroneous, inaccurate or too general annotations ■ synonyms ■ misspellings Text-based searches □ database retrieval systems NCBI HOME SEARCH SITE MAP mouse[ORGN] AND kinase AND (exons OR introns) EntreZfjfhe Life Sciences Search Engine9 PubMed All Databases Human Genome GenBank Map Viewer | BLAST Search across databases GO Clear Help 32 □ database retrieval systems Search across databases iriouse[ORGN] AND kinase AND (exons OR introns) GO Clear Help 1258 152 96 Result counts displayed in gray indicate one or more terms not found 125S 312 PubMed: biomedical literature citations and abstracts m Books: online books m PubMed Central: free, full text journal articles ® [703 OMIM: online Mendelian Inheritance in Man ® ONIA: online Mendelian Inheritance in Animals Site Search: NCBI web and FTP sites ® none 152 Nucleotide: Core subset of nucleotide sequence records [T] EST: Expressed Sequence Tag records 121 ^ GSS: Genome Survey Sequence records 96| *#*#* Protein: sequence database ® m ■r none dbGaP: genotype and phenotype SO, UniGene: gene-oriented clusters of ' transcript sequences CDD: conserved protein domain database ^> 3D Domains: domains from Entrez Structure none none 33 Text-based searches □ database retrieval systems □ advanced search options Advanced Search5 Searching in UniProtKB Gere Name [GN] Gene Name [GN] ▼ YDJ1 Remove AND Taxonomy [OC] Taxonomy [OC] ▼ Mammalia (mammals) [40674] Remove AND Keyword [KW] Keyword [KW] - Activator [KW-0010] Remove Add Field Cancel Search 34 Sequence-based searches □ searches based on sequence similarity ■ results not influenced by sequence annotations □ rely on the assumption that proteins with the same function have similar sequence ■ not always true - close homologs vs. distant homologs vs. analogs 1 2 3 4 5 6 7 PABIAAYEAP E AAY AP PDiCAAY AP ER AYDAP AGVR|A YDAP _TDVLMAYDAP PAGVRA YDAP FFITPDY1AGA FPD RYiAGV FPDl RA L FPrJE YlEGA FPDl YQAGA FPlTE 1AGV FPDB 1QAGA AFP LV RFP E LV AFP Ml I FP LV AFP R LV FP L LV AFP R LV 'SP :SE sPP 1 SP 35 □ BLAST ■ Heuristic search of similarity on significant sequences ■ Reasonable sensitivity and good speed ■ Gold standard in sequence search □ PSI-BLAST ■ "iterative BLAST" making use of multiple sequence alignment ■ more sensitive search to detect weak but biologically significant similarities between sequences □ HMMER ■ Uses Hidden Markov Models for sensitive detection of remote homologs ■ Slower than BLAST for simple sequence similarity searches ■ Widely used in protein family classification and domain identification 36 □ Basic Local Alignment Search Tool MEAAVKEEISVEDEAVDKNI MEA EAA AAV AVK VKE KEE Break database sequences into words Compare word list by hashing (allow near matches) Break query into words ISV □ Basic Local Alignment Search Tool BLOSUM scoring matrix Query sequence: R Database sequence D Ala 4 Arg -1 5 Asn -2 6 Asp -2 -2 1- Cys 0 -3 -3 -3 9 Gin -1 1 0 0 -3 5 Glu -1 0 0 2 -4 2 5 Gly 0 -2 0 -1 -3 -2 -2 6 His -2 0 1 -1 -3 0 0 -2 8 Me -1 -3 -3 -3 -1 -3 -3 -4 -3 4 Leu -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 Lys -1 2 0 -1 -3 1 1 -2 -1 -3 -2 Met -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 Phe -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 Pro -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 Ser 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 Thr 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 Trp -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 Tyr -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 Val 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 Ala Arg Asn Asp Cys Gin Glu Gly His Me Leu Lys Met Phe Pro P P Q G L F P P E G V V U Exact match is scanned Score: -2 7 7 2 6 1 -1 u HSP Optimal accumulated score = 7+7+2+6+1 =23 1 5 3 -2 2 -2 2 0 11 2 -3 7 -1 Sequence-based searches □ BLAST input National Library of Medicine National Center for Biotechnology Information blastn BLAST ® » blastp suite MSLGAKPFGEKKFIEIKGRRMAYIDEGTGDPILFQHGNPTSSYLWRNI Enter Query Sequence Enter accession number(s), gi(s), or FASTA sequence(s) 0 clear ^^^query subrange © From f To Or, upload file Job Title Browse... Nofileselected Enter a descriptive title for your BLAST search 0 l_l Align two or more sequences 0 Choose Search Set Databases (§) standard databases (nr etc.) Compare Standard Database Organism Optional Exclude Optional ■' Experimental databases □ Select to compare standard and experimental database Q Try experimental clustered nr database For more info see What is clustered nr? Non-redundant protein sequences (nr) Enter organism name or id-completions will be suggested JO exclude | Add organism | Enter organism common name, binomial, or tax id. Only 20 top taxa will be shown © D Models (XM/XP) I_I Non-redundant RefSeq proteins (WP)I_I Uncultured/environmental sample sequences 39 Sequence-based searches □ BLAST results Score f-value Sequences producing significant alignments Download s e le ct a 11 100 sequences selected Manage Colu Show 100 v Gen Pep t Graphical Distance traMbf results Multiple alignmen Max Total Query E Per. Accession Description Score Score value Ident Cover Q achaete-scute homoloa 2 [Homo sapiens] 373 373 100% 2e-130 100.00% NP 005161.1 □ achaete-scute homoloa 2 [Ponao abelii] 368 368 100% 3e-128 98.96% XP_ 002821424.1 Q achaete-scute homoloa 2 [Nomascus leucoaenvs] 361 361 100% 2e-125 97.41% XP 003282133.1 □ achaete-scute homoloa 2 [Macaca nemestrina] 356 356 100% 1e-123 96.37% XP_ 011719606.1 Q achaete-scute homoloa 2 [Piliocolobus tephrosceles] 356 356 100% 1e-123 96.37% XP_ 023039276.1 Q achaete-scute homolog 2 [Papio anubis] 297 297 100% 3e-100 95.85% XP_ 003909431.1 D PREDICTED: achaete-scute homoloa 2 [Chlorocebus sabaeus] 297 297 100% 3e-100 95.34% XP_ 008003331.1 D PREDICTED: achaete-scute homoloa 2 [Rhinopithecus bieti] 294 294 100% 3e-99 95.34% XP_ 017741776.1 Q PREDICTED: achaete-scute homoloa 2 [Cebus capucinus imitator] 271 271 92% 4e-90 96.07% xp 017363199.1 D PREDICTED: achaete-scute homoloa 2 [Callithrixjacchus] 269 269 100% 3e-89 94.82% XP_ 009006952.1 Q achaete-scute homoloa 2 [Sus scrota] 265 265 100% 1e-87 84.97% NP 001116463.1 Q PREDICTED: achaete-scute homoloa 2 fCarjra hircusl 261 261 92% 5e-86 85.39% XF 017899088.1 hits 40 Sequence-based searches □ BLAST results Sequences producing significant alignments Download Manage Columns '- Show 100 v Ö Q select all lOO sequences selected GenPept Graphics Distance tree of results Multiple alignmen Description Max Score Total Score Query Cover E value Per. Ident Accession Q achaete-scute homoloa 2 [Homo sapiens] 373 373 100% 2e-130 100.00% NP 005161.1 □ achaete-scute homoloa 2 [Ponao abelii] 368 368 100% 3e-128 98.96% XP_ 002821424.1 Q achaete-scute homolog 2 [Nomascus leucogenvs] 361 361 100% 2e-125 97.41% XP_ 003282133.1 □ achaete-scute homoloa 2 [Macaca nemestrina] 356 356 100% 1e-123 96.37% XP_ 011719606.1 Q achaete-scute homoloa 2 [Piliocolobus tephroscelesl 356 356 100% 1e-123 96.37% XP_ 023039276.1 D achaete-scute homoloa 2 [Papio anubis] 297 297 100% 3e-100 95.85% XP_ 003909431.1 Q PREDICTED: achaete-scute homoloa 2 [Chlorocebus sabaeus] 297 297 100% 3e-100 95.34% XP_ 008003331.1 D PREDICTED: achaete-scute homoloa 2 [Rhinopithecus bieti] 294 294 100% 3e-99 95.34% XP 017741776.1 Q PREDICTED: achaete-scute homoloa 2 [Cebus capucinus imitator] 271 271 92% 4e-90 96.07% XP 017363199.1 Q PREDICTED: achaete-scute homoloa 2 rCallithrix iacchusl 269 269 100% 3e-89 94.82% XP. 009006952.1 _ Q achaete-scute homolog 2 [Sus scrota] 265 265 100% 1e-87 84.97% NP 001116463.1 Q PREDICTED: achaete-scute homoloo 2 TCaDra hircusl 261 261 92% 5e-86 85.39% XP 017899088 1 41 □ BLAST Score ■ raw score normalized on the basis of the scoring method ■ sum of substitution scores and gap penalties ■ higher is better, but does not adequately represent significance of alignment □ BLAST E- value ■ number of BLAST alignments with a given or better Score that are expected to be seen simply by chance (with random sequence) ■ indicator of alignment significance (adjusted to the database size) ■ results associated with the lowest E-values are the best ■ hits with an E-value score > 0.01 belong to the "grey zone" - do not trust them 42 □ BLAST alignment ■ identity and similarity level between query and aligned sequence ■ alignment length and coverage of query sequence - the alignment is local, therefore one should always check that the alignment covers a significant portion of the query sequence (e.g., the alignment may involve only few amino acids from the query sequence -> not significant hit) 43 □ PSI-BLAST results alignment >pgb|AAT7 0109.1 I CurN [Lyngbya majuscula] Length=341 Score = 303 bits (777), Expect = Se-81, Method: Composition-based stats. Identities = 148/297 (49%), Positives = 188/297 (63%), Gaps = 8/297 (2%) Query 2 SEIGTGF PFDPHYVEVLGERMHYVDVGPRDGTPVLFLHGNPTS SYLWRNIIPHV-APSHR 60 I + FPF VEV G + YVD G G PVLFLHGNPTSSYLWRNIIP+V A +R Sbj ct 41 LPIS SEF PFAKRTVEVEGATIAYVDEG—SGQPVLFLHGNPTSSYLWRNIIPYVVAAGYR 98 Query 61 CIAPDLIGMGKSDKPDLDYFFDDHVRYLDAFIEALGLEEVVLVIHDWGSALGFHWAKRNP 120 +APDLIGMG S KPD++Y DHV Y+D FI+ALGL+++VLVIHDWGS +G A+ NP Sbj ct 99 AVAPDLIGMGDSAKPDIE YRLQDHVAYMDGFIDALGLDDMVLVIHDWG SVIGMRHARLN P 158 Query 121 ERVKGIACMEFIRPI----PTWDEWPEFARETFQAFRTADVGRELIIDQNAFIEGVLPK- 175 +RV +A ME + P P+++ F+ RTADVG ++++D N F+E +LP+ Sbj ct 159 DRVAAVAFMEALVPPALPMPSYEAMGPQLGPLFRDLRTADVGEKMVLDGNFFVETILPEM 218 Query 176 CVVRPLTEVEMDHYRE PFLKPVDRE PLWRF PNEIPIAGE PAHIVALVEAYMNWLHQS PVP 235 VVR L+E EM YR PF R P ++P E + PI GEPA A V WL SP+P Sbj ct 219 GVVRSLSEAEMAAYRAPF PTRQSRLPTLQWPREVPIGGE PAFAEAEVLKNGEWLMAS PIP 278 Query 236 KLLFWGTPGVLIPPAEAARLAE SLPNCKTVDIGPGLHYLQEDNPDLIGSE IARlAlLPG 2 92 KLLF PG L P L+E++PN + +G G H+LQED+P LIG IA WL Sbj ct 279 KLLFHAEPGALAPKPVVDYLSENVPNLEVRFVGAGTHFLQEDHPHLIGQGIADWLRR 33 5 44 □ text-based search ■ good for finding evolutionary "unrelated" proteins with some specific function ■ a large number of false negatives (missed proteins with target function) and false positives (identified proteins with different function) results due to erroneous or inaccurate annotations 45 □ text-based search □ sequence-based search ■ good for finding members of a protein family (i.e., group of evolutionary related proteins sharing some specific function) -> not suitable for finding "unrelated" proteins ■ potential false positive results (i.e., proteins belonging to other evolutionary related families) ■ searches using protein sequence queries are generally more sensitive than using nucleotide sequence queries (20 different amino acids vs. 4 different nucleotides) 46 □ text-based search □ sequence-based search □ combination of text-based and sequence-based approaches 1. text-based search 2. subdivision of identified sequences into evolutionarily related groups 3. selection of a few representatives for each group 4. sequence-based searches using each representative as a query ■ potential false positive results - should be filtered 47 How to recognize interesting sequences? ■ sequence clustering ■ sequence comparison ■ information about host organisms ■ automated in silico enzyme identification ■ reconstruction of ancestral proteins □ Clustering based on pairwise sequence similarities ■ can be used for a fast and rough classification of sequences in large datasets (thousands of sequences) -> effective way to filter results of database searches -> identification of members of individual protein families □ Tools: ■ CLANS - visualization of pairwise sequence similarities in three-dimensional space -> overview of sequence space (https://toolkit.tuebingen.mpg.de/tools/clans) ■ CD-HIT - clustering and comparison of protein or nucleotide sequences (https://sites.google.com/view/cd-hit/) 49 □ Clustering based on pairwise sequence similarities O known target proteins o o o o o 50 51 □ Clustering based on pairwise sequence similarities 52 □ Clustering based on pairwise sequence similarities haloalkane dehalogenases 53 m previously known members • new family members 54 □ multiple sequence alignment ■ analysis of conserved residues within protein family -> identification of protein family members 1 |SP ■IAAY APF1TPDY1AGA|AFP LVP SP 2 L "DAiAkAY APFPIQRYIaG IrFP LVPVSP 3 M|P[ ECAAY APFPD1 HR^ IaFPLMIpESE 4 1SD |R AYDAPFPIE Y1EGAK FP LVPI P 5 VPAGVRA YDAPFPDH Y AGAlAFPRLVP SP 6 1BTDVL AYDAPFP E lAGVlc FP LVP TT 7 VP VRA YDAPFPDB Y AGAlAFPRLVP SP 8 1TEEDVA YVSKIqEBG-YTlGVNYYRNFDRNN 55 □ multiple sequence alignment ■ analysis of conserved residues within protein family -> identification of protein family members 1 Y PFP R FP LVP 2 Y PFP R FP LVP 3 Y PFP I FP WP 4 Y PFP R FP LVP 5 Y PFP I FP LVP 6 Y PFP R FP LVP I FP LVP 7 Y PFP 8 Y KFQ V NFD // ////// 56 □ multiple sequence alignment ■ analysis of conserved residues within protein family -> identification of protein family members 1 |SP ■iAAY APF1TPDY1AGA|AFP LVP sp 2 L "DAiAkAY APFP|QRY1aG IrFP LVPVSP 3 M|P[ ECAAY APFPD1 HR^ IaFPLMIpESE 4 Isd |r aydapfpIe Y1EGAK FP LVPi P 5 VPAGVRA YDAPFPDH Y AGAlAFPRLVP sp 6 1BTDVL AYDAPFP E lAGVlc FP LVP TT 7 |VP VRA YDAPFPDH Y AGAlAFPRLVP sp XL A ^ I - 57 Sequence comparison □ multiple sequence alignment ■ identification of sequences with unique features -> proteins with potentially novel characteristics or problematic for production Shesp-Sheama Pelpro Desace Xanaxo Xylfas Chlaur Despsy Rhobal Burcen My c tub No c far Jansp-uncbac Erylit Polsp-Mycavi My c tub Mycavi Maraqu Caucre Pseatl Psycry Shefri Shefri EAKlFl— " 111 I111 1 1 IatkIlI------------------------------------lrlkic-Idtgf^Baglasvi- pgkIfI------------------------------------lalric-Idt f^Bslaasfv- pskIlI------------------------------------kalklc-Idsgf^Bsrgaawv- EKPMP------------------------------------WQIAMG-|hWTf|aFSS SW - TSKKMB------------------------------------WQIAL -|d\a LNAF L WL- --HV|------------------------------------LRIAAG-KLPQ LNAF IAATTM- --Hll------------------------------------LRISLC-lAP l|GlAWPASFM- --YM|------------------------------------QRIAAC-ImP LNLF r VTM- ngrea|------------------------------------wfq|im-|aagf|i lst lkIng- ADT LAM----------------------------------KAF S RVMS S p p---|f|VE r l BA- iDALAN----------------------------------RVF s WMS s|p---|f liDRV|l- - vdpdk---------------------------------lklpppllmm|akrlgf|erqvmtm- PDvIqEviDEIKAF RASNKKINF FTMAKNISKMDK s KHFATKFMY|QK|SWE SK|MPiGF LNSM- gesIg!---------------------------------fea|ka|nrsqp|mdva lfkr- S DV| ls----------------JO----------------PGF RaIrEMC AKNPDF DVAR LF AR- AQRRT|---------------g------------------PAFYaIraIARYSPVLPAGRIVSV- aqgrt|--------------S-------------------lpfyv|ra|aryspvlpagrlvnf- GDQiMA-------------f--------------------DVWWRF re ait SAPQ lniGAFVQG- ■DNRV|----------------------------------W RLlKAlASHSPWFPIGRIVQL- ■VGKS-----------------------------------EGFEA|LN|SQNTPELPVGFliNG- GDH|PG----------------------------------EAFMkIraIsQEVPEFPVAGIIKG- GDHDLG----------------------------------EGFRKlQQlsQEIPQFHVGGTIKS- GDH|PG----------------------------------EAFTKlRQlsQDVAIFPTGNLINS- GDEETN----------------------------------d MK|FNYSQESVDFPAGQMING- d PVT QlAF SAFVTQPA-- --DGFTaIkYDLVTPSDLRLDQFMKR- ■ckin-|s|r- etr-T|P- Jqkr- AMS|T- EAYVA RAYVA RLY|L YMA RAYVA RAY la! EGlL' ■g|l| IgllI |K| KMKPD| FENHA-----------------------ii DTWIAAY A GTEHR------------------------Pi AEL|G-----------------------vItKAI D Y| 1 ATATK------------------------RK PSKRAy|d: QMEDKLAKSKVKAYVHLLFQGLGLEKLS||TDLIKAYEA GTPI L AE AAY A LPQ- /I |TV|R---------/--------------ViSKl ItVHR--------/---------------VPA gcr|R-------/----------------|SD. gteIs-------■-----------------Is |ta|d------------------------Isd. ATVTA------------------------Is D| |TVTK------------------------Hq| ACVST------------------------|T' ASVSD------------------------|SDD| WAPT-------------------------|tE| -"if PDlCAAY A YD A YD A YD^J IAAYEA AYDA L AYDA I DAY A IIAAYDA IAAYDA AY A 58 Sequence comparison □ multiple sequence alignment Shesp-Sheama Pelpro identification of sequences with unique features -> proteins with potentially novel characteristics or problematic for production ma|m|raya NAF NAF -------------------vj --rkc Desace Xanaxo Xylfas Chlaur Despsy Rhobal purcen Myctub bo cfar Jansp- luncbaclPDViQEVIDEIKAFRASNKKINFFTMAKNISKMDKSKHFATKFMYlQKiSWESKiMPIGF LNSM-QMEDKLAKSKVKAYVHLLFQGLGLEKLSlFlTDLIKAYE prylit Polsp-Mycavi My c tub Mycavi Maraqu Caucre Pseatl Psycry Shefri Shefri wr maryspvlp r wr maryspvlp r 59 Sequence comparison □ multiple sequence alignment □ essential residues ■ analysis of conserved residues important for function (catalytic, binding, coordinating residues, etc.) ■ In UniProt/SwissProt -> Function -> Features -> list of active or binding site residues 3 4 X A KF \ R □ phylogenetics ■ establishment of evolutionary relationships among sequences 61 □ phylogenetics ■ classification of sequences HLD-II 62 □ phylogenetics ■ information about experimental data-> selection of novel proteins HLD-II 63 Information about host organisms □ extremophiles - microorganisms living in extreme conditions ■ geochemical extremes (pH, salinity) ■ physical extremes (temperature, pressure) □ proteins from extremophiles ■ often adapted to extreme conditions -> unique characteristics, useful for practical applications 64 □ Genomes OnLine Database (GOLD) ■ http://www.genomesonline.org/ ■ list of complete (>36,000) and ongoing (> 115,000) information about individual projects and source organisms (Feb 2024) □ Entrez Genome ■ http://www.ncbi.nlm.nih.gov/sites/genome ■ provided by NCBI ■ "700,000 genome information by organism ■ information about genome, source organism, genes, encoded proteins, graphical representations,... Information about host organisms □ GOLD Metagenomes Isolate Genomes A, Classification Complete Projects: 4169 • Studies: 370 3 Incompete Proiects- 17714 • Samples: 2642 O Targeted Projects: 1500 Organism Metadata MIGS 22 o OXYGEN REQUIREMENT Aerobe MIGS 37.1 « CELL SHAPE Rod-shaped MIGS 37.2 « MOTILITY Nonmotile MIGS 37.3 « SPORULATION MIGS 37.4« PRESSURE MIGS 37.12 «> TEMPERATURE RANGE Psychrophile SALINITY Halotolerant PH MIGS 37.5« CELL DIAMETER MIGS 37.6 • CELL LENGTH MIGS 37.7« COLOR MIGS 37.8 • GRAM STAINING MIGS 15« BIOTIC REALTIONSHIPS Free living 66 Information about host organisms □ Entrez Genome Psychrobacter cryohalolentis Psychrotolerant organism Lineage: Bacteria[4049]; Proteobacteria[1682]; Gammaprc>teobacteria[750]; Pseudomonadales[122]; Moraxellaceae[51]; Psychrobacter[1Q]; Psychrobacter cryoha o &nt s[1] Psychrobacter. These bacteria are commonly isolated from low temperature environments, Psychrobacter spp. are cold-adapted organisms that are often isolated from extreme environments such as permafrost or the Antarctic ice.Psychrobacter cryohalolentis. Psychrobacter cryohalolentis, formerly Psychrobacter cry operetta More... * Representative □ Community selected, Calculated : Psychrobacter cryohalolentis K5 Psychrobacter cryohalolentis K5. This organism was isolated from saline liquid (12-14%} found 11-24 m below the surface within a forty thousand-year-old Siberian permafrost at the Kolyma-lndigirka lowland in Siberia. This strain will provide insight into growth at extremely low temperatures. Human Pathogen: no Type Name RefSeq INSDC Size (Mb) GC% Protein rRNA tRNA Other RNA Gene Pseudogene Chr NCJH>7969.1 CP000323.1 3.06 42.3 2,467 12 4B 6 2,337 4 Pbm 1 NC_Q0796S.1 CP000324.1 0.041221 33.3 44 - - - 44 - Biological Properties ■ Morphology t> Shape: Bacilli q Motility: No ■ Environment biological properties o Salinity: ModerateHalophilic o TemperatureRange: Psychrophilic o Habitat: Multiple » Genome Sequencing Projects I OuomMiania [1] O Bczäblds or cajtigs [0] 3 ERA or Traces [0] UHo data [0] Organism BioProject Assembly Status Chrs Plasmids Size (Mb) GC% Gene Protein Psychrobacter cryohalolentis K5 PRJNA5S373, PRJNA13920 ASM 1390^1 • 1 1 3.1 42.2 2,581 2,511 67 Automated in silico enzyme identificatio □ EnzymeMiner https://loschmidtxhemi.munixz/enzymeminer/ Search for novel enzymes with particular activity Only input is fasta sequence and essential residues Filtering of sequences using catalytic residues, MSA, clustering Annotations of sequences based on bioinformatics predictions, information available in sequence and genome databases 2D space of sequence similarity network 68 Automated in silico enzyme identificatio □ EnzymeMiner > LinB MSLGAKPFGEKKFIEIKGRRMAYIDEGTGDPILFQHGNPTSSYLWRNIMPHCAGLGRLIACDLIGMGDSDKLDPSGPERYAYAEHRD YLDALWEALDLGDRWLWHDWGSALGFDWARRHRERVQGIAYMEAIAMPIEWADFPEQDRDLFQAFRSQAGEELVLQDNVFVE QVLPGLILRPLSEAEMMYREPFUVAGEARRPTLSWPRQIPIAGTPADWAIARDYAGWLSESPIPKLFINAEPGALTTGRMRDFCRT WPNQTEITVAGAHFIQEDSPDEIGAAIAAFVRRLRPA Query sequences Essential residues ENZYME MINER Annotated putative enzymes Accession Halide 1 Nucleophile Halide 2 Proton donor Proton accept< N, W D W E, D H D4Z2G1 38 108 109 132 272 P22643 125 124 175 260 289 69 Automated in silico enzyme identificatio □ EnzymeMiner ENZYME 1INER <<"> Automated mining of soluble enzymes with diverse structures, catalytic properties and stabilities Submit new job Help Example Acknowledgements Job ID: El Find job Swiss-Prat sequences ( Custom sequences© 1.1.1.1 - Alcohol dehydrogenase. (240) Load example Select sequences from table [max, 40) 0 Select sequences from similarity network (max, 40)© Accession ER0 Length Sequence plot© □ AOA075TMPO 9 340 mmi ^mm Q A1A83-5 7 369 mm^i mmm □ A1CFL1 7 38S mmm mmm □ A1L4V2 14 394 H 1 mm mm i Q A2XAZ3 14 331 tmrnml mm H 1 Q A5JYX5 1 309 iHHHHHHUlM □ A6ZTT5 <: 382 Q A7ZIA4 7 369 tmrnml mm Q A7ZX04 369 tmrnml mm □ B1J0S5 7 369 tmrnml mmm O B1LIP1 369 Q E4MBY0 4 254 mmmmm □ E1ACQ9 a 339 mmml mmm 1 1 n,™ _ _ _ _ Select representative sequences of clusters© USER STATISTICS * Number or visitors: * Number or jobs: k2Q Loschmidt Laboratories * eriY^e~r = -"S:: '.run'.tz * hTtoi://lc:^irrii:lt,cie'Tii,myri.c^' OTHER TOOLS SOLUPRQT Cn* PREDICTSNP tfc <> ^CALFITTER yCAVZJl I FIREPRGI HDTSPOl 1 WIZARD ACKNOWLEDGEMENT i CZECH CZECH REPUBLIC 70 Automated in silico enzyme identificatio Target selection table TARGET SELECTION TABLE Select all I Deselect all I Undo I Redo 0.50 25 90 Solubility threshold: © Primary domains:© Identity to queries:© PF00561 (AbhydrolaseJ) x X Selected Full Dataset Extra Domain Known Organism Temperature Salinity Biotic Relationship Disease Transmembrane 3D Structure Accession Annotation Closest query Identity closest qu. □ KAB2639994.1 haloalkane dehalogena... D4Z2G1 74.1 □ WP_084084852.1 haloalkane dehalogena... D4Z2G1 72.7 □ WP_071575177.1 haloalkane dehalogena... D4Z2G1 70.8 □ AOY91 276.1 haloalkane dehalogena... D4Z2G1 70.5 □ TMJSS042.1 haloalkane dehalogena... D4Z2G1 70.3 □ WP_071058776.1 haloalkane dehalogena... D4Z2G1 70.2 □ WP_066929B94,1 haloalkane dehalogena,., D4Z2G1 70,1 □ WP_096502050.1 haloalkane dehalogena... D4Z2G1 69.9 □ WP_071011817.1 haloalkane dehalogena... D4Z2G1 69.8 □ WP_015306650.1 haloalkane dehalogena... D4Z2G1 69.8 □ WP_1 10315832.1 haloalkane dehalogena... D4Z2G1 69.8 □ WP_064 949090.1 haloalkane dehalogena... D4Z2G1 69.7 □ WP_083164861.1 haloalkane dehalogena... D4Z2G1 69.7 □ WP_0S73742S3.1 haloalkane dehalogena... D4Z2G1 69.6 □ WP_015290793.1 haloalkane dehalogena... D4Z2G1 69.6 □ 2QVB_A Chain A Crystal Struct,., D4Z2G1 69,5 Identity closest qu... 4- Kingdom Solubility Sequence length Domain annotatioi 0.6026 294 Abhydrolase_1 0.6433 275 Abhydrolase_1 Sequence similarity network SEQUENCE SIMILARITY NETWORK Select network: O Identity: 50%, Nodes: 94, Edges: 1466 Q Download Cytoscape session (50 t 9 * • • • t * « ■ EnzymeMiner use case o (A O '55 c Global workflow Sequence searches I Clustering I Multiple sequence alignment I Homology modeling V /Calculation cavities and tunnels Database annotations Number of hits 5661 sequences 992 sequences 658 sequences Timeline Prediction of sequence characteristics O (A '35 w > & 75 a. c x re LU C O Molecular docking \ ► Prioritization ^_ I Selection of genes I Gene synthesis i Subcloning I Gene bank-■•■*• Gene expression t * I L°w Expression analysis Enzyme |High bank'.*± Robotic activity screening i Purification 530 sequences 20 sequences 12 proteins 9 proteins 8 proteins 1 ■c i re re u E a> o o CO ^Secondary structure Quaternary structure J Thermostability pH and temperature profile Substrate specificity I Steady-state kinetics Enantioselectivity Degradation activity ," i □ D D □ □ D □ Jays □ □ □ □ □ □ □ Putative dehalogenases 2,905 sequences Gene synthesis 67 sequences Characterization 35 proteins 72 EnzymeMiner use case The most thermostable enzyme 7~m = 71°C Activity PH optimu Substrate specifity Specificity Enantioselectivity Adv. Bioinformatics liEnzymology 3+years Enzyme Engineering 25+ years Enzyme Discovery 30+ years Temperature optimum Temperature stability 0.01 0.1 h 1 10 100 Stability The most catalytically efficient enzyme (lOOx) PH stability Reg i osel ecti vity Catalytic efficiency Enzymes active at near-to- Efficiency zero temperature 73 Reconstruction of ancestral proteins V Reconstruction of ancestral proteins V y- 3 % ^ xä "£> S k " -, CT) 7^ v i 3 s 5 ~ " ~ £ $ ö § *r a % ^ cp . WP_013040256-LinB VN/P_012392A15 WP 012893836 ^ /V5 ^ -5? J? o -» t£* *\ 2? S2 ' f # # 3 £ & 7 ^ -S> < o> .jv O CD O a> ^ , ^ x% ^ ;y \o ^ ^ x° <& ^ r i 9, H *i °& Sin CL 5 " ß 5 v ^ Ol CP CD to cp (J* ^ A The Tree of Life y' «9^ L'ARBRE DE LA VIE ■ EL ÄRBOL DE LA V1DA - LEBENSBAUM Reconstruction of ancestral proteins V . Create a tree * B I. Collect sequences and align them ..AH32LNQR. ..aeHBlnqr. ..AE^LNQP.. ..AEH2LNQR. . Reconstruct Ancestors HESELNQR. .QEB LNQR. 76 NT Reconstruction of ancestral proteins □ FireProtASR https://loschmidtxhemi.muni.cz/fireprotasr/ Web server for automated ancestral sequence reconstruction Ancestral reconstruction, successor prediction, latent space analysi Design of stable proteins with good yields and broad specificity FIREPRO" ASR vl.O Fully automated ancestral sequence reconstruction Submit new job Help Example Use cases Acknowledgement SELECT THE STARTING POINT SEQUENCE STARTING FROM SEQUENCE (•) Enler own seqjence Source: 0 Upload sequence file MSEIGTGFPFDPHYVEVLGERMHYVDVQPRDGT VLFLHG NPTSSY LWRNIIP HVAPSH RCIAPDLIG MG KSDKPDLDYFFDDHVRYLDAFIEALGLEEVVLVIHD Sequence: WGSALGFHWAKRNPERVKGIACMEFIRPIFTWDE WPEFAR ETFQAFRTADVG RELIIDQ NAFIEGALPKC WRPLTEVEMDHYREPFLKPVDREPLWRFPrJELPI | AGEPANIVALVEAYMNWLHC1SPUPKLLFWGTPG VL Validate 1 V Job title (optional): E-mail (optional): /L v| ^^^^^^^^ Musil M Khan R, SfouraG J, Bednar □ □ amborslty J. 2O20: FireProl-ASR Web Server for Fully Automated Anceslral Sequence Reconstruction {submitted) N umber at vi3ito rs: 2 917 Number at jobs: 967 Lasctirnidt Laboratories * I " :'j5-.s;i.muni.cz ■ http:i'iloschimidt.chemi muni.cz ACKNOWLEDGEMENT LILLH ^739 Reconstruction of ancestral proteins W □ FireProtASR Mutations Ptiylogenelic Iree QHI Multiple-sequence alignment Show substitutions Cljstal Show all anceslrals Sequence information: node 227 51 52 53 54 55 56 57 58 59 BO 51 62 63 64 65 66 67 &3 69 70 71 72 73 74 75 LAAARPAPATMAKRTVATTSTSSTT Position: 68 T ■■ sM A| Pi R N D C Q E G 55 32 5 3 0 0 0 0 0 0 0 | y ||_Store sequence FireProtASRuse case 57 wild types and 56 ancestors tested 123456789 1011121314 1516171819 20 2122 23 24 25 26 27 2829 303132 3334 35363738 394041424344 454647 4849 50 5152 53 54 55 56 57 Thermostability distribution -1 Ancestors: Average 7"m +9 °C 60 % have higher 7~m than the best WT enzymes with top activity Slightly better yields 50 55 Tm(°C) 79 What to keep in mind? □ sequence databases ■ nucleotide: GenBank, EMBL-BANK, DDBJ; protein: UniProtKB, nr Protein database ■ errors in sequences and annotations □ database searches ■ text-based: results influenced by sequence annotations ■ sequence-based: identification of family members - BLAST, PSI-BLAST - E-value ■ false positive results: sequences should be filtered □ selection of proteins for experimental characterization ■ clustering: classification and filtering of hits from database searches ■ sequence comparison: classification and identification of unique sequences ■ sequences from extremophiles: potentially adapted to extreme conditions ■ EnzymeMiner: automated identification of interesting catalysts ■ FireProtASR: design of stable, soluble, and broad specificity proteins 81 What to keep in mind? □ in silico identification and analysis of sequences - fast and cheap way to identify new proteins 82 □ Xiong, J. (2006). Essential Bioinformatics. Cambridge University Press, New York, p. 352. □ Claverie, J-M. and Notredame, C (2006). Bioinformatics For Dummies (2nd ed.). Wiley Publishing, Hoboken, p. 436. □ Steele, H.L. et al. (2009). Advances in Recovery of Novel Biocatalysts from Metagenomes. Journal of Molecular Microbiology and Biotechnoly 16: 25-37. □ NCBI Resource Coordinators (2013). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 41: D8-D20. □ Magrane, M. and Consortium U. (2011). UniProt Knowledgebase: a hub of integrated protein data. Database 2011: bar009. □ Frickey, T. and Lupas, A. (2004). CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics 20: 3702-3704. □ Pagani, I. et al. (2012). The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Research 40, D571-579. □ Van den Burg, B. (2003). Extremophiles as a source for novel enzymes. Current Opinion in Microbiology 6: 213-218. 83 LOSCHMIDT LABORATORIES Protein production Protein Engineering Lecture #3 Michal Vašina