L LOSCHMIDT , LABORATORIES PROTEIN ENGINEERING 2. INSILICO IDENTIFICATION OF PROTEINS Loschmidt Laboratories Department of Experimental Biology Masaryk University, Brno □ Why to search for new proteins? □ How to acquire new proteins? ■ traditional approach ■ metagenomic approach ■ bioinformatic approach □ Bioinformatic approach ■ Where to find target sequences? ■ How to find target sequences? ■ How to recognize interesting sequences? □ What to keep in mind? Why to search for new proteins? Why to search for new proteins? ■ plenty of reasons Why to search for new proteins? □ better understanding of structure-function relationships ■ required for rational design Product (ROH) Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties stability 100 ar[%] -1-1-1-1-1-r 5 100 200 300 400 500 600 time [h] Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties stability temperature profile Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties ■ stability ■ temperature profile ■ activity ■ specificity Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties stability temperature profile activity specificity enantioselectivity ...... ethyl (5)-2-bromopropionate ethyl (/?)-2-bromopropionate Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties stability temperature profile activity specificity enantioselectivity Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties □ better starting points for protein engineering Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties □ better starting points for protein engineering -> proteins with desired properties -> practical applications How to acquire new proteins? How to acquire new proteins? ■ traditional approach ■ metagenomic approach ■ bioinformatic approach How to acquire new proteins? □ traditional approach How to acquire new proteins? □ traditional approach ■ microorganisms possessing target activity are enriched from the environment and isolated in pure culture ■ proteins or corresponding genes are recovered from organisms by protein purification, DNA library screening, PCR with specific primers,... ■ © majority of microorganisms (> 99 %) cannot be cultivated using standard techniques -> a large fraction of the microbial diversity in an environment is lost How to acquire new proteins? □ metagenomic approach How to acquire new proteins? □ metagenomic approach ■ isolation and cloning of DNA extracted directly from environmental sample (without culturing the present organisms) ■ genes recovered by DNA library screening or PCR with specific primers,... ■ © enables to explore biodiversity of uncultured microorganisms How to acquire new proteins? □ bioinformatic approach sequence database (meta)genomic sequencing projects ^£ 7 Entrez, The Life Sciences Search Enginea PubMed All Databases Human Genome GenBank Map Viewer BLAST Search across databases linb OED He|p , w-4 PubMed: biomedical literature citations and m 44 abstracts 0 79 ^Jj PubMed Central: free, full text journal articles 0 none Site Search: NCBI web and FTP sites (D none ^tj Books: online books none OMIM: online Mendelian Inheritance ir [none] ifll OMIA: Online Mendelian Inheritance i Nucleotide: sequence database (includes ^ 45 ^ GenBank) U 39 *#*#* Protein: sequence database (D 4 jj| Genome: whole genome sequences (D ^ Structure: three-dimensional rnacromolecular m, '—' structures fl UniGene: gene-oriented clusters of tr; none J-^ r sequences none CDD: conserved protein domain datab< 12 3D Domains: domains from Entrez Stt none UniSTS: markers and mapping data /n s/7/co "screening □ 1: ABI93216. Report LinB rXaiitlTomoiia5...[gi:l 15291795] >gi 1115231735 I 9blABig32iE.ll LinB [Xjjifchcmoiii; jp. ICH12] HILGňH^&QOO-IEIK&PímYIIJEr/rr,^ KLDP3 &pi3J¥ňYňEHPIim]ŮIJB^^ EICAilÄÄFUHE gene synthesis, DNA request D 2: AAR0597S. Report LinB Pphitigpmona...[gi:37963e83] >qi I S73ĚS6S3 I gb|AAE0537Ě .1| LinE [Sphirigomonij piucimoiilij] MBL&AHTrrEKKFIEIK&PimVIIJEr/r™ KLDP3 rrPEEVTVMIHnDinj]ňIJHEMJ](L& ňnTPErjI)EI)LF(JňrE3 WHELVLqproFmrjí^&LII^^ IPIňr,TPňIJWMňHire.fcrJ^SE3PIP^ EI G&ÄIAAFtfRKLKPA How to acquire new proteins? □ bioinformatic approach ■ sequence data from genomic and metagenomic sequencing projects are stored in sequence databases ■ in silico searching of sequence databases -> © fast and cheap way to identify novel proteins -> © one cannot find what is not in the database (but there is a lot of data - more than one usually needs ©) ■ genes are recovered by gene synthesis or obtained from sequencing consortia upon request Where to find target sequences? Where to find target sequences? ■ databases of nucleotide sequences ■ databases of protein sequences Databases of nucleotide sequences □ GenBank http://www.ncbi.nlm.nih.gov/genbank/ NCBl provided by NCBl (National Center for Biotechnology Information) □ EMBL-BANK http://www.ebi.ac.uk/embl/ provided by EBI (European Bioinformatics Institute) NUCLEOTIDE SEQUENCE DATABASE □ DDBJ http://www.ddbj.nig.ac.jp/ (S5 DDBJ DNA Datu Bimk uf Japan provided by National Institute of Genetics from Japan Databases of nucleotide sequences □ GenBank, EMBL-Bank, DDBJ ■ annotated collections of all publically available nucleotide sequences freely available to wide community contain data obtained from genomic centers or research institutions everyday synchronization of new or updated data © contain about 250,000,000 sequences © mostly automatic annotations - lower quality, errors Databases of protein sequences □ UniProtKB http://www.uniprot.org/ provided by EBI, Swiss Institute of Bioinformatics and Protein Information Resource □ nr Protein database ■ http://www.ncbi.nlm.nih.gov/protein/ ■ provided by NCBI NCBI Databases of protein sequences □ UniProtKB, nr Protein database ■ annotated collections of publically available protein sequences ■ freely available to wide community ■ contain data obtained by conceptual translation of coding sequences from EMBL-Bank/GenBank/DDBJ or provided by research institutions ■ © contain more than 100,000,000 sequences ■ © mostly automatic annotations - lower quality, errors Databases of protein sequences □ UniProtKB rich annotations (e.g., information about function of protein and individual amino acids, experimental data, biological ontologies, classifications,...) clear indication of annotation quality (manual vs. automatic) UniProtKB Protein knowledgebase U n i Pro t KB/S wi ss-Prat Reviewed Manual annotation XT UniProtKB. TrEMBL Un reviewed Automatic annotation Databases of protein sequences □ UniProtKB/Swiss-Prot ■ high quality annotations, i.e., manually annotated entries or expert-reviewed automatic annotations ■ © source of reliable information ■ © contains "only" ~ 600,000 sequences □ UniProtKB/TrEMBL ■ © automatic annotations - lower quality, errors ■ © contains ~ 180,000,000 sequences Number of sequences Number of characterized proteins Pitfalls of sequence databases □ large number of errors © errors in sequences (wrong base, frameshift errors) wrong positions of genes exon-intron boundary errors errors and inaccuracies in annotations How to find target sequences? How to find target sequences? ■ text-based searches ■ sequence-based searches Text-based searches □ database retrieval systems ■ enable quick and easy search of many databases at the same time ■ specification of queries using logical operators (AND, OR, NOT,...) ■ Entrez(NCBI), SRS(EBI) □ © results dependent on sequence annotations ■ erroneous, inaccurate or too general annotations ■ synonyms ■ misspellings Text-based searches □ database retrieval systems NCBI HOME SEARCH SITE MAP mouse[ORGN] AND kinase AND (exons OR introns) Entrez,jfhe Life Sciences Search Engine9 PubMed All Databases Search across databases Human Genome GenBank Map Viewer | BLAST GO Clear Help □ database retrieval systems Search across databases rnouse[ORGN] AND kinase AND (exons OR introns) GO Clear Help 1258 152 96 - Result counts displayed in gray indicate one or more terms not found 1258| tjjfl 312 15 PubMed: biomedical literature citations and abstracts ® 0 Books: online books m PubMed Central: free, full text journal articles ® [703 OMIM: online Mendelian Inheritance in Man ® ONIA: online Mendelian Inheritance in Animals Site Search: NCBI web and FTP sites ® none í® 152 Nucleotide: Core subset of nucleotide sequence records [T] EST: Expressed Sequence Tag records 121 GSS: Genome Survey Sequence records m m .j 96| W* Protein: sequence database none dbGaP: genotype and phenotype SGL UniGene: gene-oriented clusters of ' transcript sequences none CDD: conserved protein domain database ^> none 3D Domains: domains from Entrez Structure □ searches based on sequence similarity ■ © results not influenced by sequence annotations □ rely on assumption that proteins with the same function have similar sequence ■ © not always true - close homologs vs. distant homologs vs. analogs 1 |S |IAAY APF(TPDY1AGA|AFP LVP SP 2 L "D ■ AAY APFP|QRYiAGVlRFPELVPVSP 3 M|P[ ECAAY APFPDl HRf> lIaFP LMPESE 4 1SD |R AYDAPFP|E Y1EGAK FP LVPI P 5 VPAGVRA YDAPFPDH Y AGAlAFPRLVP SP 6 IS DVLNrVYDAPFP lAGVl FPLfVP 7 MP VRA YDAPFPDI Y AGAlAFPRLVP SP □ BLAST ■ based on local pairwise alignment □ PSI-BLAST ■ "iterative BLAST" making use of multiple sequence alignment ■ very sensitive search strategy to detect weak but biologically significant similarities between sequences □ ... □ Basic Local Alignment Search Tool MEAAVKEEISVEDEAVDKNI MEA EAA AAV AVK VKE Break query into words I EIS I SV Break database sequences into words Compare word list by hashing (allow near matches) Basic Local Alignment Search Tool BLOSUM scoring matrix Ala 4 Arg -1 5 Asn -2 0 6 Asp -2 -2 1 6 Cys 0 -3 -3 -3 9 Gin -1 1 0 0 -3 5 Glu -1 0 0 2 -4 2 5 Gly 0 -2 0 -1 -3 -2 -2 6 His -2 0 1 -1 -3 0 0 -2 8 lie -1 -3 -3 -3 -1 -3 -3 -4 -3 4 Leu -1 -2 -3 -4 -1 -2 -3 -4 -3 2 Lys -1 2 0 -1 -3 1 1 -2 -1 -3 Met -1 -1 -2 -3 -1 0 -2 -3 -2 1 Phe -2 -3 -3 -3 -2 -3 -3 -3 -1 0 Pro -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 Ser 1 -1 1 0 -1 0 0 0 -1 -2 Thr 0 -1 0 -1 -1 -1 -1 -2 -2 -1 Trp -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 Tyr -2 -2 -2 -3 -2 -1 -2 -3 2 -1 Val 0 -3 -3 -3 -1 -2 -2 -3 -3 3 Ala Arg Asn Asp Cys Gin Glu Gly His Me Query sequence: R P P Q G L F Database sequence: D P PEG V V ^~~*Exact match is scanned Score:-2 7 7 2 6 1 -1 HSP Optimal accumulated score = 7+7+2+6+1 = 23 4 -2 5 2-15 0 -3 0 6 -3 -1 -2 -4 7 -2 0 -1 -2 -1 4 -1 -1 -1 -2 -1 1 5 -2 -3 -1 1 -4 -3 -2 11 -1 -2 -1 3 -3 -2 -2 2 7 1 -2 1 -1 -2 -2 0 -3 -1 4 Leu Lys Met Phe Pro Ser Thr Trp Tyr Val Sequence-based searches □ PSI-BLAST input H BLAST »™ 11 Welcome aklluge^Sjjj Home Recent Results Sa\ ► NCBIS BLAS i blastp suite: BLASTP pro 1 MSLGAKPFGEKKFIEIKGRRMAYIDEGTGDPILFQHGNPTSSYLWRNI Enter Query Sequence Enter accession number, gi, or FASTA sequence Query subrange jbt< LGAKPFGEKKFIEIKCRPJLS.YIDEGTCDPILFQHGKPTSSYLUPJJIHPHCACLGRLIACDLIGH p r0 m | | iL Or, upload file Job Title Browse... | 'gj Enter a descriptive title for your BLAST search Choose Search Set h | Non-redundant protein sequences (nr) '@ Database Organism ©Any <~ Human <~ A.thaliana <~ Mouse <~ Custom. Optional Search only sequences from selected organism >@ Sequence-based searches □ PSI-BLAST results Score E-value hits Sequences producing significant alignments Download v \Manage ColunYis v Show 100 v Q select all 100 sequences selected GenPept Graphic« Distancetr«Mbfresults Multiplealignmen Description Max Score Total Score Query Cover E value Per. Ident Accession o achaete-scute homoloo 2 [Homo sapiens] 373 373 100% 2e-130 100.00% NE 005161.1 □ achaete-scute homoloa 2 [Ponao abelii] 368 368 100% 3e-128 98.96% XP 002821424.1 Q achaete-scute homoloa 2 [Nomascus leucoaenvsl 361 361 100% 2e-125 97.41% XP_ 003282133.1 D achaete-scute homoloa 2 [Macaca nemestrina] 356 356 100% 1e-123 96.37% XP 011719606.1 Q achaete-scute homoloa 2 [Piliocolobus tephrosceles] 356 356 100% 1e-123 96.37% XP_ 023039276.1 Q achaete-scute homoloa 2 [Papio anubis] 2S7 297 100% 3e-100 95.85% XP_ 003909431.1 Q PREDICTED: achaete-scute homoloa 2 [Chlorocebus sabaeus] 297 297 100% 3e-100 95.34% XP 008003331.1 Q PREDICTED: achaete-scute homolog 2 [Rhinopithecus bieti] 294 294 100% 3e-99 95.34% XP_ 017741776.1 □ PREDICTED: achaete-scute homolog 2 [Cebus capucinus imitator] 271 271 92% 4e-90 96.07% XP_ 017363199.1 Q PREDICTED: achaete-scute homoloa 2 [Callithrix jacchus] 269 269 100% 3e-89 94.82% XP_ 009006952.1 Q achaete-scute homoloa 2 [Sus scrota] 265 265 100% 1e-87 84.97% NP 001116463.1 □ PREDICTED: achaete-scute homoloa 2 rCarjra hircusl 261 261 92% 5e-86 85.39% XF 017899088.1 Sequence-based searches □ PSI-BLAST results Sequences producing significant alignments Download v Manage Columns Show 100 v select all lOO sequences selected Gen Pep t Graphics Distance tree of results Multiple alignmen Description Max Score Total Score Query Cover E value Per. Ident Accession Q achaete-scute homoloa 2 [Homo sapiens] 373 373 100% 2e-130 100.00% NP 005161.1 Q achaete-scute homoloa 2 [Ponao abelii] 368 368 100% 3e-128 98.96% XP 002821424.1 D achaete-scute homoloa 2 [Nomascus leucoaenvs] 361 361 100% 2e-125 97.41% XP_ 003282133.1 Q achaete-scute homoloa 2 [Macaca nemestrina] 356 356 100% 1e-123 96.37% XP_ 011719606.1 Q achaete-scute homoloa 2 [Piliocolobus tephrosceles] 356 356 100% 1e-123 96.37% XP_ 023039276.1 D achaete-scute homoloa 2 [Papio anubis] 257 297 100% 3e-100 95.85% XP 003909431.1 Q PREDICTED: achaete-scute homoloa 2 [Chlorocebus sabaeus] 257 297 100% 3e-100 95.34% XP_ 008003331.1 D PREDICTED: achaete-scute homoloa 2 [Rhinopithecus bieti] 294 294 100% 3e-99 95.34% XP_ 017741776.1 Q PREDICTED: achaete-scute homoloa 2 [Cebus capucinus imitator] 271 271 92% 4e-90 96.07% XP_ 017363199.1 Q PREDICTED: achaete-scute homoloa 2 rCallithrix iacchusl 269 269 100% 3e-89 94.82% XP_ 009006952.1 _ Q achaete-scute homoloa 2 [Sus scrota] 265 265 100% 1e-87 84.97% NP 001116463.1 Q PREDICTED: achaete-scute homoloa 2 TCaDra hircusl 261 261 92% 5e-86 85.39% XP 017899088.1 □ PSI-BLAST results alignment >pgbIAAT70109■1 I CurN [Lyngbya majuscula] Length=341 Score = 303 bits (777), Expect = 8e-81, Method: Composition-based stats. Identities = 148/297 (49%), Positives = 188/297 (63%), Gaps = 8/297 (2%) Query 2 SEIGTGF PFDPHYVEVLGERMHYVDVGPRDGTPVLFLHGNPTS SYLWRNIIPHV-APSHR 60 I + FPF VEV G + YVD G G PVLFLHGNPTSSYLWRNIIP+V A +R Sbj ct 41 LPISSEFPFAKRTVEVEGATIAYVDEG—SGQPVLFLHGNPTSSYLWRNIIPYVVAAGYR 98 Query 61 CIAPDLIGMGKSDKPDLDYFFDDHVRYLDAFIEALGLEEVVLVIHDWGSALGFHWAKRNP 120 +APDLIGMG S KPD++Y DHV Y+D FI+ALGL+++VLVIHDWGS +G A+ NP Sbj ct 99 AVAPDLIGMGDSAKPDIE YRLQDHVAYMDGFIDALGLDDMVLVIHDWG SVIGMRHARLN P 158 Query 121 ERVKGIACMEFIRPI----PTWDEWPEFARETFQAFRTADVGRELIIDQNAFIEGVLPK- 175 +RV +A ME + P P+++ F+ RTADVG ++++D N F+E +LP+ Sbj ct 159 DRVAAVAFMEALVPPALPMPSYEAMGPQLGPLFRDLRTADVGEKMVLDGNFFVETILPEM 218 Query 176 CVVRPLTEVEMDHYRE PFLKPVDRE PLWRF PNEIPIAGE PANIVALVEAYMNWLHQS PVP 235 VVR L+E EM YR PF R P ++P E + PI GEPA A V WL SP+P Sbj ct 219 GVVRSLSEAEMAAYRAPF PTRQSRLPTLQWPREVPIGGE PAFAEAEVLKNGEWLMAS PIP 278 Query 236 KLLFWGTPGVLIPPAEAARLAE SLPNCKTVDIGPGLHYLQEDNPDLIGSEIARWLPG 2 92 KLLF PG L P L+E++PN + +G G H+LQED+P LIG IA WL Sbj ct 279 KLLFHAE PGALAPKPVVDYLSENVPNLEVRFVGAGTHFLQEDHPHLIGQGIADWLRR 3 35 □ BLAST Score ■ normalized raw score ■ raw score = sum of substitution scores and gap penalties ■ higher is better, but does not adequately represent significance of alignment □ BLAST E- value ■ equal to the number of BLAST alignments with a given Score that are expected to be seen simply by a chance ■ indicator of alignment significance ■ results associated with the lowest E-values are the best ■ hits with an E-value score > 0.01 belong to the "grey zone" - do not trust them □ BLAST alignment ■ identity and similarity level between query and aligned sequence ■ alignment length and coverage of query sequence - the alignment is local, therefore one should always check that the alignment covers a significant portion of the query sequence (e.g., the alignment may involve only few amino acids from the query sequence -> not significant hit) Optimal search strategy □ text-based search ■ good for finding evolutionary "unrelated" proteins with some specific function ■ a large number of false negatives (missed proteins with target function) and false positives (identified proteins with different function) results due to erroneous or inaccurate annotations □ text-based search □ sequence-based search ■ good for finding members of a protein family (i.e., group of evolutionary related proteins sharing some specific function) -> not suitable for finding "unrelated" proteins ■ potential false positive results (i.e., proteins belonging to other evolutionary related families) ■ searches using protein sequence queries are generally more sensitive than using nucleotide sequence queries (20 different amino acids vs. 4 different nucleotides) □ text-based search □ sequence-based search □ combination of text-based and sequence-based approaches 1. text-based search 2. subdivision of identified sequences into evolutionary related groups 3. selection of few representatives for each group 4. sequence-based searches using each representative as a query potential false positive results - should be filtered How to recognize interesting sequences? How to recognize interesting sequences? sequence clustering sequence comparison information about host organisms automated in silico enzyme identification ancestral sequence reconstruction □ clustering based on pairwise sequence similarities ■ can be used for a fast and rough classification of sequences in large datasets (thousands of sequences) -> effective way to filter results of database searches -> identification of members of individual protein families ■ CLANS - visualization of pairwise sequence similarities in three-dimensional space -> overview of sequence space □ clustering based on pairwise sequence similarities 0 known target proteins o © • * *. •• . . • . + j ... - . * . * * * • ■ ♦ ' Q * • ' * .*.*.* • •. . . * „ *. . . ....... ........ . • . . . . . O. ....... ... Sequence clustering □ clustering based on pairwise sequence similarities target protein family Sequence clustering □ clustering based on pairwise sequence similarities C-C hydrolases OH 0 HO,C HO,C OH 0 + X '^CH, HC^R haloalkane dehalogenases R-^X + ^° — R^OH + HX perhydrolases 0 0 JL + H202 ► J-L .OH + H?0 W epoxide hydrolases ,/i + H20 HO OH R # previously known members • new family members □ multiple sequence alignment ■ analysis of conserved residues within protein family -> identification of protein family members 1 |S |IAAY APFITPDYBAGAIAFP LVP SP 2 L "D ■ AAY APFP|QRYiAGVlRFPELVPVSP 3 m|P[ ECAAY APFPDl HRf* lIaFPLMIpESE 4 1SD |R AYDAPFP|E Y1EGAK FP LVPI P 5 VPAGVRA YDAPFPDH Y AGAlAFPRLVP SP 6 1BTDVL AYDAPFP E lAGVlC FP LVP TT 7 VP VRA YDAPFPDB Y AGAlAFPRLVP SP 8 BTEEDVA YV KF EBG-YTG 5VNYYRNFDRNN □ multiple sequence alignment ■ analysis of conserved residues within protein family -> identification of protein family members 1 y pfp r fp lvp 2 y pfp r fp 3 y pfp 1 fp 1 Ip 4 y pfp 1 fp 1 lvp 5 y pfp I fp MP 6 y pfp 1 pn lvp 7 y pfp lvp 8 y kfq v nfd // ////// □ multiple sequence alignment ■ analysis of conserved residues within protein family -> identification of protein family members 1 LS |lAAYEAPF1TPDY1AGAIAFP LVP sp 2 l "da! aay apfpiryIagvIrfp LVP /SP 3 MiPDlCAAY APFPDl ra L1AFP vpese 4 IHd |R AYDAPFPD Yl GAKIFP LVPi P 5 IVPAGVRA YiAPFPDH Y AGAlAFPRLVP SP 6 1BTDVL AYDAPFPFEAHlAGVlCFP LVP^TT 7 ft/PAGVRA YiAPFPDH Y AGAlAFPRLVP SP Sequence comparison □ multiple sequence alignment ■ identification of sequences with unique features -> proteins with potentially novel characteristics Shesp-Sheama Pelpro Desace Xanaxo Xylfas Chlaur Despsy Rhobal Burcen My c tub No c far Jansp-uncbac Erylit Polsp-Mycavi My c tub Mycavi Maraqu Caucre Pseatl Psycry Shefri Shefri Xanant. EAKlFl— —WALlIC IATkIlI------------------------------------LRLKIC PGkIfI------------------------------------LALRIC PSkIlI------------------------------------KALKLC EKPMP------------------------------------WQIAMG TSKKM|------------------------------------WQIALG -- vp------------------------------------LRIAAG-KLPQLl -- IP------------------------------------lrislc-iAPGL -- mp------------------------------------qriaac-Impgli InGREaI------------------------------------WFQ|IM TGF DTGF DTGF DSGF HWTF DWGL SIASYV-GLASVI-LAASFV-SRGAAWV-■SSGASWF-LGAAWL-IAATTM-.WPASFM-LiARAAVTM- GFBllst lkBng- (|n-(|n- 1ADTLAM----------------------------------kafsrvmsspf I da lan----------------------------------rvfswmss|f gvdbdk---------------------------------lklpppllmmmakrl' DV1QEvideIkaf rasnkkinf ftmakniskmdk s khfatkfmy|qk|swe sk| gvd|: PDVi' ges|g|- SDV1LS-lAQRRTl- |aqgrt|-gdqIma- ■DNRVP- ■VGKS — GDH|PG- GDHDLG-GDH|PG- GDEETN- DPVTOlAFSAFVTOPA------------------------DGFT / fverlii FLIDRV1L-ferqvmtm- pigf lnsm- nm nm: --gfe |k fnrsqpimdvaglfkr- -p r wremcaknpdfdvarlfar--pafya|ra|aryspvlpagrivsv- - lpfYV|ra|aRyspvlpagrlvnf--dvwwrf reait sapqlnigafvqg- GVK GCK gck| ickin |s|r etr T|P Jqkr------------------------|sk| AMS|T-----------------------KMKPD| FENHA-----------------------II DTWIAAY A GTEHR------------------------Pi AEL|G-----------------------V| KA| D Y| 1 ATATK------------------------RK PSKRAY|d: QMEDKLAKSKVKAYVHLLFQGLGLEKLS||TDLIKAYEA Itpd-----------AL-----------^1tda| aay a pq-----------1*------------mipd|caa|na eayva rayva rly|l yma rayva ray lai eg|l' I I I IgllI -WFRLI -EGFE -EAFMKt -EGFRK -EAFTK -DAFMK KW kw kw J shspwfpigrivql-s qnt pel p vgfijng-sqevpefpvagiikg-«,g|sqeipqfhvggtiks- |RQ|sqdvaifptgnlins- fnysqesvdfpagqming-1kydlvtpsdlrld0fmkr- / |tv|r-- --V1SKI itvhr-- —J— --vpac YD A YD A ICR1R-------#----------------LS AERA Y A te|s-------"-----------------Is AllAAYEA |ta|d-- --IsdaBsayda -ATVTA------------------------Is Dp AYDA GTVTK-- --Is AilDAY A ACVST-- --BtVKIAAYDA ASVSD-- --|SDD|IAAYDA -WAPT-------------------------IteaI SAY A Sequence comparison IShesp-Sheama Pelpro Desace Xanaxo Xylfas Chlaur Despsy Rhobal Burcen Myctub hocfar Jansp-luncbac Erylit Polsp-Mycavi Myctub Mycavi Maraqu Caucre Pseatl Psycry Shefri Shefri □ multiple sequence alignment ■ identification of sequences with unique features -> proteins with potentially novel characteristics mum DKAIi Iray iSRiLlRLYl NAF NAF ------------------V-7 iPDViQEVIDEIKAFRASNKKINFFTMAKNISKMDKSKHFATKFMYlQKlSWESKlMPIGFLNSM-QMEDKLAKSKVKAYVHLLFQGLGLEKLSl |TDLIKAYE 'DVHLS- FRAMREMC. KWI DVARL R—G ] WR F R WR F R □ phylogenetics ■ establishment of evolutionary relationships among sequences Sequence comparison □ phylogenetics classification of sequences HLD-II DatA Sav4779 DbJA i -^n Jann2620 ^DmlA ru a DlmA DhIA DelA DhmA DmbBDmpA DmaA DsfA/iso-DsfA DpcA DpaA HLD-I LinB DmbA DmsA DspA iso-DspA DxxB DrbA DphprA DsoA DsfB / DssA DsdA DpeprA cpA DsamA HLD-III 'DnfA □ phylogenetics ■ information about experimental data-> selection of novel proteins HLD-II □ extremophiles - microorganisms living in extreme conditions ■ geochemical extremes (pH, salinity) ■ physical extremes (temperature, pressure) □ proteins from extremophiles ■ often adapted to extreme conditions -> unique characteristics, useful for practical applications Information about host organisms □ Genomes OnLine Database (GOLD) ■ http://www.genomesonline.org/ ■ list of complete (>6,000), ongoing (> 27,000) and targeted genome (>1,000) projects ■ information about individual projects and source organisms □ Entrez Genome ■ http://www.ncbi.nlm.nih.gov/sites/genome ■ provided by NCBI ■ data from more than 20,000 finished or ongoing genome projects (includes almost 10,000 organisms) ■ information about genome, source organism, genes, encoded proteins, graphical representations,... □ GOLD Metagenomes Isolate Genomes ^ Classification ^/j Complete Projects: 4169 • Studies: 370 t\ Incomplete Proiects: 17714 • Samples: 2642 Targeted Projects: 1600 Organism Metadata MIGS 22 o OXYGEN REQUIREMENT Aerobe MIGS 37.1 o CELL SHAPE Rod-shaped MIGS 37.2 * MOTILITY Non motile MIGS 37.3* SPORULATION MIGS 37.4© PRESSURE MIGS 37.12© TEMPERATURE RANGE Psychrophile SALINITY Halotolerant PH MIGS 37.5© CELL DIAMETER MIGS 37.6 © CELL LENGTH MIGS 37.7 © COLOR MIGS 37.8 © GRAM STAINING MIGS 15© BIOTIC REALTIONSHIPS Free living Entrez Genome Psychrobacter cryohalolentis Psychrotolerant organism Lineage: Bacteria[4049]; Proteobacteria[1682]; Gammaproteobacteria[75Q]; Pseudomonadales[122]; Moraxellaceae[51]; Psychrobacter[1Q]; Psychrobacter crycihalolentis[1] Psychrobacter. These bacteria are commonly isolated from low temperature environments, Psychrobacter spp. are cold-adapted organisms that are often isolated from extreme environments such as permafrost or the Antarctic ice.Psychrobacter cryohalolentis. Psychrobacter cryohalolentis. formerly Psychrobacter crvopegetta More 4 Representative □ Community" selected, Calculated : Psychrobacter cryohalolentis K5 Psychrobacter cryohalolentis K5. This organism was isolated from saline liquid (12-14%) found 11-24 m below the surface within a forty thousand-year-old Siberian permafrost at the Kolyma-lndigirka lowland in Siberia. This strain will provide insight into growth at extremely low temperatures. Human Pathogen: no Type Name RefSeq INSDC Size (Mb) GC% Protein rRNA tRNA Other RNA Gene Pseudogene Chr J NC_007&6&.1 CP000323.1 3.06 42.3 2,467 12 43 6 2,337 4 Plsm J 1 NC_007&63.1 CP000324.1 0.041221 44 - - - 44 - Biological Properties ■ Morphology >-■ Shape: Bacilli .^^^^^ D4Z2Gl MSLGAKP FGEKKFIEIKGRRMAYI DE GT G D PILF QHGNPT S S YL WKMTMPHC AGLGRLIAC DLIGMGDS DKLDPSGPERY AYAEHRD YL DAL WE AL DLG DRWL WHDW G S ALGFDWARRHRE RVQGIAYMEAI AMP IE WAD FP EQDRD L FQAF RS Q AG E E LVL QDNVFVE QVLPGLILRPLS EAEMAÄYRE PF LAAGEARRPT L SW PRQIPIAGTP AD WAIARDY AGWLSESPIPKLF INAEPGALTTGRMRDFCRTWPWQTEITVAGAHFIQEDSPDEIGAAIAAFVRRLRPA >P22 643 MINAIRTPD QRF SNLDQYP F S PNY LDDLP GYPGLRAHYLDEGNSDAEDVFL CLHGEPTWSYLYRKMIPVFAE SGARVIAP DFFGFGRSDKPVDEEDYTFEFHRNFLLALIERLDLRNITLWQDWGGFLGLTLPMADPSRFRliLriMNACLMTDPVTQPA Load example Upload FASTA - © Essential residue templates© Add protein (row) I Add residue (column) Accession Halide 1 Nucleophile Halide 2 Proton donor Proton accept! N,W D W E.D H D4Z2G1 33 103 109 132 272 P22643 125 124 175 260 239 Advanced options Automated in silico enzyme identificatio TARGET SELECTION TABLE Select all I Deselect all Solubility threshold: 6 Primary domains:© PF00561 (AbhydrolaseJ) X 0.50 25 90 Identity to queries:© Selected Full Dataset Extra Domain Known Organism Temperature Salinity Biotk Relationship Disease Transmembrane 3D Structure Accession Annotation Closest query Identity closest qu... 4- Kingdom Solubility Sequence length Domain annotatioi □ KAB2639994.1 haloalkane dehalogena... D4Z2G1 74.1 = 0.S02S 294 Abhydrolase_1 g WP_0840848S2.1 haloalkane dehalogena... D4Z2G1 72.7 B 0.5433 275 Abhydrolase_1 □ WP_071575177.1 haloalkane dehalogena... D4Z2G1 70.8 3 0.9399 270 Abhydrolase_1 □ AOY91 276,1 haloalkane dehalogena,., D4Z2G1 70,5 B 0.637 295 Abhydrolase_1 □ TMJSS042.1 haloalkane dehalogena... D4Z2G1 70.3 3 0.5604 296 Abhydrolase_1 □ WP_0710SS77S.1 haloalkane dehalogena... D4Z2G1 70.2 B 0.6167 295 Abhydrolase_1 WP_066929S94.1 haloalkane dehalogena... D4Z2G1 70.1 3 0.5727 2SS Abhydrolase_1 □ WP_09e502050.1 haloalkane dehalogena... D4Z2G1 59.9 B 0.5028 292 Abhydrolase_1 □ WP_071011817,1 haloalkane dehalogena,., D4Z2G1 69.8 = 0.6104 295 Abhydrolase_1 □ WP_01 5306650.1 haloalkane dehalogena... D4Z2G1 693 B 0.583 295 Abhydrolase_1 WP_11031S832.1 haloalkane dehalogena... D4Z2G1 S9.8 3 0.5079 295 Abhydrolase_1 WP_064 949090.1 haloalkane dehalogena... D4Z2G1 59.7 B 0.6071 297 AbhydrolaseJ □ WP_0B3154851.1 haloalkane dehalogena... D4Z2G1 59.7 3 0.5778 297 Abhydrolase_1 WP_0S73742S3.1 haloalkane dehalogena... D4Z2G1 59.Ě B 0.8918 300 Abhydrolase_1 □ WP_O1S290793.1 haloalkane dehalogena... D4Z2G1 S9.S 3 0.8413 300 Abhydrolase_1 2QVB_A Chain A Crystal Struct... D4Z2G1 59.5 B 0.9187 297 AbhydrolaseJ Automated in silico enzyme identificatio TARGET SELECTION TABLE Select all I Deselect all I Undo I Redo 0.50 25 90 Solubility threshold: 6 Primary domains:© PF00561 (AbhydrolaseJ) X Identity to queries:© X v Selected Full Dataset Extra Domain Known Organism Temperature Accession Annotation Closest query Identity □ KAB2Ě39994.1 haloalkane dehalogena... D4Z2G1 74.1 • WP_084084852.1 haloalkane dehalogena... D4Z2G1 72.7 □ WP_071575177.1 haloalkane dehalogena... D4Z2G1 70.8 □ AOY91 276,1 haloalkane dehalogena,., D4Z2G1 70,5 • TMJSS042.1 haloalkane dehalogena... D4Z2G1 70.3 □ WP_0710SS77S.1 haloalkane dehalogena... D4Z2G1 70.2 □ WP_066929S94.1 haloalkane dehalogena... D4Z2G1 70.1 □ WP_096502050.1 haloalkane dehalogena... D4Z2G1 59.9 □ WP_071011817,1 haloalkane dehalogena,., D4Z2G1 69.8 □ WP_01 5306650.1 haloalkane dehalogena... D4Z2G1 693 □ WP_11031S832.1 haloalkane dehalogena... D4Z2G1 S9.8 □ WP_064 949090.1 haloalkane dehalogena... D4Z2G1 59.7 • WP_0B3154851.1 haloalkane dehalogena... D4Z2G1 59.7 □ WP_0S73742S3.1 haloalkane dehalogena... D4Z2G1 59.Ě □ WP_O1S290793.1 haloalkane dehalogena... D4Z2G1 S9.S □ 2QVB_A Chain A Crystal Struct... D4Z2G1 59.5 SEQUENCE SIMILARITY NETWORK Select network: Q Identity: 50%, Nodes: 94, Edges: 1466 Download Cytoscape session (50 1 f * • 0.9187 297 nunyui uidie_ i AbhydrolaseJ □ Proteins with exceptional properties ■ improved stability, yields, specificity... □ Resurrection of the most probable protein sequences from past ■ selection of homologous sequences ■ multiple sequence alignment ■ construction of phylogenetic tree ■ reconstruction of ancestral sequences Ancestral sequence reconstruction I. Collect sequences and align them ..AH32LNQR. ..aeHBlnqr. ..ae^lnqp.. ..AEH2LNQP.. ejehilnqp.. .qee lnqp.. Ancestral sequence reconstruction . Create a tree ♦ -1 I. Collect sequences and align them ■■AfSiTCLNQR. ..AEH2LNQR. ..AEHfiLNQP.. ..aeEElnqr. .. e[j2jlnqp.. ..QEH2LNQR. Ancestral sequence reconstruction . Create a tree ♦ -1 I. Collect sequences and align them ..A13S2LNQP.. ..aeHBlnqp.. ..AEHsjLNQP.. ..aeEElnqp.. .. E[JÜJLNQP.. ..QEßELNQP. .aeEElnqp. . Reconstruct Ancestors Ancestral sequence reconstruction □ FireProtASR web server ■ https://loschmidt.chemi.muni.cz/fireprotasr/ FIREPROT Fully automated ancestral sequence reconstruction Submit new job Help Example Use cases Acknowledgement SELECT THE STARTING POINT STARTING FROM SEQUENCE Load example Source: (•) Enler own sequence (") Upload sequeme fie MSEIGTGFPFDPFITVEVLGERMHYVDVGPRDGTP * VLFLHG NPTSSY LWRM II P H VAPSH RCIAPDLIG M G KS D KP DLD Y FFDD HVRY LDAFIEALG LEE WLVI HD Sequence : WGSALGFHWAKRNPERVKGIACMEFIRPIPTWDE WPEFARETFQAFRTADVG RELIIDQ NAFIEGALPKC WR P LTEVEM DHYRE P FLKP VDREPLWRFPNE LPI AGEPAN IVALVEAYMNWLHQSP VPKLLFWGTP GVL v Valid its JOB INFORMATION Job title (optional] : E-mail (optional): e.g. xxxxxx Find job Musil M Khan R. Stourac J, Bednar □ □ amborsky J, 2G20: FireProl-ASR: Web Server for Fully Automated Ancestral Sequence Reconstruction [submitted) USER STATISTICS ■ N umber of visitors: 2917 ■ Number of jobs: 96"? LoBchmicIt Laboratories ■ fireprat@5d.muni.cz ■ http: ff\ asch mid t-ehern i. mijni.cz ACKNOWLEDGEMENT Ancestral sequence reconstruction □ FireProtASR web server https://loschmidt.chemi.muni.cz/fireprotasr/ Mutations 0 Phylogenese Iree OKI Multiple-sequence alignment Show substitutions v Show all ancestrafe v ^030838981), ■•:,_:-5S:-s:?a, W_H671437t)-1 — LAAARPAPATMAKRTVATTSTSSTT Position: S8 T HI ■U A| Pi R N D C Q E G 55 32 5 3 0 0 0 0 0 0 0 EL pun EL Store sequence What to keep in mind? What to keep in mind □ sequence databases ■ nucleotide: GenBank, EMBL-BANK, DDBJ; protein: UniProtKB, nr Protein database ■ errors in sequences and annotations □ database searches ■ text-based: results influenced by sequence annotations ■ sequence-based: identification of family members - BLAST, PSI-BLAST - E-value ■ combination of both approaches: optimal strategy to filter false positives □ selection of proteins for experimental characterization ■ clustering: classification and filtering of hits from database searches - CLANS ■ sequence comparison: classification and identification of unique sequences ■ sequences from extremophiles: potentially adapted to extreme conditions ■ Enzyme Miner: automated identification of interesting catalysts ■ Ancestral protein sequences with interesting properties □ in silico identification and analysis of sequences - fast and cheap way to identify new proteins □ Xiong, J. (2006). Essential Bioinformatics. Cambridge University Press, New York, p. 352. □ Claverie, J-M. and Notredame, C (2006). Bioinformatics For Dummies (2nd ed.). Wiley Publishing, Hoboken, p. 436. □ Steele, H.L. et al. (2009). Advances in Recovery of Novel Biocatalysts from Metagenomes. Journal of Molecular Microbiology and Biotechnoly 16: 25-37. □ NCBI Resource Coordinators (2013). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 41: D8-D20. □ Magrane, M. and Consortium U. (2011). UniProt Knowledgebase: a hub of integrated protein data. Database 2011: bar009. □ Frickey, T. and Lupas, A. (2004). CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics 20: 3702-3704. □ Pagani, I. et al. (2012). The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Research 40, D571-579. □ Van den Burg, B. (2003). Extremophiles as a source for novel enzymes. Current Opinion in Microbiology 6: 213-218. L LOSCHMIDT , LABORATORIES PROTEIN ENGINEERING 3. PREPARATION OF RECOMBINANT PROTEINS, PROTEIN EXPRESSION AND PURIFICATION Loschmidt Laboratories Department of Experimental Biology Masaryk University, Brno