L LOSCHMIDT , LABORATORIES PROTEIN ENGINEERING 2. INSILICO IDENTIFICATION OF PROTEINS Loschmidt Laboratories Department of Experimental Biology Masaryk University, Brno □ Why to search for new proteins? □ How to acquire new proteins? ■ traditional approach ■ metagenomic approach ■ bioinformatic approach □ Bioinformatic approach ■ Where to find target sequences? ■ How to find target sequences? ■ How to recognize interesting sequences? □ What to keep in mind? Why to search for new proteins? Why to search for new proteins? ■ plenty of reasons Why to search for new proteins? Optimization of protein for application rchinodermt Protein engineering Natural diversity Why to search for new proteins? □ better understanding of structure-function relationships ■ required for rational design Product (ROH) Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties stability 100 ar[%] -1-1-1-1-1-r 5 100 200 300 400 500 600 time [h] Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties stability temperature profile Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties ■ stability ■ temperature profile ■ activity ■ specificity Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties stability temperature profile activity specificity enantioselectivity ...... ethyl (5)-2-bromopropionate ethyl (/?)-2-bromopropionate Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties stability temperature profile activity specificity enantioselectivity Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties □ better starting points for protein engineering Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties □ better starting points for protein engineering -> proteins with desired properties -> practical applications How to acquire new proteins? How to acquire new proteins? ■ traditional approach ■ metagenomic approach ■ bioinformatic approach How to acquire new proteins? □ traditional approach How to acquire new proteins? □ traditional approach ■ microorganisms possessing target activity are enriched from the environment and isolated in pure culture ■ proteins or corresponding genes are recovered from organisms by protein purification, DNA library screening, PCR with specific primers,... ■ © majority of microorganisms (> 99 %) cannot be cultivated using standard techniques -> a large fraction of the microbial diversity in an environment is lost How to acquire new proteins? □ metagenomic approach How to acquire new proteins? □ metagenomic approach ■ isolation and cloning of DNA extracted directly from environmental sample (without culturing the present organisms) ■ genes recovered by DNA library screening or PCR with specific primers,... ■ © enables to explore biodiversity of uncultured microorganisms How to acquire new proteins? □ bioinformatic approach sequence database (meta)genomic sequencing projects ^£ 7 Entrez, The Life Sciences Search Enginea PubMed All Databases Human Genome GenBank Map Viewer BLAST Search across databases linb OED He|p , w-4 PubMed: biomedical literature citations and m 44 abstracts 0 79 ^Jj PubMed Central: free, full text journal articles 0 none Site Search: NCBI web and FTP sites (D none ^tj Books: online books none OMIM: online Mendelian Inheritance ir [none] ifll OMIA: Online Mendelian Inheritance i Nucleotide: sequence database (includes ^ 45 ^ GenBank) U 39 *#*#* Protein: sequence database (D 4 jj| Genome: whole genome sequences (D ^ Structure: three-dimensional rnacromolecular m, '—' structures fl UniGene: gene-oriented clusters of tr; none J-^ r sequences none CDD: conserved protein domain datab< 12 3D Domains: domains from Entrez Stt none UniSTS: markers and mapping data in silico "screening" □ 1: ABI93216. Report LinB rXaB.thomonas...[gi:l 15291795] >gi 1115231735 I 9blABig32iE.ll LinB [Xjjifchcmoiii; ;p. ICH12] HILGňH^&QOO-IEIK&PímYIIJEr/rr,^ KLTJP3 &pi3J¥ňYňEHPIim]ŮIJ^^ EICAilÄÄFUHE gene synthesis, DNA request D 2: AAR0597S. Report LinB Pphitigpmona...[gi:37963e83] >qi I 3T3Ě36S3 I gb|537Í .1| LinE [Sphirigomonij piucimoiili j] MSLGAHTrrEKKFIEIK&PimVIIJEr/r™ KLTJPS rrPEECTYAEHPIJTLD&UJJEM^ ňnrPErjI)EI)Lr(JňrE3 WHELVLqproFmrjí^&LII^^ IPIňrITPňIJWMňHire.fcrJ^SESPIP^ EI G&ÄIAAFtfRKLKPA How to acquire new proteins? □ bioinformatic approach ■ sequence data from genomic and metagenomic sequencing projects are stored in sequence databases ■ in silico searching of sequence databases -> © fast and cheap way to identify novel proteins -> © one cannot find what is not in the database (but there is a lot of data - more than one usually needs ©) ■ genes are recovered by gene synthesis or obtained from sequencing consortia upon request Where to find target sequences? Where to find target sequences? ■ databases of nucleotide sequences ■ databases of protein sequences Databases of nucleotide sequences □ GenBank http://www.ncbi.nlm.nih.gov/genbank/ NCBl provided by NCBl (National Center for Biotechnology Information) □ EMBL-BANK http://www.ebi.ac.uk/embl/ provided by EBI (European Bioinformatics Institute) NUCLEOTIDE SEQUENCE DATABASE □ DDBJ http://www.ddbj.nig.ac.jp/ (S5 DDBJ DNA Datu Bimk uf Japan provided by National Institute of Genetics from Japan Databases of nucleotide sequences □ GenBank, EMBL-Bank, DDBJ ■ annotated collections of all publically available nucleotide sequences freely available to wide community contain data obtained from genomic centers or research institutions everyday synchronization of new or updated data © contain about 250,000,000 sequences © mostly automatic annotations - lower quality, errors Databases of protein sequences □ UniProtKB http://www.uniprot.org/ provided by EBI, Swiss Institute of Bioinformatics and Protein Information Resource □ nr Protein database ■ http://www.ncbi.nlm.nih.gov/protein/ ■ provided by NCBI NCBI Databases of protein sequences □ UniProtKB, nr Protein database ■ annotated collections of publically available protein sequences ■ freely available to wide community ■ contain data obtained by conceptual translation of coding sequences from EMBL-Bank/GenBank/DDBJ or provided by research institutions ■ © contain more than 100,000,000 sequences ■ © mostly automatic annotations - lower quality, errors Databases of protein sequences □ UniProtKB rich annotations (e.g., information about function of protein and individual amino acids, experimental data, biological ontologies, classifications,...) clear indication of annotation quality (manual vs. automatic) UniProtKB Protein knowledgebase U n i Pro t KB/S wi ss-Prat Reviewed Manual annotation XT UniProtKB. TrEMBL Un reviewed Automatic annotation Databases of protein sequences □ UniProtKB/Swiss-Prot ■ high quality annotations, i.e., manually annotated entries or expert-reviewed automatic annotations ■ © source of reliable information ■ © contains "only" ~ 560,000 sequences □ UniProtKB/TrEMBL ■ © automatic annotations - lower quality, errors ■ © contains ~ 180,000,000 sequences Number of sequences Number of characterized proteins Pitfalls of sequence databases □ large number of errors © errors in sequences (wrong base, frameshift errors) wrong positions of genes exon-intron boundary errors errors and inaccuracies in annotations How to find target sequences? How to find target sequences? ■ text-based searches ■ sequence-based searches Text-based searches □ database retrieval systems ■ enable quick and easy search of many databases at the same time ■ specification of queries using logical operators (AND, OR, NOT,...) ■ Entrez(NCBI), SRS(EBI) □ © results dependent on sequence annotations ■ erroneous, inaccurate or too general annotations ■ synonyms ■ misspellings Text-based searches □ database retrieval systems NCBI home search site map mouse[ORGN] AND kinase AND (exons OR introns) Entrez,jfhe Life Sciences Search Engine9 PubMed All Databases Search across databases Human Genome GenBank Map Viewer | BLAST GO Clear Help □ database retrieval systems Search across databases inouse[ORGN] AND kinase AND (exons OR introns) GO Clear Help 1258 152 96 - Result counts displayed in gray indicate one or more terms not found 1258| tjjfl 312 15 PubMed: biomedical literature citations and abstracts ® 0 Books: online books m PubMed Central: free, full text journal articles ® [703 OMIM: online Mendelian Inheritance in Man ® ONIA: online Mendelian Inheritance in Animals Site Search: NCBI web and FTP sites ® none í® 152 Nucleotide: Core subset of nucleotide sequence records [T] EST: Expressed Sequence Tag records 121 GSS: Genome Survey Sequence records m m .j 96| W* Protein: sequence database none dbGaP: genotype and phenotype SGL UniGene: gene-oriented clusters of ' transcript sequences none CDD: conserved protein domain database ^> none 3D Domains: domains from Entrez Structure □ searches based on sequence similarity ■ © results not influenced by sequence annotations □ rely on assumption that proteins with the same function have similar sequence ■ © not always true - close homologs vs. distant homologs vs. analogs 1 |S |IAAY APFITPDYBAGAIAFP LVP SP 2 L "D ■ AAY APFP|QRYiAGVlRFPELVPVSP 3 M|P[ ECAAY APFPDl HRf> lIaFPLMIpESE 4 1SD |R AYDAPFP|E Y1EGAK FP LVPI P 5 VPAGVRA YDAPFPDH Y AGAlAFPRLVP SP 6 IS DVLNrVYDAPFP lAGVl FPLfVP 7 MP VRA YDAPFPDI Y AGAlAFPRLVP SP □ BLAST ■ based on local pairwise alignment □ PSI-BLAST ■ "iterative BLAST" making use of multiple sequence alignment ■ very sensitive search strategy to detect weak but biologically significant similarities between sequences □ ... □ Basic Local Alignment Search Tool MEAAVKEEISVEDEAVDKNI MEA EAA AAV AVK VKE Break query into words I EIS I SV Break database sequences into words Compare word list by hashing (allow near matches) Basic Local Alignment Search Tool BLOSUM scoring matrix Ala 4 Arg -1 5 Asn -2 0 6 Asp -2 -2 1 6 Cys 0 -3 -3 -3 9 Gin -1 1 0 0 -3 5 Glu -1 0 0 2 -4 2 5 Gly 0 -2 0 -1 -3 -2 -2 6 His -2 0 1 -1 -3 0 0 -2 8 lie -1 -3 -3 -3 -1 -3 -3 -4 -3 4 Leu -1 -2 -3 -4 -1 -2 -3 -4 -3 2 Lys -1 2 0 -1 -3 1 1 -2 -1 -3 Met -1 -1 -2 -3 -1 0 -2 -3 -2 1 Phe -2 -3 -3 -3 -2 -3 -3 -3 -1 0 Pro -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 Ser 1 -1 1 0 -1 0 0 0 -1 -2 Thr 0 -1 0 -1 -1 -1 -1 -2 -2 -1 Trp -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 Tyr -2 -2 -2 -3 -2 -1 -2 -3 2 -1 Val 0 -3 -3 -3 -1 -2 -2 -3 -3 3 Ala Arg Asn Asp Cys Gin Glu Gly His Me Query sequence: R P P Q G L F Database sequence: D P PEG V V ^~~*Exact match is scanned Score:-2 7 7 2 6 1 -1 HSP Optimal accumulated score = 7+7+2+6+1 = 23 4 -2 5 2-15 0 -3 0 6 -3 -1 -2 -4 7 -2 0 -1 -2 -1 4 -1 -1 -1 -2 -1 1 5 -2 -3 -1 1 -4 -3 -2 11 -1 -2 -1 3 -3 -2 -2 2 7 1 -2 1 -1 -2 -2 0 -3 -1 4 Leu Lys Met Phe Pro Ser Thr Trp Tyr Val Sequence-based searches □ PSI-BLAST input H blast »™ 11 Welcome akllugeJSia^ Home Recent Results Sa\ ► NCBIS BLAS i blastp suite: BLASTP pro 1 MSLGAKPFGEKKFIEIKGRRMAYIDEGTGDPILFQHGNPTSSYLWRNI Enter Query Sequence Enter accession number, gi, or FASTA sequence Query subrange jbt< LGAKPFGEKKFIEIKCRPJLS.YIDEGTCDPILFQHGKPTSSYLUPJJIHPHCACLGRLIACDLIGH p r0 m | | Tof iL Or, upload file Job Title Browse... | 'gj Enter a descriptive title for your BLAST search Choose Search Set h | Non-redundant protein sequences (nr) '@ Database Organism ©Any <~ Human <~ A.thaliana <~ Mouse <~ Custom. Optional Search only sequences from selected organism >@ Sequence-based searches □ PSI-BLAST results Score E-value hits Sequences producing significant alignments Download v \Manage ColunYis v Show 100 v Q select all 100 sequences selected GenPept Graphic« Distancetr«Mbfresults Multiplealignmen Description Max Score Total Score Query Cover E value Per. Ident Accession o achaete-scute homoloo 2 [Homo sapiens] 373 373 100% 2e-130 100.00% NE 005161.1 □ achaete-scute homoloa 2 [Ponao abelii] 368 368 100% 3e-128 98.96% XP 002821424.1 Q achaete-scute homoloa 2 [Nomascus leucoaenvsl 361 361 100% 2e-125 97.41% XP_ 003282133.1 D achaete-scute homoloa 2 [Macaca nemestrina] 356 356 100% 1e-123 96.37% XP 011719606.1 Q achaete-scute homoloa 2 [Piliocolobus tephrosceles] 356 356 100% 1e-123 96.37% XP_ 023039276.1 Q achaete-scute homoloa 2 [Papio anubis] 2S7 297 100% 3e-100 95.85% XP_ 003909431.1 Q PREDICTED: achaete-scute homoloa 2 [Chlorocebus sabaeus] 297 297 100% 3e-100 95.34% XP 008003331.1 Q PREDICTED: achaete-scute homolog 2 [Rhinopithecus bieti] 294 294 100% 3e-99 95.34% XP_ 017741776.1 □ PREDICTED: achaete-scute homolog 2 [Cebus capucinus imitator] 271 271 92% 4e-90 96.07% XP_ 017363199.1 Q PREDICTED: achaete-scute homoloa 2 [Callithrix jacchus] 269 269 100% 3e-89 94.82% XP_ 009006952.1 Q achaete-scute homoloa 2 [Sus scrota] 265 265 100% 1e-87 84.97% NP 001116463.1 □ PREDICTED: achaete-scute homoloa 2 fCarjra hircusl 261 261 92% 5e-86 85.39% XF 017899088.1 Sequence-based searches □ PSI-BLAST results Sequences producing significant alignments Download v Manage Columns Show 100 v selectall lOO sequences selected Gen Pep t Graphics Distance tree of results Multiple alignmen Description Max Score Total Score Query Cover E value Per. Ident Accession Q achaete-scute homoloa 2 [Homo sapiens] 373 373 100% 2e-130 100.00% NP 005161.1 Q achaete-scute homoloa 2 [Ponao abelii] 368 368 100% 3e-128 98.96% XP_ 002821424.1 D achaete-scute homoloa 2 [Nomascus leucoaenvs] 361 361 100% 2e-125 97.41% XP_ 003282133.1 Q achaete-scute homoloa 2 [Macaca nemestrina] 356 356 100% 1e-123 96.37% XP_ 011719606.1 Q achaete-scute homoloa 2 [Piliocolobus tephrosceles] 356 356 100% 1e-123 96.37% XP_ 023039276.1 D achaete-scute homoloa 2 [Papio anubis] 257 297 100% 3e-100 95.85% XP_ 003909431.1 Q PREDICTED: achaete-scute homoloa 2 [Chlorocebus sabaeus] 257 297 100% 3e-100 95.34% XP_ 008003331.1 D PREDICTED: achaete-scute homoloa 2 [Rhinopithecus bieti] 294 294 100% 3e-99 95.34% XP_ 017741776.1 Q PREDICTED: achaete-scute homoloa 2 [Cebus capucinus imitator] 271 271 92% 4e-90 96.07% XP_ 017363199.1 Q PREDICTED: achaete-scute homoloa 2 rCallithrix iacchusl 269 269 100% 3e-89 94.82% XP_ 009006952.1 _ Q achaete-scute homoloa 2 [Sus scrota] 265 265 100% 1e-87 84.97% NP 001116463.1 Q PREDICTED: achaete-scute homoloa 2 TCaDra hircusl 261 261 92% 5e-86 85.39% XP 017899088.1 □ PSI-BLAST results alignment >pgbIAAT70109■1 I CurN [Lyngbya majuscula] Length=341 Score = 303 bits (777), Expect = 8e-81, Method: Composition-based stats. Identities = 148/297 (49%), Positives = 188/297 (63%), Gaps = 8/297 (2%) Query 2 SEIGTGF PFDPHYVEVLGERMHYVDVGPRDGTPVLFLHGNPTS SYLWRNIIPHV-APSHR 60 I + FPF VEV G + YVD G G PVLFLHGNPTSSYLWRNIIP+V A +R Sbj ct 41 LPISSEFPFAKRTVEVEGATIAYVDEG—SGQPVLFLHGNPTSSYLWRNIIPYVVAAGYR 98 Query 61 CIAPDLIGMGKSDKPDLDYFFDDHVRYLDAFIEALGLEEVVLVIHDWGSALGFHWAKRNP 120 +APDLIGMG S KPD++Y DHV Y+D FI+ALGL+++VLVIHDWGS +G A+ NP Sbj ct 99 AVAPDLIGMGDSAKPDIE YRLQDHVAYMDGFIDALGLDDMVLVIHDWG SVIGMRHARLN P 158 Query 121 ERVKGIACMEFIRPI----PTWDEWPEFARETFQAFRTADVGRELIIDQNAFIEGVLPK- 175 +RV +A ME + P P+++ F+ RTADVG ++++D N F+E +LP+ Sbj ct 159 DRVAAVAFMEALVPPALPMPSYEAMGPQLGPLFRDLRTADVGEKMVLDGNFFVETILPEM 218 Query 176 CVVRPLTEVEMDHYRE PFLKPVDRE PLWRF PNEIPIAGE PANIVALVEAYMNWLHQS PVP 235 VVR L+E EM YR PF R P ++P E + PI GEPA A V WL SP+P Sbj ct 219 GVVRSLSEAEMAAYRAPF PTRQSRLPTLQWPREVPIGGE PAFAEAEVLKNGEWLMAS PIP 278 Query 236 KLLFWGTPGVLIPPAEAARLAE SLPNCKTVDIGPGLHYLQEDNPDLIGSEIARWLPG 2 92 KLLF PG L P L+E++PN + +G G H+LQED+P LIG IA WL Sbj ct 279 KLLFHAE PGALAPKPVVDYLSENVPNLEVRFVGAGTHFLQEDHPHLIGQGIADWLRR 3 35 □ BLAST Score ■ normalized raw score ■ raw score = sum of substitution scores and gap penalties ■ higher is better, but does not adequately represent significance of alignment □ BLAST E- value ■ equal to the number of BLAST alignments with a given Score that are expected to be seen simply by a chance ■ indicator of alignment significance ■ results associated with the lowest E-values are the best ■ hits with an E-value score > 0.01 belong to the "grey zone" - do not trust them □ BLAST alignment ■ identity and similarity level between query and aligned sequence ■ alignment length and coverage of query sequence - the alignment is local, therefore one should always check that the alignment covers a significant portion of the query sequence (e.g., the alignment may involve only few amino acids from the query sequence -> not significant hit) Optimal search strategy □ text-based search ■ good for finding evolutionary "unrelated" proteins with some specific function ■ a large number of false negatives (missed proteins with target function) and false positives (identified proteins with different function) results due to erroneous or inaccurate annotations □ text-based search □ sequence-based search ■ good for finding members of a protein family (i.e., group of evolutionary related proteins sharing some specific function) -> not suitable for finding "unrelated" proteins ■ potential false positive results (i.e., proteins belonging to other evolutionary related families) ■ searches using protein sequence queries are generally more sensitive than using nucleotide sequence queries (20 different amino acids vs. 4 different nucleotides) □ text-based search □ sequence-based search □ combination of text-based and sequence-based approaches 1. text-based search 2. subdivision of identified sequences into evolutionary related groups 3. selection of few representatives for each group 4. sequence-based searches using each representative as a query potential false positive results - should be filtered How to recognize interesting sequences? How to recognize interesting sequences? ■ sequence clustering ■ sequence comparison ■ information about host organisms ■ automated in silico enzyme identification □ clustering based on pairwise sequence similarities ■ can be used for a fast and rough classification of sequences in large datasets (thousands of sequences) -> effective way to filter results of database searches -> identification of members of individual protein families ■ CLANS - visualization of pairwise sequence similarities in three-dimensional space -> overview of sequence space □ clustering based on pairwise sequence similarities 0 known target proteins o © * * •. •• . . • . * t ... - . * . * * * • I ♦ ' O * • ' * .*.*.* • •. . . * . * • * . ....... ........ . • . . . . . O. ....... ... Sequence clustering □ clustering based on pairwise sequence similarities target protein family Sequence clustering □ clustering based on pairwise sequence similarities C-C hydrolases OH 0 HO,C HO,C OH 0 + X '^CH, HC^R haloalkane dehalogenases R-^X + H2° — R^OH + HX perhydrolases O 0 JL + H,0, ► J-L .OH + H?0 R"^ T)H 1 1 R-^^O*^ 2 epoxide hydrolases ,/i + H20 HO OH R # previously known members • new family members □ multiple sequence alignment ■ analysis of conserved residues within protein family -> identification of protein family members 1 |S |IAAY APF(TPDY1AGA|AFP LVP SP 2 L "D ■ AAY APFP|QRYiAGVlRFPELVPVSP 3 M|P[ ECAAY APFPDl HRf> lIaFP LMPESE 4 1SD |R AYDAPFP|E Y1EGAK FP LVPI P 5 VPAGVRA YDAPFPDH Y AGAlAFPRLVP SP 6 1BTDVL AYDAPFP E lAGVlC FP LVP TT 7 VP VRA YDAPFPDB Y AGAlAFPRLVP SP 8 BTEEDVA YV KF EBG-YTG 5VNYYRNFDRNN □ multiple sequence alignment ■ analysis of conserved residues within protein family -> identification of protein family members 1 y pfp r fp lvp 2 y pfp r fp 3 y pfp 1 fp 1 Ip 4 y pfp 1 fp 1 lvp 5 y pfp I fp MP 6 y pfp 1 pn lvp 7 y pfp lvp 8 y kfq v nfd // ////// □ multiple sequence alignment ■ analysis of conserved residues within protein family -> identification of protein family members 1 LS |lAAYEAPF1TPDY1AGAIAFP LVP sp 2 l "da! aay apfpiryIagvIrfp LVP /SP 3 MiPDlCAAY APFPDl ra lIaFP vpese 4 IHd |R AYDAPFPD Yl GAKIFP LVPi P 5 IVPAGVRA YiAPFPDH Y AGAlAFPRLVP SP 6 1BTDVL AYDAPFPFEAHlAGVlCFP LVP^TT 7 ft/PAGVRA YiAPFPDH Y AGAlAFPRLVP SP Sequence comparison □ multiple sequence alignment ■ identification of sequences with unique features -> proteins with potentially novel characteristics Shesp-Sheama Pelpro Desace Xanaxo Xylfas Chlaur Despsy Rhobal Burcen My c tub No c far Jansp-uncbac Erylit Polsp-Mycavi My c tub Mycavi Maraqu Caucre Pseatl Psycry Shefri Shefri Xanant. EAKlFl— —WALlIC IATkIlI------------------------------------LRLKIC PGkIfI------------------------------------LALRIC PSkIlI------------------------------------KALKLC EKPMP------------------------------------WQIAMG TSKKM|------------------------------------WQIALG -- vp------------------------------------LRIAAG-KLPQLl -- IP------------------------------------lrislc-iAPGL -- mp------------------------------------qriaac-Impgli InGREaI------------------------------------WFQ|IM TGF DTGF DTGF DSGF HWTF DWGL SIASYV-GLASVI-LAASFV-SRGAAWV-■SSGASWF-LGAAWL-IAATTM-.WPASFM-LiARAAVTM- GFBllst lkBng- (|n-(|n- 1ADTLAM----------------------------------kafsrvmsspf I da lan----------------------------------rvfswmss|f gvdbdk---------------------------------lklpppllmmmakrl' DV1QEvideIkaf rasnkkinf ftmakniskmdk s khfatkfmy|qk|swe sk| gvd|: PDVi' ges|g|- SDV1LS-lAQRRTl- |aqgrt|-gdqIma- ■DNRVP- ■VGKS — GDH|PG- GDHDLG-GDH|PG- GDEETN- DPVTOlAFSAFVTOPA------------------------DGFT / fverlii FLIDRV1L-ferqvmtm- pigf lnsm- nm nm: --gfe |k fnrsqpimdvaglfkr- -p r wremcaknpdfdvarlfar--pafya|ra|aryspvlpagrivsv--lpfyv|ra|aryspvlpagrlvnf --dvwwrf reait sapqlnigafvqg- GVK GCK gck| ickin |s|r etr T|P Jqkr------------------------|sk| AMS|T-----------------------KMKPD| FENHA-----------------------II DTWIAAY A GTEHR------------------------Pi AEL|G-----------------------V| KA| D Y| 1 ATATK------------------------RK PSKRAY|d: QMEDKLAKSKVKAYVHLLFQGLGLEKLS||TDLIKAYEA Itpd-----------AL-----------^1tda| aay a pq-----------1*------------m1pd|caa|na eayva rayva rly|l yma rayva ray lai eg|l' I I i IgllI -WFRLI -EGFE -EAFMKt -EGFRK -EAFTK -DAFMK KW kw kw J shspwfpigrivql-s qnt pel p vgfijng-sqevpefpvagiikg-«,g|sqe ipqfhvggtiks- |RQ|sqdvaifptgnlins- fnysqesvdfpagqming-1kydlvtpsdlrld0fmkr- / |tv|r-- --V1SKI itvhr-- —J— --vpac YD A YD A ICR1R-------#----------------LS AERA Y A te|s-------"-----------------Is AllAAYEA |ta|d-- --IsdaBsayda -ATVTA------------------------Is Dp AYDA GTVTK-- --Is AilDAY A ACVST-- --BtVKIAAYDA ASVSD-- --|SDD|IAAYDA -WAPT-------------------------IteaI SAY A Sequence comparison IShesp-Sheama Pelpro Desace Xanaxo Xylfas Chlaur Despsy Rhobal Burcen Myctub hocfar Jansp-luncbac Erylit Polsp-Mycavi Myctub Mycavi Maraqu Caucre Pseatl Psycry Shefri Shefri □ multiple sequence alignment ■ identification of sequences with unique features -> proteins with potentially novel characteristics mum DKAIi Iray iSRiLlRLYl NAF NAF ------------------V-7 iPDViQEVIDEIKAFRASNKKINFFTMAKNISKMDKSKHFATKFMYlQKlSWESKlMPIGFLNSM-QMEDKLAKSKVKAYVHLLFQGLGLEKLSl |TDLIKAYE 'DVHLS- FRAjBREMC. KjMI DVARL R—G ] WR F R WR F R □ phylogenetics ■ establishment of evolutionary relationships among sequences Sequence comparison □ phylogenetics classification of sequences HLD-II DatA Sav4779 DbJA i -^n Jann2620 ^DmlA ru a DlmA DhIA DelA DhmA DmbBDmpA DmaA DsfA/iso-DsfA DpcA DpaA HLD-I LinB DmbA DmsA DspA iso-DspA DxxB DrbA DphprA DsoA DsfB / DssA DsdA DpeprA cpA DsamA HLD-III 'DnfA □ phylogenetics ■ information about experimental data-> selection of novel proteins HLD-II □ extremophiles - microorganisms living in extreme conditions ■ geochemical extremes (pH, salinity) ■ physical extremes (temperature, pressure) □ proteins from extremophiles ■ often adapted to extreme conditions -> unique characteristics, useful for practical applications Information about host organisms □ Genomes OnLine Database (GOLD) ■ http://www.genomesonline.org/ ■ list of complete (>6,000), ongoing (> 27,000) and targeted genome (>1,000) projects ■ information about individual projects and source organisms □ Entrez Genome ■ http://www.ncbi.nlm.nih.gov/sites/genome ■ provided by NCBI ■ data from more than 20,000 finished or ongoing genome projects (includes almost 10,000 organisms) ■ information about genome, source organism, genes, encoded proteins, graphical representations,... □ GOLD Metagenomes Isolate Genomes ^ Classification ^/j Complete Projects: 4169 • Studies: 370 t\ Incomplete Proiects: 17714 • Samples: 2642 Targeted Projects: 1600 Organism Metadata MIGS 22 o OXYGEN REQUIREMENT Aerobe MIGS 37.1 o CELL SHAPE Rod-shaped MIGS 37.2 * MOTILITY Non motile MIGS 37.3* SPORULATION MIGS 37.4© PRESSURE MIGS 37.12© TEMPERATURE RANGE Psychrophile SALINITY Halotolerant PH MIGS 37.5© CELL DIAMETER MIGS 37.6 © CELL LENGTH MIGS 37.7 © COLOR MIGS 37.8 © GRAM STAINING MIGS 15© BIOTIC REALTIONSHIPS Free living Entrez Genome Psychrobacter cryohalolentis Psychrotolerant organism Lineage: Bacteria[4049]; Proteobacteria[1682]; Garnmaprotec>bacteria[75Q]; Pseudornonadales[122]; Moraxellaceae[51]; Psychrobacter[1Q]; Psychrobacter cryohalolentis[1] Psychrobacter. These bacteria are commonly isolated from low temperature environments, Psychrobacter spp. are cold-adapted organisms that are often isolated from extreme environments such as permafrost or the Antarctic ice.Psychrobacter cryohalolentis. Psychrobacter cryohalolentis. formerly Psychrobacter crvopegetta More 4 Representative □ Community" selected, Calculated : Psychrobacter cryohalolentis K5 Psychrobacter cryohalolentis K5. This organism was isolated from saline liquid (12-14%) found 11-24 m below the surface within a forty thousand-year-old Siberian permafrost at the Kolyma-lndigirka lowland in Siberia. This strain will provide insight into growth at extremely low temperatures. Human Pathogen: no Type Name RefSeq INSDC Size (Mb) GC% Protein rRNA tRNA Other RNA Gene Pseudogene Chr J NC_007&6&.1 CP000323.1 3.06 42.3 2,467 12 43 6 2,337 4 Plsm J 1 NC_007&63.1 CP000324.1 0.041221 44 - - - 44 - Biological Properties ■ Morphology >-■ Shape: Bacilli DrbA HSCRLSSNRRGSSKLAAMTNLASDLFPHPSSELSIOGHTLRYIDTAASSDIPSSAVGSSD GEPTFLCVHGNPTWSFYYRRIIERYGKQORVIAVDHIGCGRSDKPSEDEFPYTMAAHRDN LIRLVDELDLKNVILIAHDWGGAIGLSAMHARRDRLAGIGLLNTAAFPPPYMPORIAACR MPVI GTPAVRGI NI FARAAVTMAMSRTKMKPDVAAGI I APYDNWKNRVATnRFVRDTPI N Load from file: Vybrat soubor Soubor nevybrán Other known sequences: © >DnibC HSIOFTPDPOLYPFESRWFDSSRGRIHYVDEGTGPPILLCHGNPTWSFLYRDIIVALRDR FRCVAPDYLGFGLSERPSGFGYQIOEHARVIGEFVDHLGLDRYLSMGODWGGPISMAVAV ERAORVRGWLGNTWFWPADTLAMKAFSRVMSSPPVOYAILRRNFFVERLIPAGTEHRPS SAVMAHYRAVOPNAAARRGVAFMPKOTI AARPI I ARI ARFVPATI GTKPTI I TWGMKOVA Load from file: Vybrat soubor Soubor nevybrán Essential residue templates: Q Add protein (row) I Add residue (column) Accession nucleophile acidl acid2 base halidel halide2 halide3 D D.E D, E H H. N,aw.Y H,N,aW,Y H.N,Q,W,Y DrbA 139 Enter position 272 300 71 140 Enter position DmbB 123 Enter position 250 279 Enter position 124 164 DmbC 109 Enter position 238 267 43 110 Enter position 16 JOB OUTPUT INFORMATION DOWNLOAD RE5UITS Result table £xlsx) Result table (tsv] Raw results TARGET 5ELECTION TABLE Select all I Deselect all I Undo I Redo Solubility'threshold:© 0.00; Primary domains:© PFÜÜ561 (AbhydrolaseJ) X Selected Full Dataset Extra domain Known Organism Temperature Salinity Biotic Relationship Disease Transmembrane With Structure Accession Annotation Closest query Identity closest query Kingdom Solubility Sequence length Domain 2P5.D.A Chain A Crystal Stru... □4Z2G1 41.5 E 0.9735 313 Abr A 2PSF_A Chain A Crystal Stru... □4Z2G1 41.5 E 0.9515 310 Abt 2PSJ_A Chain A Crystal Stru... □4Z2G1 41.5 E 0.9614 312 Abr 2PSH_A Chain A Crystal Stru... □4Z2G1 41.2 E 0.95Ů6 319 Abr WP_071575177.1 haloalkane dehalog... □4Z2G1 70.3 E 0.9399 270 Abr 3SKO_A Chain A structure o... □4Z2G1 46.2 B 0.9393 311 Abr 4BRZ_A Chain A Haloalkane,.. □4Z2G1 61.7 0.9357 290 Abr What to keep in mind? □ sequence databases ■ nucleotide: GenBank, EMBL-BANK, DDBJ; protein: UniProtKB, nr Protein database ■ errors in sequences and annotations □ database searches ■ text-based: results influenced by sequence annotations ■ sequence-based: identification of family members - BLAST, PSI-BLAST - E-value ■ combination of both approaches: optimal strategy ■ false positive results: sequences should be filtered □ selection of proteins for experimental characterization ■ clustering: classification and filtering of hits from database searches - CLANS ■ sequence comparison: classification and identification of unique sequences ■ sequences from extremophiles: potentially adapted to extreme conditions ■ Enzyme Miner: automated identification of interesting catalysts □ in silico identification and analysis of sequences - fast and cheap way to identify new proteins □ Xiong, J. (2006). Essential Bioinformatics. Cambridge University Press, New York, p. 352. □ Claverie, J-M. and Notredame, C (2006). Bioinformatics For Dummies (2nd ed.). Wiley Publishing, Hoboken, p. 436. □ Steele, H.L. et al. (2009). Advances in Recovery of Novel Biocatalysts from Metagenomes. Journal of Molecular Microbiology and Biotechnoly 16: 25-37. □ NCBI Resource Coordinators (2013). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 41: D8-D20. □ Magrane, M. and Consortium U. (2011). UniProt Knowledgebase: a hub of integrated protein data. Database 2011: bar009. □ Frickey, T. and Lupas, A. (2004). CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics 20: 3702-3704. □ Pagani, I. et al. (2012). The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Research 40, D571-579. □ Van den Burg, B. (2003). Extremophiles as a source for novel enzymes. Current Opinion in Microbiology 6: 213-218. L LOSCHMIDT , LABORATORIES PROTEIN ENGINEERING 3. PREPARATION OF RECOMBINANT PROTEINS, PROTEIN EXPRESSION AND PURIFICATION Loschmidt Laboratories Department of Experimental Biology Masaryk University, Brno