L LOSCHMIDT , LABORATORIES PROTEIN ENGINEERING 2. INSILICO IDENTIFICATION OF PROTEINS Loschmidt Laboratories Department of Experimental Biology Masaryk University, Brno □ Why to search for new proteins? □ How to acquire new proteins? ■ traditional approach ■ metagenomic approach ■ bioinformatic approach □ Bioinformatic approach ■ Where to find target sequences? ■ How to find target sequences? ■ How to recognize interesting sequences? □ What to keep in mind? Why to search for new proteins? Why to search for new proteins? ■ plenty of reasons Why to search for new proteins? □ better understanding of structure-function relationships ■ required for rational design Product (ROH> > Enzyme-Intermediate (E-R) Free Enzyme (E) Time Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties stability 100 ar[%] -1-1-1-1-1-r 5 100 200 300 400 500 600 time [h] Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties stability temperature profile 15 25 (°C) 55 65 Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties ■ stability ■ temperature profile ■ activity ■ specificity Why to search for new proteins? □ better understanding of structure-function relationships □ novel properties stability temperature profile activity specificity enantioselectivity ethyl (5)-2-bromopropionate 15 20 time (min) □ better understanding of structure-function □ novel properties ■ stability ■ temperature profile ■ activity ■ specificity ■ enantioselectivity □ better understanding of structure-function relationships □ novel properties □ better starting points for protein engineering □ better understanding of structure-function relationships □ novel properties □ better starting points for protein engineering -> proteins with desired properties -> practical applications How to acquire new proteins? How to acquire new proteins? ■ traditional approach ■ metagenomic approach ■ bioinformatic approach How to acquire new proteins? □ traditional approach enrichment sample isolation How to acquire new proteins? □ traditional approach ■ microorganisms possessing target activity are enriched from the environment and isolated in pure culture ■ proteins or corresponding genes are recovered from organisms by protein purification, DNA library screening, PCR with specific primers,... ■ © majority of microorganisms (> 99 %) cannot be cultivated using standard techniques -> a large fraction of the microbial diversity in an environment is lost How to acquire new proteins? □ metagenomic approach sample How to acquire new proteins? □ metagenomic approach ■ isolation and cloning of DNA extracted directly from environmental sample (without culturing the present organisms) ■ genes recovered by DNA library screening or PCR with specific primers,... ■ © enables to explore biodiversity of uncultured microorganisms How to acquire new proteins? □ bioinformatic approach sequence database (meta)genomic sequencing projects ^£ 7 Entrez, The Life Sciences Search Enginea PubMed All Databases Human Genome GenBank Map Viewer BLAST Search across databases linb OED He|p , w-4 PubMed: biomedical literature citations and m 44 abstracts 0 79 ^Jj PubMed Central: free, full text journal articles 0 none Site Search: NCBI web and ftp sites (D none ^tj Books: online books none OMIM: online Mendelian Inheritance ir [none] ifll OMIA: Online Mendelian Inheritance i Nucleotide: sequence database (includes ^ 45 ^ GenBank) U 39 *#*#* Protein: sequence database © 4 (jj Genome: whole genome sequences (?) ^ Structure: three-dimensional macro molecular m '—' \Er structures fl UniGene: gene-oriented clusters of tr; none j-^ r sequences none CDD: conserved protein domain datab< 12 3D Domains: domains from Entrez Stt none UniSTS: markers and mapping data in silico "screening C 1: ABI93216. Report LinB Paiitliomoiias...[gi:l 15291795] >gi 1115231735 I gblABI93216.il LinB [Xjjifchcmoiii; jp. ICH12] HILWtfl^&QOO-IEIK&PímYIIJE&T™^ KLDF3 &PI3J5fňYňI31PIim]ŮIJ^^ IPIÄCTPADWUHäPireÄr^SESPIP)^ EICňilňAFUHE gene synthesis, DNA request D 2: AAR0597S. Report LinB Pphitigpmona...[gi:37963e83] >qi I S736S6S3 I gb|AAP.0537i .1| LinE [Sphirigomonii piucimoiilij] MSLkytPTG-EKKFIEIK&HHteíIIE^ KLTJPS &PEEVTVMIHnDinj]ňIJ[iirEMJ](L& ňnrraiJI)EI)LF(JňrE3 WHELVLqpTOFmiMÍ&LIIJlPLSE IPIň&TPňDWňlňHireňr^SESPIP^ EI G&ÄIAAFtfRKLKPA How to acquire new proteins? □ bioinformatic approach ■ sequence data from genomic and metagenomic sequencing projects are stored in sequence databases ■ in silico searching of sequence databases -> © fast and cheap way to identify novel proteins -> © one cannot find what is not in the database (but there is a lot of data - more than one usually needs ©) ■ genes are recovered by gene synthesis or obtained from sequencing consortia upon request Where to find target sequences? Where to find target sequences? ■ databases of nucleotide sequences ■ databases of protein sequences Databases of nucleotide sequences □ GenBank http://www.ncbi.nlm.nih.gov/genbank/ NCBl provided by NCBl (National Center for Biotechnology Information) □ EMBL-BANK http://www.ebi.ac.uk/embl/ provided by EBI (European Bioinformatics Institute) NUCLEOTIDE SEQUENCE DATABASE □ DDBJ http://www.ddbj.nig.ac.jp/ (S5 DDBJ DNA Datu Bimk uf Japan provided by National Institute of Genetics from Japan Databases of nucleotide sequences □ GenBank, EMBL-Bank, DDBJ ■ annotated collections of all publically available nucleotide sequences freely available to wide community contain data obtained from genomic centers or research institutions everyday synchronization of new or updated data © contain about 250,000,000 sequences © mostly automatic annotations - lower quality, errors □ UniProtKB http://www.uniprot.org/ provided by EBI, Swiss Institute of Bioinformatics and Protein Information Resource □ nr Protein database ■ http://www.ncbi.nlm.nih.gov/protein/ ■ provided by NCBI NCBI Databases of protein sequences □ UniProtKB, nr Protein database ■ annotated collections of publically available protein sequences ■ freely available to wide community ■ contain data obtained by conceptual translation of coding sequences from EMBL-Bank/GenBank/DDBJ or provided by research institutions ■ © contain more than 100,000,000 sequences ■ © mostly automatic annotations - lower quality, errors Databases of protein sequences □ UniProtKB rich annotations (e.g., information about function of protein and individual amino acids, experimental data, biological ontologies, classifications,...) clear indication of annotation quality (manual vs. automatic) UniProtKB Protein knowledgebase U n i Pro t KB/S wi ss-Prat Reviewed Manual annotation XT UniProtKB.TrEMBL Un reviewed Automatic annotation Databases of protein sequences □ UniProtKB/Swiss-Prot ■ high quality annotations, i.e., manually annotated entries or expert-reviewed automatic annotations ■ © source of reliable information ■ © contains "only" ~ 560,000 sequences □ UniProtKB/TrEMBL ■ © automatic annotations - lower quality, errors ■ © contains ~ 180,000,000 sequences Unexplored protein diversity Number of sequences Number of characterized proteins to CD 00 CD O 250 n 200 - c 150 - CD CT CD 100 - 50 - ■Gene Bank ■Swissprot Unexplored protein diversity 1990 1997 2004 2011 2018 Pitfalls of sequence databases □ large number of errors © errors in sequences (wrong base, frameshift errors) wrong positions of genes exon-intron boundary errors errors and inaccuracies in annotations How to find target sequences? How to find target sequences? ■ text-based searches ■ sequence-based searches Text-based searche □ database retrieval systems ■ enable quick and easy search of many databases at the same time ■ specification of queries using logical operators (AND, OR, NOT,...) ■ Entrez(NCBI), SRS(EBI) □ © results dependent on sequence annotations ■ erroneous, inaccurate or too general annotations ■ synonyms ■ misspellings Text-based searche □ database retrieval systems NCBI HOME SEARCH SITE MAP mouse[ORGN] AND kinase AND (exons OR introns) Entrez,jfhe Life Sciences Search Engine9 PubMed All Databases Search across databases Human Genome GenBank | Map Viewer | BLAST GO Clear Help □ database retrieval systems Search across databases rnouse[ORGN] AND kinase AND (exons OR introns) GO Clear Help 1258 152 96 - Result counts displayed in gray indicate one or more terms not found 1258| tjjfl 312 15 PubMed: biomedical literature citations and abstracts ® 0 Books: online books m PubMed Central: free, full text journal articles ® [703 OMIM: online Mendelian Inheritance in Man ® ONIA: online Mendelian Inheritance in Animals Site Search: NCBI web and FTP sites ® none í® 152 Nucleotide: Core subset of nucleotide sequence records [T] EST: Expressed Sequence Tag records 121 GSS: Genome Survey Sequence records m m .j 96| W* Protein: sequence database none dbGaP: genotype and phenotype SO, UniGene: gene-oriented clusters of ' transcript sequences none none CDD: conserved protein domain database ^ 3D Domains: domains from Entrez Structure □ searches based on sequence similarity ■ © results not influenced by sequence annotations □ rely on assumption that proteins with the same function have similar sequence ■ © not always true - close homologs vs. distant homologs vs. analogs 1 ■PABIIAAY APFP rPDYBAGA|AFP LVP SP 2 L "D ■ AAY APFPlQRYiAGVlRFPELVPVSP 3 M|P[ ECAAY APFPDl HRf> lIaFPLMIpESE 4 1SD |R AYDAPFP|E Y1EGAK FP LVPI P 5 VPAGVRA YDAPFPDH Y AGAlAFPRLVP SP 6 IS DVLNrVYDAPFP lAGVl FPLfVP 7 MP VRA YDAPFPDI Y AGAlAFPRLVP SP □ BLAST ■ based on local pairwise alignment □ PSI-BLAST ■ "iterative BLAST" making use of multiple sequence alignment ■ very sensitive search strategy to detect weak but biologically significant similarities between sequences □ ... □ Basic Local Alignment Search Tool MEAAVKEEISVEDEAVDKNI MEA EAA AAV AVK VKE Break query into words I EIS I SV Break database sequences into words Compare word list by hashing (allow near matches) Basic Local Alignment Search Tool BLOSUM scoring matrix Ala 4 Arg -1 5 Asn -2 0 6 Asp -2 -2 1 6 Cys 0 -3 -3 -3 9 Gin -1 1 0 0 -3 5 Glu -1 0 0 2 -4 2 5 Gly 0 -2 0 -1 -3 -2 -2 6 His -2 0 1 -1 -3 0 0 -2 8 lie -1 -3 -3 -3 -1 -3 -3 -4 -3 4 Leu -1 -2 -3 -4 -1 -2 -3 -4 -3 2 Lys -1 2 0 -1 -3 1 1 -2 -1 -3 Met -1 -1 -2 -3 -1 0 -2 -3 -2 1 Phe -2 -3 -3 -3 -2 -3 -3 -3 -1 0 Pro -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 Ser 1 -1 1 0 -1 0 0 0 -1 -2 Thr 0 -1 0 -1 -1 -1 -1 -2 -2 -1 Trp -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 Tyr -2 -2 -2 -3 -2 -1 -2 -3 2 -1 Val 0 -3 -3 -3 -1 -2 -2 -3 -3 3 Ala Arg Asn Asp Cys Gin Glu Gly His Me Query sequence: R P P Q G L F Database sequence: D P PEG V V ^~~*Exact match is scanned Score:-2 7 7 2 6 1 -1 HSP Optimal accumulated score = 7+7+2+6+1 = 23 4 -2 5 2-15 0 -3 0 6 -3 -1 -2 -4 7 -2 0 -1 -2 -1 4 -1 -1 -1 -2 -1 1 5 -2 -3 -1 1 -4 -3 -2 11 -1 -2 -1 3 -3 -2 -2 2 7 1 -2 1 -1 -2 -2 0 -3 -1 4 Leu Lys Met Phe Pro Ser Thr Trp Tyr Val Sequence-based searches □ PSI-BLAST input H blast »™ 11 Welcome akllugeJSia^ Home Recent Results Sa\ ► NCBI/ BLAS i blastp suite: BLASTP pro 1 MSLGAKPFGEKKFIEIKGRRMAYIDEGTGDPILFQHGNPTSSYLWRNI Enter Query Sequence Enter accession number, gi, or FASTA sequence Query subrange jbt< LGAKPFGEKKFIEIKCRPJLS.YIDEGTCDPILFQHGKPTSSYLUPJJIHPHCACLGELIACDLIGH p r0 m | | Tof iL Or, upload file Job Title Browse... | >@i Enter a descriptive title for your BLAST search Choose Search Set h | Non-redundant protein sequences (nr) '@ Database Organism ©Any <~ Human <~ A.thaliana <~ Mouse <~ Custom. Optional Search only sequences from selected organism >@ Sequence-based se □ PSI-BLAST results hits Score E-value Sequences producing significant alignments Download v VManage ColunYis v Show 100 v © select all 100 sequences selected GenPept Graphir^l DistancetnaMbfresults Multiplealignmen Description Max Score Total Score Query Cover E value Per. Ident Accession achaete-scute home-log 2 [Homo sapiens] 373 373 100% 2e-130 100.00% NP 005161.1 achaete-scute homoloa 2 [Ponao abelii] 368 368 100% 3e-128 98.96% il; 017741776.1 PREDICTED: achaete-scute homolog 2 [Cebus capucinus imitator] 271 271 92% 4e-90 96.07% ipgbIAAT70109.1| CurN [Lyngbya majuscula] Length=341 Score = 303 bits (777), Expect = 8e-81, Method: Composition-based stats. Identities = 148/297 (49%), Positives = 188/297 (63%), Gaps = 8/297 (2%) SEIGTGF PFDPHYVEVLGERMHYVDVGPRDGTPVLFLHGNPTS SYLWRNIIPHV-APSHR 60 I + FPF VEV G + YVD G G PVLFLHGNPTSSYLWRNIIP+V A +R LPIS SEF PFAKRTVEVEGATIAYVDEG—SGQPVLFLHGNPTSSYLWRNIIPYVVAAGYR 9 8 CIAPDLIGMGKSDKPDLDYFFDDHVRYLDAFIEALGLEEVVLVIHDWGSALGFHWAKRNP 12 0 +APDLIGMG S KPD++Y DHV Y+D FI+ALGL+++VLVIHDWGS +G A+ NP AVAPDLIGMGDSAKPDIE YRLQDHVAYMDGFIDALGLDDMVLVIHDWG 5VIGMRHARLN P 15 8 ERVKGIACMEFIRPI----PTWDEWPEFARETFQAFRTADVGRELIIDQNAFIEGVLPK- 175 +RV +A ME + P P+++ F+ RTADVG ++++D N F+E +LP+ DRVAAVAFMEALVPPALPMPSYEAMGPQLGPLFRDLRTADVGEKMVLDGNFFVETILPEM 218 CVVRPLTEVEMDHYRE PFLKPVDRE PLWRF PNEIPIAGE PANIVALVEAYMNWLHQS PVP 2 3 5 VVR L+E EM YR PF R P ++P E + PI GEPA A V WL SP+P GVVRSLSEAEMAAYRAPF PTRQSRLPTLQWPREVPIGGE PAFAEAEVLKNGEWLMAS PIP 278 KLLFWGTPGVLIPPAEAARLAE SLPNCKTVDIGPGLHYLQEDNPDLIGSEIARWLPG 2 92 KLLF PG L P L+E++PN + +G G H+LQED+P LIG IA WL KLLFHAE PGALAPKPVVDYLSENVPNLEVRFVGAGTHFLQEDHPHLIGQGIADWLRR 3 35 Query 2 Sbj ct 41 Query 61 Sbj ct 99 Query 121 Sbj ct 159 Query 176 Sbj ct 219 Query 236 Sbj ct 279 □ BLAST Score ■ normalized raw score ■ raw score = sum of substitution scores and gap penalties ■ higher is better, but does not adequately represent significance of alignment □ BLAST E- value ■ equal to the number of BLAST alignments with a given Score that are expected to be seen simply by a chance ■ indicator of alignment significance ■ results associated with the lowest E-values are the best ■ hits with an E-value score > 0.01 belong to the "grey zone" - do not trust them BLAST alignment ■ identity and similarity level between query and aligned sequence ■ alignment length and coverage of query sequence - the alignment is local, therefore one should always check that the alignment covers a significant portion of the query sequence (e.g., the alignment may involve only few amino acids from the query sequence -> not significant hit) □ text-based search ■ good for finding evolutionary "unrelated" proteins with some specific function ■ a large number of false negatives (missed proteins with target function) and false positives (identified proteins with different function) results due to erroneous or inaccurate annotations □ text-based search □ sequence-based search ■ good for finding members of a protein family (i.e., group of evolutionary related proteins sharing some specific function) -> not suitable for finding "unrelated" proteins ■ potential false positive results (i.e., proteins belonging to other evolutionary related families) ■ searches using protein sequence queries are generally more sensitive than using nucleotide sequence queries (20 different amino acids vs. 4 different nucleotides) □ text-based search □ sequence-based search □ combination of text-based and sequence-based approaches 1. text-based search 2. subdivision of identified sequences into evolutionary related groups 3. selection of few representatives for each group 4. sequence-based searches using each representative as a query potential false positive results - should be filtered How to recognize interesting sequences? How to recognize interesting sequences? ■ sequence clustering ■ sequence comparison ■ information about host organisms ■ automated in silico enzyme identification □ clustering based on pairwise sequence similarities ■ can be used for a fast and rough classification of sequences in large datasets (thousands of sequences) -> effective way to filter results of database searches -> identification of members of individual protein families ■ CLANS - visualization of pairwise sequence similarities in three-dimensional space -> overview of sequence space □ clustering based on pairwise sequence similarities O known target proteins o o ....... ........ . . . . . . . O. ....... ... Sequence clustering □ clustering based on pairwise sequence similarities target protein family Sequence clustering □ clustering based on pairwise sequence similarities C-C hydrolases OH 0 HO,C HO,C OH 0 + X '^CH, HC^R haloalkane dehalogenases R-^X + H>° — R^OH + HX perhydrolases 0 0 JL + H202 ► J-L JDH + H?0 R"^ T)H 2 2 R-^^O*^ 2 4kL epoxide hydrolases + H20 HO OH R # previously known members • new family members □ multiple sequence alignment ■ analysis of conserved residues within protein family -> identification of protein family members 1 |S |IAAY APFITPDYBAGAIAFP LVP SP 2 L "D ■ AAY APFP|QRYiAGVlRFPELVPVSP 3 M|P[ ECAAY APFPDl HRf> lIaFPLMIpESE 4 1SD |R AYDAPFP|E Y1EGAK FP LVPI P 5 VPAGVRA YDAPFPDH Y AGAlAFPRLVP SP 6 1BTDVL AYDAPFP E lAGVlC FP LVP TT 7 VP VRA YDAPFPDB Y AGAlAFPRLVP SP 8 BTEEDVA YV KF EBG-YTG 5VNYYRNFDRNN Sequence comparison □ multiple sequence alignment ■ analysis of conserved residues within protein family -> identification of protein family members 1 y pfp r fp lvp 2 y pfp r fp lnJp 3 y pfp 1 fp 1 ■ p 4 y pfp 1 fp 1 lvp 5 y pfp I fp l\jp 6 y pfp 1 pn lvp 7 y pfp l v|p 8 y kf // v yy NFD ////// □ multiple sequence alignment ■ analysis of conserved residues within protein family -> identification of protein family members 1 LS |lAAYEAPF1TPDY1AGAIAFP LVP SP 2 l "da! aay apfpiryIagvIrfp LVP /SP 3 MiPDlCAAY APFPDl ra lIaFP vpese 4 IHd |R AYDAPFPD Yl GAKIFP LVPI P 5 IVPAGVRA YiAPFPDH Y AGAlAFPRLVP SP 6 1BTDVL AYDAPFPFEAHlAGVlCFP LVP^TT 7 ft/PAGVRA YiAPFPDH Y AGAlAFPRLVP SP Sequence compari □ multiple sequence alignment ■ identification of sequences with unique features -> proteins with potentially novel characteristics Shesp-Sheama Pelpro Desace Xanaxo Xylfas Chlaur Despsy Rhobal Burcen My c tub No c far Jansp-uncbac Erylit Polsp-Mycavi My c tub Mycavi Maraqu Caucre Pseatl Psycry Shefri Shefri Xanant. EAKlFl— —WALlIC IATkIlI------------------------------------LRLKIC PGkIfI------------------------------------LALRIC PSkIlI------------------------------------KALKLC EKPMP------------------------------------WQIAMG TSKKM|------------------------------------WQIALG -- vp------------------------------------LRIAAG-KLPQLl -- IP------------------------------------lrislc-iAPGL -- mp------------------------------------qriaac-Impgli InGREaI------------------------------------WFQ|IM TGF DTGF DTGF DSGF HWTF DWGL SIASYV-GLASVI-LAASFV-SRGAAWV-■SSGASWF-LGAAWL-iaattm-.WPASFM-LiARAAVTM- GFBllst lkBng- (|n-(|n- iadtlam----------------------------------kafsrvmsspf i da lan----------------------------------rvfswmss|f gvdbdk---------------------------------LKLPPPLLMM1AKRL' dv1qevideikaf rasnkkinf ftmakniskmdk s khfatkfmy|qk|swe sk| GVDl PDVi' ges|g|- SDV1LS-lAQRRTl- |aqgrt|-gdqIma- ■DNRVP- ■VGKS — GDH|PG- GDHDLG-GDH|PG- GDEETN- DPVTOlAFSAFVTOPA------------------------DGFT / fverlii FLIDRV1L-ferqvmtm- pigf lnsm- nm nm: --gfe |k fnrsqpimdvaglfkr- -p r wremcaknpdfdvarlfar--pafya|ra|aryspvlpagrivsv- - lpfYV|ra|aRyspvlpagrlvnf--dvwwrf reait sapqlnigafvqg- GVK GCK gck| ickin |s|r etr t|P Jqkr------------------------|sk| AMS|t-----------------------KMKPD| FENHA-----------------------II DTWIAAY A gtehr------------------------Pi AEL|G-----------------------V| KA| D Y| 1 atatk------------------------RK PSKRAY|d: QMEDKLAKSKVKAYVHLLFQGLGLEKLS||TDLIKAYEA Itpd-----------AL-----------^1tda| aay a pq-----------1*------------mipd|caa|na eayva rayva rly|l yma rayva ray lai eg|l' I I I IgllI -WFRLI -EGFE -EAFMKt -EGFRK -EAFTK -DAFMK kw kw kw J shspwfpigrivql-s qnt pel p vgfijng-sqevpefpvagiikg-«,g|sqeipqfhvggtiks- |RQ|sqdvaifptgnlins- fnysqesvdfpagqming-1kydlvtpsdlrld0fmkr- / |tv|r-- --V1SKI itvhr-- —J— --vpac YD A YD A ICR1R-------#----------------LS AERA Y A te|s-------"-----------------Is AllAAYEA |ta|d-- --IsdaBsayda -ATVTA------------------------Is Dp AYDA gtvtk-- --Is AilDAY A ACVST-- --BtVKIAAYDA ASVSD-- --|SDD|IAAYDA -WAPT-------------------------IteaI SAY A Sequence compari K a IShesp-Sheama Pelpro Desace Xanaxo Xylfas Chlaur Despsy Rhobal Burcen Myctub hocfar Jansp-luncbac Erylit Polsp-Mycavi Myctub Mycavi Maraqu Caucre Pseatl Psycry Shefri Shefri □ multiple sequence alignment ■ identification of sequences with unique features -> proteins with potentially novel characteristics mum DKAIi Iray iSRiLlRLYl NAF NAF ------------------V-7 iPDViQEVIDEIKAFRASNKKINFFTMAKNISKMDKSKHFATKFMYlQKlSWESKlMPIGFLNSM-QMEDKLAKSKVKAYVHLLFQGLGLEKLSl |TDLIKAYE 'DVHLS- FRAMREMC. KWI DVARL R—G ] WR F R WR F R □ phylogenetics ■ establishment of evolutionary relationships among sequences Sequence compari □ phylogenetics ■ classification of sequences HLD-II DatA Sav4779 DbJA i -^n Jann2620 ^DmlA ru a DlmA DhIA DelA DhmA DmbBDmpA DmaA DsfA/iso-DsfA DpcA DpaA HLD-I LinB DmbA DmsA DspA iso-DspA DxxB DrbA DphprA DsoA DsfB / DssA DsdA DpeprA cpA DsamA HLD-III 'DnfA □ phylogenetics ■ information about experimental data-> selection of novel proteins Information about host organisms □ extremophiles - microorganisms living in extreme conditions ■ geochemical extremes (pH, salinity) ■ physical extremes (temperature, pressure) □ proteins from extremophiles ■ often adapted to extreme conditions -> unique characteristics, useful for practical applications Information about host organisms □ Genomes OnLine Database (GOLD) ■ http://www.genomesonline.org/ ■ list of complete (>6,000), ongoing (> 27,000) and targeted genome (>1,000) projects ■ information about individual projects and source organisms □ Entrez Genome ■ http://www.ncbi.nlm.nih.gov/sites/genome ■ provided by NCBI ■ data from more than 20,000 finished or ongoing genome projects (includes almost 10,000 organisms) ■ information about genome, source organism, genes, encoded proteins, graphical representations,... □ GOLD Metagenomes Isolate Genomes ^ Classification ^/j Complete Projects: 4169 • Studies: 370 t\ Incomplete Proiects: 17714 • Samples: 2642 Targeted Projects: 1600 Organism Metadata MIGS 22 o OXYGEN REQUIREMENT Aerobe MIGS 37.1 o CELL SHAPE Rod-shaped MIGS 37.2 * MOTILITY Non motile MIGS 37.3* SPORULATION MIGS 37.4© PRESSURE MIGS 37.12© TEMPERATURE RANGE Psychrophile SALINITY Halotolerant PH MIGS 37.5© CELL DIAMETER MIGS 37.6 © CELL LENGTH MIGS 37.7 © COLOR MIGS 37.8 © GRAM STAINING MIGS 15© BIOTIC REALTIONSHIPS Free living Information about host organisms □ Entrez Genome Psychrobacter cryohalolentis Psychrotolerant organism Lineage: Bacteria[4049]; Proteobacteria[1682]; Gammaproteobacteria[75Q]; Pseudomonadales[122]; Moraxellaceae[51]; Psychrobacter[1Q]; Psychrobacter cryohalolentis[1] Psychrobacter. These bacteria are commonly isolated from low temperature environments, Psychrobacter spp. are cold-adapted organisms that are often isolated from extreme environments such as permafrost or the Antarctic ice.Psychrobacter cryohalolentis. Psychrobacter cryohalolentis. formerly Psychrobacter crvopegetta More 4 Representative □ Community" selected, Calculated : Psychrobacter cryohalolentis K5 Psychrobacter cryohalolentis K5. This organism was isolated from saline liquid (12-14%) found 11-24 m below the surface within a forty thousand-year-old Siberian permafrost at the Kolyma-lndigirka lowland in Siberia. This strain will provide insight into growth at extremely low temperatures. Human Pathogen: no Type Name RefSeq INSDC Size (Mb) GC% Protein rRNA tRNA Other RNA Gene Pseudogene Chr - NC_007&6&.1 CP000323.1 3.06 42.3 2,467 12 43 6 2,337 4 Plsm 1 NC_007&6S.1 CP000324.1 0.041221 44 - - - 44 - Biological Properties ■ Morphology □ Shape: Bacilli q Motility: No ■ Environment biological properties o Salinity: ModerateHalophilic o TemperatureRange: Psychrophilic t> Habitat: Multiple - Genome Sequencing Projects ) Chromozome; [1] Scafíblda or conrig^ [0] 1 ? SRA or Traces [0] o Wo data [0] Organism BioProject Assembly Status Chrs Plasmids Size (Mb) GC% Gene Protein Psychrobacter cryohalolentis K5 PPJHAESS7S, PRJNA11&20 ASM13MV1 • 1 1 i.1 42.2 2,581 2,511 Automated in silico enzyme identificatio □ Enzyme Miner ■ https://loschmidtxhemi.munixz/enzymeminer/ Query sequences Essential residues Other sequences ENZYME MINER Annotated putative enzymes Solubility prediction Automated in silico enzyme identificatio ENZYME MINER 1 Automated mining of enzymes with diversified function. MM 1 Submit new job Help Example Use cases Acknowledgements Job ID: Swiss-Prot sequences 6 Enter EC number... Advanced options Custom sequences© JOB INFORMATION Hon, J., Borko. S.. Bednar. D., Prokop, Z, Martinek, T., Damborsky, J., 2019: EnzymeMinen Web Server for Automated Mining of Sequences Encoding Enzymes with Diversified Functions Nucleic Acids Research (in preparation). USER STATISTICS • Number of visitors: • Number of jobs: 60 Loschmidt Laboratories • email 1 • email Z %0 PREDIC75NP We wjCAVzn 0 FIRE !Ci. HOTSPO" WIZARD Keywords: enzyme mining, protein function, diversity, sequence space, computational characterization Swiss-Prot sequences Q Custom sequences Q Load saved input Query sequences: O >DrbA MSCRLSSNRRGSSKLAAMTNLASDLFPHPSSELSIOGHTLRYIDTAASSDIPSSAVGSSD GEPTFLCVHGNPTWSFYYRRIIERYGKQORVIAVDHIGCGRSDKPSEDEFPYTMAAHRDN LIRLVDELDLKNVILIAHDWGGAIGLSAMHARRDRLAGIGLLNTAAFPPPYMPORIAACR MPVI GTPAVRGI NI FARAAVTHAMSRTKMKPOVAAGI I APYDNWKNRVATnRFVRDTPI N Load from file: Vybrat soubor Soubor nevybrán Other known sequences Q >DnibC HSIOFTPDPOLYPFESRWFDSSRGRIHYVOEGTGPPILLCHGNPTWSFLYRDIIVALRDR FRCVAPDYLGFGLSERPSGFGYQIDEHARVIGEFVDHLGLDRYLSMGQDWGGPISMAVAV ERADRVRGWLGNTWFWPADTLAMKAFSRVMSSPPVOYAILRRNFFVERLIPAGTEHRPS SAVMAHYRAVOPNAAARRGVAFMPKOTI AARPI I ARI ARFVPATI GTKPTI I TWGMKOVA Load from file: Vybrat soubor Soubor nevybrán Essential residue templates: Q Add protein (row) I Add residue (column) Accession nucleophile acidl acid2 0 D.E O.E DrbA 139 Enter position 272 DmbB 123 Entef position 250 DmbC 109 Enter position 238 base H 300 halidel halide2 halide3 H. N, Q, W, Y H, N. Q, W. Y H, N, Q, W, Y 71 140 Enter position 279 267 Enter position 43 124 110 164 Enter position 16 JOB OUTPUT INFORMATION DOWNLOAD RE5UITS Result table [xlsx] Result table (tsv] Raw results TARGET 5ELECTION TABLE Select all I Deselect all I Undo I Redo Solubile threshold:© 0.00; Primary domains:© PFÜÜ561 (AbhydrolaseJ) X Selected Full Dataset Extra domain Known Organism Temperature Salinity Biotic Relationship Disease Transmembrane With Structure Accession Annotation Closest query Identity closest query Kingdom Solubility Sequence length Domain 2PSD.A Chain A, Crystal Stru... □4Z2G1 41.5 E 0.9735 313 Abh A 2PSF_A Chain A, Crystal Stru... □4Z2G1 41.5 E 0.9515 310 Abt 2PSJ_A Chain A Crystal Stru... □4Z2G1 41.5 E 0.9614 312 Abh 2PSH_A Chain A Crystal Stru... □4Z2G1 41.2 E 0.95Ů6 319 Abt WP_071575177.1 haloalkane dehalog... □4Z2G1 70.3 E 0.9399 270 Abh 3SKO_A Chain A structure o... □4Z2G1 46.2 B 0.9393 311 Abt 4BRZ_A Chain A Haloalkane,.. □4Z2G1 61.7 0.9357 290 Abh What to keep in mind? □ sequence databases ■ nucleotide: GenBank, EMBL-BANK, DDBJ; protein: UniProtKB, nr Protein database ■ errors in sequences and annotations □ database searches ■ text-based: results influenced by sequence annotations ■ sequence-based: identification of family members - BLAST, PSI-BLAST - E-value ■ combination of both approaches: optimal strategy ■ false positive results: sequences should be filtered □ selection of proteins for experimental characterization ■ clustering: classification and filtering of hits from database searches - CLANS ■ sequence comparison: classification and identification of unique sequences ■ sequences from extremophiles: potentially adapted to extreme conditions ■ Enzyme Miner: automated identification of interesting catalysts What to keep in mind? □ in silico identification and analysis of sequences - fast and cheap way to identify new proteins □ Xiong, J. (2006). Essential Bioinformatics. Cambridge University Press, New York, p. 352. □ Claverie, J-M. and Notredame, C. (2006). Bioinformatics For Dummies (2nd ed.). Wiley Publishing, Hoboken, p. 436. □ Steele, H.L. et al. (2009). Advances in Recovery of Novel Biocatalysts from Metagenomes. Journal of Molecular Microbiology and Biotechnoly 16: 25-37. □ NCBI Resource Coordinators (2013). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 41: D8-D20. □ Magrane, M. and Consortium U. (2011). UniProt Knowledgebase: a hub of integrated protein data. Database 2011: bar009. □ Frickey, T. and Lupas, A. (2004). CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics 20: 3702-3704. □ Pagani, I. et al. (2012). The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Research 40, D571-579. □ Van den Burg, B. (2003). Extremophiles as a source for novel enzymes. Current Opinion in Microbiology 6: 213-218. L LOSCHMIDT , LABORATORIES PROTEIN ENGINEERING 3. PREPARATION OF RECOMBINANT PROTEINS, PROTEIN EXPRESSION AND PURIFICATION Loschmidt Laboratories Department of Experimental Biology Masaryk University, Brno