L LOSCHMIDT
, LABORATORIES
PROTEIN ENGINEERING
2. INSILICO IDENTIFICATION OF PROTEINS
Loschmidt Laboratories Department of Experimental Biology Masaryk University, Brno
□ Why to search for new proteins?
□ How to acquire new proteins?
■ traditional approach
■ metagenomic approach
■ bioinformatic approach
□ Bioinformatic approach
■ Where to find target sequences?
■ How to find target sequences?
■ How to recognize interesting sequences?
□ What to keep in mind?
Why to search for new proteins?
Why to search for new proteins?
■ plenty of reasons
Why to search for new proteins?
□ better understanding of structure-function relationships
■   required for rational design
Product (ROH>
>
Enzyme-Intermediate (E-R)      Free Enzyme (E)
Time
Why to search for new proteins?
□ better understanding of structure-function relationships
□ novel properties
stability
100
ar[%]
-1-1-1-1-1-r
5        100    200    300    400    500 600 time [h]
Why to search for new proteins?
□ better understanding of structure-function relationships
□ novel properties
stability
temperature profile
15 25
(°C)
55
65
Why to search for new proteins?
□ better understanding of structure-function relationships
□ novel properties
■ stability
■ temperature profile
■ activity
■ specificity
Why to search for new proteins?
□ better understanding of structure-function relationships
□ novel properties
stability
temperature profile activity specificity enantioselectivity
ethyl (5)-2-bromopropionate
15 20
time (min)
□ better understanding of structure-function
□ novel properties
■ stability
■ temperature profile
■ activity
■ specificity
■ enantioselectivity
□ better understanding of structure-function relationships
□ novel properties
□ better starting points for protein engineering
□ better understanding of structure-function relationships
□ novel properties
□ better starting points for protein engineering
-> proteins with desired properties -> practical applications
How to acquire new proteins?
How to acquire new proteins?
■ traditional approach
■ metagenomic approach
■ bioinformatic approach
How to acquire new proteins?
□ traditional approach
enrichment
sample
isolation
How to acquire new proteins?
□ traditional approach
■ microorganisms possessing target activity are enriched from the environment and isolated in pure culture
■ proteins or corresponding genes are recovered from organisms by protein purification, DNA library screening, PCR with specific primers,...
■ © majority of microorganisms (> 99 %) cannot be cultivated using standard techniques -> a large fraction of the microbial diversity in an environment is lost
How to acquire new proteins?
□ metagenomic approach
sample
How to acquire new proteins?
□ metagenomic approach
■ isolation and cloning of DNA extracted directly from environmental sample (without culturing the present organisms)
■ genes recovered by DNA library screening or PCR with specific primers,...
■ © enables to explore biodiversity of uncultured microorganisms
How to acquire new proteins?
□ bioinformatic approach
sequence database
(meta)genomic sequencing projects
	^£    7 Entrez,	The Life Sciences Search Enginea	
PubMed	All Databases	Human Genome              GenBank            Map Viewer	BLAST
Search across databases linb
OED He|p
,   w-4  PubMed: biomedical literature citations and m 44         abstracts 0 79  ^Jj PubMed Central: free, full text journal articles 0 none         Site Search: NCBI web and ftp sites (D	none  ^tj   Books: online books none          OMIM: online Mendelian Inheritance ir [none] ifll OMIA: Online Mendelian Inheritance i
	
Nucleotide: sequence database (includes ^ 45 ^ GenBank) U 39 *#*#* Protein: sequence database © 4   (jj  Genome: whole genome sequences (?) ^   Structure: three-dimensional macro molecular m '—' \Er structures	fl UniGene: gene-oriented clusters of tr; none j-^ r sequences none         CDD: conserved protein domain datab< 12         3D Domains: domains from Entrez Stt none         UniSTS: markers and mapping data
in silico "screening
C 1: ABI93216. Report LinB Paiitliomoiias...[gi:l 15291795]
>gi 1115231735 I gblABI93216.il LinB [Xjjifchcmoiii; jp. ICH12]
HILWtfl^&QOO-IEIK&PímYIIJE&T™^
KLDF3 &PI3J5fňYňI31PIim]ŮIJ^^
IPIÄCTPADWUHäPireÄr^SESPIP)^ EICňilňAFUHE
gene synthesis, DNA request
D 2: AAR0597S. Report LinB Pphitigpmona...[gi:37963e83]
>qi I S736S6S3 I gb|AAP.0537i .1| LinE [Sphirigomonii piucimoiilij]
MSLkytPTG-EKKFIEIK&HHteíIIE^
KLTJPS &PEEVTVMIHnDinj]ňIJ[iirEMJ](L&
ňnrraiJI)EI)LF(JňrE3 WHELVLqpTOFmiMÍ&LIIJlPLSE
IPIň&TPňDWňlňHireňr^SESPIP^
EI G&ÄIAAFtfRKLKPA
How to acquire new proteins?
□ bioinformatic approach
■ sequence data from genomic and metagenomic sequencing projects are stored in sequence databases
■ in silico searching of sequence databases
-> © fast and cheap way to identify novel proteins -> © one cannot find what is not in the database (but there is a lot of data - more than one usually needs ©)
■ genes are recovered by gene synthesis or obtained from sequencing consortia upon request
Where to find target sequences?
Where to find target sequences?
■ databases of nucleotide sequences
■ databases of protein sequences
Databases of nucleotide sequences
□ GenBank
http://www.ncbi.nlm.nih.gov/genbank/
NCBl
provided by NCBl (National Center for Biotechnology Information)
□ EMBL-BANK
http://www.ebi.ac.uk/embl/
provided by EBI (European Bioinformatics Institute)
NUCLEOTIDE SEQUENCE DATABASE
□ DDBJ
http://www.ddbj.nig.ac.jp/
(S5 DDBJ
DNA Datu Bimk uf Japan
provided by National Institute of Genetics from Japan
Databases of nucleotide sequences
□ GenBank, EMBL-Bank, DDBJ
■  annotated collections of all publically available nucleotide sequences
freely available to wide community
contain data obtained from genomic centers or research institutions everyday synchronization of new or updated data © contain about 250,000,000 sequences © mostly automatic annotations - lower quality, errors
□ UniProtKB
http://www.uniprot.org/
provided by EBI, Swiss Institute of Bioinformatics and Protein Information Resource
□  nr Protein database
■ http://www.ncbi.nlm.nih.gov/protein/
■ provided by NCBI
NCBI
Databases of protein sequences
□ UniProtKB, nr Protein database
■ annotated collections of publically available protein sequences
■ freely available to wide community
■ contain data obtained by conceptual translation of coding sequences from EMBL-Bank/GenBank/DDBJ or provided by research institutions
■ © contain more than 100,000,000 sequences
■ © mostly automatic annotations - lower quality, errors
Databases of protein sequences
□ UniProtKB
rich annotations (e.g., information about function of protein and individual amino acids, experimental data, biological ontologies, classifications,...)
clear indication of annotation quality (manual vs. automatic)
UniProtKB
Protein knowledgebase
U n i Pro t KB/S wi ss-Prat
Reviewed
Manual annotation
XT
UniProtKB.TrEMBL
Un reviewed
Automatic annotation
Databases of protein sequences
□ UniProtKB/Swiss-Prot
■ high quality annotations, i.e., manually annotated entries or expert-reviewed automatic annotations
■ © source of reliable information
■ © contains "only" ~ 560,000 sequences
□ UniProtKB/TrEMBL
■ © automatic annotations - lower quality, errors
■ © contains ~ 180,000,000 sequences
Unexplored protein diversity
Number of sequences
Number of characterized proteins
to CD
00 CD O
250 n
200 -
c   150 -
CD
CT
CD
100 -
50 -
■Gene Bank ■Swissprot
Unexplored
protein
diversity
1990
1997
2004
2011
2018
Pitfalls of sequence databases
□ large number of errors ©
errors in sequences (wrong base, frameshift errors)
wrong positions of genes
exon-intron boundary errors
errors and inaccuracies in annotations
How to find target sequences?
How to find target sequences?
■ text-based searches
■ sequence-based searches
Text-based searche
□ database retrieval systems
■ enable quick and easy search of many databases at the same time
■ specification of queries using logical operators (AND, OR, NOT,...)
■ Entrez(NCBI), SRS(EBI)
□ © results dependent on sequence annotations
■ erroneous, inaccurate or too general annotations
■ synonyms
■ misspellings
Text-based searche
□ database retrieval systems
NCBI
HOME    SEARCH    SITE MAP
mouse[ORGN] AND kinase AND (exons OR introns)
Entrez,jfhe Life Sciences Search Engine9
PubMed
All Databases
Search across databases
Human Genome
GenBank    |     Map Viewer     | BLAST
GO    Clear Help
□ database retrieval systems
Search across databases   rnouse[ORGN] AND kinase AND (exons OR introns)   GO    Clear Help
1258
152
96
- Result counts displayed in gray indicate one or more terms not found 1258| tjjfl
312
15
PubMed: biomedical literature citations and abstracts	®		0	Books: online books	m
PubMed Central: free, full text journal articles	®	[703		OMIM: online Mendelian Inheritance in Man	®
				ONIA: online Mendelian Inheritance in Animals	
Site Search: NCBI web and FTP sites	®	none			í®
					
152
Nucleotide: Core subset of nucleotide sequence records
[T] EST: Expressed Sequence Tag records
121 GSS: Genome Survey Sequence records
m m .j
96| W* Protein: sequence database
none
dbGaP: genotype and phenotype
SO, UniGene: gene-oriented clusters of '      transcript sequences
none
none
CDD: conserved protein domain database ^
3D Domains: domains from Entrez Structure
□ searches based on sequence similarity
■ © results not influenced by sequence annotations
□ rely on assumption that proteins with the same function have similar sequence
■ © not always true - close homologs vs. distant homologs vs. analogs
1 ■PABIIAAY  APFP rPDYBAGA|AFP   LVP SP
2 L "D ■ AAY APFPlQRYiAGVlRFPELVPVSP
3 M|P[ ECAAY  APFPDl HRf> lIaFPLMIpESE
4 1SD |R  AYDAPFP|E  Y1EGAK FP LVPI P
5 VPAGVRA YDAPFPDH Y AGAlAFPRLVP SP
6 IS  DVLNrVYDAPFP lAGVl FPLfVP
7 MP     VRA YDAPFPDI Y AGAlAFPRLVP SP
□ BLAST
■ based on local pairwise alignment
□ PSI-BLAST
■ "iterative BLAST" making use of multiple sequence alignment
■ very sensitive search strategy to detect weak but biologically significant similarities between sequences
□ ...
□ Basic Local Alignment Search Tool
MEAAVKEEISVEDEAVDKNI
MEA EAA AAV AVK VKE
Break query into words
I
EIS I SV
Break database sequences into words
Compare word list
by hashing (allow near matches)
Basic Local Alignment Search Tool
BLOSUM scoring matrix
Ala	4									
Arg	-1	5								
Asn	-2	0	6							
Asp	-2	-2	1	6						
Cys	0	-3	-3	-3	9					
Gin	-1	1	0	0	-3	5				
Glu	-1	0	0	2	-4	2	5			
Gly	0	-2	0	-1	-3	-2	-2	6		
His	-2	0	1	-1	-3	0	0	-2	8	
lie	-1	-3	-3	-3	-1	-3	-3	-4	-3	4
Leu	-1	-2	-3	-4	-1	-2	-3	-4	-3	2
Lys	-1	2	0	-1	-3	1	1	-2	-1	-3
Met	-1	-1	-2	-3	-1	0	-2	-3	-2	1
Phe	-2	-3	-3	-3	-2	-3	-3	-3	-1	0
Pro	-1	-2	-2	-1	-3	-1	-1	-2	-2	-3
Ser	1	-1	1	0	-1	0	0	0	-1	-2
Thr	0	-1	0	-1	-1	-1	-1	-2	-2	-1
Trp	-3	-3	-4	-4	-2	-2	-3	-2	-2	-3
Tyr	-2	-2	-2	-3	-2	-1	-2	-3	2	-1
Val	0	-3	-3	-3	-1	-2	-2	-3	-3	3
	Ala	Arg	Asn	Asp	Cys	Gin	Glu	Gly	His	Me
Query sequence: R P P Q G L F
Database sequence: D P PEG V V
^~~*Exact match is scanned
Score:-2  7 7 2   6   1 -1
HSP
Optimal accumulated score = 7+7+2+6+1 = 23
4
-2 5 2-15
0	-3 0	6						
-3	-1 -2	-4	7					
-2	0 -1	-2	-1	4				
-1	-1 -1	-2	-1	1	5			
-2	-3 -1	1	-4	-3	-2	11		
-1	-2 -1	3	-3	-2	-2	2	7	
1	-2 1	-1	-2	-2	0	-3	-1	4
Leu	Lys Met	Phe	Pro	Ser	Thr	Trp	Tyr	Val
Sequence-based searches
□ PSI-BLAST input
H blast »™
11 Welcome akllugeJSia^
Home     Recent Results Sa\
► NCBI/ BLAS i blastp suite: BLASTP pro
1
MSLGAKPFGEKKFIEIKGRRMAYIDEGTGDPILFQHGNPTSSYLWRNI
Enter Query Sequence
Enter accession number, gi, or FASTA sequence
Query subrange jbt<
LGAKPFGEKKFIEIKCRPJLS.YIDEGTCDPILFQHGKPTSSYLUPJJIHPHCACLGELIACDLIGH      p r0 m | |
Tof
iL
Or, upload file Job Title
Browse... | >@i
Enter a descriptive title for your BLAST search
Choose Search Set h
| Non-redundant protein sequences (nr) '@
Database
Organism ©Any <~ Human <~ A.thaliana <~ Mouse <~ Custom.
Optional Search only sequences from selected organism >@
Sequence-based se
□ PSI-BLAST results
hits
	Score	E-value		
Sequences producing significant alignments	Download v	VManage ColunYis v Show	100 v	©
				
select all 100 sequences selected
GenPept    Graphir^l DistancetnaMbfresults Multiplealignmen
Description	Max Score	Total Score	Query Cover	E value	Per. Ident		Accession
achaete-scute home-log 2 [Homo sapiens]	373	373	100%	2e-130	100.00%	NP	005161.1
achaete-scute homoloa 2 [Ponao abelii]	368	368	100%	3e-128	98.96%	i<E_	002821424.1
achaete-scute homoloa 2 [Nomascus leucoqenvsl	361	361	100%	2e-125	97.41%	XP.	003282133.1
achaete-scute homoloa 2 [Macaca nemestrina]	356	356	100%	1e-123	96.37%	XP	011719606.1
achaete-scute homoloa 2 [Piliocolobus tephrosceles]	356	356	100%	1e-123	96.37%	XP_	023039276.1
achaete-scute homoloa 2 [Papio anubis]	297	297	100%	3e-100	95.85%	XP_	003909431.1
PREDICTED: achaete-scute homoloa 2 [Chlorocebus sabaeus]	2S7	297	100%	3e-100	95.34%	XP_	008003331.1
PREDICTED: achaete-scute homoloa 2 [Rhinopithecus bieti]	294	294	100%	3e-99	95.34%	>l;	017741776.1
PREDICTED: achaete-scute homolog 2 [Cebus capucinus imitator]	271	271	92%	4e-90	96.07%	i<E_	017363199.1
PREDICTED: achaete-scute homolog 2 [Callithrix jacchus]	269	269	100%	3e-89	94.82%	i<E_	009006952.1
achaete-scute homoloa 2 [Sus scrota]	265	265	100%	1e-87	84.97%	NP	001116463.1
PREDICTED: achaete-scute homoloa 2 fCarjra hircusl	261	261	92%	5e-86	85.39%	XP	017899088.1
■
Sequence-based searches
□ PSI-BLAST results
Sequences producing significant alignments
Download v      Manage Columns
Show
100 v
select all lOO sequences selected
Gen Pep t   Graphics    Distance tree of results Multiple alignmen
	Description	Max Score	Total Score	Query Cover	E value	Per. Ident	Accession	
Q	achaete-scute homoloa 2 [Homo sapiens]	373	373	100%	2e-130	100.00%	NP	005161.1
Q	achaete-scute homoloa 2 [Ponao abelii]	368	368	100%	3e-128	98.96%	XP	002821424.1
D	achaete-scute homoloa 2 [Nomascus leucoaenvs]	361	361	100%	2e-125	97.41%	XP_	003282133.1
Q	achaete-scute homoloa 2 [Macaca nemestrina]	356	356	100%	1e-123	96.37%	XP	011719606.1
Q	achaete-scute homoloa 2 [Piliocolobus tephrosceles]	356	356	100%	1e-123	96.37%	XP_	023039276.1
D	achaete-scute homoloa 2 [Papio anubis]	257	297	100%	3e-100	95.85%	XP_	003909431.1
Q	PREDICTED: achaete-scute homoloa 2 [Chlorocebus sabaeus]	257	297	100%	3e-100	95.34%	XP_	008003331.1
D	PREDICTED: achaete-scute homoloa 2 [Rhinopithecus bieti]	294	294	100%	3e-99	95.34%	XP_	017741776.1
Q	PREDICTED: achaete-scute homoloa 2 [Cebus capucinus imitator]	271	271	92%	4e-90	96.07%	XP_	017363199.1
Q	PREDICTED: achaete-scute homoloa 2 rCallithrix iacchusl	269	269	100%	3e-89	94.82%	XP_	009006952.1
	_							
Q	achaete-scute homoloa 2 [Sus scrota]	265	265	100%	1e-87	84.97%	NP	001116463.1
Q	PREDICTED: achaete-scute homoloa 2 TCaDra hircusl	261	261	92%	5e-86	85.39%	XP	017899088.1
Sequence-based searches
□ PSI-BLAST results
alignment
>pgbIAAT70109.1|     CurN   [Lyngbya majuscula] Length=341
Score =    303 bits   (777),     Expect = 8e-81,  Method:  Composition-based stats. Identities = 148/297   (49%),   Positives = 188/297   (63%),  Gaps = 8/297 (2%)
SEIGTGF PFDPHYVEVLGERMHYVDVGPRDGTPVLFLHGNPTS SYLWRNIIPHV-APSHR 60
I  + FPF VEV G    + YVD G       G  PVLFLHGNPTSSYLWRNIIP+V A +R
LPIS SEF PFAKRTVEVEGATIAYVDEG—SGQPVLFLHGNPTSSYLWRNIIPYVVAAGYR     9 8
CIAPDLIGMGKSDKPDLDYFFDDHVRYLDAFIEALGLEEVVLVIHDWGSALGFHWAKRNP     12 0
+APDLIGMG  S  KPD++Y       DHV Y+D FI+ALGL+++VLVIHDWGS  +G      A+ NP AVAPDLIGMGDSAKPDIE YRLQDHVAYMDGFIDALGLDDMVLVIHDWG 5VIGMRHARLN P     15 8
ERVKGIACMEFIRPI----PTWDEWPEFARETFQAFRTADVGRELIIDQNAFIEGVLPK- 175
+RV    +A ME  +  P P+++ F+    RTADVG ++++D N F+E +LP+
DRVAAVAFMEALVPPALPMPSYEAMGPQLGPLFRDLRTADVGEKMVLDGNFFVETILPEM 218
CVVRPLTEVEMDHYRE PFLKPVDRE PLWRF PNEIPIAGE PANIVALVEAYMNWLHQS PVP    2 3 5
VVR L+E  EM    YR  PF R  P    ++P E + PI  GEPA      A V WL SP+P
GVVRSLSEAEMAAYRAPF PTRQSRLPTLQWPREVPIGGE PAFAEAEVLKNGEWLMAS PIP 278
KLLFWGTPGVLIPPAEAARLAE SLPNCKTVDIGPGLHYLQEDNPDLIGSEIARWLPG    2 92 KLLF       PG L  P L+E++PN +       +G G H+LQED+P LIG     IA WL
KLLFHAE PGALAPKPVVDYLSENVPNLEVRFVGAGTHFLQEDHPHLIGQGIADWLRR     3 35
Query	2
Sbj ct	41
Query	61
Sbj ct	99
Query	121
Sbj ct	159
Query	176
Sbj ct	219
Query	236
Sbj ct	279
□ BLAST Score
■ normalized raw score
■ raw score = sum of substitution scores and gap penalties
■ higher is better, but does not adequately represent significance of alignment
□ BLAST E- value
■ equal to the number of BLAST alignments with a given Score that are expected to be seen simply by a chance
■ indicator of alignment significance
■ results associated with the lowest E-values are the best
■ hits with an E-value score > 0.01 belong to the "grey zone" - do not trust them
BLAST alignment
■ identity and similarity level between query and aligned sequence
■ alignment length and coverage of query sequence - the alignment is local, therefore one should always check that the alignment covers a significant portion of the query sequence (e.g., the alignment may involve only few amino acids from the query sequence -> not significant hit)
□ text-based search
■ good for finding evolutionary "unrelated" proteins with some specific function
■ a large number of false negatives (missed proteins with target function) and false positives (identified proteins with different function) results due to erroneous or inaccurate annotations
□ text-based search
□ sequence-based search
■ good for finding members of a protein family (i.e., group of evolutionary related proteins sharing some specific function) -> not suitable for finding "unrelated" proteins
■ potential false positive results (i.e., proteins belonging to other evolutionary related families)
■ searches using protein sequence queries are generally more sensitive than using nucleotide sequence queries (20 different amino acids vs. 4 different nucleotides)
□ text-based search
□ sequence-based search
□ combination of text-based and sequence-based approaches
1. text-based search
2. subdivision of identified sequences into evolutionary related groups
3. selection of few representatives for each group
4. sequence-based searches using each representative as a query
potential false positive results - should be filtered
How to recognize interesting sequences?
How to recognize interesting sequences?
■ sequence clustering
■ sequence comparison
■ information about host organisms
■ automated in silico enzyme identification
□ clustering based on pairwise sequence similarities
■ can be used for a fast and rough classification of sequences in large datasets (thousands of sequences)
-> effective way to filter results of database searches
-> identification of members of individual protein families
■ CLANS - visualization of pairwise sequence similarities in three-dimensional space -> overview of sequence space
□ clustering based on pairwise sequence similarities
O
known target proteins
o o
....... ........     . .     .    .    . . .
O. ....... ...
Sequence clustering
□ clustering based on pairwise sequence similarities
target protein family
Sequence clustering
□ clustering based on pairwise sequence similarities
C-C hydrolases
OH 0
HO,C
HO,C
OH 0
+ X
'^CH, HC^R
haloalkane dehalogenases
R-^X     +   H>°   —  R^OH  + HX
perhydrolases
0 0
JL + H202 ► J-L JDH + H?0 R"^  T)H        2 2 R-^^O*^ 2
4kL
epoxide hydrolases
+ H20
HO OH R
# previously known members
• new family members
□ multiple sequence alignment
■ analysis of conserved residues within protein family -> identification of protein family members
1 |S    |IAAY  APFITPDYBAGAIAFP   LVP SP
2 L "D ■ AAY APFP|QRYiAGVlRFPELVPVSP
3 M|P[ ECAAY  APFPDl HRf> lIaFPLMIpESE
4 1SD |R  AYDAPFP|E  Y1EGAK FP LVPI P
5 VPAGVRA YDAPFPDH Y AGAlAFPRLVP SP
6 1BTDVL  AYDAPFP  E    lAGVlC FP   LVP TT
7 VP     VRA YDAPFPDB Y AGAlAFPRLVP SP
8 BTEEDVA YV  KF  EBG-YTG 5VNYYRNFDRNN
Sequence comparison
□ multiple sequence alignment
■ analysis of conserved residues within protein family -> identification of protein family members
1	y		pfp	r fp	lvp
2	y		pfp	r fp	lnJp
3	y		pfp	1  fp 1	■ p
4	y		pfp	1  fp 1	lvp
5	y		pfp	I fp	l\jp
6	y		pfp	1 pn	lvp
7	y		pfp		l v|p
8
y kf
//
v  yy NFD
//////
□ multiple sequence alignment
■ analysis of conserved residues within protein family -> identification of protein family members
1 LS    |lAAYEAPF1TPDY1AGAIAFP   LVP SP
2 l "da! aay apfpiryIagvIrfp LVP /SP
3 MiPDlCAAY  APFPDl     ra   lIaFP vpese
4 IHd |R AYDAPFPD    Yl GAKIFP   LVPI P
5 IVPAGVRA YiAPFPDH Y AGAlAFPRLVP SP
6 1BTDVL  AYDAPFPFEAHlAGVlCFP   LVP^TT
7 ft/PAGVRA YiAPFPDH Y AGAlAFPRLVP SP
Sequence compari
□ multiple sequence alignment
■ identification of sequences with unique features -> proteins with potentially novel characteristics
Shesp-Sheama Pelpro Desace Xanaxo Xylfas Chlaur Despsy Rhobal Burcen My c tub No c far Jansp-uncbac Erylit Polsp-Mycavi My c tub Mycavi Maraqu Caucre Pseatl Psycry Shefri Shefri Xanant.
EAKlFl— —WALlIC
IATkIlI------------------------------------LRLKIC
PGkIfI------------------------------------LALRIC
PSkIlI------------------------------------KALKLC
EKPMP------------------------------------WQIAMG
TSKKM|------------------------------------WQIALG
-- vp------------------------------------LRIAAG-KLPQLl
-- IP------------------------------------lrislc-iAPGL
-- mp------------------------------------qriaac-Impgli
InGREaI------------------------------------WFQ|IM
TGF DTGF DTGF DSGF HWTF DWGL
SIASYV-GLASVI-LAASFV-SRGAAWV-■SSGASWF-LGAAWL-iaattm-.WPASFM-LiARAAVTM-
GFBllst lkBng-
(|n-(|n-
iadtlam----------------------------------kafsrvmsspf
i da lan----------------------------------rvfswmss|f
gvdbdk---------------------------------LKLPPPLLMM1AKRL'
dv1qevideikaf rasnkkinf ftmakniskmdk s khfatkfmy|qk|swe sk|
GVDl PDVi'
ges|g|-
SDV1LS-lAQRRTl-
|aqgrt|-gdqIma-
■DNRVP-
■VGKS — GDH|PG-
GDHDLG-GDH|PG-
GDEETN-
DPVTOlAFSAFVTOPA------------------------DGFT
/
fverlii FLIDRV1L-ferqvmtm-
pigf lnsm-
nm
nm:
--gfe |k fnrsqpimdvaglfkr-
-p r wremcaknpdfdvarlfar--pafya|ra|aryspvlpagrivsv-
- lpfYV|ra|aRyspvlpagrlvnf--dvwwrf reait sapqlnigafvqg-
GVK
GCK
gck|
ickin |s|r
etr t|P
Jqkr------------------------|sk|
AMS|t-----------------------KMKPD|
FENHA-----------------------II DTWIAAY A
gtehr------------------------Pi
AEL|G-----------------------V| KA| D Y| 1
atatk------------------------RK PSKRAY|d:
QMEDKLAKSKVKAYVHLLFQGLGLEKLS||TDLIKAYEA
Itpd-----------AL-----------^1tda| aay a
pq-----------1*------------mipd|caa|na
eayva rayva rly|l
yma rayva
ray lai
eg|l' I I I
IgllI
-WFRLI -EGFE -EAFMKt -EGFRK -EAFTK -DAFMK
kw kw kw J
shspwfpigrivql-s qnt pel p vgfijng-sqevpefpvagiikg-«,g|sqeipqfhvggtiks-
|RQ|sqdvaifptgnlins-
fnysqesvdfpagqming-1kydlvtpsdlrld0fmkr-
/
|tv|r-- --V1SKI
itvhr-- —J— --vpac
YD A YD A
ICR1R-------#----------------LS AERA Y A
te|s-------"-----------------Is AllAAYEA
|ta|d-- --IsdaBsayda
-ATVTA------------------------Is Dp AYDA
gtvtk-- --Is AilDAY A
ACVST-- --BtVKIAAYDA ASVSD-- --|SDD|IAAYDA
-WAPT-------------------------IteaI SAY A
Sequence compari
K a
IShesp-Sheama Pelpro Desace Xanaxo Xylfas Chlaur Despsy Rhobal Burcen Myctub hocfar Jansp-luncbac Erylit Polsp-Mycavi Myctub Mycavi Maraqu Caucre Pseatl Psycry Shefri Shefri
□ multiple sequence alignment
■ identification of sequences with unique features -> proteins with potentially novel characteristics
mum
DKAIi
Iray
iSRiLlRLYl
NAF NAF
------------------V-7
iPDViQEVIDEIKAFRASNKKINFFTMAKNISKMDKSKHFATKFMYlQKlSWESKlMPIGFLNSM-QMEDKLAKSKVKAYVHLLFQGLGLEKLSl |TDLIKAYE
'DVHLS- FRAMREMC. KWI     DVARL    R—G ]
WR F R WR F R
□ phylogenetics
■ establishment of evolutionary relationships among sequences
Sequence compari
□ phylogenetics
■ classification of sequences
HLD-II
DatA
Sav4779
DbJA   i -^n Jann2620
^DmlA   ru a
DlmA
DhIA
DelA
DhmA
DmbBDmpA DmaA
DsfA/iso-DsfA DpcA DpaA
HLD-I
LinB DmbA
DmsA
DspA
iso-DspA
DxxB
DrbA
DphprA DsoA
DsfB / DssA DsdA
DpeprA cpA DsamA
HLD-III
'DnfA
□ phylogenetics
■ information about experimental data-> selection of novel proteins
Information about host organisms
□ extremophiles - microorganisms living in extreme conditions
■ geochemical extremes (pH, salinity)
■ physical extremes (temperature, pressure)
□ proteins from extremophiles
■ often adapted to extreme conditions -> unique characteristics, useful for practical applications
Information about host organisms
□ Genomes OnLine Database (GOLD)
■ http://www.genomesonline.org/
■ list of complete (>6,000), ongoing (> 27,000) and targeted genome (>1,000) projects
■ information about individual projects and source organisms
□ Entrez Genome
■ http://www.ncbi.nlm.nih.gov/sites/genome
■ provided by NCBI
■ data from more than 20,000 finished or ongoing genome projects (includes almost 10,000 organisms)
■ information about genome, source organism, genes, encoded proteins, graphical representations,...
□ GOLD
Metagenomes	Isolate Genomes
^ Classification	^/j Complete Projects: 4169
• Studies: 370	t\ Incomplete Proiects: 17714
• Samples: 2642	Targeted Projects: 1600
Organism Metadata
MIGS 22 o	OXYGEN REQUIREMENT	Aerobe
MIGS 37.1 o	CELL SHAPE	Rod-shaped
MIGS 37.2 *	MOTILITY	Non motile
MIGS 37.3*	SPORULATION	
MIGS 37.4©	PRESSURE	
MIGS 37.12©	TEMPERATURE RANGE	Psychrophile
	SALINITY	Halotolerant
	PH	
MIGS 37.5©	CELL DIAMETER	
MIGS 37.6 ©	CELL LENGTH	
MIGS 37.7 ©	COLOR	
MIGS 37.8 ©	GRAM STAINING	
MIGS 15©	BIOTIC REALTIONSHIPS	Free living
Information about host organisms
□ Entrez Genome
Psychrobacter cryohalolentis
Psychrotolerant organism
Lineage: Bacteria[4049]; Proteobacteria[1682]; Gammaproteobacteria[75Q]; Pseudomonadales[122]; Moraxellaceae[51]; Psychrobacter[1Q]; Psychrobacter cryohalolentis[1]
Psychrobacter. These bacteria are commonly isolated from low temperature environments, Psychrobacter spp. are cold-adapted organisms that are often isolated from extreme environments such as permafrost or the Antarctic ice.Psychrobacter cryohalolentis. Psychrobacter cryohalolentis. formerly Psychrobacter crvopegetta More
4 Representative
□ Community" selected, Calculated : Psychrobacter cryohalolentis K5
Psychrobacter cryohalolentis K5. This organism was isolated from saline liquid (12-14%) found 11-24 m below the surface within a forty thousand-year-old Siberian permafrost at the Kolyma-lndigirka lowland in Siberia. This strain will provide insight into growth at extremely low temperatures. Human Pathogen: no
Type	Name	RefSeq	INSDC	Size (Mb)	GC%	Protein	rRNA	tRNA	Other RNA	Gene	Pseudogene
Chr	-	NC_007&6&.1	CP000323.1	3.06	42.3	2,467	12	43	6	2,337	4
Plsm	1	NC_007&6S.1	CP000324.1	0.041221		44	-	-	-	44	-
Biological Properties
■ Morphology
□ Shape: Bacilli q Motility: No
■ Environment
biological properties
o Salinity: ModerateHalophilic
o TemperatureRange: Psychrophilic
t> Habitat: Multiple
- Genome Sequencing Projects
) Chromozome; [1]      Scafíblda or conrig^ [0] 1 ? SRA or Traces [0] o Wo data [0]
Organism	BioProject	Assembly Status		Chrs	Plasmids	Size (Mb)	GC%	Gene	Protein
Psychrobacter cryohalolentis K5	PPJHAESS7S, PRJNA11&20	ASM13MV1	•	1	1	i.1	42.2	2,581	2,511
Automated in silico enzyme identificatio
□ Enzyme Miner
■ https://loschmidtxhemi.munixz/enzymeminer/
Query sequences
Essential residues
Other sequences
ENZYME MINER
Annotated putative enzymes
Solubility prediction
Automated in silico enzyme identificatio
ENZYME MINER 1
Automated mining of enzymes with diversified function.
MM
1
		
Submit new job      Help      Example      Use cases Acknowledgements	Job ID:	
Swiss-Prot sequences 6
Enter EC number...
Advanced options
Custom sequences©
JOB INFORMATION
Hon, J., Borko. S.. Bednar. D., Prokop, Z, Martinek, T., Damborsky, J., 2019: EnzymeMinen Web Server for Automated Mining of Sequences Encoding Enzymes with Diversified Functions Nucleic Acids Research (in preparation).
USER STATISTICS
• Number of visitors:
• Number of jobs: 60
Loschmidt Laboratories
• email 1
• email Z
%0 PREDIC75NP We
wjCAVzn
0 FIRE
!Ci.
HOTSPO" WIZARD
Keywords: enzyme mining, protein function, diversity, sequence space, computational characterization
Swiss-Prot sequences Q
Custom sequences Q
Load saved input
Query sequences: O >DrbA
MSCRLSSNRRGSSKLAAMTNLASDLFPHPSSELSIOGHTLRYIDTAASSDIPSSAVGSSD GEPTFLCVHGNPTWSFYYRRIIERYGKQORVIAVDHIGCGRSDKPSEDEFPYTMAAHRDN LIRLVDELDLKNVILIAHDWGGAIGLSAMHARRDRLAGIGLLNTAAFPPPYMPORIAACR MPVI GTPAVRGI NI FARAAVTHAMSRTKMKPOVAAGI I APYDNWKNRVATnRFVRDTPI N Load from file:
Vybrat soubor Soubor nevybrán
Other known sequences Q
>DnibC
HSIOFTPDPOLYPFESRWFDSSRGRIHYVOEGTGPPILLCHGNPTWSFLYRDIIVALRDR FRCVAPDYLGFGLSERPSGFGYQIDEHARVIGEFVDHLGLDRYLSMGQDWGGPISMAVAV ERADRVRGWLGNTWFWPADTLAMKAFSRVMSSPPVOYAILRRNFFVERLIPAGTEHRPS SAVMAHYRAVOPNAAARRGVAFMPKOTI AARPI I ARI ARFVPATI GTKPTI I TWGMKOVA Load from file:
Vybrat soubor Soubor nevybrán Essential residue templates: Q
Add protein (row) I Add residue (column)
Accession	nucleophile	acidl	acid2
	0	D.E	O.E
DrbA	139	Enter position	272
DmbB	123	Entef position	250
DmbC	109	Enter position	238
base
H
300
halidel halide2 halide3
H. N, Q, W, Y H, N. Q, W. Y H, N, Q, W, Y 71
140
Enter position
279 267
Enter position
43
124 110
164
Enter position
16
JOB OUTPUT INFORMATION
DOWNLOAD RE5UITS
Result table [xlsx]
Result table (tsv]
Raw results
TARGET 5ELECTION TABLE
Select all I Deselect all I Undo I Redo
Solubile threshold:© 0.00;
Primary domains:©
PFÜÜ561 (AbhydrolaseJ) X
Selected
Full Dataset
Extra domain    Known Organism
Temperature
Salinity   Biotic Relationship    Disease   Transmembrane   With Structure
Accession	Annotation	Closest query	Identity closest query	Kingdom	Solubility	Sequence length	Domain	
2PSD.A	Chain A, Crystal Stru...	□4Z2G1	41.5	E	0.9735	313	Abh	A
2PSF_A	Chain A, Crystal Stru...	□4Z2G1	41.5	E	0.9515	310	Abt	
2PSJ_A	Chain A Crystal Stru...	□4Z2G1	41.5	E	0.9614	312	Abh	
2PSH_A	Chain A Crystal Stru...	□4Z2G1	41.2	E	0.95Ů6	319	Abt	
WP_071575177.1	haloalkane dehalog...	□4Z2G1	70.3	E	0.9399	270	Abh	
3SKO_A	Chain A structure o...	□4Z2G1	46.2	B	0.9393	311	Abt	
4BRZ_A	Chain A Haloalkane,..	□4Z2G1	61.7		0.9357	290	Abh	
What to keep in mind?
□ sequence databases
■ nucleotide: GenBank, EMBL-BANK, DDBJ; protein: UniProtKB, nr Protein database
■ errors in sequences and annotations
□ database searches
■ text-based: results influenced by sequence annotations
■ sequence-based: identification of family members - BLAST, PSI-BLAST - E-value
■ combination of both approaches: optimal strategy
■ false positive results: sequences should be filtered
□ selection of proteins for experimental characterization
■ clustering: classification and filtering of hits from database searches - CLANS
■ sequence comparison: classification and identification of unique sequences
■ sequences from extremophiles: potentially adapted to extreme conditions
■ Enzyme Miner: automated identification of interesting catalysts
What to keep in mind?
□ in silico identification and analysis of sequences - fast and cheap way to identify new proteins
□ Xiong, J. (2006). Essential Bioinformatics. Cambridge University Press, New York, p. 352.
□ Claverie, J-M. and Notredame, C. (2006). Bioinformatics For Dummies (2nd ed.). Wiley Publishing, Hoboken, p. 436.
□ Steele, H.L. et al. (2009). Advances in Recovery of Novel Biocatalysts from Metagenomes. Journal of Molecular Microbiology and Biotechnoly 16: 25-37.
□ NCBI Resource Coordinators (2013). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 41: D8-D20.
□ Magrane, M. and Consortium U. (2011). UniProt Knowledgebase: a hub of integrated protein data. Database 2011: bar009.
□ Frickey, T. and Lupas, A. (2004). CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics 20: 3702-3704.
□ Pagani, I. et al. (2012). The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Research 40, D571-579.
□ Van den Burg, B. (2003). Extremophiles as a source for novel enzymes. Current Opinion in Microbiology 6: 213-218.
L LOSCHMIDT
, LABORATORIES
PROTEIN ENGINEERING
3. PREPARATION OF RECOMBINANT PROTEINS, PROTEIN
EXPRESSION AND PURIFICATION
Loschmidt Laboratories Department of Experimental Biology Masaryk University, Brno