Secondary database searching
Bioinformatics - lectures
Introduction Information networks Protein information resources Genome information resources DNA sequence analysis Pairwise sequence alignment Multiple sequence alignment Secondary database searching Analysis packages Protein structure modelling
Secondary database searching
why search secondary databases? secondary databases regular expressions fingerprints
blocks
profiles
Hidden Markov Models
Why search secondary databases?
Interpretation   of   the   results   from database searches is sometimes difficult:
primary
X.000.000 sequences from XX.000 organisms
complex and redundant search outputs
irrelevant matches of low-complexity sequences, repetitive sequences, modular sequences
local regions of similarity in multi-domain proteins
truncated description lines
Secondary database searches enable to identify both homology and more exacting orthology.
Secondary databases
Contains information derived from primary sequence data, typically in the form < abstractions: regular expressions, fingerprints, blocks, profiles or Hidden Markov Models.
These abstractions represent destinations of the most conserved features of multiple alignments.
The abstractions are useful for discrimination of family membership for newly determined sequences.
Secondary databases
PROSITE - regular expressions
PRINTS - fingerprints
BLOCKS - blocks
PROFILES - profiles
PFAM - Hidden Markov Models
IDENTIFY - fuzzy regular expressions
Terms used in sequence analysis methods
fingerprint
motif
'cydeggis cyedggis eyeeggit cyhgdggs .cyŕgdgnt
insertions
frequency matrix
weight matrix
(block)
C-Y-X2-[DGj-G-x-[ST
regular expression
Regular expressions
■ Regular expression reduces the sequence data to the most conserved residue information.
Multiple alignment
ADLGAVFALCDRYFQ
SDVGPRSCFCERFYQ
ADLGRTQNRCDRYYQ
ADIGQPHSLCERYFQ
Regular expression [AS]-D-[IVL]-G-X5-C-[DE]-R-[FY]2-Q
Limitations:
>• stringent pattern - retrieves only identical matches and can miss remote relatives
** fuzzier  pattern   -   better  chance   to   detect   remote relatives, but results in more noisy output
** single motif may not be sufficient to infer the function
Regular expressions
■ Regular expressions works most effectively when a particular protein family can be characterised by a highly conserved motif (10-20 residues).
Limitation: short patterns (3-4 residues) are not sufficiently discriminative.
Asp-Ala-Val-Ile-Asp (DAVID) Asp-Ala-Val-Glu (DAVE)
71 exact matches in OWL29.6 1088 exact matches in OWL29.6
Regular expressions
■ Rules - short patterns that can be used to provide a guide to possible existence of functional sites:
Functional site
N-glycosilation site
Protein kinase C phosphorylation site
Casein kinase II phosphorylation site
Asp adn Asn hydroxylation site
Regular expression
N-{P}-[ST]-{P}
[ST]-X-[RK]
[ST]-X(2)-[DE]
C-X-[DN]-X(4)-[FY]-X-C
Regular expressions
■ Fuzzy regular expressions - regular expressions with introduced fuzziness into patterns using groups of amino acids with similar biochemical properties (FYW - aromatic, HKR - basic, etc.).
Multiple alignment
ADLGAVFALCDRYFQ
SDVGPRSCFCERFYQ
ADLGRTQNRCDRYYQ
ADIGQPHSLCERYFQ
Fuzzy regular expression [ASGPT]-D-[IVLM]-G-X5-C-[DENQ]-R-[FYW]2-Q
Residue
Asp, Glu
His, Arg, Lys
Ser, Thr, Asn, Gin
Ala, Val, Leu, lie, Met
Phe, Try, Trp
Pro, Gly
Cys
Acidic                                                     red
Basic                                                      blue
Polar neutral                                          green
Hydrophobic aliphatic                            white
Hydrophobic aromatic                            purple
Special structural properties                   brown
Disulphide bond former                          yellow
Aliphatic
Aromatic
Hydrophobic
Ní ■:.-.-
Positive
Polar Charged
Regular expressions
■ Introduction fuzziness into regular expressions increases the number of matches retrieved from the sequence database:
Regular   expression
D-A-V-I-D
D-A-V-I-[DENQ]
[DENQ]-A-V-I-[DENQ]
[DENQ]-A-[VLI]-I-[DENQ]
[DENQ]-[AQ]-[VLI]2-[DENQ]
No. of exact matches  (OWL29.6)
252
925
2739 51506
Fingerprints
Motivation: there are often more than one conserved region present in multiple alignment.
Groups of motifs excised from the sequence and converted into matrices populated by the residue frequencies observed at each position.
Unweighted scoring system - no additional mutation or substitution matrices are employed.
Weighted scoring system - additional matrices are employed resulting in less sparse matrix, but poor signal-to-noise performance.
(a)
YVTVQHKKLRTPL
YVTVQHKKLRTPL YVTVQHKKLRTPL AATMKFKKLRHPL AATMKFKKLRHPL YIFATTKSLRTPA VATLRYKKLRQPL ÝIFGGTKSLRTPA WVFSAAKSLRTPS WIFSTSKSLRTPS
YLFTKTKSLQTPA
(b)
TCAGNSPFLYHQVKDEI    WRMBXZ
0	0	2	0	0	0	0	0	0	7	0	0	1	0	0	0	0	2	0	0	0	0	0
0	0	3	0	0	0	0	0	2	0	0	0	4	0	0	0	3	0	0	0	0	0	0
0	0	0	0	0	0	0	6	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	1	1	0	3	0	0	1	0	0	0	3	0	0	0	0	0	0	2	0	0	0
0	0	1	1	0	0	0	0	0	0	0	3	0	4	0	0	0	0	1	0	0	0	0
0	0	1	0	0	1	0	2	0	1	3	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	12	0	0	0	0	0	0	0	0	0
0	0	0	0	0	6	0	0	0	0	0	0	0	6	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	12	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	10	0	0	0	0
0	0	0	0	0	0	0	0	0	0	2	1	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	12	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	4	0	0	2	0	0	6	0	0	0	0	0	0	0	0	0	0	0	0	0	0
s        í
Example of frequency matrix derived from initial unweighted motif (a) and PAM-weighted matrix (b)
(a) TCAGNSPFLYHQVKDE    I  W   R   M   &  X   Z
0	0	4	0	0	0	0	8	4	34	0	0	as	0	0	0	1	7	0	0	0	0	0
0	4	15	0	0	0	0	0	7	0	0	0	37	0	0	0	10	0	0	0	0	0	Ü
50	0	0	0	0	3	0	18	a	0	0	0	0	0	0	0	0	0	0	2	0	Ü	0
3	0	1?	2	1	8	0	3	6	0	0	0	14	0	0	0	15	2	0	7	0	0	0
9	2	1	2	1	1	0	0	0	0	1	25	0	20	0	ů	0	0	4	0	0	0	0
14	0	2	0	0	4	0	14	a	e	31	0	Ü	0	0	0	0	0	0	0	0	0	0
0	0	1	a	0	0	0	0	a	0	0	0	0	70	0	0	0	0	2	0	0	0	0
0	0	2	i	0	17	0	0	0	0	0	0	0	52	0	0	0	0	1	0	0	0	0
0	0	D	0	0	0	0	0 73		0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	Ü	0	0	0	0	0	0	0	5	0	0	0	0	0	0	58	0	0	0	0
44	0	D	0	0	6	0	0	0	0	12	11	0	0	0	0	0	0	0	0	0	0	0
0	0	1	0	0	0	69	0	a	0	3	0	0	0	0	0	0	0	0	0	0	0	0
2	0	11	0	0	7	0	0	53	0	0	0	D	0	0	0	0	0	0	0	0	0	0
fb)
TCAG      NS      PFLYHQVKflElWRh
-29   -22   -29   -48    -24   -24   -46    40   -13     62   -10   -40   -22   -33   -44   -44  -IS     16   -3U   -21
-1   -3?     -1   -18    -20   -10   -13    -9    20 -12   -21   -Ifl    32   -23   -22   -20    32   -61   -26    15
Ď -36   -IS   -30    -24   -12   -30    36      0    24   -15   -36    -6   -30   -36   -30      6   -30   -30    -6
3-29       S    -4    -10    -1     -7  -11      3 -31   -19   -15     14   -12   -15   -13    11   "52   "15     11
3-48-1-8       7       1-4  -54   -31-46      ť     14  -17     23      6      S  -20   -48     14    -í
2  -27     -7   -19     -3     -5   -13      0   -16       6      S   -10   -11   -15   -33   -It    -7   -3?   -12   -15
0   -60   -12   "24      12      0   -1?  -60   -36 -48      O     12   -24    60      0      0  -24   -3D     30      0
6-30       0     -6      12     12       0  -48   -36 -42     -6       0  -18    30      0      G  -18   -30     IE   -12
-24   -72   -24   -48    -36   -36   -36    24    72   -12   -24   -24    24   -36   -48   -36    24   -24   -36    U
-12   -50   -20   -32       2     -2       0   -50   -34 -48    26     18  -24    32    -6     -6  -24     10     62    -2
24-29       7-5       S      6       0  -36   -24   -31       6      1-6      1      4      4-6   -56     -4   -14
0   -36     12   -12    -12     12     72  -60   -36-60      0      0  "12   -12   -12   -12  -24   -7?      0   -24
-í   -44     "2   -18    -16   -10   -12   -10    22 -24   -18   -14    10   -22   -24   -18      6   -40   -26    36
Blocks
Conserved motifs are located by a first motif-finding algorithm: search for the spaced residue triplets (e.g., Ala-X-X-Val-X-Trp); a block score is weighted using BLOSUM 62 substitution matrix.
Validation of blocks by a second motif-finding algorithm: search for the highest-scoring set of blocks in the correct order without overlapping.
Sequences are clustered to avoid a bias due to identical sequences.
100
OPSD SHEEP
■.MA.^.a.h.-..U...i>.mJ.i.««i.ŕ„..,..l.>i,i.,i..„.i.....1,nii.,i..,..j.....li
. ■ ■" i tlmi ii i i * ■.. i i . . . 11 11, b i
^"H'l.iH.I
ili It.i.ilu.niniH.iuíM.ttil.. i.Ni.aii .u.U...........Il.niu.,,1...•■■f *'»«*.-.1
-■>-■—•
^ÉlttfrMkMUkUdl......^-n-i^^-i.-.i-K.-i......*—„Ľ,ili----------,.i,l<hi»iUllnlwiJlii<iHlim,it.iililňl.ii
,...i1Ji.^.l»Tlr)n(,i.<ii,>,; lii(^ir.Hl,k.i^.j.n.MillUiii.i.l..„...i.J    ■■      '.Jl..,.l|lli(lllhll»ii...l..<..B.i,mliailill*liliii„iťk.jJl.i-i.....;.........
■■■■ ' ■'.....inUJta......»-■" .iI...i.,..,i,.au
i« m-'-M'-iľ......-•■ ■—■■■b.l-—-.i-l.-..

ÍL______t.til.litJiili,t L... „HjIbiM ..... -^ ■■ ■■■ .< itin.rfh.Li Ji.....L..ril.ii1mti -L. ..Ü..I.. ..n..t,.i,.ti d .
0
iufam
liLáiJtiiťllfllí.....■.lintiiiililJímmili.....,m
350
Residue number

44
CCKR_HÜMAN ( 362) CCKR_RAT í 378)
FML2_HUMAN   ( 294)
FMLR_HUMAN   ( 293)
FMLR_MOUSE   ( 304)
FMLR_RABIT   { 295)
GASR_CANFA (   388)
GASR_HUMAN (  382)
GASR_PRANA (   385)
GASR_RABIT (   387)
GASR_RAT (   387)
ET1R_B0VIN (   361)
ET1R_RAT {   361)
ETBR_BOVIN <   377)
ETBR_HUMAN (   378)
ETBR_PIG (   379)
ETBR_RAT (   378)
OPSD_LOLFO { 307) OPSD_OCTDO ( 308) OPSDJTODPA í 306)
P2UR_HUMAN ( 296)
P2UR_MOÜSE ( 298)
P2UR_RAT ( 297)
5H6_RAT ( 312)
EDG1_HUMAN ( 302)
EBI2_HUMAN ( 300)
OXYR_HUMAN (   321)
OXYR_PIG (   323)
V1AR_HUMAN (   340)
V1AR_RAT t   346)
PER3_BOVIN ( 337) PER3_HUMAN {   338)
YN84_CAEEL ( 331)
SSCVNPIIYCFMNKRFR 3 SSCVMPIIYCFMNKRFR 3
NSCLNPMLYVFVGQDFR  4
NSCLNPMLYVFMGQDFR  4
NSCLNPMLYVFMGQDFR  4
NSCLNPMLYVFMGQDFR  4
SACVNPLVYCFMHRRFR 5 SACVNPLVYCFMHRRFR 5 SACVNPLVYCFMHRRFR 5 SACVNPLVYCFMHRRFR 5 SACVNPLVYCFMHRRFR 5
NSCINPIALYFVSKKFK 9 NSCINPIALYFVSKKFK 9 NSCINPIALYLVSKRFK 9 NSCINPIALYLVSKRFK 9 NSCINPIALYLVSKRFK 9 NSCINPIALYLVSKRFK 9
SAIHNPMIYSVSHPKFR 12 SAIHNPIVYSVSHPKFR 12 SAIHNPMIYSVSHPKFR 12
NSCLDPVLYFLAGQRLV 13 NSCLDPVLYFLAGQRLV 13 NSCLDPVLYFLAGQRLV 13
NSTMNPIIYPLFMRDFK 16
NSGTNPIIYTLTNKEMR 21
NCCMDPFIYFFACKGYK 23
NSCCNPWIYMLFTGHLF 24 NSCCNPWIYMLFTGHLF 24 NSCCNPWIYMFFSGHLL 18 NSCCNPWIYMFFSGHLL 18
NQILDPWVYLLLRKILL 35 NQILDPWVYLLLRKILL 35
SCVAYFLIFTLLNRGIR 100
Profiles
Based on entire sequences.
Profiles define which residues are allowed at given positions, which positions are highly conserved and which degenerate, which positions can tolerate insertions.
The  scoring  system  may  include  evolutionary weights and results from structural analysis.
/DEFAULT:   MI=-26;   X=-3;   IM=0;   MD=-26;   D=-3;   DM=0;
/M:   SY='F';M=-2,-3,-3,-4,2,-3,-2,1,-2,0,-1,-2,-3,-3,-4,-2,-1,0,-5,2;
/M:   SY='I';M=-l,-5,-2,-3,-2,-3,0,l,l,-l,l,-l,-2,-1,1,-1,0,1,-4,-4;
/M:   SY='A';M=2,-3,1,0,-5,2,-2,-1,-1,-3,-2,1,1,0,-2,2,2,0,-8,-5;
/M:   SY='L'^=-3,-8,-5,-4,2,-6,-2, 2,-4, 6, 4,-3,-3,-2,-3,-3,-2,1,-3,0;
/M:   SY='Y';M=-4,-2,-6,-6,9,-7,O,-1,-5,-1,-3,-3,-6,-S,-6,-4,-4,-4,-1,11;
/M:   SY=,D,;M=l,-6,3,3,-7,0,0,-2,-l,-4,-3,2,0,l,-2,0,0,-2,-9,-6;
/M:   SY='Y';M=-5,-3,-6,-6,10,-7,-1,-1,-2,-1,-2,-3,-6,-5,-5,-4,-4,-4,-1,11;
/M:   SY=*K';M=-l,-6,l,l,-4,-2,0,-2,2,-3,-l,l,-l,l,l,0,0,-3,-7,-6;
/M:   SY='A';M=1,-4,1,0,-5,1,-1,-1,0,-3,-1,1,0,O,O,1,1,-1,-7,-6;
/M:   SY='R';M=0,-5,O,O,-5,-1,0,-1,1,-3,-1,1,O,1,1,O,O,-2,-5,-S;
/M:   SY=,Rl;M=0,-5,1,1,-6,O,1,-2,1,-4,-2,1,0,1,2,1,0,-2,-S.-5;
/M:   SY='E';M=1,-5,2,2,-6,O,O,-2,-1,-4,-2,1,1,1,-1,O,O,-3,-8,-6;
/M:  SYs'D';M=0,-6,2,2,-6,0,1,-3,O,-5,-3,2,-1,2,-1,0,O,-4,-7,-4;
/M:  SY='D';M=0,-S,4,3,-6,O,O,-2,-1,-3,-2,2,-2,2,-2,O,-1,-3,-9,-6;
/M:  SY='Ľ;M=-2,-8,-5,-5,2,-5,-3,3,-4,7,5,-4,-3,-3,-4,-3,-2,3,-4,-2;
/M:   SY='S*;M=1,-4,1,1,-5,1,0,-2,1,-4,-2,1,0, O,0,1,1,-2, -6,-5;
/M:   SY='F';M=-3,-7,-6,-6,6,-5,-3,3,-2,5,3,-4,-5,-4,-S,-4,-3,l,-3,3;
/M:   SY='Q';M=-1.-6,0,O,-3,-2,1,-1,1,-2,O,O,-1,1,1,-1,O,-1,-6,-4;
/M:   SY='K';M=-1,-8,O,1,-3,-2,O,-2,3,-3,O,1,0,2,2,O,O,-3,-6,-6;
/M:   SY='G';M*2,-5,1,0,-7,7,-3,-4,-2,-6,-4,1,-1,-2,-4,2,O,-2,-10,-8;
/M:   SY='D';M=1,-7,5,4,-8,1,1,-3,O,-5,-3,2,-1,2,-2,O,0,-4,-10,-6;
/M:   SY='ľ;M=0,-5,-1,-2,-2,-2,-1,2,O,O,1,-1,-2,O,O,-1,0,1,-6,-S;
/M:   SY= 'Ľ;M=-2, -6, -5, -5,3, -5, -3,4, -3, 6, 4, -4, -4, -3, -4, -3, -2, 3, -5,0;
/M:   SY='Q';M=-l,-5,-1,-1,-3,-2,O,O,O,-2,-1,0,-1,0,O,-1,0,-1,-6,-3;
/M:   SY*'V';M*0,-4,-3,-4,-1,-3,-3,5,-3,3,3,-2, -2,-2,-3,-2,O,5,-8,-4;
/M:  SY= »L^M—1,-6, -3, -3, -1,-3, -2, 2, -3, 3,2, -2, -2, -2, -3, -2, -1,2, -5,-3;
/M:   SY='D';M=0,-6,3,3,-6,O,1,-3,2,-5,-2,2,-1,2,1,0,O,-4,-7,-5;
/M:   SY='K';M=-1,-6,0.0,-2,-1,0,-3,3,-4,-1,1,-1,0,1,0,O,-3,-6,-4;
/M:  SY=,N';M=l,-4,l,l,-5,0,0,-2,0,-3,-2,l,l,0,-l,l,l,-l,-7,-5;
/I:   MI=0;   I=-Í;   MD=0;   /M:   SY='X';   M=0;   D=-l; /M:   SY='G';M=1,-5,O,O,-5,1,-2,-1,-2,-3,-2,O,O,-1,-2,O,O,-1,-8,-6; /M:   SY='G';rM=l,-6,3,3,-7,3,0,-4,-1,-5,-4,2,-1,1,-2,1,0,-3,-10,-6; /M:  SY='W;M=-9,-12,-9,-ll,l,-ll,-4,-8,-5,-3,-6,-6,-8,-7,3,-4,-8,-9,26,0; /M:   SY='W;M=-7,-9,-9.-9,0,-9,-4,-5,-5,-l,-4,-6,-7,-6,2,-3,-6,-6,18,-l; /M;  SY='K';M=-1,-7,O,O,-3,-2,O,-2,2,-3,-1,1,-1,1,2,O,-1,-3,-5,-5; /Mi  SY='G';M=2,-3,O,-1,-6,3,-3,-2,-3,-4,-3,O,O,-2,-3,1,0,O,-10,-6; /M:   SY='Q',-M=-2,-6,0,0,-3, -3,1,-2,0, -2,-1,0,-2.1,1,-1,-1,-3,-5,-3;
/I:   MI=0;   I~-2;   MD=0;   /M:   SY='X';   M=0;   D=-2; /M:   SY='T" ;M=0,-4,-1,-1,-4,O,-2,O,-1,-2,0,O,-1,-1,-1,0,1,0,-7,-5; /M:   SY='T';M=0,-5,O,O,-3,-1,-1,-1,1,-3,-1,1,-1,0,O,1,1,-1,-6,-4; /M:   SY='G';M=0,-5,O,-1,-5,3,-2,-3,-1,-5,-3,O,-1,-1,-1,1,O,-2,-7,-6; /M:   SY='K';M=0,-6,1.1.-5,-1,1,-2,2,-4,-1,1,-1,2,2,O,O,-3,-6,-6; /M:   SY='R';M=-l,-6,-1,-1,-5,-3,1,-1.1,-3,-1,0,-1,1,3,-1,-1,-2,-2,-6; /M:   SY=,G,;M=l,-5,0,0,-6,6,-3,-3,-3,-5,-4,0,-l,-2,-4,l,0,-2,-10,-6; /M:   SY='W';M=-5,-5,-5,-5,2,-6,-2,-2,-4,-l,-3,-3,-6,-5,-3,-3,-4,-4,4,3; /M:  SY='F';M=-3,-5.-6,-6,6,-5.-3,4.-l,3,2,-4,-4,-5,-4,-3,-2,2,-4,3; ./M:  SY='P',-M=2,-4,-1,-1,-7,-1.0,-3.-2,-4,-3,-1,8,O,O,1,0,-2,-8,-7,• /M:  SY='G';M=1,-3,O,O,-4.2,-1,-2,O,-3,-2,O,O,-1,-1,1,1,-1,-6,-5; /M:   SY='N';M=1,-5,2,1,-5.O,1.-2,1,-4,-2,2,O,O,O,1,1,-2,-7,-4; /M:   SY='Y';M=-5,-1,-7,-7,10,-8,-1.-1,-5,-1,-3,-3,-7,-6,-6,-4,-4,-5,0,13; /M:  SY='V;M=0,-3,-3,-5,-2,-2,-3,5,-3,2,2,-2,-2,-3,-4,-1,0,5,-8,-5; /M:  SY='E';M=1,-6,2,3,-6,O,0.-2,1,-4,-2,1,0,2,O,O,O,-3,-8,-6; /M;  SY='P*íM=0,-5,-1,-1,-2,-2,-1,-2,-1,-3,-2,O,1,-1,-2,O,-1,-2,-6,-3;
Hidden Markov Models
Based on entire sequences.
HMMs are probabilistic models consisting of a number of interconnecting states - linear chains of match, delete or insert states.
Each position in the multiple alignment is assigned to either match, insert or delete state.
Construction: seed alignment, iterative sequence gathering, final alignment (all automatic).