Secondary database searching Bioinformatics - lectures Introduction Information networks Protein information resources Genome information resources DNA sequence analysis Pairwise sequence alignment Multiple sequence alignment Secondary database searching Analysis packages Protein structure modelling Secondary database searching why search secondary databases? secondary databases regular expressions fingerprints blocks profiles Hidden Markov Models Why search secondary databases? Interpretation of the results from database searches is sometimes difficult: primary X.000.000 sequences from XX.000 organisms complex and redundant search outputs irrelevant matches of low-complexity sequences, repetitive sequences, modular sequences local regions of similarity in multi-domain proteins truncated description lines Secondary database searches enable to identify both homology and more exacting orthology. Secondary databases Contains information derived from primary sequence data, typically in the form < abstractions: regular expressions, fingerprints, blocks, profiles or Hidden Markov Models. These abstractions represent destinations of the most conserved features of multiple alignments. The abstractions are useful for discrimination of family membership for newly determined sequences. Secondary databases PROSITE - regular expressions PRINTS - fingerprints BLOCKS - blocks PROFILES - profiles PFAM - Hidden Markov Models IDENTIFY - fuzzy regular expressions Terms used in sequence analysis methods fingerprint motif 'cydeggis cyedggis eyeeggit cyhgdggs .cyŕgdgnt insertions frequency matrix weight matrix (block) C-Y-X2-[DGj-G-x-[ST regular expression Regular expressions ■ Regular expression reduces the sequence data to the most conserved residue information. Multiple alignment ADLGAVFALCDRYFQ SDVGPRSCFCERFYQ ADLGRTQNRCDRYYQ ADIGQPHSLCERYFQ Regular expression [AS]-D-[IVL]-G-X5-C-[DE]-R-[FY]2-Q Limitations: >• stringent pattern - retrieves only identical matches and can miss remote relatives ** fuzzier pattern - better chance to detect remote relatives, but results in more noisy output ** single motif may not be sufficient to infer the function Regular expressions ■ Regular expressions works most effectively when a particular protein family can be characterised by a highly conserved motif (10-20 residues). Limitation: short patterns (3-4 residues) are not sufficiently discriminative. Asp-Ala-Val-Ile-Asp (DAVID) Asp-Ala-Val-Glu (DAVE) 71 exact matches in OWL29.6 1088 exact matches in OWL29.6 Regular expressions ■ Rules - short patterns that can be used to provide a guide to possible existence of functional sites: Functional site N-glycosilation site Protein kinase C phosphorylation site Casein kinase II phosphorylation site Asp adn Asn hydroxylation site Regular expression N-{P}-[ST]-{P} [ST]-X-[RK] [ST]-X(2)-[DE] C-X-[DN]-X(4)-[FY]-X-C Regular expressions ■ Fuzzy regular expressions - regular expressions with introduced fuzziness into patterns using groups of amino acids with similar biochemical properties (FYW - aromatic, HKR - basic, etc.). Multiple alignment ADLGAVFALCDRYFQ SDVGPRSCFCERFYQ ADLGRTQNRCDRYYQ ADIGQPHSLCERYFQ Fuzzy regular expression [ASGPT]-D-[IVLM]-G-X5-C-[DENQ]-R-[FYW]2-Q Residue Asp, Glu His, Arg, Lys Ser, Thr, Asn, Gin Ala, Val, Leu, lie, Met Phe, Try, Trp Pro, Gly Cys Acidic red Basic blue Polar neutral green Hydrophobic aliphatic white Hydrophobic aromatic purple Special structural properties brown Disulphide bond former yellow Aliphatic Aromatic Hydrophobic Ní ■:.-.- Positive Polar Charged Regular expressions ■ Introduction fuzziness into regular expressions increases the number of matches retrieved from the sequence database: Regular expression D-A-V-I-D D-A-V-I-[DENQ] [DENQ]-A-V-I-[DENQ] [DENQ]-A-[VLI]-I-[DENQ] [DENQ]-[AQ]-[VLI]2-[DENQ] No. of exact matches (OWL29.6) 252 925 2739 51506 Fingerprints Motivation: there are often more than one conserved region present in multiple alignment. Groups of motifs excised from the sequence and converted into matrices populated by the residue frequencies observed at each position. Unweighted scoring system - no additional mutation or substitution matrices are employed. Weighted scoring system - additional matrices are employed resulting in less sparse matrix, but poor signal-to-noise performance. (a) YVTVQHKKLRTPL YVTVQHKKLRTPL YVTVQHKKLRTPL AATMKFKKLRHPL AATMKFKKLRHPL YIFATTKSLRTPA VATLRYKKLRQPL ÝIFGGTKSLRTPA WVFSAAKSLRTPS WIFSTSKSLRTPS YLFTKTKSLQTPA (b) TCAGNSPFLYHQVKDEI WRMBXZ 0 0 2 0 0 0 0 0 0 7 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 0 2 0 0 0 4 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 3 0 0 1 0 0 0 3 0 0 0 0 0 0 2 0 0 0 0 0 1 1 0 0 0 0 0 0 0 3 0 4 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 2 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 2 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 s í Example of frequency matrix derived from initial unweighted motif (a) and PAM-weighted matrix (b) (a) TCAGNSPFLYHQVKDE I W R M & X Z 0 0 4 0 0 0 0 8 4 34 0 0 as 0 0 0 1 7 0 0 0 0 0 0 4 15 0 0 0 0 0 7 0 0 0 37 0 0 0 10 0 0 0 0 0 Ü 50 0 0 0 0 3 0 18 a 0 0 0 0 0 0 0 0 0 0 2 0 Ü 0 3 0 1? 2 1 8 0 3 6 0 0 0 14 0 0 0 15 2 0 7 0 0 0 9 2 1 2 1 1 0 0 0 0 1 25 0 20 0 ů 0 0 4 0 0 0 0 14 0 2 0 0 4 0 14 a e 31 0 Ü 0 0 0 0 0 0 0 0 0 0 0 0 1 a 0 0 0 0 a 0 0 0 0 70 0 0 0 0 2 0 0 0 0 0 0 2 i 0 17 0 0 0 0 0 0 0 52 0 0 0 0 1 0 0 0 0 0 0 D 0 0 0 0 0 73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Ü 0 0 0 0 0 0 0 5 0 0 0 0 0 0 58 0 0 0 0 44 0 D 0 0 6 0 0 0 0 12 11 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 69 0 a 0 3 0 0 0 0 0 0 0 0 0 0 0 0 2 0 11 0 0 7 0 0 53 0 0 0 D 0 0 0 0 0 0 0 0 0 0 fb) TCAG NS PFLYHQVKflElWRh -29 -22 -29 -48 -24 -24 -46 40 -13 62 -10 -40 -22 -33 -44 -44 -IS 16 -3U -21 -1 -3? -1 -18 -20 -10 -13 -9 20 -12 -21 -Ifl 32 -23 -22 -20 32 -61 -26 15 Ď -36 -IS -30 -24 -12 -30 36 0 24 -15 -36 -6 -30 -36 -30 6 -30 -30 -6 3-29 S -4 -10 -1 -7 -11 3 -31 -19 -15 14 -12 -15 -13 11 "52 "15 11 3-48-1-8 7 1-4 -54 -31-46 ť 14 -17 23 6 S -20 -48 14 -í 2 -27 -7 -19 -3 -5 -13 0 -16 6 S -10 -11 -15 -33 -It -7 -3? -12 -15 0 -60 -12 "24 12 0 -1? -60 -36 -48 O 12 -24 60 0 0 -24 -3D 30 0 6-30 0 -6 12 12 0 -48 -36 -42 -6 0 -18 30 0 G -18 -30 IE -12 -24 -72 -24 -48 -36 -36 -36 24 72 -12 -24 -24 24 -36 -48 -36 24 -24 -36 U -12 -50 -20 -32 2 -2 0 -50 -34 -48 26 18 -24 32 -6 -6 -24 10 62 -2 24-29 7-5 S 6 0 -36 -24 -31 6 1-6 1 4 4-6 -56 -4 -14 0 -36 12 -12 -12 12 72 -60 -36-60 0 0 "12 -12 -12 -12 -24 -7? 0 -24 -í -44 "2 -18 -16 -10 -12 -10 22 -24 -18 -14 10 -22 -24 -18 6 -40 -26 36 Blocks Conserved motifs are located by a first motif-finding algorithm: search for the spaced residue triplets (e.g., Ala-X-X-Val-X-Trp); a block score is weighted using BLOSUM 62 substitution matrix. Validation of blocks by a second motif-finding algorithm: search for the highest-scoring set of blocks in the correct order without overlapping. Sequences are clustered to avoid a bias due to identical sequences. 100 OPSD SHEEP ■.MA.^.a.h.-..U...i>.mJ.i.««i.ŕ„..,..l.>i,i.,i..„.i.....1,nii.,i..,..j.....li . ■ ■" i tlmi ii i i * ■.. i i . . . 11 11, b i ^"H'l.iH.I ili It.i.ilu.niniH.iuíM.ttil.. i.Ni.aii .u.U...........Il.niu.,,1...•■■f *'»«*.-.1 -■>-■—• ^ÉlttfrMkMUkUdl......^-n-i^^-i.-.i-K.-i......*—„Ľ,ili----------,.i,l,; lii(^ir.Hl,k.i^.j.n.MillUiii.i.l..„...i.J ■■ '.Jl..,.l|lli(lllhll»ii...l..<..B.i,mliailill*liliii„iťk.jJl.i-i.....;......... ■■■■ ' ■'.....inUJta......»-■" .iI...i.,..,i,.au i« m-'-M'-iľ......-•■ ■—■■■b.l-—-.i-l.-.. ÍL______t.til.litJiili,t L... „HjIbiM ..... -^ ■■ ■■■ .< itin.rfh.Li Ji.....L..ril.ii1mti -L. ..Ü..I.. ..n..t,.i,.ti d . 0 iufam liLáiJtiiťllfllí.....■.lintiiiililJímmili.....,m 350 Residue number 44 CCKR_HÜMAN ( 362) CCKR_RAT í 378) FML2_HUMAN ( 294) FMLR_HUMAN ( 293) FMLR_MOUSE ( 304) FMLR_RABIT { 295) GASR_CANFA ( 388) GASR_HUMAN ( 382) GASR_PRANA ( 385) GASR_RABIT ( 387) GASR_RAT ( 387) ET1R_B0VIN ( 361) ET1R_RAT { 361) ETBR_BOVIN < 377) ETBR_HUMAN ( 378) ETBR_PIG ( 379) ETBR_RAT ( 378) OPSD_LOLFO { 307) OPSD_OCTDO ( 308) OPSDJTODPA í 306) P2UR_HUMAN ( 296) P2UR_MOÜSE ( 298) P2UR_RAT ( 297) 5H6_RAT ( 312) EDG1_HUMAN ( 302) EBI2_HUMAN ( 300) OXYR_HUMAN ( 321) OXYR_PIG ( 323) V1AR_HUMAN ( 340) V1AR_RAT t 346) PER3_BOVIN ( 337) PER3_HUMAN { 338) YN84_CAEEL ( 331) SSCVNPIIYCFMNKRFR 3 SSCVMPIIYCFMNKRFR 3 NSCLNPMLYVFVGQDFR 4 NSCLNPMLYVFMGQDFR 4 NSCLNPMLYVFMGQDFR 4 NSCLNPMLYVFMGQDFR 4 SACVNPLVYCFMHRRFR 5 SACVNPLVYCFMHRRFR 5 SACVNPLVYCFMHRRFR 5 SACVNPLVYCFMHRRFR 5 SACVNPLVYCFMHRRFR 5 NSCINPIALYFVSKKFK 9 NSCINPIALYFVSKKFK 9 NSCINPIALYLVSKRFK 9 NSCINPIALYLVSKRFK 9 NSCINPIALYLVSKRFK 9 NSCINPIALYLVSKRFK 9 SAIHNPMIYSVSHPKFR 12 SAIHNPIVYSVSHPKFR 12 SAIHNPMIYSVSHPKFR 12 NSCLDPVLYFLAGQRLV 13 NSCLDPVLYFLAGQRLV 13 NSCLDPVLYFLAGQRLV 13 NSTMNPIIYPLFMRDFK 16 NSGTNPIIYTLTNKEMR 21 NCCMDPFIYFFACKGYK 23 NSCCNPWIYMLFTGHLF 24 NSCCNPWIYMLFTGHLF 24 NSCCNPWIYMFFSGHLL 18 NSCCNPWIYMFFSGHLL 18 NQILDPWVYLLLRKILL 35 NQILDPWVYLLLRKILL 35 SCVAYFLIFTLLNRGIR 100 Profiles Based on entire sequences. Profiles define which residues are allowed at given positions, which positions are highly conserved and which degenerate, which positions can tolerate insertions. The scoring system may include evolutionary weights and results from structural analysis. /DEFAULT: MI=-26; X=-3; IM=0; MD=-26; D=-3; DM=0; /M: SY='F';M=-2,-3,-3,-4,2,-3,-2,1,-2,0,-1,-2,-3,-3,-4,-2,-1,0,-5,2; /M: SY='I';M=-l,-5,-2,-3,-2,-3,0,l,l,-l,l,-l,-2,-1,1,-1,0,1,-4,-4; /M: SY='A';M=2,-3,1,0,-5,2,-2,-1,-1,-3,-2,1,1,0,-2,2,2,0,-8,-5; /M: SY='L'^=-3,-8,-5,-4,2,-6,-2, 2,-4, 6, 4,-3,-3,-2,-3,-3,-2,1,-3,0; /M: SY='Y';M=-4,-2,-6,-6,9,-7,O,-1,-5,-1,-3,-3,-6,-S,-6,-4,-4,-4,-1,11; /M: SY=,D,;M=l,-6,3,3,-7,0,0,-2,-l,-4,-3,2,0,l,-2,0,0,-2,-9,-6; /M: SY='Y';M=-5,-3,-6,-6,10,-7,-1,-1,-2,-1,-2,-3,-6,-5,-5,-4,-4,-4,-1,11; /M: SY=*K';M=-l,-6,l,l,-4,-2,0,-2,2,-3,-l,l,-l,l,l,0,0,-3,-7,-6; /M: SY='A';M=1,-4,1,0,-5,1,-1,-1,0,-3,-1,1,0,O,O,1,1,-1,-7,-6; /M: SY='R';M=0,-5,O,O,-5,-1,0,-1,1,-3,-1,1,O,1,1,O,O,-2,-5,-S; /M: SY=,Rl;M=0,-5,1,1,-6,O,1,-2,1,-4,-2,1,0,1,2,1,0,-2,-S.-5; /M: SY='E';M=1,-5,2,2,-6,O,O,-2,-1,-4,-2,1,1,1,-1,O,O,-3,-8,-6; /M: SYs'D';M=0,-6,2,2,-6,0,1,-3,O,-5,-3,2,-1,2,-1,0,O,-4,-7,-4; /M: SY='D';M=0,-S,4,3,-6,O,O,-2,-1,-3,-2,2,-2,2,-2,O,-1,-3,-9,-6; /M: SY='Ľ;M=-2,-8,-5,-5,2,-5,-3,3,-4,7,5,-4,-3,-3,-4,-3,-2,3,-4,-2; /M: SY='S*;M=1,-4,1,1,-5,1,0,-2,1,-4,-2,1,0, O,0,1,1,-2, -6,-5; /M: SY='F';M=-3,-7,-6,-6,6,-5,-3,3,-2,5,3,-4,-5,-4,-S,-4,-3,l,-3,3; /M: SY='Q';M=-1.-6,0,O,-3,-2,1,-1,1,-2,O,O,-1,1,1,-1,O,-1,-6,-4; /M: SY='K';M=-1,-8,O,1,-3,-2,O,-2,3,-3,O,1,0,2,2,O,O,-3,-6,-6; /M: SY='G';M*2,-5,1,0,-7,7,-3,-4,-2,-6,-4,1,-1,-2,-4,2,O,-2,-10,-8; /M: SY='D';M=1,-7,5,4,-8,1,1,-3,O,-5,-3,2,-1,2,-2,O,0,-4,-10,-6; /M: SY='ľ;M=0,-5,-1,-2,-2,-2,-1,2,O,O,1,-1,-2,O,O,-1,0,1,-6,-S; /M: SY= 'Ľ;M=-2, -6, -5, -5,3, -5, -3,4, -3, 6, 4, -4, -4, -3, -4, -3, -2, 3, -5,0; /M: SY='Q';M=-l,-5,-1,-1,-3,-2,O,O,O,-2,-1,0,-1,0,O,-1,0,-1,-6,-3; /M: SY*'V';M*0,-4,-3,-4,-1,-3,-3,5,-3,3,3,-2, -2,-2,-3,-2,O,5,-8,-4; /M: SY= »L^M—1,-6, -3, -3, -1,-3, -2, 2, -3, 3,2, -2, -2, -2, -3, -2, -1,2, -5,-3; /M: SY='D';M=0,-6,3,3,-6,O,1,-3,2,-5,-2,2,-1,2,1,0,O,-4,-7,-5; /M: SY='K';M=-1,-6,0.0,-2,-1,0,-3,3,-4,-1,1,-1,0,1,0,O,-3,-6,-4; /M: SY=,N';M=l,-4,l,l,-5,0,0,-2,0,-3,-2,l,l,0,-l,l,l,-l,-7,-5; /I: MI=0; I=-Í; MD=0; /M: SY='X'; M=0; D=-l; /M: SY='G';M=1,-5,O,O,-5,1,-2,-1,-2,-3,-2,O,O,-1,-2,O,O,-1,-8,-6; /M: SY='G';rM=l,-6,3,3,-7,3,0,-4,-1,-5,-4,2,-1,1,-2,1,0,-3,-10,-6; /M: SY='W;M=-9,-12,-9,-ll,l,-ll,-4,-8,-5,-3,-6,-6,-8,-7,3,-4,-8,-9,26,0; /M: SY='W;M=-7,-9,-9.-9,0,-9,-4,-5,-5,-l,-4,-6,-7,-6,2,-3,-6,-6,18,-l; /M; SY='K';M=-1,-7,O,O,-3,-2,O,-2,2,-3,-1,1,-1,1,2,O,-1,-3,-5,-5; /Mi SY='G';M=2,-3,O,-1,-6,3,-3,-2,-3,-4,-3,O,O,-2,-3,1,0,O,-10,-6; /M: SY='Q',-M=-2,-6,0,0,-3, -3,1,-2,0, -2,-1,0,-2.1,1,-1,-1,-3,-5,-3; /I: MI=0; I~-2; MD=0; /M: SY='X'; M=0; D=-2; /M: SY='T" ;M=0,-4,-1,-1,-4,O,-2,O,-1,-2,0,O,-1,-1,-1,0,1,0,-7,-5; /M: SY='T';M=0,-5,O,O,-3,-1,-1,-1,1,-3,-1,1,-1,0,O,1,1,-1,-6,-4; /M: SY='G';M=0,-5,O,-1,-5,3,-2,-3,-1,-5,-3,O,-1,-1,-1,1,O,-2,-7,-6; /M: SY='K';M=0,-6,1.1.-5,-1,1,-2,2,-4,-1,1,-1,2,2,O,O,-3,-6,-6; /M: SY='R';M=-l,-6,-1,-1,-5,-3,1,-1.1,-3,-1,0,-1,1,3,-1,-1,-2,-2,-6; /M: SY=,G,;M=l,-5,0,0,-6,6,-3,-3,-3,-5,-4,0,-l,-2,-4,l,0,-2,-10,-6; /M: SY='W';M=-5,-5,-5,-5,2,-6,-2,-2,-4,-l,-3,-3,-6,-5,-3,-3,-4,-4,4,3; /M: SY='F';M=-3,-5.-6,-6,6,-5.-3,4.-l,3,2,-4,-4,-5,-4,-3,-2,2,-4,3; ./M: SY='P',-M=2,-4,-1,-1,-7,-1.0,-3.-2,-4,-3,-1,8,O,O,1,0,-2,-8,-7,• /M: SY='G';M=1,-3,O,O,-4.2,-1,-2,O,-3,-2,O,O,-1,-1,1,1,-1,-6,-5; /M: SY='N';M=1,-5,2,1,-5.O,1.-2,1,-4,-2,2,O,O,O,1,1,-2,-7,-4; /M: SY='Y';M=-5,-1,-7,-7,10,-8,-1.-1,-5,-1,-3,-3,-7,-6,-6,-4,-4,-5,0,13; /M: SY='V;M=0,-3,-3,-5,-2,-2,-3,5,-3,2,2,-2,-2,-3,-4,-1,0,5,-8,-5; /M: SY='E';M=1,-6,2,3,-6,O,0.-2,1,-4,-2,1,0,2,O,O,O,-3,-8,-6; /M; SY='P*íM=0,-5,-1,-1,-2,-2,-1,-2,-1,-3,-2,O,1,-1,-2,O,-1,-2,-6,-3; Hidden Markov Models Based on entire sequences. HMMs are probabilistic models consisting of a number of interconnecting states - linear chains of match, delete or insert states. Each position in the multiple alignment is assigned to either match, insert or delete state. Construction: seed alignment, iterative sequence gathering, final alignment (all automatic).