t t Bioinformatics - lectures Introduction Information networks Protein information resources Genome information resources DNA sequence analysis Pairwise sequence alignment Multiple sequence alignment Secondary database searching Analysis packages Protein structure modelling Multiple sequence alignment multiple sequence alignment consensus sequence manual methods simultaneous and progressive methods databases of multiple sequence alignments hybrid approach for database searching Multiple sequence alignment Multiple sequence alignment is a 2D table in which the rows represent individual sequences and the columns the residue positions. Multiple sequence alignments are essential for analysis of sets of gene families. Sequence-based multiple sequence alignments -constructed according to similar strings of amino acid residues. Structure-based multiple sequence alignments -constructed according to structural evidence. Colour-coded multiple sequence alignments mmmmmimmtm f.. $««;$■...............•«■&&#!£OM£i %mßm...............mmm*m*m> m* m m [■>! 1 t i ý" v j * vj vyj V+ -I j-ä n c y r- í;; b^VJaC: ř Viť k I U 3 Q 40 5D chk-H5 25 Hhpty [S gM h ■AAA! RA EKSRQ ■rosí Q[ JY 1 K S H - Y - - - K hun-Hl 36 Ig p p v s Píl r ■T K A V A A S K E RS gv| 1LAALK ]A L A A a a y - - pea-Sl 61 1HPT Y EGIMD ■ K D A I V S L K E K N Is E 1qy a i A [ If i E E K - 0- K- Q r la nl 40 Ig p s a SHL I |V K A V S S S K F fi s ! 3V E ■l A A L K J 1A L A A QQY----------- ace-El.L 41 asKs Rau r ll E Q L T A LKERK JS [ Irpalkc Jf i k E N- VPI V G nce-Hl.2 176 as L T Y k[|mD ■lksmpolndqki 3s_e Ir i v L k ľ 1_Y V KJJJT F 3 S K L K PdbllHST Alpha 1 sera i Alpha 2 70 HO 90 ch- ungapped block of residues (weighted): block >■ weighted matrix: profile Multiple alignment and the consensus sequence 1 2 3 ■. í*. 5 6 7 8 9 I Y D G G A V — hm E A II Y D G G — — — E A III F E G G I L V E A IV F D — G I L V Q A V Y E G G A V v : Q A y d G G A/I V/L v e A 10 L L L V L Multiple alignment and the profile, block and fingerprint C-Y-X2-[DGj-G-x-[ST] regular expression Manual methods Manual methods are subjective however they enable to incorporate experimental evidences (e.g., mutagenesis data, structural knowledge) into the multiple alignment. Manual modification of the multiple alignments from automatic methods is the best approach. Intuitive colouring schemes assist the eye in spotting similarities. Quantitative evaluation of relatedness through calculation of residue identities/similarities. Residue Asp, Glu His, Arg, Lys Ser, Thr, Asn, Gin Ala, Val, Leu, lie, Met Phe, Try, Trp Pro, Gly Cys Acidic red Basic blue Polar neutral green Hydrophobic aliphatic white Hydrophobic aromatic purple Special structural properties brown Disulphide bond former yellow Aliphatic Aromatic Hydrophobic Ní ■:.-.- Positive Polar Charged Simultaneous methods Simultaneous methods align all sequences in a given set at once, rather than aligning pairs of sequences or building sequence clusters. Extension of 2D dynamic programming matrix to more dimensions. Number of dimensions = number of sequences. Suitable only for small sets of short sequences. Progressive methods Multi-dimensional programming matrix is not applicable to realistic problems - larger sets of longer sequences. CLUSTAL ** 1. construction of evolutionary tree ** 2. pairwise alignment of closely related sequences, addition of less related sequences ^ 3. final alignment, final evolutionary three CLUSTALW ** positioning of gaps in closely related sequences according to their variability Databases of multiple alignments Multiple alignments bring together sequences from different species. This important evolutionary information can enhance sensitivity of database searches. Various abstractions (regular expressions, profiles, blocks, fingerprints or HMMs) can be searched against sequence databases. More information used in a query - higher sensitivity. Results of the searches using alignments are more difficult to interpret. multiple Databases of multiple alignments Multiple alignments databases available via Web are produced automatically (e.g., PFAM) or manually (e.g., PRINTS). Iterative automatic methods may include false- positive sequences in the alignment which will corrupt it by insertion of many unrealistic gaps. «The Sanger Centre Pfam Protein families database of alignments and HMMs |üf il - fi zf-C2H2 Figure 1:1a1h Complex (zinc finger/dna) Qgsr (zif268 variant) zinc finger—dna complex (gcac site) Accession number: PF00096 Zinc finger, C2H2 type The C2H2 zinc finger is the classical zinc finger domain. The two conserved cysteines and histidines co-ordinate a zinc ion. The following pattern describes the zinc finger. #_X-C-X(1-5)-C-X3-#-X5-#-X2-H-X(3-6)-[H/C] Where X can be any amino acid, and numbers in brackets indicate the number of residues. The positions marked # are those that are important for the stable fold of the zinc finger. The final position can be either his orcys.The C2H2 zinc finger is composed of two short beta strands followed by an alpha helix. The amino terminal part of the helix binds the major groove in DNA binding zinc fingers. INTERPRO description (entry ipr000822 Zinc finger domains [MEDLINE: 88151019], PUB00005329 are nucleic acid-binding protein structures first identified in the Xenoßus transcription factor TFI11 A. These domains have since been found in numerous nucleic acid-binding proteins. A zinc finger domain is composed of 25 to 30 amino-acid residues including 2 conserved Cys and 2 conserved His residues in a C-2-C-12-H-3-H type motif. The 12 residues separating the second Cys and the first His are mainly polar and basic, implicating this region in particular in nucleic acid binding. The zinc finger motif is an unusually small, self-folding domain in which Zn is a crucial component of its tertiary structure. All bind 1 atom ofZn in atetrahedral array to yield a finger-like projection, which interacts with nucleotides in the major groove of the nucleic acid. The Zn binds to the conserved Cys and His residues. Fingers have been found to bind to about 5 base pairs of nucleic acid containing short runs of guanine residues. They have the ability to bind to both RNA and DNA, a versatility not demonstrated by the helix-turn-helix motif. The zinc finger may thus represent the original nucleic acid binding protein. It has also been suggested that a Zn-centred domain could be used in a protein interaction, e.g. in protein kinase C. Many classes of zinc fingers are characterized according to the number and positions of the histidine and cysteine residues involved in the zinc atom coordination. In the first class to be characterized, called C2H2, the first pair of zinc coordinating residues are cysteines, while the second pair are histidines. For additional annotation, seethe PROSITE document pdqcooo2 8 rsxpasyi srs-uki srs-usai «The Sanger Pfam Protein families database of alignments and HMMs !■■■ ■!■■: ■-! -407 83 -32 8 102 YVCPF.DGCN -292 -413 65 564 -742 -319 -623 84 -950 -348 -525 7 94 388 -313 00 115 -2 81 130 -329 -427 -627 -512 -216 -358 -366 -506 510 7 168 84 -342 -196 160 -191 -757 :i 112 490 YTCT. . . .QCN.. YTCE. . ..ICD.. YKCT. . ..VCR.. YECP.. ..NCK.. FGCD.. ..NCG.. YLCY. . ..YCG.. FKCD.. ..ICL.. FQCD.. ..ICK.. FQCD.. ..KCS.. FKCPV. .IGCE.. FVCT.. ..VCG.. ITCH.. ..LCQ.. YSCS. . ..KCR.. HKCS.. ..KCD.. FACT.. ..KCK.. KACT.. ..LCQ.. HKCP.. ..DCP.. INCP.. ..DCP.. HSCPT. .AGCK.. VQCS.. ..ICF.. LNCPF. .PICQ.. LKCSV. .PGCK.. IICS.. ..ICN.. GFCL.. ..ICN.. FECLY. .PNCN.. FECN.. ..MCG.. QECT.. ..TCG.. FECK.. ..VCG.. FECK.. ..QCG.. FECK.. ..ECG.. IECD.. ..ECG.. FTCP.. ..ECG.. FTCT.. ..ECG.. FTCF.. ..ECG.. FECP.. ..KCG.. FTCT.. ..ECG.. FMCT.. ..KCG.. SVCDV. .PGCG.. HKCQ.. ..HCG.. FKCN.. ..MCG.. WKCGK. .KDCG.. FSCT.. ..VCG.. LKCM.. ..RCG.. KKFAQSTNLKSHILT. KQFSHSAQLRAH1ST. GKFSDSNQLKSHMLV. KDIS S S ESLRTHMFKQ KRFSHSGSYSSHISSK KRFSHSGSFSSHMTSK KTLSDRLEYQQHMLK. LTFSDTKEVQQHALV. KTFKNÄCSVKIHHKN. YTCVNKSMLNSHRKS. KTYKNQNGLKYHRLH. KTYKYKHGLNTHLHS. KTYSNKGTFRAHYKT. KTFKRWKSFLNHQQT. LTFSHWSTFMKHSKL. RRFCSNKELFSHKRI. KRFVTNQQLRRHLNS. KTFKTPGTLÄMHRKI. KSFKTQTSYERHIFI. MTFSTKKSLSRHKLY. KTFCDKGALKIHFSA. KTFRRKDAYKRHVAM. RS FRKKRÄLRIHVS E. VSFKSRKTFNHHTLI. TTFENKKELEHHLQF. KVFKRRYNIRSHIQT. YHSQDRY EFS SHITRG KVYNSWYQLQKHIS E. KS FKRESNLIQHGAV. KIFSNGSYLLRHYDT. KÄFHFSSQLNNHKTS. KHFSHAGALFTHKMV. KRF.SQKSNCWHTED. EHFANKVSLLGHLKM. TCFVNY SWLMLHIRM. KCYFRKENLLEHEÄR. KCLTRQYQLTEHSYL. KCLSTKQKLNLHHMT. WKS TSAAKLAAHHRR. KPFAGAAQLLÄHSRG. EKCDGPVGLFVHMARN KMFARKRQIQKHMKR. EMFTYRAQFSKHMLK. ESFRSLGEMTKHMOET .H P25490 .H .H HH KC KC VH .H MH .H GH .H VH .H .H .H .H .H TH KH VH VH .H .H DH .H EH EH .H .H .H .H .H .H .H NC .H .H .H .H AH .H .H OH PI 3727 P07247 P34303 P36197 P28166 P34437 P23607 001954 P05084 P32432 PI 8717 Q01954 P08045 P08045 P18721 P32 805 P20385 P08970 P25066 P23792 P32338 P25066 P07664 P07664 P21192 Q03267 P07664 P16373 P16374 P16374 P17010 PI 8727 PI 8737 PI 8750 P10074 P18716 PI 3718 P25066 P30373 P05084 P25066 PI 3726 P22265 Hybrid approach for database searching PSI-BLAST ** Position-Specific Iterated - BLAST >► algorithm by Altschul eta/. (1997) ** incorporates elements of both pain/vise and multiple sequence alignment methods >- procedure: initial search - creation of position specific profiles from the hits - new search ... in iterations *■ advantage: detects even very weak similarities >■ disadvantages: the profile can be diluted if low-complexity regions are not masked; inclusion of single false-positive sequence into the profile leads to bias towards unrelated sequences Graphic hit list from a database search using PSI-BLAST <40 40-50 Color Key for fllignnent Scores 0-80 ^H 80-; 0 50 100 150 200 250