t
t
Bioinformatics - lectures
Introduction Information networks Protein information resources Genome information resources DNA sequence analysis Pairwise sequence alignment Multiple sequence alignment Secondary database searching Analysis packages Protein structure modelling
Multiple sequence alignment
multiple sequence alignment
consensus sequence
manual methods
simultaneous and progressive methods
databases of multiple sequence alignments
hybrid approach for database searching
Multiple sequence alignment
Multiple sequence alignment is a 2D table in which the rows represent individual sequences and the columns the residue positions.
Multiple sequence alignments are essential for analysis of sets of gene families.
Sequence-based multiple sequence alignments -constructed according to similar strings of amino acid residues.
Structure-based multiple sequence alignments -constructed according to structural evidence.
Colour-coded multiple sequence alignments
mmmmmimmtm
f..
$««;$■...............•«■&&#!£OM£i
%mßm...............mmm*m*m> m*

m
m
[■>!
1
t
i
ý" v j * vj
vyj
V+ -I
j-ä n c y r- í;;
b^VJaC:  ř Viť
k I
U
3 Q
40
5D
chk-H5	25	Hhpty	[S gM h	■AAA!   RA	EKSRQ		■rosí  Q[	JY 1   K S	H -   Y -   -   -   K
hun-Hl	36	Ig p p v	s Píl r	■T K A V A A	S K E RS	gv|	1LAALK	]A L A A	a a y - -
pea-Sl	61	1HPT Y	EGIMD	■ K D A  I   V S	L K E K N	Is E	1qy a i  A [	If i   E E	K -   0-   K-   Q
r la   nl	40	Ig p s a	SHL I	|V   K A V   S   S	S K F fi s !	3V E	■l A A L K J	1A L A A	QQY-----------
ace-El.L	41	asKs	Rau r	ll   E Q L T A	LKERK	JS [	Irpalkc	Jf i  k E	N-   VPI   V G
nce-Hl.2	176	as L T Y	k[|mD	■lksmpolndqki		3s_e	Ir i v L k ľ	1_Y V KJJJT  F 3 S K L K	
PdbllHST			Alpha 1			sera i	Alpha 2		
70
HO
90
ch<c-H5
bun-Hl
Poi-Hl
xli-Hl
*ce-Hl,l
«cs-Kl.2
V D
L D S
T
G V
p v
A
S
H N A D L Q I
N  N  S  R  I
F  K  K  L  L
N  N  S  R  L
F  D  L  Y  F
F  P  V  l   F
K L S I K L G L
L Q N L KLAL N N A i
N.S..AJ_
R R  L L A  A
K S  L V S  K
K K  N V A  S
K A  L V T  K
K K  G V E  A
K K  C V S  H
GS  F  R|L
G S  F  K
S A  A  A
G 3  F  K
V K  L  A
P SGI   I   K L N K K K V K
A N K N K
K
K
P K
K
S 0 K A A V K O S P
pdOlISST
Alphas                                                    Bela 2                             Beta 3
Multiple sequence alignment
■ Construction of a multiple sequence alignment:
** positioning of residues within any sequence is preserved (absolute positions)
** similar residues in all sequences are brought into vertical register (relative positions)
All residues in any single column of an alignment
will have the same relative position but different absolute position (unless the sequences are identical).
Consensus sequence
The alignment table can be summarised by
** a single line: pseudo-sequence *■ unweighted matrix: fingerprint >- ungapped block of residues (weighted): block
>■ weighted matrix: profile
Multiple alignment and the consensus sequence
	1	2	3	■. í*.	5	6	7	8	9
I	Y	D	G	G	A	V	—	hm E	A
II	Y	D	G	G	—	—	—	E	A
III	F	E	G	G	I	L	V	E	A
IV	F	D	—	G	I	L	V	Q	A
V	Y	E	G	G	A	V	v	: Q	A
	y	d	G	G	A/I	V/L	v	e	A
10
L
L L V
L
Multiple alignment and the profile, block and fingerprint
C-Y-X2-[DGj-G-x-[ST]
regular expression
Manual methods
Manual methods are subjective however they enable to incorporate experimental evidences (e.g., mutagenesis data, structural knowledge) into the multiple alignment.
Manual modification of the multiple alignments from automatic methods is the best approach.
Intuitive colouring schemes assist the eye in spotting similarities.
Quantitative evaluation of relatedness through calculation of residue identities/similarities.
Residue
Asp, Glu
His, Arg, Lys
Ser, Thr, Asn, Gin
Ala, Val, Leu, lie, Met
Phe, Try, Trp
Pro, Gly
Cys
Acidic                                                     red
Basic                                                      blue
Polar neutral                                          green
Hydrophobic aliphatic                            white
Hydrophobic aromatic                            purple
Special structural properties                   brown
Disulphide bond former                          yellow
Aliphatic
Aromatic
Hydrophobic
Ní ■:.-.-
Positive
Polar Charged
Simultaneous methods
Simultaneous methods align all sequences in a given set at once, rather than aligning pairs of sequences or building sequence clusters.
Extension of 2D dynamic programming matrix to more dimensions.
Number of dimensions = number of sequences.
Suitable only for small sets of short sequences.
Progressive methods
Multi-dimensional programming matrix is not applicable to realistic problems - larger sets of longer sequences.
CLUSTAL
** 1. construction of evolutionary tree
** 2. pairwise alignment of closely related sequences, addition of less related sequences
^ 3. final alignment, final evolutionary three
CLUSTALW
** positioning of gaps in closely related sequences according to their variability
Databases of multiple alignments
Multiple alignments bring together sequences from different species. This important evolutionary information can enhance sensitivity of database searches.
Various abstractions (regular expressions, profiles, blocks, fingerprints or HMMs) can be searched against sequence databases. More information used in a query - higher sensitivity.
Results  of  the   searches  using alignments are more difficult to interpret.
multiple
Databases of multiple alignments
Multiple alignments databases available via Web are produced automatically (e.g., PFAM) or manually (e.g., PRINTS).
Iterative automatic methods may include false-
positive sequences in the alignment which will corrupt it by insertion of many unrealistic gaps.
«The Sanger
Centre
Pfam
Protein families database of alignments and HMMs
|üf      il -  fi
zf-C2H2
Figure 1:1a1h
Complex (zinc finger/dna)
Qgsr (zif268 variant) zinc finger—dna complex
(gcac site)
Accession number: PF00096
Zinc finger, C2H2 type
The C2H2 zinc finger is the classical zinc finger domain. The two conserved cysteines and histidines co-ordinate a zinc ion. The following pattern describes the zinc finger. #_X-C-X(1-5)-C-X3-#-X5-#-X2-H-X(3-6)-[H/C] Where X can be any amino acid, and numbers in brackets indicate the number of residues. The positions marked # are those that are important for the stable fold of the zinc finger. The final position can be either his orcys.The C2H2 zinc finger is composed of two short beta strands followed by an alpha helix. The amino terminal part of the helix binds the major groove in DNA binding zinc fingers.
INTERPRO description (entry ipr000822
Zinc finger domains [MEDLINE: 88151019], PUB00005329 are nucleic acid-binding protein structures first identified in the Xenoßus transcription factor TFI11 A. These domains have since been found in numerous nucleic acid-binding proteins. A zinc finger domain is composed of 25 to 30 amino-acid residues including 2 conserved Cys and 2 conserved His residues in a C-2-C-12-H-3-H type motif. The 12 residues separating the second Cys and the first His are mainly polar and basic, implicating this region in particular in nucleic acid binding. The zinc finger motif is an unusually small, self-folding domain in which Zn is a crucial component of its tertiary structure. All bind 1 atom ofZn in atetrahedral array to yield a finger-like projection, which interacts with nucleotides in the major groove of the nucleic acid. The Zn binds to the conserved Cys and His residues. Fingers have been found to bind to about 5 base pairs of nucleic acid containing short runs of guanine residues. They have the ability to bind to both RNA and DNA, a versatility not demonstrated by the helix-turn-helix motif. The zinc finger may thus represent the original nucleic acid binding protein. It has also been suggested that a Zn-centred domain could be used in a protein interaction, e.g. in protein kinase C. Many classes of zinc fingers are characterized according to the number and positions of the histidine and cysteine residues involved in the zinc atom coordination. In the first class to be characterized, called C2H2, the first pair of zinc coordinating residues are cysteines, while the second pair are histidines.
For additional annotation, seethe PROSITE document pdqcooo2 8 rsxpasyi srs-uki srs-usai
«The Sanger
Pfam
Protein families database of alignments and HMMs
!■■■ ■!■■: ■-!
-407
83
-32 8 102
YVCPF.DGCN
-292
-413
65
564
-742
-319
-623
84
-950
-348
-525
7
94
388
-313
00
115
-2 81
130
-329
-427
-627
-512
-216
-358
-366
-506
510
7
168
84
-342
-196
160
-191
-757
:i
112 490
YTCT. .	. .QCN..
YTCE. .	..ICD..
YKCT. .	..VCR..
YECP..	..NCK..
FGCD..	..NCG..
YLCY. .	..YCG..
FKCD..	..ICL..
FQCD..	..ICK..
FQCD..	..KCS..
FKCPV.	.IGCE..
FVCT..	..VCG..
ITCH..	..LCQ..
YSCS. .	..KCR..
HKCS..	..KCD..
FACT..	..KCK..
KACT..	..LCQ..
HKCP..	..DCP..
INCP..	..DCP..
HSCPT.	.AGCK..
VQCS..	..ICF..
LNCPF.	.PICQ..
LKCSV.	.PGCK..
IICS..	..ICN..
GFCL..	..ICN..
FECLY.	.PNCN..
FECN..	..MCG..
QECT..	..TCG..
FECK..	..VCG..
FECK..	..QCG..
FECK..	..ECG..
IECD..	..ECG..
FTCP..	..ECG..
FTCT..	..ECG..
FTCF..	..ECG..
FECP..	..KCG..
FTCT..	..ECG..
FMCT..	..KCG..
SVCDV.	.PGCG..
HKCQ..	..HCG..
FKCN..	..MCG..
WKCGK.	.KDCG..
FSCT..	..VCG..
LKCM..	..RCG..
KKFAQSTNLKSHILT.
KQFSHSAQLRAH1ST. GKFSDSNQLKSHMLV. KDIS S S ESLRTHMFKQ KRFSHSGSYSSHISSK KRFSHSGSFSSHMTSK KTLSDRLEYQQHMLK. LTFSDTKEVQQHALV. KTFKNÄCSVKIHHKN. YTCVNKSMLNSHRKS. KTYKNQNGLKYHRLH. KTYKYKHGLNTHLHS. KTYSNKGTFRAHYKT. KTFKRWKSFLNHQQT. LTFSHWSTFMKHSKL. RRFCSNKELFSHKRI. KRFVTNQQLRRHLNS. KTFKTPGTLÄMHRKI. KSFKTQTSYERHIFI. MTFSTKKSLSRHKLY. KTFCDKGALKIHFSA. KTFRRKDAYKRHVAM. RS FRKKRÄLRIHVS E. VSFKSRKTFNHHTLI. TTFENKKELEHHLQF. KVFKRRYNIRSHIQT. YHSQDRY EFS SHITRG KVYNSWYQLQKHIS E. KS FKRESNLIQHGAV. KIFSNGSYLLRHYDT. KÄFHFSSQLNNHKTS. KHFSHAGALFTHKMV. KRF.SQKSNCWHTED. EHFANKVSLLGHLKM. TCFVNY SWLMLHIRM. KCYFRKENLLEHEÄR. KCLTRQYQLTEHSYL. KCLSTKQKLNLHHMT. WKS TSAAKLAAHHRR. KPFAGAAQLLÄHSRG. EKCDGPVGLFVHMARN KMFARKRQIQKHMKR. EMFTYRAQFSKHMLK. ESFRSLGEMTKHMOET
.H P25490
.H .H HH KC KC VH .H MH .H GH .H VH .H .H .H .H .H TH KH VH VH .H .H DH .H EH EH .H .H .H .H .H .H .H NC .H .H .H .H AH .H .H OH
PI 3727 P07247 P34303 P36197 P28166 P34437 P23607 001954 P05084 P32432 PI 8717 Q01954
P08045 P08045 P18721
P32 805 P20385 P08970
P25066 P23792 P32338
P25066 P07664 P07664
P21192 Q03267 P07664
P16373 P16374 P16374
P17010 PI 8727 PI 8737
PI 8750 P10074 P18716
PI 3718 P25066 P30373
P05084 P25066 PI 3726 P22265
Hybrid approach for database searching
PSI-BLAST
** Position-Specific Iterated - BLAST
>► algorithm by Altschul eta/. (1997)
** incorporates elements of both pain/vise and multiple sequence alignment methods
>- procedure: initial search - creation of position specific
profiles from the hits - new search ... in iterations
*■ advantage: detects even very weak similarities
>■ disadvantages: the profile can be diluted if low-complexity regions are not masked; inclusion of single false-positive sequence into the profile leads to bias towards unrelated sequences
Graphic hit list from a database search using PSI-BLAST
<40                       40-50
Color  Key for fllignnent Scores
0-80   ^H 80-;
0                     50                    100                   150                   200                   250