Bioinformatics
Jiří Damborský
National Center for Biomolecular Research
jiri@chemi.muni.cz, ph. 41129 377, Kotlářská 2, bid. 7, 2nd floor
Bioinformatics - what is it?
■ The term bioinformatics is used to encompass almost all computer applications in biological sciences.
■  Information technology applied to the management and analysis of biological data
■  originally - analysis of sequence data (80s)
■  presently - also analysis of 3D-structures
Bioinformatics - study material
■  Introduction to bioinformatics, T.K. Attwood and DJ. Parry-Smith, Longman, Essex, 1999.
■  copy of the slides
■  http://www.chemi.muni.cz/~jiri
■ http://www.bioinf.man.ac.uk/dbbrowser/ bioactivity/prefacefrm.html
Bioinformatics -		composition	
12 lectures	per semester		
3 hours per	week		
■ 1st and 2n	d hour =	= lectures	- theory
■ 3rd hour =	= practical course on computers		
Bioinformatics - lectures
■  Introduction
■  Information networks
■  Protein information resources
■  Genome information resources
■  DNA sequence analysis
■  Pairwise sequence alignment
■  Multiple sequence alignment
■  Secondary database searching
■ Analysis packages
■  Protein structure modelling
Bioinformatics - practical training
■  Biological databases
■  Searching and modelling servers
■  Building a sequence search protocol
■  Case examples
■  Protein structure prediction
■  Protein modelling
■  Follow-up of lectures
Introduction
■  history of sequencing
■  what is it Bioinformatics?
■  sequence to structure deficit
■  genome projects
■  why is Bioinformatics important?
■  patter recognition and prediction
■  folding problem
■  sequence analysis
■  homo/analogy and ortho/paralogy
History of sequencing
■ Protein sequencing
>- separation of peptides, identification and quantification of amino acids
>- Edman degradation
>- mass-spectrometry - advantage in identification of post-translational modifications
>- 1955 sequencing of peptide insuline
>- 1960 sequencing of enzyme ribonuclease
>- 1980s automated sequencers
History of sequencing
■ Nucleic acid sequencing
>- tRNA - short, could be purified
>- DNA - large (human chromosome 55-250 x 106 bp);
the  longest fragment for sequencing   is   500   bp;
purification is problematic >- advent of gene cloning and PCR
>- 1972 DNA cloning
>- 1975 DNA sequencing
>- 1980s and 1990s sequence revolution
The history of technology developments and structure determination

Transfer RNA	(a) primary sequence			and (b) tertiary structure
(a)	c pO-C    ÁrclpIO,		(b)	
				
				
	0	U		
		U		
				
				
				
	A          m»CUO U G          ,.			
DD      ACUCi*                    Cv                 Tľ				
"oA°AO	Cm!0c.GA°     Va™*fcW			
	C	G		
				
				
				
				
	U             *             loop			
	0n.A A			
				
Automatic sequencing machine
ABI Prism 310, Applied Biosystems
Automated production line in sequencing "factory"
Whitehead Institute, Center for Genome Research, USA
Sequencing chromatogram
..                                                           ..     .-■-■■"■        ".   ■
What is Bioinformatics?
■  improvements in DNA sequencing technologies and computer-based technologies
■  originally - analysis of sequence data (1980s)
■  presently - also analysis of 3D-structures
■ The term bioinformatics is used to encompass almost all computer applications in biological sciences.
■  Information technology applied to the management and analysis of biological data.
The sequence to structure deficit		
300		
g		
1    150 E		
	988                                                1993	1998
Date of database release		
		
Genome projects
■  1977 first complete genome - virus o)X174, 5000 nucleotides; 11 genes
■  1995 first complete genome of living organism Haemophilus influenzae, 1.8 million nucleotides and 1700 genes
■  sequencing of model systems: Escherichia coli, Saccharomyces cerevisiae, Cernorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Canis fa miliaris, Mus Musculus
The genome size of various species		
(Bits 232-	(Nucle io9-	otides) •   Human •   Mouse                                    • GenBank 10/97 •   Rice
		■       :  řtoiAC(1998>                   .GmB«,k9»2 • Arabidopsis
224j	io7-	•  Budding yeast (1997)             »GenBank 9/87
		•   Escherichia coli (1997) •   Haemophilus influenzae (1995)
216-	. ios-	• GenBank 10/82 •   Cytomegalovirus (1990) •   X phage(1982) •   «>X 174 phage (1977)
	10J1	
28 -		
Comparative genomic	analysis of model		organisms
	Genome	Gene	Haploid
	size (Mb)	number	chromosome number
Bacterium {Escherichia coli)	~4	4,403	i
Yeast {Saccharomyces cerevisiae)	-12	6,190	16
Worm {Caenorhabditis elegans)	97	19,730	6
Fruit Fly {DroBophila melanogaster)	120	13,601	4
Mouse {Mus Musculus)	3,454	-50,000 (estimated)	20
Human {Homo sapiens)	2,910	33,609	23
Human Genome Project
■  in mid-1980s initiated Human Genome Project
■  estimated 100.000 genes and completion in 2005
■  need for automated sequencing and improved computational techniques
■  shotgun method
■  sequencing of rough draft first
■  first draft completed in 2000 by publicly funded the International Consortium Human Genome Project and the company Celera Genomics
Human Genome Project
■  ~33.000 genes
■  genes are complex due to alternative splicing
■  >1.000.000 proteins (estimated)
■  hundreds of genes resulted from horizontal transfer from bacteria (in vertebrate lineage)
■  dozen of genes derived from transposable elements (their activity however has declined)
■  the mutation rate in male is two-times higher than in female
■  >1.400.000 single point polymorphisms (SNPs)
Why is bioinformatics important?
■  last 20-30 years - structural biology
■  new era - bioinformatics - due to genome projects and sequence/structure deficit
■  biological function is not known for about 50% of all genes in every sequenced genome
■  role of bioinformatics
>■ data management and storage >■ data analysis = conversion of primary sequence to biological knowledge
Pattern recognition versus prediction
TH I T Ľ E LAVVLQRRDWEHPG 1 KHLAJLHPPrA5HKHKE
E U T D F T E O (J L N E L K í H í ľ ■'-' ; í S V L í C O L P [ SHHQHKGYDJLPIY t D V T v H t v » f ľ H ř T f H P I QCVSLTFHI                    QiCStS
[IŕDCVfíSAFH  :. Y 0 Q D & B [. P í K F I) I. S > F L ft * C BHRLAVMVLRtf ÍÍC?Í!. B D Q U K W R K 2 '.:  I FRDVBL.LHKPTT 0 I S D T W V A 1
Levels of protein structure
primary structure:                 the Linear sequence of aTino adds in a protein molecule
Secondary structure:             regions of local regularity within a protein fold (e.g,,
ti-helices, ß-turnsr ß-strands) Super-secondary structure: tne arrangement of ct-rielices- and/or |h acrEsnci-s into
discrete folding units (e.g.. [1-barrelsr ßüß'units,
Greek U ■.■'. '■:'■ ■ lei tiary structure:                  the overall fold of a protein sequencer formed by the
packing of its secondary and/or super-secondary
structure elements Quaternary structure:           the arrangement of separate protein chains in a pro-
tein molecule witn more than one suhunit Qu internal'y structure:          the arrangement of separate molecules, such a? in
protein-protein or průtein-nudeic acid interactions
Homology and analogy
■  Sequences are said to be homologous if they are related by divergence from a common ancestor.
■  Proteins can share similar folds (e.g., ß-barrel) or similar catalytic residues (e.g., serine proteases) without any sequential similarity. Convergence to similar biological solutions from different evolutionary starting points results in analogy.
■  Sequence analysis assumes homologous proteins.
■  Homology is not a measure of similarity.
Application areas of different analysis methods			
Percent Identity			Alignmanl Methods Automatic p in v. w. míitťioífei
		j——	
Á Twilight Zona		f 1    Průf Ic můthůdi	
Midfiight 7one T		í    Stratu") pníd*stiun	
Orthology and paralogy
■  Proteins performing the same function in different species - orthologues.
■  Proteins performing different, but related functions within same organism - paralogues.
■  Sequence comparison of orthologuos proteins phylogenetic analysis.
Modularity of proteins = difficulties of homology searches
ofocř« *?.
M
S*.
Š
ů~m?
Information networks
■  what is the Internet?
■  how do computers find each other?
■  FTP and Telnet
■  what is the Worl Wide Web?
■  HTTP, HTML and URL
■  EMBnet, EBI, NCBI
■  SRS and ENTREZ
What is the Internet?
■  Global network of computer networks that link government, academic and business institutions.
■  communication by TCP/IP
(Transmission Control Protocol/Internet Protocol)
■  computers - nodes, data - packets
■  packets may not be transferred directly from one computer to another
How do computers find each other?
■  Each computer is assigned IP address 147.251.28.2
machine.site.domain bilbo.chemi.muni.cz
■  FTP - File Transfer Protocol
■ Telnet - remote connection
Example of Internet domains and subdomains		
Country-based domains       Other domains                      Subdomains		
Australia                .a u         Educational            .edu        Academic                  .ac		
Denmark                .dk        Commercial            .com        Company                   .co		
Finland                    .fi           Governmental         .gov         Other organisation     .org		
France                   .ft          Military                  .mil         General                     .gen		
Germany                .de		
Greece                   .gr		
Hungary                .hu		
Ireland                  .ie		
Israel                    .il		
Italy                        .it		
Netherlands           .ni		
New Zealand           .nz		
Poland                   .pi		
Portugal                .pt		
South Africa           .za		
Spain                     .es		
Sweden                  .se		
Switzerland            .ch		
United Kingdom      .uk		
USA                         .us		
		
What is the World Wide Web?
■  Developed at CERN - the European Laboratory of Particle Physics.
■ The purpose was sharing of information.
■  Hypermedia based information system.
■ The most advanced information system found on the web.
■  Very popular - almost synonymous with the Internet.
Web browsers
■  Browser is the client communicating with servers using standard protocols.
■  Home page is the first point of contact between browser and the server.
Lynx - academic, VT100 terminal Mosaic - academic, X-windows Netscape Navigator - commercial Internet Explorer - commercial
HTTP, HTML and URL
■  HTTP - HyperText Transport Protocol documents exploited by browsers are written in hypertext and transferred by HTTP
■  HTML - HyperText markup Language standard language for writing a hypertext
■  URL - Uniform Resourse Locator unique address for a document example: http://www.chemi.muni.cz/~jiri
EMBnet, EBI, NCBI
■  1988 established the network of European biocomputing and bioinformatics laboratories.
■  Eliminates the need for multicopies of biology databases and retrieval software.
■  Hinxton Hall = Sanger Centre + MRC Human Genome Mapping Project Resource Centre + European Bioinformatics Institute (EBI)
■  National Center for Biotechnology Information (NCBI)
SRS, ENTREZ and LinkDB	
SRS - The Sequence Retrieval System	
>- maintained by EBI	
>- network browser for databases in molecular	biology
>- allows indexation of flat-file databases	
>- allows customised search of selected databases	
>- link databanks: sequence, structure, bibliography, etc.	
ENTREZ	
>- integrates databases of NCBI	
>- less flexible then SRS	
>- valuable concept of neighbouring	
>- link databanks: DNA and protein sequences,	genome
data, structural data, PubMed bibliography	
SRS, ENTREZ and LinkDB
LinkDB
>- maintained by Institute for Chemical Reseach, Japan >- network browser for databases in DBGET and KEGG
(Kyoto encyclopedia of genes and genomes) >- link databanks: sequence, motifs, structure, amino acid
properties, ligands, metabolic pathways
Network of databases linked via SRS HI Ea Ml   MH     —         hmtibcI
■ • a •ebsh". ^ •ron't

WMI
—"I
ij      iSJ
Network of databases linked via ENTREZ
MEDLINE
Sequences
Network of databases linked via LinkDB DBGET Database Links
			
			
*			
		Wi \	
		^^^J	
Protein information resources
■  biological databases - introduction
■  primary protein sequence databases
■  composite protein sequence databases
■  secondary databases
■  composite secondary databases
■  protein structure databases
■  protein structure classification databases
Biological databases - introduction
■  Vast amounts of data produced - databases must be established for storage of the data.
■  Databases must be maintained and disseminated together with the analysis tools.
■  Classification of databases
» flat files
>- relational
>- object-oriented
>- primary >- secondary >- composite
Entry from a flat file database
Relational database
ü.    PROJECT
I
Object-oriented database
Levels of protein structure and corresponding databases I
AVILDRYFH
primary            sequence
secondary                motif              [A3] - [IL] 2-X [DE] -E- [FYH] 2-H
tertiary     domain      module
primary database
a,b,c
©,*,#
secondary "~ database struct Lire database
Primary protein	sequence databases
■ PIR	
■ MIPS	
■ SWISS-PROT	
■ TrEMBL	
■ NRL-3D	
Store biomolecular	sequences and annotations.
Primary protein sequence databases
■  PIR - Protein Sequence Database
>- 1960s by Margaret Dayhoff
>- maintained by international consortium
>- four sections PIR1-PIR4
PIR1 - fully classified and annotated entries
PIR2 - preliminary entries
PIR3 - unverified entries
PIR4 - conceptual translations of artefactual
sequences, non-transcribed, non-translated
■  MIPS - Martinsried Institute for Protein Sequences
>- collects and processes sequence data for PIR
Primary protein sequence databases
■ SWISS-PROT
>- University Geneva »EBI »Swiss Inst, of Bioinformatics >- high-level annotations including description of the
function, structure and domains, post-translational
modifications, variants, etc. >- annotated manually (high quality) >- automatically annotated = TrEMBL >- minimally redundant >- interlinked with many other sources >- efficient searching of selected fields only >- most widely used protein sequences database
Primary protein sequence databases
■ TrEMBL - Translated EMBL
>- computer-annotated supplement of SWISS-PROT >- contains translations of all coding sequences in EMBL >- SP-TrEMBL (SWISS-PROT TrEMBL), REM-TrEMBL
■  NRL-3D
>- produced by PIR from sequences extracted from
Brookhaven Protein Databank (PDB) >- annotations in PIR format including structural
information extracted from PDB: secondary elements,
active site Ms, experimental method, resolution >- makes sequence information in PDB searchable by
keywords and similarity
Composite protein sequence databases
■  NRDB
■  OWL
■  MIPSX
■  SWISS-PROT+TrEMBL
Amalgamates a number of primary sources, using a set of clearly defined criteria.
Composite protein sequence databases
■ NRDB - Non-Redundant DataBase
>- developed and maintained by NCBI
>- composite: GenPept (CDS translations of GenBank),
GenPeptupdate, PDB sequences, SWISS-PROT,
SWISS-PROTupdate, RIR
>- advantages: comprehesive and up-to date >- disadvantages: not fully redundant (only identical copies removed), occurence of multiple entries due to polymorphism, incorrect sequences amended in SWISS-PROT re-introduced by translation of GenBank
>- default database of the NCBI BLAST (ENTREZ/NCBI)
Composite protein sequence databases
■ OWL
>- developed and maintained by University of Leads >- composite: SWISS-PROT, PIR1-4, GenBank, NRL-3D >- SWISS-PROT the highest priority for annotation
>- advantages: less redundant, fully indexed (fast)
>- disadvantages: not up-to-date (released every 6-8 weeks), incorrect sequences
>- available from SEQNET of UK EMBnet
Composite protein sequence databases
■  MIPSX
>- developed by Max-Planck Institute in Martinsried >- composite: PIR1-4, MIPS, NRL-3D, SWISS-PROT,
TrEMBL, GenPept, Kabat, PSeqIP >- identical entries and subsequences removed
■  SWISS-PROT+TrEMBL
>- developed and maintained by EBI >- composite: SWISS-PROT, TrEMBL >- advantages: comprehensive, minimally redundant,
fewer errors >- disadvantages: not as up-to-date as NRDB >- available from SRS of EBI
Overview of	primary sources	of composite databases	
NRDB	OWL	MIPSX	SP+TrEMBL
PDB	SWISS-PROT	PIR1-4	SWISS-PROT
SWISS-PROT	PIR	MlPSOwn	TrEMBL
PIR	GenBank	MIPSTrn	
GenPept	NRL-3D	MIPSH	
SWISS-PROTupdate		PIRMOD	
GenPeptupdate		NRL-3D SWISS-PROT EMTrans GBTrans Kabat PseqlP	
Secondary databases
■  Contains information derived from primary sequence data, typically in the form of abstractions: regular expressions, fingerprints, blocks, profiles or Hidden Markov Models.
■ These abstractions represent distillations of the most conserved features of multiple alignments.
■ The abstractions are useful for discrimination of family membership for newly determined sequences.
Terms used in sequence analysis methods
C-Y-X2-[DG]-G-x-[ST]
regular expression
Three principal methods for building secondary databases
alignment methods (PROFILE LIBRARY}
t
J id den Markov Models (PFAM)
Implication of function from a sequence
DNA-binding proteins
Hchi-k±p hrlii 'UK Iv(r.'                   L'uMiiiai !i:cn                          Lem-iK Jiprtf
nHÍŮTAh USU W^'AC^I-HWjiVI.II IVMSr*rij..lVMFYHI-                   VlK>H.lťJJ)dJ
Secondary databases
PROSÍTE
PRINTS
BLOCKS
Profiles
Pfam
IDENTIFY
Secondary databases
PROSITE
>- historically the first secondary database >- maintained by Swiss Institute of Bioinformatics >- motivation: identification of protein families >- abstraction: regular expressions (patterns) >- construction: automatic multiple alignment and
manual extraction of conserved regions >- ideally patterns should identify only true-positives
(not false-positives) >- entries deposited as two distinct files:
pattern file and documentation files >- primary source: SWISS-PROT
Pattern file of a entry from the PROSITE database								
ED	OrEírlJ	pattum.						
K	HMttl	r						
□»	AľpOHO   inu*TTQj		HÜJ-IMT    HUTA   UľĽ*Tt|			l .■    ■   <1 <	(Jlírt   UtOATCI   .	
ca	VEllLll	pSUBMic*   líif*j.rr*k  í*Liů*l blading *íl*						
fa	[Livm	-|PK]-KlJS-	[ÍPvľl -H- 1 iTitlU 1 - ICUCW | -			stacp:-	x 1J: - [riptur 1-	lAPä-
pa	ml]-lír),							
m	ŕPJLEUEilJ, 19 ItC:							
m	jTOtAL-S 1 [S] i i   'POSITIVUS} 1 U] 1   SĽMlUKMf'0 [QN   ŕ FALEE^POT-J1 í> j							
m	ŕHLSEjiKNO? rrmuL			*1i				
CC	ŕim »bEiMU h		ŕUJU-TUPIAT-l |					
re	ŕíTCi!	T>ttLMl;						
DM	HM),	ŮPÍÍ_f«OMl	ľ:	k-J**Tí.   OľsL   Wut^.	i.	P3JZ&4,	OHl_CUUl	Tr
CC	RtM,	0ÍS3JXOHE	7:	PIMTP.  ops: MOPS.	Tí	ICISVj	;r:: 7;ř.c*-	Tj
c«	IftHHi	OPů)_HWPS	Ti	POtí Hr   «PM^KHI,	T.	FjtfCH.	::r..-i " -i-	Tj
EA	PľFí«.	grn_EPovi	T;	F)íí4Jr  *Pín_SŕkJr.	Tr	HJ&»I	C*íB_«tecv	Tr
		Oij j   ŔSTTA		; ;■■>-■-. OPSD :  ■.n..				
H	►JÍJĎF.	QMS   f A KAJ	T:	Hm*, M*B_gn«i	1.	P3IEBL.	ŮHD CD E«	Tr
n	FOBICS.	Oi-.J   l:>A)	El	F] :-<Cl.   CPEDJ»LLSB.	TF	PUtU,	OPSulPOWII	Tr
c«	rttnt.	QPHLBHIRF	TJ	PiWílr  OPSP-ŕpn-fc,	Tf	rasttn,	o«p,v»j*?i	Tr
E*	miH,	□f=d  FAHrl,	Ti	MWH,  HFSD  LuLFS,	T	mtiii	ŕľiC CCTjlí	Tr
u	ŕJEiJE-S,	ŮH3  FHOCL	Tj-	PJlJlť,  ŮPSD TŮLFA.	1:	PÚHI,	DPaL_LIH»	Tr
CC	nun.	f"    'INf	Ti	Plíjlí,  VPGhJzmt:.	Ti	F3*(U.	muna	Ti
»	HAH,	OPÍJUSKCÍ	Tj	PflJÍÍÍ.   OFFUJPJWUf.	r.	Htm.	OFťV_Lľn ICF	Ti
ZF	HUH,	arwajk c	Tr	Hffll, ůWrrjiítFA.	Tp	pimii	;rw; CMuvj	Tr
C*	mm.	orsmsiwíf	T:	RHU. WHLGMIOt,	T,	P» K«.	<•*«■_') UTO«	Tr
t*	mníi	■";:. ■-. h-.f-A.-;	T:	passu, ůFsmutatJi,	T,	■49 MX,	ÖT\SR>STFA	Tr
c*	umí.	ŮHÍ_CAllAÍf	Tf	r-33 T Tí.   ÜPflK_CH]CK,	Tf	phdcs	DPfiFUJflMW	Ti
N	tni«.	.-.M.-i	Tj	PífijSP.  QPBV_PFAMP.	T.	fltilQ.	PUS.TWPA	Tr
n	Mn*9<	fcäLBt^i«	Tt	PlíJdUlr    PräM_HUKA3l    .	Tj			
H	M7SI5,	OK]   C*WJ	Pj					
DO	umetu	LU						
Secondary databases
■ PRINTS
>■ developed at University College London
>■ motivation: identification of protein families by more
than one pattern >■ abstraction: fingerprints (aligned motifs)
fingerprints store original sequence information >■ construction: sequence information in a seed motifs
are augmented through iterative database scanning >■ construction of fingerprints done manually >■ primary source (original): OWL >- primary source (new): SWISS-PROT and SP-TrEMBL
Pattern file of a entry from the PRINTS database (I)
l.":.                           OřSIH  SICWATUíie
Typ»  of   fingerprint:   COMPOUND with  1  *l*Mtitf LUK«
FAINTS;   PAflOÍJT  CPCP.KKCCŮPSN r   PA0ŮÍI7  CfCkCmíit   HICIŮÍ4Í   GrCtriC*
HUHfSp   FA4D2I1 CPCUÉEcMTlIfr   PROOHO ÜPCK5TE2;   PPOoíil  BACTRLQP31JI
«HJSíTtr   rSOflJJfi OrS(n;   FriCiJ?  G_Fr Of 11 *_«£*«*#
BLOCKS;   &L40Í3Í
:;iuľ i:-   ■ :.; t_\ irNAH
KKPBí  CCA_DI>8&
CrMtian  dat»  20-DEC-1T9Jj   UPDATE   Z-JVt-lfTf
1.   APFlE&ĽSŕ.   H.L.   and  HAAGStAVEr   P.A.
H»l+Cul4r   fri*]*Sď   ůf   tli*   vi»ll*l   píf>n4tl,
VJ9IW PV9-   2«   113)   MiJ-lSÍ*   (lífíl-
■ -::.ř í   .:■!■    -.v.
7) fůd»  Involving J  »l«MJiti
I   «cd» Indivina 2 *1*»*AIÉ .■.^i, :..■ s-::-.-K =-■ = ■-1 =-" [tfHü
II      71        73        73 21       0            1            1
l       1       í       )
Pattern file of a entry from the PRINTS database (II)			
INITIAL  HOľlF  SETS			
OPSim   Hinjtb  u C  H.	Lit   -   li   Huti	number  -  1	
<*Ľj5n  íOTif   í         1			
	POOÜE	if	IHT
VVTV0Hö:Jj7ei.	O-FĚD  HOVIJÍ	SD	ta-
YVWOKKKUJPC.	npizn uuuu?	fC	to
TVTVtJHJOajfTPL	DF5D_2HEEP	f: i:	M
AATHKnUUtKHFl.	DFSÜ_MUHMf	:u	7 ú
JUľlTIFrKJUjFHrL	i:ť!řn_Muiuier	.■K	■.■h
YlfWTTAaiETPA	OPSfl_DEUKE	T3	73
VATUHKKLKQK.	UťB3._H0J£fLK	i?	i?
TIPfKTTUaLRTPA	■,.  ■?     .I- Kt	řr.	ill]
•ÍVFSWUÍSLRTPS	OP53_DRCfH	*L	ai
WirSTÍVÍILRTIW	::-::  nf£«Ě	77	77
TUEKTKELCfTPA	oPsruxrTTC.	■ill	■in
^LŕTKTKaKrrŕJl	OPSD_LÜLrO	■:■	r, t
QPSIKZ  Length  ol  macll   -   13- KíUi		flJlYlřtj      ■    i	
OpffÍTl   rn.nl; i f    ] ;    -   1			
	PCCTO	ET	>■:.
GH5RYIFKWCCS	OPJ5IJ  EWVTN	1 .L	101
SManriťML^tu	Opsd_hľhan	;  ".	141
OHMiltflíHXS	OKD  SHUfľ	; ■:	101
OWSBYWeHLKW	ÜPim_HOHAN	lí(	141
GUSKÍHľHSLATi:	::■.".- !■.".:■"::	lij	141
CUSRYVTECHI.TS	f.Kf.   : H.vtt	lí'	141
HH5RTIFEQLQCS	opaa  H13ÍAN	ITL	141
5U SAWS KM LT*	ups; drohe	as*	141
THCTtFLTPFdVLTi]	OKT. nacME	17.1	14G
rwUK/VJ'EĽÍl.rE	Ül'is   DACHE	] .■	100
IÍHÍlAWrRCIWS	OľSU  CCTUU	-H	101
{TŕCAVTLECVLOI	QT3D   LŮLFŮ	--I	141
Secondary databases
■  BLOCKS (abstraction: blocks)
■  Profiles (abstraction: profiles)
■  Pfam (abstraction: Hidden Markov Models)
■  IDENTIFY
>- developed at Stanford University
>- abstraction: motifs encoded by fuzzy approach
(alternative residues are tolerated in motifs) >- construction: automatically derived using the
program eMOTIF >- primary sources: PRINTS and BLOCKS
Properties of amino acids used in eMOTIF
Residue property
Small
Small hydroxyl
Basic
Aromatic
Basic
Small hydrophobic
Medium hydrophobic
Acidic/amide
Small/polar
Residue groups
Ala, Gly Ser, Thr Lys, Arg Phe, Tyr, Trp His, Lys, Arg Val, Leu, íle Val, Leu, íle, Met Asp, Glu, Asn, Gin ALa, Gly, Ser, Thr, Pro
Overview of primary sources and stored information in secondary databases
Secondary database	Primary source	Stored information
PROSÍTE	SWISS-PROT	Regular expressions (patterns)
Profiles	SWISS-PROT	Weighted matrices (profiles)
PRINTS	OWL*	Aligned motifs (fingerprints)
Pfam	SWISS-PROT	Hidden Markov Models (HMMs)
BLOCKS	PROSITE/PRINTS	Aligned motifs (blocks)
IDENTIFY	BLOCKS/PRINTS	Fuzzy regular expressions (patterns)
Composite secondary databases
■ INTERPRO - Integrated resource of Protein Families, Domains and Sites
>- developed by EBI, SIB, University of Manchester, Sanger Centre, GENE-IT, CNRS/INRA, LION Bioscience AG and University of Bergen (European Research Project)
>- provides an integrated view of the commonly used secondary databases: PROSITE, PRINTS, SMART, Pfam and ProDom
>- accessible by ftp, www and via member databases
InterPro dataflow scheme
adffiln                                                      develop                                   sptr
Protein structure databases
■  PDB
■  PDBsum
Protein structure classification databases
■  SCOP
■  CATCH
Genome information resources
■  primary DNA sequence databases
■  specialised DNA sequence databases
Primary DNA sequence databases
■  EMBL
■  DDBJ
■  GenBank
■  dbEST
■  GSDB
Store DNA sequences and annotations.
Primary DNA sequence databases
■ EMBL - European Molecular Biology Laboratory
>- European Bioinformatics Institute (EBI)
>- collaboration with DDBJ and GenBank - exchange of
new entries on daily basis >- source of sequences: direct author submissions,
genome projects, scientific literature, patents >- rate of growth is exponential with doubling time
~9-12 months >- most entries from model organisms >- retrieval through SRS
Primary DNA sequence databases
■  DDBJ - DNA Data Bank of Japan
>- National Institute of Genetics
>- collaboration with EMBL and GenBank
>- retrieval through DBGet
■  GenBank
>- National Center for Biotechnology Information (NCBI) >- collaboration with DDBJ and EMBL >- data split into 17 divisions >- retrieval through Entrez
	Codes for 17 divisions of GenBank
Division                           Sequence subset	
PRI	Primate
ROD	Rodent
MAM	Other mammalian
VRT	Other vertebrate
INV	Invertebrate
PLN	Plant, fungal, algal
BCT	Bacterial
RNA	Structural RNA
VRL	Viral
PHG	Bacteriophage
SYN	Synthetic
UNA	Unannotated
EST	EST (Expressed Sequence Tags)
PAT	Patent
STS	STS (Sequence Tagged Sites)
GSS	GSS (Genome Survey Sequences)
HTG	HTG (High Throughput Genomic Sequences)
Entry from the GenBank database
Primary DNA sequence databases
■  dbEST
>- National Center for Biotechnology Information (NCBI) >- maintains only Expressed Sequence Tag (EST) data
■  GSDB - Genome Sequence DataBase
>- National Center for Genome Resourses >- complete collections of DNA sequence for
genome-sequencing laboratories >- on-line submission of large-scale data >- quality checks >- format consistent with GenBank + GSDBID
Specialised DNA	sequence	databases
■ SGD		
■ UniGene		
■ TDB		
■ ACeDB		
Store species-specific	: and techniq	ue-specific
DNA sequences.		
Specialised DNA sequence databases
■  SGD - Saccharomyces Genome Database
>- molecular biology and genetics of 5. cerevisiae >- complete genome, genes, proteins, phenotypes >- first eukaryotic genome sequenced (1998) >- sequence analysis, register of genes, 3D structural data, primer sequences for cloning
■  UniGene
>- collection of genes encoding proteins (transcript map) >- non-redundant; derived from GenBank >- data organised in clusters (1 cluster = 1 unique gene) >- gene-mapping projects and gene expression analysis
Specialised DNA sequence databases
■ TDB - T1GR Database
>- suite of databases: DNA and protein sequences, gene expression, protein families, taxonomie data
>- links: TIGR microbial genome sequencing projects, parasite databases, gene index projects, A. thaliana database, human genomic dataset
■  ACeDB - A Cernorhabditis e/egans DataBase
>- C. e/egans genome project
>- restriction maps, gene structural information, cosmid
maps, sequence data, bibliographic information >- software to organise data ACEDB: CGI script and perl
ACEDB software for organisation of genomic data

II
Ji
1 _
J
DNA sequence analysis
■  why to analyse DNA?
■  gene structure
■  gene sequence analysis
■  expression profile, cDNA, EST
■  EST sequences analysis
Why to analyse DNA?
■ The most sensitive comparisons between sequences are on protein level because of redundancy of the genetic code.
■ The loss of degeneracy is accompanied by a loss of information directly linked to the evolution -proteins are only functional abstractions of genetic events at DNA level.
■  Silent mutations, important for phylogenetic analysis, can not be detected at protein level.
■  Exon/intron analysis, open reading frame [ORF] analysis can not be performed at protein level.
The genetic code
	T		C		A		G		
T	m TTC	Phe	TCT TCC TCA TCG	Ser	TAT TAC	Try	TGT TGC	Cys	T C A G
	TTA TTG	Leu			TAA TAG	Stop	TGA	Stop	
							TGG	Trp	
C	CTT CTC CTA CTG	Leu	CCT CCC CCA CCG	Pro	CAT CAC	His	CGT CGC CGA CGG	Arg	T C A G
					CAA CAG	Gin			
A	ATT ATC ATA	íle	ACT ACC ACA ACG	Tri r	AAT AAC	Asn	AGT AGC	Ser	T C A G
					AAA AAG	Lys	AGA AGG	Arg	
	ATG	Met							
G	GTT GTC GTA GTG	Val	GCT GCC GCA GCG	Ala	GAT GAC	*SP	GGT GGC GGA GGG	Gly	T C A G
					GAA GAG	Glu			
Gene structure
■  Eukaryotic genes are more complex then prokaryotic due to presence of introns.
■  DNA databases typically contain genomic data: untranslated sequences, introns+exons, mRNA, cDNA.
■  Gene products (proteins) can be of different length, because not all exons can be present in final mRNA.
■ The proteins of different length originating from single sequence are called splice variants.
	Central dogma of molecular		biology	
	5' i                  Intron                     Intron		31	
5'UTR Sense st	y    Exon	Exon                 Exon |             3'		UTR
	and genomic DNA*^			
		Transcription		
		,		
mRNA	5'UTR	CDS	3'UTR	
		1		
		Translation		
Protein				
Gene structure
■  Untranslated regions (UTRs)
>- portions of the sequence flanking the coding sequence
(CDS) not translated into protein >- UTRs (especially 3' end) is highly gene/species specific
■  Exons
>- protein-coding DNA sequences of a gene
■  Introns
>- DNA sequences interrupting protein-coding DNA
sequence of a gene >- transcribed into RNA but are edited out during post-
transcriptional modifications
Gene sequence analysis
■  Conceptual translation - theoretical translation of the DNA sequence to the protein sequence using DNA code without biochemical support.
■  Six-frame translation results in six potential protein sequences (ORF analysis).
■  ORF analysis
>- codon for methionine - initial codon in the CDS >- sufficient CDS lenght - long CDS are rare >- pattern of codon usage - species specific >- bias towards G/C in the third base of a codon - species specific
Expression profile, cDNA, EST
■ Hierarchy of genomic information
>- human genome consists of ~3 billion bp >- ~3% of the DNA is coding sequence -*mRNA-»- protein >- rest of the genome needed for compact structure of chromosomes, replication, control of transcription, etc.
>- 1. chromosomal genome (genome) - genetic information common to every cell in the organism
>- 2. expressed genome (transcriptome) - part of genome expressed in a cell at specific stage in its development
>- 3. proteome - protein molecules that interact to give the cell its individual character
Expression profile, cDNA, EST
■ Expression profile
>- characteristic range of genes expressed at particular
stage of development and functioning >- goal of genome projects is to sequence entire
(chromosomal) genome >- having complete sequences and knowing what they
mean - two distinct stages of understanding genome >- alternative approach is analysis of parts of genome
expressed in a cell at specific stage in its development >- comparison of expression profiles: identification of
abnormal expressions, expression levels >- interesting for industry - gene discovery, drug design
Expression profile, cDNA, EST
■ Complementary DNA (cDNA)
>- DNA that is synthesised from a messenger RNA template using the enzyme reverse transcriptase
>- cDNA captures expression profile
>- preparation: cultivation/isolation of cells, mRNA extraction, reverse transcription of mRNA to cDNA, transformation of cDNA into library, sequencing of randomly chosen clones (100.000 out of 2 mil.)
>- ideally 100.000 sequences 200-400 bp length -expressed sequence tags (ESTs)
>- in reality many failures, number of sequences lower
>- number of clones constructed and sequenced must be large enough to represent expression profile
Origin of complementary DNA and e 5' i                Intron                    Intr 5' UTR             |    Exon                     Exon		xpression sequence tags 3' Exon \            y UTR 3'UTR
Sense strand genomic DNA\^                    -^ Transcription 1 5' UTR                                CDS		
mRNA Protein	1 Translation 4i§fe	
		
		
EST   ........ UTR   -----------		
		
Expression profile, cDNA, EST
Libraries of ESTs
>- Merck/IMAGE - 300 000 ESTs from a variety of normalised libraries - higher chance to capture different genes; expression levels not known; sequences deposited to dbEST
>- Incyte - quantitative information on expression levels -standardised libraries; expression profiles in healthy and diseased tissues; sequences form the commercial database LifeSeq
>- TIGR - TIGR Human Gene Index - integrates results from human gene projects [dbEST+GenBank] -purpose is to identify all possible human genes by sequence assembly - creates Tentative Human Consensus (THC) sequences and contigs
EST sequences analysis
EST production is highly automated (fluorescent laser systems and computer analysis of chromatograms) influencing the quality of sequences. Specific character of ESTs must be respected during their analysis:
■  EST alphabet
■ Insertions, deletions, frameshifts
■ Splice variants in EST
■  Non-coding regions
EST sequences analysis
■  EST alphabet
>- automated computer analysis of chromatograms >- program is sometimes unable to decide base for particular position and inserts ambiguous base N >- should be <5% of total length
■  Insertions, deletions and frameshifts
>- automated base-calling software assumes regular
intervals among peaks - not always the case >- phantom INDELs (insertions and deletions) >- identification of INDELs by sequence comparisons
List of base-ambiguity symbols defined by IUB-IUPAC		
WB symbol		Represented bases
A		A
C		C
G		G
T/U		T
N		A or C
R		A or 6
W		AorT
S		CorG
V		CorT
K		GorT
V		A or C or G
H		A or C or T
Ď		A or G or T
B		C or G or T
X/N		G or A or T or C
EST sequences analysis		
■ Splice variants >- splice variants are represented by deletions arising from non-inclusion of exons		
>- in EST maybe missing bases due to >- partially good match = splice form	sequencing or sequence	errors error?
■ Non-coding regions >- question: does this EST represent a new gene? >- search of DNA database for similar non-coding >- no hit found = the EST represents a new gene or the EST represents non-coding sequence not in the database		regions (CDS) present
Sequencing chromatogram
í        :      ,                           .

EST sequences analysis
Three categories of EST analysis tools:
■  Sequence similarity search tools
■  Sequence assembly tools
■  Sequence clustering tools
EST sequences analysis
■  Sequence similarity search tools
>- current database search programs are designed to cope with EST: TBLASTN (translate DNA databases), BLASTX (translate input sequence), TBLASTX (translate both)
■  Sequence assembly tools
>- search of the databases reveals several ESTs matching
the query sequence >- alignment of hits and construction of consensus >- search with consensus, augment, .... >- iterative sequence alignment = sequence assembly
EST sequences analysis
■ Sequence clustering tools
>- clustering of EST sequences reduces redundancy and
saves the search time >- enables estimation of genes in the EST database >- approach 1: clustering based on sequences from
comprehensive DNA database >- approach 2: clustering of all ESTs, construction of
consensus sequences representing each cluster, DNA
database search using consensus sequences only >- result = ESTs that do not match any of the database
sequences
Clustering of EST library
EST library
Clustering
---*•
.....»■
-<-----------
—»•
-*2---------►
'       -.----------?4
----*-
Plus sense EST Minus sense EST
Pairwise sequence alignment
■  database searching
■  alphabets and complexity
■  algorithms and programs
■  sequences and sub-sequences
■  identity and similarity
■  dotplot
■  local and global similarity
■  pairwise database searching
Database searching
■  Database search can take a form of text queries or sequence similarity searches.
■ Text queries are problematic due to missing annotations in many sequences.
■  query sequence = probe searched sequence = subject
■ The purpose of searches is to identify evolutionary relationships (homology) from sequence similarity. Important for search of analogous family members in different species.
Alphabets and complexity
■  A sequence consists of letters from an alphabet.
■ The complexity of the alphabet is defined by the number of letters it contains:
>- DNA = 4 >- EST = 5 >- proteins = 20
■  Special letters can be used for ambiguous bases (N) or residues (X). Sequence searching programs must be able to deal with them.
Algorithms and programs	
■ Algorithm is a set of steps that define	a certain
computational process.	
■ Program    is   a    the    implementation	of   the
algorithm.	
■ Same algorithm may be implemented	in many
programs.	
Sequences and sub-sequences				
■ Alignment of two short sequences:				
Unaligned		score  =  6		
Sequence   1    (query) Sequence  2    (subject)		AGGVLIIQVG 1 1 1 1 1 1 AGGVLIQVG		
Aligned		score  =  9		
Sequence   1    (query) Sequence  2    (subject)		AGGVLIIQVG 1 1 1 1 1 1    III AGGVLI-QVG		
■ Score increases	by	the insertion of a	gap-	The
gap increases residues.	the	number of aligned	identical	
Alignment of a sub-sequence with full sequence
A
Identity and similarity
■  Introduction of gaps solely to maximise identities is not biologically meaningful.
■  Scoring penalties are introduced to minimise opening and extension of gaps.
■  Unitary matrix (counting identities) is replaced by similarity matrix (counting similarities) = high-scoring matches are replaced by biologically meaningful low-scoring matches.
■  Diagnostic power of similarity matrices is higher.
Unitary scoring matrices: (a) DNA and (b) protein
CSTPAGNPEQI
I   L   V   F   Y   W  B   z   :
Identity and similarity
Dayhoff Mutation Data Matrix
>- score is based on the concept of Point Accepted
Mutation (PAM) >- evolutionary distance 1 PAM = probability of a residue
mutating during a distance in which 1 point mutation is
accepted per 100 residues >- 250 PAM matrix - similarity score equivalent to 20%
matches remaining between two sequences = suitable
for identification of similarities in twilight zone >- limitation: derived from alignment of sequences
>85% identical
	Mutation Data Matrix for 250 PAMs						
T       -2 P       -3 A       -2 G       -3	2 1     3 1     0    6 1112						
							
N       -4 D       -5 E       -5 Q       -5	10-100 0    0-10    1 0    0-100 -1-10    0   -1	2 2    4 13     4 12     2    4					
R       -4 K       -5	-1   -1    0   -1   '2 0-1    0-2  -3 0    0-1-1   -2	2     113 0-1-1    1 10    0    1		6 2     6 0     3     5			
H       -5 I        -2 L       -6 V       -2	-2   -1  -2  -1   -3 -1    0   -2  -1   -3 -3   -2  -3  -2   -4 -10-10   -1	-2  -3  -2   -1 -2  -2  -2   -2 -3  -4  -3   -2 -2  -2  -2   -2		-2    0    0 -2   -2  -2 -2   -3  -3 -2   -2  -2	6 2     5 4     2    6 2     4    2     4		
Y         0 W      -8	-3   -3  -5  -4  -5 -3   -3  -5  -3   -5 -2   -5  -6  -6  -7	-4  -6  -5   -5 -2   -4  -4  -4 -4  -7   -7   -5		-2   -4 -5 0   -4  -4 -3    2   -3	0     12-1 -2  -1  -1  -2 -4  -5  -2  -6	9 7   10 0    0   17	
	S    T    P    A    G	K    D    E    Q		H    R    K	M    I     L    V	F    Y    W	
Identity and similarity
■ BLOSUM matrices
>- BLOcks Substitution Matrix
>- derived from blocks of aligned sequences in BLOCKS
database - represents distant relationships implicitly >- bias from identical sequences is removed by
clustering >- BLOSUM62 = matrix derived from sequences
clustered at 62% or greater identity
Identity and similarity
■ Statistical measures of alignment significance
>- performing sequence alignment computationally = creating match according to mathematical model
>- adjustable parameters: gap penalties, impact of sequence length, effect of alphabet complexity
>- level of confidence to constructed alignment is quantified by statistical parameters:
probability (p) - probability that the constructed alignment arose by chance [should approach 0]
expected frequency (E) - number of hits one can expect to see by chance [should be <0.001]
Example hit list from a database search
ip   P51698
ip   Q50642
íp   P27652
íp   Q50600
ip   Q50670
ip   P22643
ip   P34913
ip   007214
ip   Q50599
ip   031158
ip   P22862
ip   P23106
ip   P29715
ip   P49323
ip   P54549
ip   P48972
ip   Q55921
ip   Q9JZR6
ip   013912
ip   Q59695
ip   P46544
ip   P46542
ip   P10244
lignific
aligni
LINB_PSEPR (LINB)l,3,4,6-tetrachloro-l, YP7 9_MYCTU (RV2 5 79..(Hypothetical 33.7 LUCI_RENRE Renilla-luciferin 2-monooxy< YJ33_MYCTU (RV1833C..)Hypothetical 32.; YM96_MYCTU (RV2296..)Putative haloalkai HALO_XANAU (DHLA)Haloalkane dehalogenase ( HYES_HUMňN (EPHX2)Soluble epoxide hydrolase YR15_MYCTU (RV2715..)Hypothetical 36.9 kDa YI34_MYCTU (RV1834..)Hypothetical 31.7 kDa PRXC_PSEFL (CPO..)Bon-heme chloroperoxidas. ESTE_PSEFL Arylesterase (EC 3.1.1.2) (Aryl XYLF_PSEPU (XYLF)2-hydroxymuconic sem BPA2_STRAU (BPOA2)Non-haem bromoperox PRXC_STRLI (CPO..)Bon-heme chloropero: YQJL_BACSU (YOJL)Hypothetical 28.2 kD, MYBB_MOUSE (MYBL2..)Myb-related prote PRXC_SYNY3 (SLR0314)Putative non-heme PIP_NEIMB (PIP..(Proline iminopeptidase (E> YDW6_SCHPO (SPRC23C11.06C)Hypothetical 60.1 kDa prote RCOC_PSEPU (RCOC)Dihydrolipoamide acetyltransferase c PIP_LRCDE (PEPIP)Proline iminopeptidase (EC 3.4.11.5) PIP_LRCDL (PIP.. (Proline iminopeptidase (EC 3.4.11.5) MYBB_HUMňN (MYBL2..)Myb-related protein B (B-Myb).[Ho ITSN_HUMňN   (ITSN..)Intersectin   (SH3  domain-containing
ia  protein  R
EC   3.8.1.5) (SEH)     (EC
hydr hyde hydrol BPO-R2    (EC
(B-Myb).[Mu C   3.4.11.5)
93      7e-19
Dotplot
■ The most basic visual method for comparison of two sequences.
■  Separates noise (random dots) from the signal (adjacent dots).
■  Identical sequences are represented by single central diagonal line, similar sequences by a broken diagonal and dissimilar sequences by random dots.
■  Advanced dotplots utilise similarity matrices for calculation of cell scores.
Construction of the dotplot matrix
HTFRDLLSVSFEGPRPOSSAGGSSAGG
Dotplot of (a) identical, (b) similar and (c) related sequences	
(a)     133 i                         y	
s.	
§	
°£	133 LYC1 _PIG
(b)     133	/' /
0	133 LYC1_ANAPL
(c)      133	
(3 |	
0	>                             128 LYC1_MACRG
Local and global similarity
■ Alignments are mathematical models whose behaviour can be modified through the use of adjustable parameters. The models constructed by dynamic programming algorithms - finding solution of a problem by solving smaller, but similar sub-problems.
■  Global alignment - considers similarity across the entire sequence.
■  Local alignment - considers similarity in parts of sequences only.
Path matrix with optimal path by dynamic programming AIMS
Alignment
AIM-S A-MOS
Local and global similarity
■ Global alignment
>- Needleman and Wunsch algorithm >- suitable for sequences similar across most of their length (usually closely related)
>- 1. construction of 2D similarity matrix ("dotplot") >- 2. successive summation of the cells in the matrix
starting from N-terminal end -»-progressing through
the sequence >- 3. construction of maximum-match path through the
entire sequence
Local and global similarity
■ Local alignment
>- Smith-Waterman algorithm
>- suitable for distantly related sequences displaying
local regions of similarity (functionally-relevant or
structurally-relevant) >- each point of the matrix defines the end point of a
potential alignment = edge cells of the matrix are
initialised to 0 >- possibility for ending the alignment are calculated for
every cell >- algorithm is much faster compared to global similarity
algorithms
Concepts of global and local optimality in the pairwise sequence alignment
(a) Global vs. Global
(b) Local vs. Global
(c) Local vs. Local
Pairwise database searching
■  Extension of the pairwise sequence alignments.
■  Large database searches can not be performed using the original Needleman and Wunsch or Smith-Waterman algorithms due to time limitations.
■  Very fast local-similarity search methods employing heuristics = FastA and BLAST. These methods concentrates on finding short identical matches.
Pairwise database searching
FastA
>- algorithm by Lipman and Pearson (1985)
>- identifies short words (k-tuples) common to both
sequences >- k-tuples for proteins: 1-2 residues >- k-tuples for DNA: up to 6 bases >- k-tuples lying close to each other on the same
diagonal joined by heuristics -► gapped alignments
computed by dynamic programming
Output from FastA search
Pairwise database searching
BLAST
>- Basic Local Alignment Search Tool >- algorithm by Altschul eta/. (1990) >- identifies short ungapped sub-sequences (segment
pairs) of the same length >- sub-sequences are extended using dynamic
programming to obtain local alignments - high
scoring pairs (HSPs) >- improved algorithm by Altschul eta/. (1997) -
produces gapped alignments >- algorithm very fast - most commonly used for
databases searching
Output from	BLAST	search
Multiple sequence alignment
■  multiple sequence alignment
■  consensus sequence
■  manual methods
■  simultaneous and progressive methods
■  databases of multiple sequence alignments
■  hybrid approach for database searching
Multiple sequence alignment
■  Multiple sequence alignment is a 2D table in which the rows represent individual sequences and the columns the residue positions.
■  Multiple sequence alignments are essential for analysis of sets of gene families.
■  Sequence-based multiple sequence alignments -constructed according to similar strings of amino acid residues.
■  Structure-based multiple sequence alignments -constructed according to structural evidence.
Colour-coded multiple sequence alignments
:S     S -T:;:::* ::::i::::::::::::::::       tiiMf'Jt::     M E   a mm,m          ■ ■ ■• ■■     —■ ,...■.......... .i . . i   ,|  i..   . m-
x^mmmmm^
timiti
Multiple sequence alignment
■  Construction of a multiple sequence alignment:
>- positioning of residues within any sequence is
preserved (absolute positions) >- similar residues in all sequences are brought into
vertical register (relative positions)
■  All residues in any single column of an alignment will have the same relative position but different absolute position (unless the sequences are identical).
Consensus sequence
■ The alignment table can be summarised by:
>- a single line: pseudo-sequence
>- unweighted matrix: fingerprint
>- ungapped block of residues (weighted): block
>- weighted matrix: profile
Multiple alignment and the consensus sequence
	1	2	3	4	5	e	7	8	9	10
I	Y	D	G	G	A	v	-	E	A	L
II	Y	0	G	G	-	-	-	E	A	L
III	F	E	G	G	I	L	V	E	A	L
IV	F	D	-	G	I	L	V	Q	A	V
V	Y	E	G	G	A	V	V	-   Q	A	L
	1 y	d	G	G	A/I	V/L	v	e	A	l
Multiple alignment and the profile, block and fingerprint
fingerprint
C-Y-X2-[DG]-G-x-[ST]
regular expression
Manual methods
■  Manual methods are subjective however they enable to incorporate experimental evidences (e.g., mutagenesis data, structural knowledge) into the multiple alignment.
■  Manual modification of the multiple alignments from automatic methods is the best approach.
■  Intuitive colouring schemes assist the eye in spotting similarities.
■  Quantitative evaluation of relatedness through calculation of residue identities/similarities.
Amino acid property groupings and colouring		
Residue	Property	Colour
Asp, Glu	Acidic	red
His, Arg, Lys	Basic	blue
Ser, Thr, Asn, Gin	Polar neutral	green
Ala, Val, Leu, He, Met	Hydrophobic aliphatic	white
Phe, Try, Trp	Hydrophobic aromatic	purple
Pro, Gly	Special structural properties	brown
Cys	Disulphide bond former	yellow
Venn diagram grouping properties of the amino acids		
		_^^       Tiny
		^4^/s^^Smai1
		a      Xfl^N^    \l
-f-  I    V		
	■ ■	
Aiomalic        ^W^          ^		"*jf   r^^"^          Jr"- ?iiťnivft
HydťúpľlOtHC		X              Polar Charg«!
Simultaneous methods
■  Simultaneous methods align all sequences in a given set at once, rather than aligning pairs of sequences or building sequence clusters.
■  Extension of 2D dynamic programming matrix to more dimensions.
■  Number of dimensions = number of sequences.
■  Suitable only for small sets of short sequences.
Progressive methods
■  Multi-dimensional programming matrix is not applicable to realistic problems - larger sets of longer sequences.
■  CLUSTAL
>- 1. construction of evolutionary tree
>- 2. pairwise alignment of two the most closely related
sequences, addition of less related sequences >- 3. final alignment, final evolutionary tree
■  CLUSTALW
>- positioning of gaps in closely related sequences according to their variability
Databases of multiple alignments
■  Multiple alignments bring together sequences from different species. This important evolutionary information can enhance sensitivity of database searches.
■  Various abstractions (regular expressions, profiles, blocks, fingerprints or HMMs) can be searched against sequence databases. More information used in a query - higher sensitivity.
■  Results of the searches using the multiple alignments are more difficult to interpret.
Databases of multiple alignments
■  Multiple alignments databases available via Web are produced automatically (e.g., PFAM) or manually (e.g., PRINTS).
■  Iterative automatic methods may include false-positive sequences in the alignment which will corrupt it by insertion of many unrealistic gaps.
Example entry from PFAM database
p;«*
_  KrŕtwMKHmwhpi T^fPff**!»H*»Mtwmtu
iMftfCu MilUti ^ pn h till' 1*1] frrhpw
M JWHmWkv TRIM Ti
KtAq2 rnMuHĽri nl ion W^iJ-H^HTpíripíŕ* iľ imutwmi*! MjiwCytJhj« hvH* jh
Example alignment from PFAM database
r/flflfc
Hybrid approach for database searching
■ PSI-BLAST
>- Position-Specific Iterated - BLAST
>- algorithm by Altschul eta/. (1997)
>- incorporates elements of both pairwise and multiple sequence alignment methods
>- procedure: initial search - creation of position specific profiles from the hits - new search ... in iterations
>- advantage: detects even very weak similarities
>- disadvantages: the profile can be diluted if low-complexity regions are not masked; inclusion of single false-positive sequence into the profile leads to bias towards unrelated sequences
Graphic hit list from a database search using PSI-BLAST
Color Key For Rlignnent Scores
^^^^^^^^^
Secondary database searching
■  why to search secondary databases?
■  secondary databases
■  regular expressions
■  fingerprints
■  blocks
■  profiles
■  Hidden Markov Models
Why to search secondary databases?
■  Interpretation   of   the   results   from   primary database searches is sometimes difficult:
>- X.000.000 sequences from XX.000 organisms
>- complex and redundant search outputs
>- irrelevant matches of low-complexity sequences,
repetitive sequences, modular sequences >- local regions of similarity in multi-domain proteins >- truncated description lines
■  Secondary database searches enable to identify both homology and more exacting orthology.
Secondary databases
■  Contains information derived from primary sequence data, typically in the form of abstractions: regular expressions, fingerprints, blocks, profiles or Hidden Markov Models.
■ These abstractions represent distillations of the most conserved features of multiple alignments.
■ The abstractions are useful for discrimination of family membership for newly determined sequences.
Secondary databases
■  PROSITE - regular expressions
■  PRINTS - fingerprints
■  BLOCKS - blocks
■  Profiles - profiles
■  Pfam - Hidden Markov Models
■  IDENTIFY - fuzzy regular expressions
Terms used in sequence analysis methods
fingerprint
C-Y-X2-[DG]-G-x-[ST]
regular expression
Regular expressions
■ Regular expression reduces the sequence data to the most conserved residue information.
Regular expression [AS]-D-[IVL]-G-X5-C-[DE]-R-[FY]2-Q
Multiple alignment
ADLGAVFALCDRYFQ
SDVGPRSCFCERFYQ
ADLGRTQNRCDRYYQ
ADIGQPHSLCERYFQ
Limitations:
^ stringent pattern - retrieves only identical matches and
can miss remote relatives ^fuzzier  pattern   -   better  chance  to  detect  remote
relatives, but results in more noisy output ^ single motif may not be sufficient to infer the function
Regular expressions
■  Regular expressions works most effectively when a particular protein family can be characterised by a highly conserved motif (10-20 residues).
■  Limitation: short patterns (3-4 residues) are not sufficiently discriminative.
Asp-Ala-Val-Ile-Asp (DAVID)             71 exact matches in OWL29.6
Asp-Ala-Val-Glu (DAVE)                 1088 exact matches in OWL29.6
Regular	expressions	
■ Rules -	short patterns that can	be used to provide
a guide to possible existence of functional sites:		
Functional	site	Regular  expression
N-glycosilation   site		N-{P}-[ST]-{P}
Protein  ki	naše  C   phosphorylation   site	[ST]-X-[RK]
Casein  kin	ase   II   phosphorylation   site	[ST]-X(2)-[DE]
Asp   adn  As	n  hydroxylation   site	C-X-[DN]-X(4)-[FY]-X-C
Regular expressions			
■ Fuzzy regular expressions		regular expressions	
with introduced	fuzziness	into	patterns using
groups of amino	acids with sim		lar biochemical
properties (FYW	- aromatic,	HKR	- basic, etc.).
Multiple   alignment	Tuzzy   regular	expression	
ADLGAVFALCDRYFQ	ASGPT]-D-[IVLM]-G-X5		-C-[DENQ]-R-[FYW]2-Q
SDVGPRSCFCERFYQ			
ADLGRTQNRCDRYYQ			
ADIGQPHSLCERYFQ			
		Amino acid	property groupings and colouring	
Residue			Property	Colour
Asp	Glu		Acidic	red
His,	Arq	Lys	Basic	blue
Ser, Thr,		Asn, GLn	Polar neutral	green
Ala	VaL,	Leu, lie. Met	Hydrophobic aliphatic	white
Phe	Try, Trp		Hydrophobic aromatic	purple
Pro	GLy		Special structural properties	brown
Cys			Disulphide bond former	yellow
Venn diagram grouping properties of the amino acids
Hydrophobic                                                                             Polar
ChargwJ
Regular expressions			
■ Introduction fuzziness	into	regular	expressions
increases the number of matches retrieved from			
the sequence database			
Regular  expression	No.    of	exact  matches    (OWL29.6)	
D-A-V-I-D	71		
D-A-V-I-[DENQ]	252		
[DENQ]-A-V-I-[DENQ]	925		
[DENQ]-A-[VLI]-I-[DENQ]	2739		
[DENQ]-[AQ]-[VLI]2-[DENQ]	51506		
Fingerprints
■  Motivation: there are often more than one conserved region present in multiple alignment.
■  Groups of motifs excised from the sequence and converted into matrices populated by the residue frequencies observed at each position.
■  Unweighted scoring system - no additional mutation or substitution matrices are employed.
■  Weighted scoring system - additional matrices are employed resulting in less sparse matrix, but poor signal-to-noise performance.
Example of (a) ungapped	aligned motif
and (b) its corresponding frequency matrix	
w	
YVTVQHKKLRTPL	
YVTVQHKKLRTPL	
YVTVQHKKLRTPL	
VATLRYKKWtQPL	
YIFGGTKSI.RTFA	
YLFSKTKSLQTPA	
YLS*riCrKSI«3TPA	
<b)	
TCAGNSPFLYHQVKO	I    W   R    H   B   X    Z
00300000070010000200000 00300000200010003000000 00110300100030000002000 00110000000304000010000 004002    0060000    0000000000	
	
Example of frequency matrix derived from (a) initial unweighted motif and (b) PAM-weighted matrix TCAGNSPFIYHQVKDE    IWRHBXZ
004000084 34    00  15    0001    700000 04  15    000007000 37    000  10    oooooo 3    0  12    2    1803    6000  14    000  15    2    07000 9222    1    100001 25    0 20    060040000 14    020040  14    08 31    000000000000 ooioooooooooo 70   000020000 002    10  17    0000000 52    000010000 00000000 73    00000000000000 000000000005000000 6S    0000 001000 69    0003000000000000 20  1100700 53    00000000000000
m
l i }:; "i 'i l i i i ~l i i i "'Í h -ii z 't«
^^-S-S-^^-í^l^^-S^

Blocks
■  Conserved motifs are located by a first motif-finding algorithm: search for the spaced residue triplets (e.g., Ala-X-X-Val-X-Trp); a block score is weighted using BLOSUM 62 substitution matrix.
■  Validation of blocks by a second motif-finding algorithm: search for the highest-scoring set of blocks in the correct order without overlapping.
■  Sequences are clustered to avoid a bias due to identical sequences.
Visualisation of protein fingerprints and blocks
..........
tl■<■ i Mim t tme■■w
Block with clustered sequences and weighted scores
^Tj; i!::; ■—■»-—
^iiiilSSSSKE;

Profiles
■  Based on entire sequences.
■  Profiles define which residues are allowed at given positions, which positions are highly conserved and which degenerate, which positions can tolerate insertions.
■ The scoring system may include evolutionary weights and results from structural analysis.
Example of PROSITE profile
I'll
Hidden Markov Models
■  Based on entire sequences.
■  HMMs are probabilistic models consisting of a number of interconnecting states - linear chains of match, delete or insert states.
■  Each position in the multiple alignment is assigned to either match, insert or delete state.
■  Construction: seed alignment, iterative sequence gathering, final alignment (all automatic).
Scheme of the linear Hidden Markov Model
(M) match, (I) insertion, (D) deletion
Analysis packages
■  commercial databases
■  commercial software
■  comprehensive packages
■  packages for DNA analysis
■  intranet packages
■  Internet packages
Comprehensive packages	
■ GCG	
■ EGCG/EMBOSS	
■ Staden	
■ Lasergene	
Packages for DNA	analysis
■ Sequencher	
■ VectorNTI	
■ MacVector	
Intranet packages
■  SYNERGY
■  GeneMill, GeneWorld, GeneThesaurus
Internet packages
■  CINEMA
■  EGCG/EMBOSS
■  Alfresco
Protein structure modelling
■  protein structure
■  protein structure databases
■  prediction of secondary structure
■  prediction of protein fold
■  prediction of tertiary structure
■  modelling of protein-ligand complexes
Protein structure
■  Proteins are build up by amino acids that are linked by peptide bonds. The 20 different amino acids occur naturally in proteins.
■  Protein structure can be experimentally determined by X-ray crystallography, nuclear magnetic resonance (NMR) or by electron crystallography.
■  Levels of protein structure:
>- primary structure
>- secondary structure
>- supersecondary structure
>- tertiary structure
>- quaternary structure
Scheme of an amino acids (a) and polypeptide chain (b)
(41                          51'lif fftŕíiil
Wl i
*m\*Q pnup              uvfotift jrpup
Side chains of 20 different amino acids that occur in proteins
r; ^ r- ^ .$■
	Levels of protein structure
Primary -.h ;■ u re:	the Linear sequence of amino acids in a protein molecule
Secondary structure:	regions of local regularity within a protein fold (e.g,.
	ti-helices, ß-turnsr ß-strands)
Super-secondary structure: the arrangement of ct-rielices- and/or ß-strands into	
	discrete folding units (e.g., ß-barrels, ßüß'units.
	Greek U ■.-■. "■:'■ ■
Tbtfaiy struct urc:	the overall fold of a protein sequencer formed by the
	packing of its secondary and/or super-secondary
	structure elements
Quaternary structure;	the arrangement of separate protein chains in a pro-
	tein molecule with mure than one sub-unit
Quinter nary structure:	the arrangement of separate molecules, such as in
	protein-protein or průtein-nudeic acid interactions
	Levels of protein structure
run                                 5rtMdUy                                   Tirtn                                                           OmtirirntTt	
F     I       P      A	i ->tfk  "^
■	
V     ■    ■     V	c         s"-v3ť    ^ľj  äňi^tr^
•    *    ■    ■	
	
Synchrotron radiation facility
European Synchrotron Radiation Facility at Grenoble, France
Protein structure databases
■ PDB
■ PDBsum
Protein structure classification databases
■ SCOP
■ CATCH
Protein structure databases
■ PDB - Protein Data Bank
>- developed at Brookhaven National Laboratory
>- currently maintained by Research Collaborator for
Structural Bioinformatics (RCSB) >- world repository of three-dimensional protein structures >- entries from crystallographic analysis (80%), nuclear
magnetic resonance (16%) and modelling (2%) >- entries stored as flat files composed of section for
information records and section for co-ordinates >- entries identified by unique PDB-ID code (e.g., 1EDE) >- searchable by keywords >- interactive visualization of structures
Information on entry from the PDB database pQg                             Structure Explorer- 1CV2
Entry from the PDB database (crystallographic info)
Entry from the PDB database (co-ordinates)
Protein structure databases
■ PDBsum
>- developed at University College London
>- summaries and analyses of protein structures
(secondary database derived from PDB) >- summary of PDB entries: resolution, R-factor,
# protein chains, topology, ligands, metal ions, etc. >- analysis of PDB entries:  protein-metal and
protein-ligand interactions, protein validation >- provides links to many related databases
Information on entry from the PDBsum database
1CVŽ
PDB id: 1cv2
%
Sb
FteHbflon: ■ at, ft-r»d*; D IB H-fc^ la I
Entry from the PDBsum database (secondary elements)
LrtJ     CATHKt        OH       AHttlKfcn ĽUQ 14CJMM -> AtAf&a 3-l*r*tX>i}Sln}*itit
^A^.
]--------a—a—x—n—e—b—z—2—e—íl—a—
i-l—'■—^-^vw1---->^w>A
ŕ-iwv—^—>^C-
•^-Ka.    • ^M '"> '"</\/\/V—
A—*—
Protein structure classification databases
Classification attempts to capture the structural similarities among proteins. The structural similarities relate to the evolution. The structural similarities may imply function. The classification scheme is dependent on the underlying philosophy.
Protein structure classification databases
■ SCOP - Structural Classification of Proteins
>- developed at MRC Laboratory of Molecular Biology
>- construction: combination of manual and automatic
methods (complicated by multidomain proteins)
>- fold = same secondary elements in same arrangement, independently of common evolutionary origin
>- superfamily = low identity but common evolutionary origin implied from common structure and function
>- family = sequence identity >30%
Protein structure classification databases
■ CATCH - Class, Architecture, Topology, Homology
>- developed at University College London >- construction: mostly automatic >- unique numbering scheme analogous to Enzyme Classification (E.C.) scheme
>- class = gross secondary structure content
>- architecture = gross secondary structure arrangement
>- topology = shape and connectivity of secondary
structures (60% of larger protein matches smaller one) >- homology = sequence identity >35%, common ancestry >- sequence = clustering based on sequence identity
Prediction of secondary structure
■  Algorithms assign probability for occurrence of a-helix, ß-strand, turn and random coil at particular position in the sequence.
■  Methods: statistical, stereochemical and homology/neural networks based. All methods rely on information derived from known 3D structures. Most recent methods use the information from multiple alignments.
■  Reliability of the best current methods is >70%.
Prediction of secondary structure
■  Chou-Fasman and GOR
>- statistical - amino acids show preference for particular secondary structure elements
■  PHD and NNPredict
>- neural networks - the rules for prediction are not defined in advance, they are created by training
■  NNSSP and PREDATOR
>- nearest neighbour approach
■  JPRED
>- consensus approach - utilises multiple alignments and state-of-art method - makes consensus
Comparison of secondary structure predictions
Prediction of protein fold
■ Threading
>- threading = protein fold assignment or fold recognition >- target sequence is searched against database of folds
(3D profiles) and threaded models are constructed >- 3D profile - each residue in 3D structure is assigned
environmental variables (buried area, fraction of side
chain covered polar atoms, secondary structure, etc.).
Assumption - environment of the residue should be
more conserved that the residue itself. >- residue can be also described by its interactions >- match of target sequence with 3D profile (quality of
threaded models) is quantified by Z-score or energy >- limitation: can not handle multi-domain proteins
Fold recognition by threading
Prediction of protein fold
■  Bioinbgu
>- consensus method utilising predictions from five different algorithms
■  3D-PSSM
>- scoring functions: lD-PSSMs (sequence profiles built from relatively close homologues), 3D-PSSMs (more general profiles containing more remote homologues), matching of secondary structure elements, and propensities of the residues for solvent accessibility
■  GenThreader
>- hybrid method: profile-based alignment, evaluation of alignments by threading, evaluation of threaded models by neural network
Prediction of tertiary structure
■  Ab initio
*■ 3D structure of a protein is predicted from first principles (search for global minimum structure) >- current algorithms are not very reliable
■  Homology modelling
>- 1. alignment of modelled sequence against sequences of structurally similar proteins (templates)
>- 2. "extraction" of the backbone from template structure and positioning of side-chain
>- 3. modelling of loops
>- 4. structure refinement and validation
Validati L	Dn of protein	models
	Quality of Protein Model	|
	-                    .	-^
-~		■■*
Stereochemical Accuracy	Packing Quality	Folding Reliability
• Torsion angles	• Interatomic distances	template structure
angla distribution.	-  'Atomic contact quality'	->  RMSdeviatwis
(Ramachandran plots) -  Sidechai n torsion	•  Secondary structural	between backbone
angle distribution,	-.  Locabon and geometry	•  3D-1 D-profiles
• Planarity of peptide	■ Hydro p do B city	"*  ^nľeľtlngs
->  cuangleflislnbution	-t   Distribution ol polar	sequences
■ Chirality oi C-atoms -  (angledistribution ■   Bond lengths •   Bond angles •   Planarity	surface of amino acids bond donors/acceptors	potentials
->  Aromatic ring syslems		
gľjS"hybrit,'Zedmd		
Prediction of tertiary structure
■  SWISS-MODEL
>- fully automated modelling server >- input = protein sequence; output = PDB file >- 1. search of ExNRL-3D using BLASTP for potential templates; 2. select all templates with sequence identities above 25%; 3. Generate structures of 3D models; 4. energy minimise models using GROMOS 96 >- first approach and optimise mode (Swiss-PDBViewer)
■  MODELLER
>- most widely used academic program for homology modelling (satisfaction of spatial restrains)
Modelling of protein-ligand complexes
■  Docking
>- positioning of small organic molecules (ligands) inside
the protein active site >- different orientations and conformations of the ligand
are evaluated using geometric or energetic scoring
functions >- Protein-ligand interaction energy = van der Waals
term + electrostatic term + H-bond term + entropie
term >- flexible docking - considers different conformation of
ligand; different rotamers of protein side chains
■  Software: DOCK, AUTODOCK, FLEX