Bioinformatics Jiří Damborský National Center for Biomolecular Research jiri@chemi.muni.cz, ph. 41129 377, Kotlářská 2, bid. 7, 2nd floor Bioinformatics - what is it? ■ The term bioinformatics is used to encompass almost all computer applications in biological sciences. ■ Information technology applied to the management and analysis of biological data ■ originally - analysis of sequence data (80s) ■ presently - also analysis of 3D-structures Bioinformatics - study material ■ Introduction to bioinformatics, T.K. Attwood and DJ. Parry-Smith, Longman, Essex, 1999. ■ copy of the slides ■ http://www.chemi.muni.cz/~jiri ■ http://www.bioinf.man.ac.uk/dbbrowser/ bioactivity/prefacefrm.html Bioinformatics - composition 12 lectures per semester 3 hours per week ■ 1st and 2n d hour = = lectures - theory ■ 3rd hour = = practical course on computers Bioinformatics - lectures ■ Introduction ■ Information networks ■ Protein information resources ■ Genome information resources ■ DNA sequence analysis ■ Pairwise sequence alignment ■ Multiple sequence alignment ■ Secondary database searching ■ Analysis packages ■ Protein structure modelling Bioinformatics - practical training ■ Biological databases ■ Searching and modelling servers ■ Building a sequence search protocol ■ Case examples ■ Protein structure prediction ■ Protein modelling ■ Follow-up of lectures Introduction ■ history of sequencing ■ what is it Bioinformatics? ■ sequence to structure deficit ■ genome projects ■ why is Bioinformatics important? ■ patter recognition and prediction ■ folding problem ■ sequence analysis ■ homo/analogy and ortho/paralogy History of sequencing ■ Protein sequencing >- separation of peptides, identification and quantification of amino acids >- Edman degradation >- mass-spectrometry - advantage in identification of post-translational modifications >- 1955 sequencing of peptide insuline >- 1960 sequencing of enzyme ribonuclease >- 1980s automated sequencers History of sequencing ■ Nucleic acid sequencing >- tRNA - short, could be purified >- DNA - large (human chromosome 55-250 x 106 bp); the longest fragment for sequencing is 500 bp; purification is problematic >- advent of gene cloning and PCR >- 1972 DNA cloning >- 1975 DNA sequencing >- 1980s and 1990s sequence revolution The history of technology developments and structure determination Transfer RNA (a) primary sequence and (b) tertiary structure (a) c pO-C ÁrclpIO, (b) 0 U U A m»CUO U G ,. DD ACUCi* Cv Tľ "oA°AO Cm!0c.GA° Va™*fcW C G U * loop 0n.A A Automatic sequencing machine ABI Prism 310, Applied Biosystems Automated production line in sequencing "factory" Whitehead Institute, Center for Genome Research, USA Sequencing chromatogram .. .. .-■-■■"■ ". ■ What is Bioinformatics? ■ improvements in DNA sequencing technologies and computer-based technologies ■ originally - analysis of sequence data (1980s) ■ presently - also analysis of 3D-structures ■ The term bioinformatics is used to encompass almost all computer applications in biological sciences. ■ Information technology applied to the management and analysis of biological data. The sequence to structure deficit 300 g 1 150 E 988 1993 1998 Date of database release Genome projects ■ 1977 first complete genome - virus o)X174, 5000 nucleotides; 11 genes ■ 1995 first complete genome of living organism Haemophilus influenzae, 1.8 million nucleotides and 1700 genes ■ sequencing of model systems: Escherichia coli, Saccharomyces cerevisiae, Cernorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Canis fa miliaris, Mus Musculus The genome size of various species (Bits 232- (Nucle io9- otides) • Human • Mouse • GenBank 10/97 • Rice ■ : řtoiAC(1998> .GmB«,k9»2 • Arabidopsis 224j io7- • Budding yeast (1997) »GenBank 9/87 • Escherichia coli (1997) • Haemophilus influenzae (1995) 216- . ios- • GenBank 10/82 • Cytomegalovirus (1990) • X phage(1982) • «>X 174 phage (1977) 10J1 28 - Comparative genomic analysis of model organisms Genome Gene Haploid size (Mb) number chromosome number Bacterium {Escherichia coli) ~4 4,403 i Yeast {Saccharomyces cerevisiae) -12 6,190 16 Worm {Caenorhabditis elegans) 97 19,730 6 Fruit Fly {DroBophila melanogaster) 120 13,601 4 Mouse {Mus Musculus) 3,454 -50,000 (estimated) 20 Human {Homo sapiens) 2,910 33,609 23 Human Genome Project ■ in mid-1980s initiated Human Genome Project ■ estimated 100.000 genes and completion in 2005 ■ need for automated sequencing and improved computational techniques ■ shotgun method ■ sequencing of rough draft first ■ first draft completed in 2000 by publicly funded the International Consortium Human Genome Project and the company Celera Genomics Human Genome Project ■ ~33.000 genes ■ genes are complex due to alternative splicing ■ >1.000.000 proteins (estimated) ■ hundreds of genes resulted from horizontal transfer from bacteria (in vertebrate lineage) ■ dozen of genes derived from transposable elements (their activity however has declined) ■ the mutation rate in male is two-times higher than in female ■ >1.400.000 single point polymorphisms (SNPs) Why is bioinformatics important? ■ last 20-30 years - structural biology ■ new era - bioinformatics - due to genome projects and sequence/structure deficit ■ biological function is not known for about 50% of all genes in every sequenced genome ■ role of bioinformatics >■ data management and storage >■ data analysis = conversion of primary sequence to biological knowledge Pattern recognition versus prediction TH I T Ľ E LAVVLQRRDWEHPG 1 KHLAJLHPPrA5HKHKE E U T D F T E O (J L N E L K í H í ľ ■'-' ; í S V L í C O L P [ SHHQHKGYDJLPIY t D V T v H t v » f ľ H ř T f H P I QCVSLTFHI QiCStS [IŕDCVfíSAFH :. Y 0 Q D & B [. P í K F I) I. S > F L ft * C BHRLAVMVLRtf ÍÍC?Í!. B D Q U K W R K 2 '.: I FRDVBL.LHKPTT 0 I S D T W V A 1 Levels of protein structure primary structure: the Linear sequence of aTino adds in a protein molecule Secondary structure: regions of local regularity within a protein fold (e.g,, ti-helices, ß-turnsr ß-strands) Super-secondary structure: tne arrangement of ct-rielices- and/or |h acrEsnci-s into discrete folding units (e.g.. [1-barrelsr ßüß'units, Greek U ■.■'. '■:'■ ■ lei tiary structure: the overall fold of a protein sequencer formed by the packing of its secondary and/or super-secondary structure elements Quaternary structure: the arrangement of separate protein chains in a pro- tein molecule witn more than one suhunit Qu internal'y structure: the arrangement of separate molecules, such a? in protein-protein or průtein-nudeic acid interactions Homology and analogy ■ Sequences are said to be homologous if they are related by divergence from a common ancestor. ■ Proteins can share similar folds (e.g., ß-barrel) or similar catalytic residues (e.g., serine proteases) without any sequential similarity. Convergence to similar biological solutions from different evolutionary starting points results in analogy. ■ Sequence analysis assumes homologous proteins. ■ Homology is not a measure of similarity. Application areas of different analysis methods Percent Identity Alignmanl Methods Automatic p in v. w. míitťioífei j—— Á Twilight Zona f 1 Průf Ic můthůdi Midfiight 7one T í Stratu") pníd*stiun Orthology and paralogy ■ Proteins performing the same function in different species - orthologues. ■ Proteins performing different, but related functions within same organism - paralogues. ■ Sequence comparison of orthologuos proteins phylogenetic analysis. Modularity of proteins = difficulties of homology searches ofocř« *?. M S*. Š ů~m? Information networks ■ what is the Internet? ■ how do computers find each other? ■ FTP and Telnet ■ what is the Worl Wide Web? ■ HTTP, HTML and URL ■ EMBnet, EBI, NCBI ■ SRS and ENTREZ What is the Internet? ■ Global network of computer networks that link government, academic and business institutions. ■ communication by TCP/IP (Transmission Control Protocol/Internet Protocol) ■ computers - nodes, data - packets ■ packets may not be transferred directly from one computer to another How do computers find each other? ■ Each computer is assigned IP address 147.251.28.2 machine.site.domain bilbo.chemi.muni.cz ■ FTP - File Transfer Protocol ■ Telnet - remote connection Example of Internet domains and subdomains Country-based domains Other domains Subdomains Australia .a u Educational .edu Academic .ac Denmark .dk Commercial .com Company .co Finland .fi Governmental .gov Other organisation .org France .ft Military .mil General .gen Germany .de Greece .gr Hungary .hu Ireland .ie Israel .il Italy .it Netherlands .ni New Zealand .nz Poland .pi Portugal .pt South Africa .za Spain .es Sweden .se Switzerland .ch United Kingdom .uk USA .us What is the World Wide Web? ■ Developed at CERN - the European Laboratory of Particle Physics. ■ The purpose was sharing of information. ■ Hypermedia based information system. ■ The most advanced information system found on the web. ■ Very popular - almost synonymous with the Internet. Web browsers ■ Browser is the client communicating with servers using standard protocols. ■ Home page is the first point of contact between browser and the server. Lynx - academic, VT100 terminal Mosaic - academic, X-windows Netscape Navigator - commercial Internet Explorer - commercial HTTP, HTML and URL ■ HTTP - HyperText Transport Protocol documents exploited by browsers are written in hypertext and transferred by HTTP ■ HTML - HyperText markup Language standard language for writing a hypertext ■ URL - Uniform Resourse Locator unique address for a document example: http://www.chemi.muni.cz/~jiri EMBnet, EBI, NCBI ■ 1988 established the network of European biocomputing and bioinformatics laboratories. ■ Eliminates the need for multicopies of biology databases and retrieval software. ■ Hinxton Hall = Sanger Centre + MRC Human Genome Mapping Project Resource Centre + European Bioinformatics Institute (EBI) ■ National Center for Biotechnology Information (NCBI) SRS, ENTREZ and LinkDB SRS - The Sequence Retrieval System >- maintained by EBI >- network browser for databases in molecular biology >- allows indexation of flat-file databases >- allows customised search of selected databases >- link databanks: sequence, structure, bibliography, etc. ENTREZ >- integrates databases of NCBI >- less flexible then SRS >- valuable concept of neighbouring >- link databanks: DNA and protein sequences, genome data, structural data, PubMed bibliography SRS, ENTREZ and LinkDB LinkDB >- maintained by Institute for Chemical Reseach, Japan >- network browser for databases in DBGET and KEGG (Kyoto encyclopedia of genes and genomes) >- link databanks: sequence, motifs, structure, amino acid properties, ligands, metabolic pathways Network of databases linked via SRS HI Ea Ml MH — hmtibcI ■ • a •ebsh". ^ •ron't WMI —"I ij iSJ Network of databases linked via ENTREZ MEDLINE Sequences Network of databases linked via LinkDB DBGET Database Links * Wi \ ^^^J Protein information resources ■ biological databases - introduction ■ primary protein sequence databases ■ composite protein sequence databases ■ secondary databases ■ composite secondary databases ■ protein structure databases ■ protein structure classification databases Biological databases - introduction ■ Vast amounts of data produced - databases must be established for storage of the data. ■ Databases must be maintained and disseminated together with the analysis tools. ■ Classification of databases » flat files >- relational >- object-oriented >- primary >- secondary >- composite Entry from a flat file database Relational database ü. PROJECT I Object-oriented database Levels of protein structure and corresponding databases I AVILDRYFH primary sequence secondary motif [A3] - [IL] 2-X [DE] -E- [FYH] 2-H tertiary domain module primary database a,b,c ©,*,# secondary "~ database struct Lire database Primary protein sequence databases ■ PIR ■ MIPS ■ SWISS-PROT ■ TrEMBL ■ NRL-3D Store biomolecular sequences and annotations. Primary protein sequence databases ■ PIR - Protein Sequence Database >- 1960s by Margaret Dayhoff >- maintained by international consortium >- four sections PIR1-PIR4 PIR1 - fully classified and annotated entries PIR2 - preliminary entries PIR3 - unverified entries PIR4 - conceptual translations of artefactual sequences, non-transcribed, non-translated ■ MIPS - Martinsried Institute for Protein Sequences >- collects and processes sequence data for PIR Primary protein sequence databases ■ SWISS-PROT >- University Geneva »EBI »Swiss Inst, of Bioinformatics >- high-level annotations including description of the function, structure and domains, post-translational modifications, variants, etc. >- annotated manually (high quality) >- automatically annotated = TrEMBL >- minimally redundant >- interlinked with many other sources >- efficient searching of selected fields only >- most widely used protein sequences database Primary protein sequence databases ■ TrEMBL - Translated EMBL >- computer-annotated supplement of SWISS-PROT >- contains translations of all coding sequences in EMBL >- SP-TrEMBL (SWISS-PROT TrEMBL), REM-TrEMBL ■ NRL-3D >- produced by PIR from sequences extracted from Brookhaven Protein Databank (PDB) >- annotations in PIR format including structural information extracted from PDB: secondary elements, active site Ms, experimental method, resolution >- makes sequence information in PDB searchable by keywords and similarity Composite protein sequence databases ■ NRDB ■ OWL ■ MIPSX ■ SWISS-PROT+TrEMBL Amalgamates a number of primary sources, using a set of clearly defined criteria. Composite protein sequence databases ■ NRDB - Non-Redundant DataBase >- developed and maintained by NCBI >- composite: GenPept (CDS translations of GenBank), GenPeptupdate, PDB sequences, SWISS-PROT, SWISS-PROTupdate, RIR >- advantages: comprehesive and up-to date >- disadvantages: not fully redundant (only identical copies removed), occurence of multiple entries due to polymorphism, incorrect sequences amended in SWISS-PROT re-introduced by translation of GenBank >- default database of the NCBI BLAST (ENTREZ/NCBI) Composite protein sequence databases ■ OWL >- developed and maintained by University of Leads >- composite: SWISS-PROT, PIR1-4, GenBank, NRL-3D >- SWISS-PROT the highest priority for annotation >- advantages: less redundant, fully indexed (fast) >- disadvantages: not up-to-date (released every 6-8 weeks), incorrect sequences >- available from SEQNET of UK EMBnet Composite protein sequence databases ■ MIPSX >- developed by Max-Planck Institute in Martinsried >- composite: PIR1-4, MIPS, NRL-3D, SWISS-PROT, TrEMBL, GenPept, Kabat, PSeqIP >- identical entries and subsequences removed ■ SWISS-PROT+TrEMBL >- developed and maintained by EBI >- composite: SWISS-PROT, TrEMBL >- advantages: comprehensive, minimally redundant, fewer errors >- disadvantages: not as up-to-date as NRDB >- available from SRS of EBI Overview of primary sources of composite databases NRDB OWL MIPSX SP+TrEMBL PDB SWISS-PROT PIR1-4 SWISS-PROT SWISS-PROT PIR MlPSOwn TrEMBL PIR GenBank MIPSTrn GenPept NRL-3D MIPSH SWISS-PROTupdate PIRMOD GenPeptupdate NRL-3D SWISS-PROT EMTrans GBTrans Kabat PseqlP Secondary databases ■ Contains information derived from primary sequence data, typically in the form of abstractions: regular expressions, fingerprints, blocks, profiles or Hidden Markov Models. ■ These abstractions represent distillations of the most conserved features of multiple alignments. ■ The abstractions are useful for discrimination of family membership for newly determined sequences. Terms used in sequence analysis methods C-Y-X2-[DG]-G-x-[ST] regular expression Three principal methods for building secondary databases alignment methods (PROFILE LIBRARY} t J id den Markov Models (PFAM) Implication of function from a sequence DNA-binding proteins Hchi-k±p hrlii 'UK Iv(r.' L'uMiiiai !i:cn Lem-iK Jiprtf nHÍŮTAh USU W^'AC^I-HWjiVI.II IVMSr*rij..lVMFYHI- VlK>H.lťJJ)dJ Secondary databases PROSÍTE PRINTS BLOCKS Profiles Pfam IDENTIFY Secondary databases PROSITE >- historically the first secondary database >- maintained by Swiss Institute of Bioinformatics >- motivation: identification of protein families >- abstraction: regular expressions (patterns) >- construction: automatic multiple alignment and manual extraction of conserved regions >- ideally patterns should identify only true-positives (not false-positives) >- entries deposited as two distinct files: pattern file and documentation files >- primary source: SWISS-PROT Pattern file of a entry from the PROSITE database ED OrEírlJ pattum. K HMttl r □» AľpOHO inu*TTQj HÜJ-IMT HUTA UľĽ*Tt| l .■ ■ <1 < (Jlírt UtOATCI . ca VEllLll pSUBMic* líif*j.rr*k í*Liů*l blading *íl* fa [Livm -|PK]-KlJS- [ÍPvľl -H- 1 iTitlU 1 - ICUCW | - stacp:- x 1J: - [riptur 1- lAPä- pa ml]-lír), m ŕPJLEUEilJ, 19 ItC: m jTOtAL-S 1 [S] i i 'POSITIVUS} 1 U] 1 SĽMlUKMf'0 [QN ŕ FALEE^POT-J1 í> j m ŕHLSEjiKNO? rrmuL *1i CC ŕim »bEiMU h ŕUJU-TUPIAT-l | re ŕíTCi! T>ttLMl; DM HM), ŮPÍÍ_f«OMl ľ: k-J**Tí. OľsL Wut^. i. P3JZ&4, OHl_CUUl Tr CC RtM, 0ÍS3JXOHE 7: PIMTP. ops: MOPS. Tí ICISVj ;r:: 7;ř.c*- Tj c« IftHHi OPů)_HWPS Ti POtí Hr «PM^KHI, T. FjtfCH. ::r..-i " -i- Tj EA PľFí«. grn_EPovi T; F)íí4Jr *Pín_SŕkJr. Tr HJ&»I C*íB_«tecv Tr Oij j ŔSTTA ; ;■■>-■-. OPSD : ■.n.. H ►JÍJĎF. QMS f A KAJ T: Hm*, M*B_gn«i 1. P3IEBL. ŮHD CD E« Tr n FOBICS. Oi-.J l:>A) El F] :-STFA Tr c* umí. ŮHÍ_CAllAÍf Tf r-33 T Tí. ÜPflK_CH]CK, Tf phdcs DPfiFUJflMW Ti N tni«. .-.M.-i Tj PífijSP. QPBV_PFAMP. T. fltilQ. PUS.TWPA Tr n Mn*9< fcäLBt^i« Tt PlíJdUlr PräM_HUKA3l . Tj H M7SI5, OK] C*WJ Pj DO umetu LU Secondary databases ■ PRINTS >■ developed at University College London >■ motivation: identification of protein families by more than one pattern >■ abstraction: fingerprints (aligned motifs) fingerprints store original sequence information >■ construction: sequence information in a seed motifs are augmented through iterative database scanning >■ construction of fingerprints done manually >■ primary source (original): OWL >- primary source (new): SWISS-PROT and SP-TrEMBL Pattern file of a entry from the PRINTS database (I) l.":. OřSIH SICWATUíie Typ» of fingerprint: COMPOUND with 1 *l*Mtitf LUK« FAINTS; PAflOÍJT CPCP.KKCCŮPSN r PA0ŮÍI7 CfCkCmíit HICIŮÍ4Í GrCtriC* HUHfSp FA4D2I1 CPCUÉEcMTlIfr PROOHO ÜPCK5TE2; PPOoíil BACTRLQP31JI «HJSíTtr rSOflJJfi OrS(n; FriCiJ? G_Fr Of 11 *_«£*«*# BLOCKS; &L40Í3Í :;iuľ i:- ■ :.; t_\ irNAH KKPBí CCA_DI>8& CrMtian dat» 20-DEC-1T9Jj UPDATE Z-JVt-lfTf 1. APFlE&ĽSŕ. H.L. and HAAGStAVEr P.A. H»l+Cul4r fri*]*Sď ůf tli* vi»ll*l píf>n4tl, VJ9IW PV9- 2« 113) MiJ-lSÍ* (lífíl- ■ -::.ř í .:■!■ -.v. 7) fůd» Involving J »l«MJiti I «cd» Indivina 2 *1*»*AIÉ .■.^i, :..■ s-::-.-K =-■ = ■-1 =-" [tfHü II 71 73 73 21 0 1 1 l 1 í ) Pattern file of a entry from the PRINTS database (II) INITIAL HOľlF SETS OPSim Hinjtb u C H. Lit - li Huti number - 1 <*Ľj5n íOTif í 1 POOÜE if IHT VVTV0Hö:Jj7ei. O-FĚD HOVIJÍ SD ta- YVWOKKKUJPC. npizn uuuu? fC to TVTVtJHJOajfTPL DF5D_2HEEP f: i: M AATHKnUUtKHFl. DFSÜ_MUHMf :u 7 ú JUľlTIFrKJUjFHrL i:ť!řn_Muiuier .■K ■.■h YlfWTTAaiETPA OPSfl_DEUKE T3 73 VATUHKKLKQK. UťB3._H0J£fLK i? i? TIPfKTTUaLRTPA ■,. ■? .I- Kt řr. ill] •ÍVFSWUÍSLRTPS OP53_DRCfH *L ai WirSTÍVÍILRTIW ::-:: nf£«Ě 77 77 TUEKTKELCfTPA oPsruxrTTC. ■ill ■in ^LŕTKTKaKrrŕJl OPSD_LÜLrO ■:■ r, t QPSIKZ Length ol macll - 13- KíUi flJlYlřtj ■ i OpffÍTl rn.nl; i f ] ; - 1 PCCTO ET >■:. GH5RYIFKWCCS OPJ5IJ EWVTN 1 .L 101 SManriťML^tu Opsd_hľhan ; ". 141 OHMiltflíHXS OKD SHUfľ ; ■: 101 OWSBYWeHLKW ÜPim_HOHAN lí( 141 GUSKÍHľHSLATi: ::■.".- !■.".:■":: lij 141 CUSRYVTECHI.TS f.Kf. : H.vtt lí' 141 HH5RTIFEQLQCS opaa H13ÍAN ITL 141 5U SAWS KM LT* ups; drohe as* 141 THCTtFLTPFdVLTi] OKT. nacME 17.1 14G rwUK/VJ'EĽÍl.rE Ül'is DACHE ] .■ 100 IÍHÍlAWrRCIWS OľSU CCTUU -H 101 {TŕCAVTLECVLOI QT3D LŮLFŮ --I 141 Secondary databases ■ BLOCKS (abstraction: blocks) ■ Profiles (abstraction: profiles) ■ Pfam (abstraction: Hidden Markov Models) ■ IDENTIFY >- developed at Stanford University >- abstraction: motifs encoded by fuzzy approach (alternative residues are tolerated in motifs) >- construction: automatically derived using the program eMOTIF >- primary sources: PRINTS and BLOCKS Properties of amino acids used in eMOTIF Residue property Small Small hydroxyl Basic Aromatic Basic Small hydrophobic Medium hydrophobic Acidic/amide Small/polar Residue groups Ala, Gly Ser, Thr Lys, Arg Phe, Tyr, Trp His, Lys, Arg Val, Leu, íle Val, Leu, íle, Met Asp, Glu, Asn, Gin ALa, Gly, Ser, Thr, Pro Overview of primary sources and stored information in secondary databases Secondary database Primary source Stored information PROSÍTE SWISS-PROT Regular expressions (patterns) Profiles SWISS-PROT Weighted matrices (profiles) PRINTS OWL* Aligned motifs (fingerprints) Pfam SWISS-PROT Hidden Markov Models (HMMs) BLOCKS PROSITE/PRINTS Aligned motifs (blocks) IDENTIFY BLOCKS/PRINTS Fuzzy regular expressions (patterns) Composite secondary databases ■ INTERPRO - Integrated resource of Protein Families, Domains and Sites >- developed by EBI, SIB, University of Manchester, Sanger Centre, GENE-IT, CNRS/INRA, LION Bioscience AG and University of Bergen (European Research Project) >- provides an integrated view of the commonly used secondary databases: PROSITE, PRINTS, SMART, Pfam and ProDom >- accessible by ftp, www and via member databases InterPro dataflow scheme adffiln develop sptr Protein structure databases ■ PDB ■ PDBsum Protein structure classification databases ■ SCOP ■ CATCH Genome information resources ■ primary DNA sequence databases ■ specialised DNA sequence databases Primary DNA sequence databases ■ EMBL ■ DDBJ ■ GenBank ■ dbEST ■ GSDB Store DNA sequences and annotations. Primary DNA sequence databases ■ EMBL - European Molecular Biology Laboratory >- European Bioinformatics Institute (EBI) >- collaboration with DDBJ and GenBank - exchange of new entries on daily basis >- source of sequences: direct author submissions, genome projects, scientific literature, patents >- rate of growth is exponential with doubling time ~9-12 months >- most entries from model organisms >- retrieval through SRS Primary DNA sequence databases ■ DDBJ - DNA Data Bank of Japan >- National Institute of Genetics >- collaboration with EMBL and GenBank >- retrieval through DBGet ■ GenBank >- National Center for Biotechnology Information (NCBI) >- collaboration with DDBJ and EMBL >- data split into 17 divisions >- retrieval through Entrez Codes for 17 divisions of GenBank Division Sequence subset PRI Primate ROD Rodent MAM Other mammalian VRT Other vertebrate INV Invertebrate PLN Plant, fungal, algal BCT Bacterial RNA Structural RNA VRL Viral PHG Bacteriophage SYN Synthetic UNA Unannotated EST EST (Expressed Sequence Tags) PAT Patent STS STS (Sequence Tagged Sites) GSS GSS (Genome Survey Sequences) HTG HTG (High Throughput Genomic Sequences) Entry from the GenBank database Primary DNA sequence databases ■ dbEST >- National Center for Biotechnology Information (NCBI) >- maintains only Expressed Sequence Tag (EST) data ■ GSDB - Genome Sequence DataBase >- National Center for Genome Resourses >- complete collections of DNA sequence for genome-sequencing laboratories >- on-line submission of large-scale data >- quality checks >- format consistent with GenBank + GSDBID Specialised DNA sequence databases ■ SGD ■ UniGene ■ TDB ■ ACeDB Store species-specific : and techniq ue-specific DNA sequences. Specialised DNA sequence databases ■ SGD - Saccharomyces Genome Database >- molecular biology and genetics of 5. cerevisiae >- complete genome, genes, proteins, phenotypes >- first eukaryotic genome sequenced (1998) >- sequence analysis, register of genes, 3D structural data, primer sequences for cloning ■ UniGene >- collection of genes encoding proteins (transcript map) >- non-redundant; derived from GenBank >- data organised in clusters (1 cluster = 1 unique gene) >- gene-mapping projects and gene expression analysis Specialised DNA sequence databases ■ TDB - T1GR Database >- suite of databases: DNA and protein sequences, gene expression, protein families, taxonomie data >- links: TIGR microbial genome sequencing projects, parasite databases, gene index projects, A. thaliana database, human genomic dataset ■ ACeDB - A Cernorhabditis e/egans DataBase >- C. e/egans genome project >- restriction maps, gene structural information, cosmid maps, sequence data, bibliographic information >- software to organise data ACEDB: CGI script and perl ACEDB software for organisation of genomic data II Ji 1 _ J DNA sequence analysis ■ why to analyse DNA? ■ gene structure ■ gene sequence analysis ■ expression profile, cDNA, EST ■ EST sequences analysis Why to analyse DNA? ■ The most sensitive comparisons between sequences are on protein level because of redundancy of the genetic code. ■ The loss of degeneracy is accompanied by a loss of information directly linked to the evolution -proteins are only functional abstractions of genetic events at DNA level. ■ Silent mutations, important for phylogenetic analysis, can not be detected at protein level. ■ Exon/intron analysis, open reading frame [ORF] analysis can not be performed at protein level. The genetic code T C A G T m TTC Phe TCT TCC TCA TCG Ser TAT TAC Try TGT TGC Cys T C A G TTA TTG Leu TAA TAG Stop TGA Stop TGG Trp C CTT CTC CTA CTG Leu CCT CCC CCA CCG Pro CAT CAC His CGT CGC CGA CGG Arg T C A G CAA CAG Gin A ATT ATC ATA íle ACT ACC ACA ACG Tri r AAT AAC Asn AGT AGC Ser T C A G AAA AAG Lys AGA AGG Arg ATG Met G GTT GTC GTA GTG Val GCT GCC GCA GCG Ala GAT GAC *SP GGT GGC GGA GGG Gly T C A G GAA GAG Glu Gene structure ■ Eukaryotic genes are more complex then prokaryotic due to presence of introns. ■ DNA databases typically contain genomic data: untranslated sequences, introns+exons, mRNA, cDNA. ■ Gene products (proteins) can be of different length, because not all exons can be present in final mRNA. ■ The proteins of different length originating from single sequence are called splice variants. Central dogma of molecular biology 5' i Intron Intron 31 5'UTR Sense st y Exon Exon Exon | 3' UTR and genomic DNA*^ Transcription , mRNA 5'UTR CDS 3'UTR 1 Translation Protein Gene structure ■ Untranslated regions (UTRs) >- portions of the sequence flanking the coding sequence (CDS) not translated into protein >- UTRs (especially 3' end) is highly gene/species specific ■ Exons >- protein-coding DNA sequences of a gene ■ Introns >- DNA sequences interrupting protein-coding DNA sequence of a gene >- transcribed into RNA but are edited out during post- transcriptional modifications Gene sequence analysis ■ Conceptual translation - theoretical translation of the DNA sequence to the protein sequence using DNA code without biochemical support. ■ Six-frame translation results in six potential protein sequences (ORF analysis). ■ ORF analysis >- codon for methionine - initial codon in the CDS >- sufficient CDS lenght - long CDS are rare >- pattern of codon usage - species specific >- bias towards G/C in the third base of a codon - species specific Expression profile, cDNA, EST ■ Hierarchy of genomic information >- human genome consists of ~3 billion bp >- ~3% of the DNA is coding sequence -*mRNA-»- protein >- rest of the genome needed for compact structure of chromosomes, replication, control of transcription, etc. >- 1. chromosomal genome (genome) - genetic information common to every cell in the organism >- 2. expressed genome (transcriptome) - part of genome expressed in a cell at specific stage in its development >- 3. proteome - protein molecules that interact to give the cell its individual character Expression profile, cDNA, EST ■ Expression profile >- characteristic range of genes expressed at particular stage of development and functioning >- goal of genome projects is to sequence entire (chromosomal) genome >- having complete sequences and knowing what they mean - two distinct stages of understanding genome >- alternative approach is analysis of parts of genome expressed in a cell at specific stage in its development >- comparison of expression profiles: identification of abnormal expressions, expression levels >- interesting for industry - gene discovery, drug design Expression profile, cDNA, EST ■ Complementary DNA (cDNA) >- DNA that is synthesised from a messenger RNA template using the enzyme reverse transcriptase >- cDNA captures expression profile >- preparation: cultivation/isolation of cells, mRNA extraction, reverse transcription of mRNA to cDNA, transformation of cDNA into library, sequencing of randomly chosen clones (100.000 out of 2 mil.) >- ideally 100.000 sequences 200-400 bp length -expressed sequence tags (ESTs) >- in reality many failures, number of sequences lower >- number of clones constructed and sequenced must be large enough to represent expression profile Origin of complementary DNA and e 5' i Intron Intr 5' UTR | Exon Exon xpression sequence tags 3' Exon \ y UTR 3'UTR Sense strand genomic DNA\^ -^ Transcription 1 5' UTR CDS mRNA Protein 1 Translation 4i§fe EST ........ UTR ----------- Expression profile, cDNA, EST Libraries of ESTs >- Merck/IMAGE - 300 000 ESTs from a variety of normalised libraries - higher chance to capture different genes; expression levels not known; sequences deposited to dbEST >- Incyte - quantitative information on expression levels -standardised libraries; expression profiles in healthy and diseased tissues; sequences form the commercial database LifeSeq >- TIGR - TIGR Human Gene Index - integrates results from human gene projects [dbEST+GenBank] -purpose is to identify all possible human genes by sequence assembly - creates Tentative Human Consensus (THC) sequences and contigs EST sequences analysis EST production is highly automated (fluorescent laser systems and computer analysis of chromatograms) influencing the quality of sequences. Specific character of ESTs must be respected during their analysis: ■ EST alphabet ■ Insertions, deletions, frameshifts ■ Splice variants in EST ■ Non-coding regions EST sequences analysis ■ EST alphabet >- automated computer analysis of chromatograms >- program is sometimes unable to decide base for particular position and inserts ambiguous base N >- should be <5% of total length ■ Insertions, deletions and frameshifts >- automated base-calling software assumes regular intervals among peaks - not always the case >- phantom INDELs (insertions and deletions) >- identification of INDELs by sequence comparisons List of base-ambiguity symbols defined by IUB-IUPAC WB symbol Represented bases A A C C G G T/U T N A or C R A or 6 W AorT S CorG V CorT K GorT V A or C or G H A or C or T Ď A or G or T B C or G or T X/N G or A or T or C EST sequences analysis ■ Splice variants >- splice variants are represented by deletions arising from non-inclusion of exons >- in EST maybe missing bases due to >- partially good match = splice form sequencing or sequence errors error? ■ Non-coding regions >- question: does this EST represent a new gene? >- search of DNA database for similar non-coding >- no hit found = the EST represents a new gene or the EST represents non-coding sequence not in the database regions (CDS) present Sequencing chromatogram í : , . EST sequences analysis Three categories of EST analysis tools: ■ Sequence similarity search tools ■ Sequence assembly tools ■ Sequence clustering tools EST sequences analysis ■ Sequence similarity search tools >- current database search programs are designed to cope with EST: TBLASTN (translate DNA databases), BLASTX (translate input sequence), TBLASTX (translate both) ■ Sequence assembly tools >- search of the databases reveals several ESTs matching the query sequence >- alignment of hits and construction of consensus >- search with consensus, augment, .... >- iterative sequence alignment = sequence assembly EST sequences analysis ■ Sequence clustering tools >- clustering of EST sequences reduces redundancy and saves the search time >- enables estimation of genes in the EST database >- approach 1: clustering based on sequences from comprehensive DNA database >- approach 2: clustering of all ESTs, construction of consensus sequences representing each cluster, DNA database search using consensus sequences only >- result = ESTs that do not match any of the database sequences Clustering of EST library EST library Clustering ---*• .....»■ -<----------- —»• -*2---------► ' -.----------?4 ----*- Plus sense EST Minus sense EST Pairwise sequence alignment ■ database searching ■ alphabets and complexity ■ algorithms and programs ■ sequences and sub-sequences ■ identity and similarity ■ dotplot ■ local and global similarity ■ pairwise database searching Database searching ■ Database search can take a form of text queries or sequence similarity searches. ■ Text queries are problematic due to missing annotations in many sequences. ■ query sequence = probe searched sequence = subject ■ The purpose of searches is to identify evolutionary relationships (homology) from sequence similarity. Important for search of analogous family members in different species. Alphabets and complexity ■ A sequence consists of letters from an alphabet. ■ The complexity of the alphabet is defined by the number of letters it contains: >- DNA = 4 >- EST = 5 >- proteins = 20 ■ Special letters can be used for ambiguous bases (N) or residues (X). Sequence searching programs must be able to deal with them. Algorithms and programs ■ Algorithm is a set of steps that define a certain computational process. ■ Program is a the implementation of the algorithm. ■ Same algorithm may be implemented in many programs. Sequences and sub-sequences ■ Alignment of two short sequences: Unaligned score = 6 Sequence 1 (query) Sequence 2 (subject) AGGVLIIQVG 1 1 1 1 1 1 AGGVLIQVG Aligned score = 9 Sequence 1 (query) Sequence 2 (subject) AGGVLIIQVG 1 1 1 1 1 1 III AGGVLI-QVG ■ Score increases by the insertion of a gap- The gap increases residues. the number of aligned identical Alignment of a sub-sequence with full sequence A Identity and similarity ■ Introduction of gaps solely to maximise identities is not biologically meaningful. ■ Scoring penalties are introduced to minimise opening and extension of gaps. ■ Unitary matrix (counting identities) is replaced by similarity matrix (counting similarities) = high-scoring matches are replaced by biologically meaningful low-scoring matches. ■ Diagnostic power of similarity matrices is higher. Unitary scoring matrices: (a) DNA and (b) protein CSTPAGNPEQI I L V F Y W B z : Identity and similarity Dayhoff Mutation Data Matrix >- score is based on the concept of Point Accepted Mutation (PAM) >- evolutionary distance 1 PAM = probability of a residue mutating during a distance in which 1 point mutation is accepted per 100 residues >- 250 PAM matrix - similarity score equivalent to 20% matches remaining between two sequences = suitable for identification of similarities in twilight zone >- limitation: derived from alignment of sequences >85% identical Mutation Data Matrix for 250 PAMs T -2 P -3 A -2 G -3 2 1 3 1 0 6 1112 N -4 D -5 E -5 Q -5 10-100 0 0-10 1 0 0-100 -1-10 0 -1 2 2 4 13 4 12 2 4 R -4 K -5 -1 -1 0 -1 '2 0-1 0-2 -3 0 0-1-1 -2 2 113 0-1-1 1 10 0 1 6 2 6 0 3 5 H -5 I -2 L -6 V -2 -2 -1 -2 -1 -3 -1 0 -2 -1 -3 -3 -2 -3 -2 -4 -10-10 -1 -2 -3 -2 -1 -2 -2 -2 -2 -3 -4 -3 -2 -2 -2 -2 -2 -2 0 0 -2 -2 -2 -2 -3 -3 -2 -2 -2 6 2 5 4 2 6 2 4 2 4 Y 0 W -8 -3 -3 -5 -4 -5 -3 -3 -5 -3 -5 -2 -5 -6 -6 -7 -4 -6 -5 -5 -2 -4 -4 -4 -4 -7 -7 -5 -2 -4 -5 0 -4 -4 -3 2 -3 0 12-1 -2 -1 -1 -2 -4 -5 -2 -6 9 7 10 0 0 17 S T P A G K D E Q H R K M I L V F Y W Identity and similarity ■ BLOSUM matrices >- BLOcks Substitution Matrix >- derived from blocks of aligned sequences in BLOCKS database - represents distant relationships implicitly >- bias from identical sequences is removed by clustering >- BLOSUM62 = matrix derived from sequences clustered at 62% or greater identity Identity and similarity ■ Statistical measures of alignment significance >- performing sequence alignment computationally = creating match according to mathematical model >- adjustable parameters: gap penalties, impact of sequence length, effect of alphabet complexity >- level of confidence to constructed alignment is quantified by statistical parameters: probability (p) - probability that the constructed alignment arose by chance [should approach 0] expected frequency (E) - number of hits one can expect to see by chance [should be <0.001] Example hit list from a database search ip P51698 ip Q50642 íp P27652 íp Q50600 ip Q50670 ip P22643 ip P34913 ip 007214 ip Q50599 ip 031158 ip P22862 ip P23106 ip P29715 ip P49323 ip P54549 ip P48972 ip Q55921 ip Q9JZR6 ip 013912 ip Q59695 ip P46544 ip P46542 ip P10244 lignific aligni LINB_PSEPR (LINB)l,3,4,6-tetrachloro-l, YP7 9_MYCTU (RV2 5 79..(Hypothetical 33.7 LUCI_RENRE Renilla-luciferin 2-monooxy< YJ33_MYCTU (RV1833C..)Hypothetical 32.; YM96_MYCTU (RV2296..)Putative haloalkai HALO_XANAU (DHLA)Haloalkane dehalogenase ( HYES_HUMňN (EPHX2)Soluble epoxide hydrolase YR15_MYCTU (RV2715..)Hypothetical 36.9 kDa YI34_MYCTU (RV1834..)Hypothetical 31.7 kDa PRXC_PSEFL (CPO..)Bon-heme chloroperoxidas. ESTE_PSEFL Arylesterase (EC 3.1.1.2) (Aryl XYLF_PSEPU (XYLF)2-hydroxymuconic sem BPA2_STRAU (BPOA2)Non-haem bromoperox PRXC_STRLI (CPO..)Bon-heme chloropero: YQJL_BACSU (YOJL)Hypothetical 28.2 kD, MYBB_MOUSE (MYBL2..)Myb-related prote PRXC_SYNY3 (SLR0314)Putative non-heme PIP_NEIMB (PIP..(Proline iminopeptidase (E> YDW6_SCHPO (SPRC23C11.06C)Hypothetical 60.1 kDa prote RCOC_PSEPU (RCOC)Dihydrolipoamide acetyltransferase c PIP_LRCDE (PEPIP)Proline iminopeptidase (EC 3.4.11.5) PIP_LRCDL (PIP.. (Proline iminopeptidase (EC 3.4.11.5) MYBB_HUMňN (MYBL2..)Myb-related protein B (B-Myb).[Ho ITSN_HUMňN (ITSN..)Intersectin (SH3 domain-containing ia protein R EC 3.8.1.5) (SEH) (EC hydr hyde hydrol BPO-R2 (EC (B-Myb).[Mu C 3.4.11.5) 93 7e-19 Dotplot ■ The most basic visual method for comparison of two sequences. ■ Separates noise (random dots) from the signal (adjacent dots). ■ Identical sequences are represented by single central diagonal line, similar sequences by a broken diagonal and dissimilar sequences by random dots. ■ Advanced dotplots utilise similarity matrices for calculation of cell scores. Construction of the dotplot matrix HTFRDLLSVSFEGPRPOSSAGGSSAGG Dotplot of (a) identical, (b) similar and (c) related sequences (a) 133 i y s. § °£ 133 LYC1 _PIG (b) 133 /' / 0 133 LYC1_ANAPL (c) 133 (3 | 0 > 128 LYC1_MACRG Local and global similarity ■ Alignments are mathematical models whose behaviour can be modified through the use of adjustable parameters. The models constructed by dynamic programming algorithms - finding solution of a problem by solving smaller, but similar sub-problems. ■ Global alignment - considers similarity across the entire sequence. ■ Local alignment - considers similarity in parts of sequences only. Path matrix with optimal path by dynamic programming AIMS Alignment AIM-S A-MOS Local and global similarity ■ Global alignment >- Needleman and Wunsch algorithm >- suitable for sequences similar across most of their length (usually closely related) >- 1. construction of 2D similarity matrix ("dotplot") >- 2. successive summation of the cells in the matrix starting from N-terminal end -»-progressing through the sequence >- 3. construction of maximum-match path through the entire sequence Local and global similarity ■ Local alignment >- Smith-Waterman algorithm >- suitable for distantly related sequences displaying local regions of similarity (functionally-relevant or structurally-relevant) >- each point of the matrix defines the end point of a potential alignment = edge cells of the matrix are initialised to 0 >- possibility for ending the alignment are calculated for every cell >- algorithm is much faster compared to global similarity algorithms Concepts of global and local optimality in the pairwise sequence alignment (a) Global vs. Global (b) Local vs. Global (c) Local vs. Local Pairwise database searching ■ Extension of the pairwise sequence alignments. ■ Large database searches can not be performed using the original Needleman and Wunsch or Smith-Waterman algorithms due to time limitations. ■ Very fast local-similarity search methods employing heuristics = FastA and BLAST. These methods concentrates on finding short identical matches. Pairwise database searching FastA >- algorithm by Lipman and Pearson (1985) >- identifies short words (k-tuples) common to both sequences >- k-tuples for proteins: 1-2 residues >- k-tuples for DNA: up to 6 bases >- k-tuples lying close to each other on the same diagonal joined by heuristics -► gapped alignments computed by dynamic programming Output from FastA search Pairwise database searching BLAST >- Basic Local Alignment Search Tool >- algorithm by Altschul eta/. (1990) >- identifies short ungapped sub-sequences (segment pairs) of the same length >- sub-sequences are extended using dynamic programming to obtain local alignments - high scoring pairs (HSPs) >- improved algorithm by Altschul eta/. (1997) - produces gapped alignments >- algorithm very fast - most commonly used for databases searching Output from BLAST search Multiple sequence alignment ■ multiple sequence alignment ■ consensus sequence ■ manual methods ■ simultaneous and progressive methods ■ databases of multiple sequence alignments ■ hybrid approach for database searching Multiple sequence alignment ■ Multiple sequence alignment is a 2D table in which the rows represent individual sequences and the columns the residue positions. ■ Multiple sequence alignments are essential for analysis of sets of gene families. ■ Sequence-based multiple sequence alignments -constructed according to similar strings of amino acid residues. ■ Structure-based multiple sequence alignments -constructed according to structural evidence. Colour-coded multiple sequence alignments :S S -T:;:::* ::::i:::::::::::::::: tiiMf'Jt:: M E a mm,m ■ ■ ■• ■■ —■ ,...■.......... .i . . i ,| i.. . m- x^mmmmm^ timiti Multiple sequence alignment ■ Construction of a multiple sequence alignment: >- positioning of residues within any sequence is preserved (absolute positions) >- similar residues in all sequences are brought into vertical register (relative positions) ■ All residues in any single column of an alignment will have the same relative position but different absolute position (unless the sequences are identical). Consensus sequence ■ The alignment table can be summarised by: >- a single line: pseudo-sequence >- unweighted matrix: fingerprint >- ungapped block of residues (weighted): block >- weighted matrix: profile Multiple alignment and the consensus sequence 1 2 3 4 5 e 7 8 9 10 I Y D G G A v - E A L II Y 0 G G - - - E A L III F E G G I L V E A L IV F D - G I L V Q A V V Y E G G A V V - Q A L 1 y d G G A/I V/L v e A l Multiple alignment and the profile, block and fingerprint fingerprint C-Y-X2-[DG]-G-x-[ST] regular expression Manual methods ■ Manual methods are subjective however they enable to incorporate experimental evidences (e.g., mutagenesis data, structural knowledge) into the multiple alignment. ■ Manual modification of the multiple alignments from automatic methods is the best approach. ■ Intuitive colouring schemes assist the eye in spotting similarities. ■ Quantitative evaluation of relatedness through calculation of residue identities/similarities. Amino acid property groupings and colouring Residue Property Colour Asp, Glu Acidic red His, Arg, Lys Basic blue Ser, Thr, Asn, Gin Polar neutral green Ala, Val, Leu, He, Met Hydrophobic aliphatic white Phe, Try, Trp Hydrophobic aromatic purple Pro, Gly Special structural properties brown Cys Disulphide bond former yellow Venn diagram grouping properties of the amino acids _^^ Tiny ^4^/s^^Smai1 a Xfl^N^ \l -f- I V ■ ■ Aiomalic ^W^ ^ "*jf r^^"^ Jr"- ?iiťnivft HydťúpľlOtHC X Polar Charg«! Simultaneous methods ■ Simultaneous methods align all sequences in a given set at once, rather than aligning pairs of sequences or building sequence clusters. ■ Extension of 2D dynamic programming matrix to more dimensions. ■ Number of dimensions = number of sequences. ■ Suitable only for small sets of short sequences. Progressive methods ■ Multi-dimensional programming matrix is not applicable to realistic problems - larger sets of longer sequences. ■ CLUSTAL >- 1. construction of evolutionary tree >- 2. pairwise alignment of two the most closely related sequences, addition of less related sequences >- 3. final alignment, final evolutionary tree ■ CLUSTALW >- positioning of gaps in closely related sequences according to their variability Databases of multiple alignments ■ Multiple alignments bring together sequences from different species. This important evolutionary information can enhance sensitivity of database searches. ■ Various abstractions (regular expressions, profiles, blocks, fingerprints or HMMs) can be searched against sequence databases. More information used in a query - higher sensitivity. ■ Results of the searches using the multiple alignments are more difficult to interpret. Databases of multiple alignments ■ Multiple alignments databases available via Web are produced automatically (e.g., PFAM) or manually (e.g., PRINTS). ■ Iterative automatic methods may include false-positive sequences in the alignment which will corrupt it by insertion of many unrealistic gaps. Example entry from PFAM database p;«* _ KrŕtwMKHmwhpi T^fPff**!»H*»Mtwmtu iMftfCu MilUti ^ pn h till' 1*1] frrhpw M JWHmWkv TRIM Ti KtAq2 rnMuHĽri nl ion W^iJ-H^HTpíripíŕ* iľ imutwmi*! MjiwCytJhj« hvH* jh Example alignment from PFAM database r/flflfc Hybrid approach for database searching ■ PSI-BLAST >- Position-Specific Iterated - BLAST >- algorithm by Altschul eta/. (1997) >- incorporates elements of both pairwise and multiple sequence alignment methods >- procedure: initial search - creation of position specific profiles from the hits - new search ... in iterations >- advantage: detects even very weak similarities >- disadvantages: the profile can be diluted if low-complexity regions are not masked; inclusion of single false-positive sequence into the profile leads to bias towards unrelated sequences Graphic hit list from a database search using PSI-BLAST Color Key For Rlignnent Scores ^^^^^^^^^ Secondary database searching ■ why to search secondary databases? ■ secondary databases ■ regular expressions ■ fingerprints ■ blocks ■ profiles ■ Hidden Markov Models Why to search secondary databases? ■ Interpretation of the results from primary database searches is sometimes difficult: >- X.000.000 sequences from XX.000 organisms >- complex and redundant search outputs >- irrelevant matches of low-complexity sequences, repetitive sequences, modular sequences >- local regions of similarity in multi-domain proteins >- truncated description lines ■ Secondary database searches enable to identify both homology and more exacting orthology. Secondary databases ■ Contains information derived from primary sequence data, typically in the form of abstractions: regular expressions, fingerprints, blocks, profiles or Hidden Markov Models. ■ These abstractions represent distillations of the most conserved features of multiple alignments. ■ The abstractions are useful for discrimination of family membership for newly determined sequences. Secondary databases ■ PROSITE - regular expressions ■ PRINTS - fingerprints ■ BLOCKS - blocks ■ Profiles - profiles ■ Pfam - Hidden Markov Models ■ IDENTIFY - fuzzy regular expressions Terms used in sequence analysis methods fingerprint C-Y-X2-[DG]-G-x-[ST] regular expression Regular expressions ■ Regular expression reduces the sequence data to the most conserved residue information. Regular expression [AS]-D-[IVL]-G-X5-C-[DE]-R-[FY]2-Q Multiple alignment ADLGAVFALCDRYFQ SDVGPRSCFCERFYQ ADLGRTQNRCDRYYQ ADIGQPHSLCERYFQ Limitations: ^ stringent pattern - retrieves only identical matches and can miss remote relatives ^fuzzier pattern - better chance to detect remote relatives, but results in more noisy output ^ single motif may not be sufficient to infer the function Regular expressions ■ Regular expressions works most effectively when a particular protein family can be characterised by a highly conserved motif (10-20 residues). ■ Limitation: short patterns (3-4 residues) are not sufficiently discriminative. Asp-Ala-Val-Ile-Asp (DAVID) 71 exact matches in OWL29.6 Asp-Ala-Val-Glu (DAVE) 1088 exact matches in OWL29.6 Regular expressions ■ Rules - short patterns that can be used to provide a guide to possible existence of functional sites: Functional site Regular expression N-glycosilation site N-{P}-[ST]-{P} Protein ki naše C phosphorylation site [ST]-X-[RK] Casein kin ase II phosphorylation site [ST]-X(2)-[DE] Asp adn As n hydroxylation site C-X-[DN]-X(4)-[FY]-X-C Regular expressions ■ Fuzzy regular expressions regular expressions with introduced fuzziness into patterns using groups of amino acids with sim lar biochemical properties (FYW - aromatic, HKR - basic, etc.). Multiple alignment Tuzzy regular expression ADLGAVFALCDRYFQ ASGPT]-D-[IVLM]-G-X5 -C-[DENQ]-R-[FYW]2-Q SDVGPRSCFCERFYQ ADLGRTQNRCDRYYQ ADIGQPHSLCERYFQ Amino acid property groupings and colouring Residue Property Colour Asp Glu Acidic red His, Arq Lys Basic blue Ser, Thr, Asn, GLn Polar neutral green Ala VaL, Leu, lie. Met Hydrophobic aliphatic white Phe Try, Trp Hydrophobic aromatic purple Pro GLy Special structural properties brown Cys Disulphide bond former yellow Venn diagram grouping properties of the amino acids Hydrophobic Polar ChargwJ Regular expressions ■ Introduction fuzziness into regular expressions increases the number of matches retrieved from the sequence database Regular expression No. of exact matches (OWL29.6) D-A-V-I-D 71 D-A-V-I-[DENQ] 252 [DENQ]-A-V-I-[DENQ] 925 [DENQ]-A-[VLI]-I-[DENQ] 2739 [DENQ]-[AQ]-[VLI]2-[DENQ] 51506 Fingerprints ■ Motivation: there are often more than one conserved region present in multiple alignment. ■ Groups of motifs excised from the sequence and converted into matrices populated by the residue frequencies observed at each position. ■ Unweighted scoring system - no additional mutation or substitution matrices are employed. ■ Weighted scoring system - additional matrices are employed resulting in less sparse matrix, but poor signal-to-noise performance. Example of (a) ungapped aligned motif and (b) its corresponding frequency matrix w YVTVQHKKLRTPL YVTVQHKKLRTPL YVTVQHKKLRTPL VATLRYKKWtQPL YIFGGTKSI.RTFA YLFSKTKSLQTPA YLS*riCrKSI«3TPA