Introduction Bioinformatics - lectures Introduction Information networks Protein information resources Genome information resources DNA sequence analysis Pairwise sequence alignment Multiple sequence alignment Secondary database searching Analysis packages Protein structure modelling Introduction ~ history of sequencing what is it Bioinformatics? sequence to structure deficit genome projects why is Bioinformatics important? patter recognition and prediction folding problem sequence analysis homo/analogy and ortho/paralogy History of sequencing Protein sequencing ** separation of peptides, identification and quantification of amino acids Edman degradation mass-spectrometry - advantage in identification of post-translational modifications 1955 sequencing of peptide insuline 1960 sequencing of enzyme ribonuclease 1980s automated sequencers History of sequencing Nucleic acid sequencing *■ tRNA - short, could be purified DNA - large (human chromosome 55-250 x 106 bp); the longest fragment for sequencing is 500 bp; purification is problematic advent of gene cloning and PCR 1972 DNA cloning 1975 DNA sequencing 1980s and 1990s sequence revolution Technology development Structure determination 1950 49 Edman degradation 54 Isomorphous replacement 51 a-helix model 53 DNA double helix model Insulin primary structure 1960 62 Restriction enzyme 60 Myoglobin tertiary structure 65 tRNA^a primary structure 1970 72 DNA cloning 75 DNA sequencing 1980 73 tRNAphe tertiary structure 77 0X174 complete genome 79 Z-DNA by single crystal diffraction 84 Pulse field gel electrophoresis 85 Polymerase chain reaction 87 YAC vector 86 Protein structure by 2D NMR 88 Human Genome Project 1990 93 DNA chip 95 H. influenzae complete genome 2000 (a) a c c A PO * C Acceptor C ■ G stem G'C G- U A-U U-a TKbop u ' A CUBi. Dloop v G A C A C m'A C A mSCUGUG c DD ACUCn*3 C T* G ' I ' ' m7G „ A G A G C _ " G G m2G ° CG CG A Variable loop A-U Gm5C A »p Cm A U Y Gm . A Anticodon hop Fig. 1.8. Transfer RNA. (a) The primai ■t alanyl-transfer RNA. (b) The tertiary (PDB:1TRA). (b) y sequence and the secondary structure of yeast structure of yeast phenylalanyl-transfer RNA Automatic sequencing machine ABI Prism 310, Applied Biosystems Automated production line in sequencing "factory" FT" Whitehead Institute, Center for Genome Research, USA Sequencing chromatogram í' tC » tfí 01í 52 ti* A T íi \ C 7 ií » M ÍT » W A h i .. * What is Bioinformatics? ■ improvements in DNA sequencing technologies and computer-based technologies ■ originally - analysis of sequence data (1980s) ■ presently - also analysis of 3D-structures The term bioinformatics is used to encompass almost all computer applications in biological sciences. Information technology applied to the management and analysis of biological data. g ť/J CD CD CO ÍTJ -Q CO W U CD JU E 300 150 1988 1993 Date of database release 1998 Figure 1.1 The protein sequence/structure defidt in 1998. The graph illustrates the non-redundant growth of sequence data during the last decade (—) and the corresponding growth in the number of unique structures (—). Genome projects 1977 first complete genome - virus (|>X174, 5000 nucleotides; 11 genes 1995 first complete genome of living organism Haemophilus influenzae, 1.8 million nucleotides and 1700 genes sequencing of model systems: Escherichia coll, Saccharomyces cerevisiae, Cernorhabditis elegans, Drosophlla melanogaster, Arabidopsis t ha liana, Cam's família r is, Mus Musculus (Bits) (Nucleotides) 2 32 2 24 2 16 2 8 10 10 10 10 Human Mouse Rice Fruit fly Nematode (1998) Arabidopsis Budding yeast (1997) GenBank 10/97 GenBank 9/92 GenBank 9/87 Escherichia coli (1997) Haemophilus influenzae (1995) Cytomegalovirus (1990) GenBank 10/82 X phage(1982) 0X174 phage (1977) Bacterium {Escherichia coli) Yeast (Saccharomyces cerevisiae) Worm (Caenorhabditis elegans) Fruit Fly (Drosophila melanogaster) Mouse (Mus Musculus) Human (Homo sapiens) Genome Gene Haploid size (Mb) number chromosome number ~4 4,403 1 -12 6,190 16 97 19,730 120 13,601 3,454 -50,000 20 (estimated) 2,910 33,609 23 Human Genome Project ■ in mid-1980s initiated Human Genome Project ■ estimated 100.000 genes and completion in 2005 ■ need for automated sequencing and improved computational techniques ■ shotgun method sequencing of rough draft first first draft completed in 2000 by publicly funded the International Consortium Human Genome Project and the company Celera Genomics Human Genome Project ■ "33.000 genes ■ genes are complex due to alternative splicing ■ >1.000.000 proteins (estimated) ■ hundreds of genes resulted from horizontal transfer from bacteria (in vertebrate lineage) ■ dozen of genes derived from transposable element« (their activity however has declined) ■ the mutation rate in male is two-times higher than in female ■ >1.400.000 single point polymorphisms (SNPs) Why is bioinformatics important? last 20-30 years - structural biology new era - bioinformatics - due to genome projects and sequence/structure deficit biological function is not known for about 50% of all genes in every sequenced genome role of bioinformatics ** data management and storage ** data analysis = conversion of primary sequence to biological knowledge T M 1 T i; S L A V V I v p. R D W E N P G V T Q :. :: R L A A H r- : r A S W :- N S E E A R T !> R "r' S Q 0 L R S L :\" G F W R F A W F r A P K A v p :-: ;; w L E C D L P E A D T V V V P íľ. :■■; w Q M H í; Y !' A P I Y T N V T Y ? I T v N P P F v P T E HPT G C Y S L T F N v ;. E S v; L : E G Q T R I I F " c V N S A F H L w C N G R WVG Y G V r"; S R I. 1 ^ K F i: L .; A F :. RAG E N r :. A V y. '.' :. R W r D -■ ::" Y :. EDQ D M v; R M ::: G I v R n V S 1 L H K P T T Q I S Z F H v A T H F :: D :j F S R A V L Primary structure: the linear seq Secondary structure: regions of Id a-helices, ß-Super-secondary structure: the arrangen discrete folc Greek keys, e the overall f c packing of ; structure elei the arrangen tein molecule the arranger protein-prot« Tertiary structure: Quaternary structure: Quinternary structure: uence of amino acids in a protein molecule cal regularity within a protein fold (e.g., turns, ß-strands) nent of a-helices and/or ß-strands into ling units (e.g., ß-barrels, ßaß-units, tc.) )ld of a protein sequence, formed by the its secondary and/or super-secondary nents lent of separate protein chains in a pro- ; with more than one subunit nent of separate molecules, such as in sin or protein-nuclei c add interactions Homology and analogy ■ Sequences are said to be homologous if they are related by divergence from a common ancestor. ■ Proteins can share similar folds (e.g., ß-barrel) or similar catalytic residues (e.g., serine proteases) without any sequential similarity. Convergence to similar biological solutions from different evolutionary starting points results in analogy. ■ Sequence analysis assumes homologous proteins. Homology is not a measure of similarity. Percent Identity Alignment Methods A Twilight Zone Midnight Zone T Automatic pairwise methods Consensus methods Profile methods Structure prediction Orthology and paralogy Proteins performing the same function in different species - orthologues. Proteins performing different, but related functions within same organism - paralogues. Sequence comparison of orthologuos proteins phylogenetic analysis. Self-opening umbrella * 'S* —? 9|wJ • «bť