Molekulárně biologická data Výkonn nologie Automatické sekvenování M NMR spektroskopie Proteinová krystalografie Výrazný nárůst množství biologických dat Rozdělení molekulárně biologických databází Databáze: Primární Sekundární Strukturní Genomové zdroje Total species (502 3) Viruses Eukaryota Archaea Plasmids 2105 1759 _ "\ 72 V. 38 Bacteria Viroids 1010 39 Total records (9315) Viroids 41 Plasmids 39 tit^ses Eukaryota Bacteria Archaea 3249 a 1015 chromosomes 1447 chromosomes 66 chromosomes 1771 organelles 1590 plasmids 58 plasmids 39 plasmids Molekulárně biologická data CAGCGG ACG ACAG CT CG G ATGCAGC AG AT CAT CC GCATC CGG AACGGCG GTGGCGGC ATCACGCAC T TCCAGT TCG ATCGGGG CAACAATGCCGGCAT CT 160 170 180 190 200 210 220 230 240 250 CGGTTTCGCGCAGATGCAGCTGATCACCCGGGC CAGAC CGG TAAACAGACGG CTAT CGT TATGGC CCAGCTGCGCGGC AT CG CCCGGGCTAACAACATA< 310 320 330 340 350 360 370 380 390 400 1 GATAGCGTAATGATCGGCTGGCTGCCGCATTTCATGCTGGTTTCCCAACGAAAATAACCGCTCACGGTGCCATCACGATCGCACACCGCAAAATCGGCGG TACAGGTGGTCGCGCCCGCCGCCAGCACATCGCTGCGCCAATAATGATCTTTCAGCGGACGACAGCTCGGATGCAGCAGATCATCCGCATCCGGAACGGC GGTGGCGGCATCACGCACTTCCAGTTCGATCGGGGCAACAATGCCGGCATCTTTCAGGGCAAAGCGAATAAACAGCACGCTCACTTCCGCGCGCAGCGCC AGCGCGGTTTCGCGCAGATGCAGCTGATCACCCGGGCTCAGACCGGTAAACAGACGGCTATCGTTATGGCCCAGCTGCGCGGCATCGCCCGGGCTAACAA CATACAGGTGGCGACCATCAATCACGGTCGGGGCGGCCGGATCACGGCTGGCTTCCGGATAGGCGCTCAGCAGGGTAACGGCATCCACAATCACCAGCAT GATAGCGTAATGATCGGCTGGCTGCCGCATTTCATGCTGGTTTCCCAACGAAAATAACCGCTCACGGTGCCATCACGATCGCACACCGCAAAATCGGCGG TACAGGTGGTCGCGCCCGCCGCCAGCACATCGCTGCGCCAATAATGATCTTTCAGCGGACGACAGCTCGGATGCAGCAGATCATCCGCATCCGGAACGGC GGTGGCGGCATCACGCACTTCCAGTTCGATCGGGGCAACAATGCCGGCATCTTTCAGGGCAAAGCGAATAAACAGCACGCTCACTTCCGCGCGCAGCGCC AGCGCGGTTTCGCGCAGATGCAGCTGATCACCCGGGCTCAGACCGGTAAACAGACGGCTATCGTTATGGCCCAGCTGCGCGGCATCGCCCGGGCTAACAA CATACAGGTGGCGACCATCAATCACGGTCGGGGCGGCCGGATCACGGCTGGCTTCCGGATAGGCGCTCAGCAGGGTAACGGCATCCACAATCACCAGCAT DNA 4 o a a • Posloupnost písmen může (a nemusí) mít význam sekvence nukleotidů, počítačové 0 a 1, v běžném jazyku Smysluplná sekvence? Themainchalle nandthewayinw ousthatthesep estructureoft tureofDNAThen tedgeneticist j obcouldbedon waspartlybase< ixWealsoknewt ctionphotogra ctureTherewas eighteenmonth idatedwefrequ tructurehavet ticmoodsweoft llThatisitwou rethansomethi elixthusbroug ablyinterestii oposal f o r therm rs in English text Frekvenční analýza derstand gene control protein oblems could be ure of the gene ructure of DNA. Then he interested ndish lab, we thought ithin a few months. Pauling's feat in w that Maurice »11 | on photographs from I II I Jl ^ structure. There was III I I ing the next eighteen I l| I I I ure became I § I I I ■ necessity that the I 1 1 I I I I I I self-replication. 0 | | | * | | | | f I led that the correct uld suggest e than something _ double helix thus brought us not only joy but great relief. It was unbelievably interesting and immediately allowed us to make a serious proposal for the mechanism of gene duplication. h I j k Imnopqrstuvwxyz Smysluplná sekvence? j cvbyfmmktllkrfsuogqfoqzpj klhvzgnkifytjtbjavafjlvqnlyf ozkcbjbwkdyueayklxkietjzclpgrknxhjdnqitaxyvuorfxgihkyr rcxummzwuoxzujxjzyrzbsebpzfxjwjrxapzpyaqcneijgdwtpsweo t j cjcjepnl tykhvmf elnihshvuxxi]^^^^a^^Ľ^^^^^^^^^^^^£i^^^£^nnub2-djpdipxftmdhyothcvoixoc yhkyfgkyqvghibnyjamluo> cczxvnkzcxyuxrfwdosxqsn vktgj xhhvrvwxtfiudbvqj s syiqexibxtsvyxepvdocaht egzdkhegrcwmwtselofmfyí asesf ptktyacpxlimtiqj jqtc iecnowaemfmrpqcbretesns i1drxuepp1ewxrqu j adbw1e bxxdihdyspvfccjdneaeacr yupyekrqpcj alsehvnzsnqn ggeyhpwobwtaatwgxcamjur lqpogupltfpbwj j ahdkbwhá xehqemciyakfkpwcycjddsc nqmqqloukfrfpwbxyluffp\ ogncujkyjujorbpssmweqf£ Letters in random text 14%- 12% 10% B% 6% ' 4% 2% 0% a b c d e f g h I j k I rn n o p q f s t u v w x y z Frekvenční analýza Smysluplná sekvence? 010101010101010101 010101010101010101 Ol,Sekvence nemůže být současně 011 náhodná i smysluplná! Náhodná nebo smysluplná? Frekvenční analýza číslo počet poměr 10 (60) 50% 1 10 (60) 50% 01010101010101010101 číslo počet poměr ^^^^^^ 50% 1 10 50% Očekávaná frekvenční analýza párů pro náhodnou sekvenci číslo počet poměr 11 25% 01 25% 10 25% Frekvenční analýza párů pro výše uvedenou sekvenci číslo počet poměr 00 0% 11 0 0% 10 53% 10 9 47% K čemu je to dobré? • Obsah GC je např. vyšší v genových částech než intergenových • GC ostrůvky se objevují v oblastech regulujících transkripci, ... Genes: HBZ HBZP HBAP1 HBAP2 HBA2 HBA1 HBQ1 Tabic 1 Software commonly used for bacterial genome annotation and comparison DNA level annotation GcncMark h t tp ://cxon .gatcch .edu/gencmark/ Gl immer h t tp ://ww w.genomic s.jh J .cdu/Glimmcr/ SHOW http://gcmmc.jouy.inra.fr/ssb/SHOW/ tRNA scan- SE h ttp://lo wclab.uc sc .cdu/tRN A scan - SE/ RN Ammer http://ww w.cbs.dm .dk/scrviccs/RN Ammci/ Rep Seck h t tp://ww w.abi .snv j u ssicu .f r/%9 Hpubl ic/RcpScck/ IslandPath hlLp://www.palhogcnoniics.sfu.ca/isLindpath/ Protein tevef annotation BLAST http://www.cbi.ac.uk/blast/ I n tc rP ro Sc an h t tp ://ww w.cbi .ac. uk/Intcr ProScan/ COGNITOR http://ww w.ncbi .n 1 m.n ih .go v/COG/old/xogni tor.h tml PRIAM http://bioinro.gcnopolc-ttJulousc.prd.rr/priam/ G O A n no h I tp ://bip s. u- st rasbg. lir/GO A n mV1 PSORTb http ://ww w.psortorg/psortb/ TMHMM http://www.cbs.dtu.dk/scrviccs/TMHMM/ SignalP http://www.cbs.dtu.dk/scrviccs/SignalP/ Comparative genomic tools Mauve hltp://gc l.ahabs .wLsc.edu/mauvc/ MOSAIC http://mig.jouy.inra.fr/mig/mig_cng/ picscntation^>rojccty mosaic ACT h t tp ://ww w.sangcr.ac. uk/Sof t ware/ACT/ CGAT http://mbgd.gcnomc.ad.jp/CGAT/ MaGc http://www.gcnoscopc.cns.rr/agc/magc/ Pathologic htlp://biocyc.org/ PUMA2 htip://compbkhmcs.anl.g(Jv7puma2/ The SEED http://lhcsccd.Lichicjigo.edu/FIG/ STRING http://string.cmbl.de/ P y Ph y h t tp://ww w.cbs.dtu .dk/staff/tho mas/p yphy/ HoScql http://pbil.univ-lyonl.lr/soltware/HoScqI/ Protein gene prediction Protein gene prediction Protein gene prediction tRNA gene prediction iRNA gene prediction Search U>r approximate repeats in complete DNA sequences Idcntilicalion of genomic islands Compare a novel sequence with those contained in nucleotide and protein databases Search for domains/motifs in the InlcrPro database Compare a query sequence to the COG {Cluster of Orthologous Groups of proteins) database Detection of enzymatic function in a fully sequenced genome, based on all sequences available in the ENZYME database BLAST search on the Gene Ontotogy database Prediction of bacterial protein subcellular localization Prediction of transmembrane helices in protein sequences Prediction of signal peptide cleavage sites in protein sequences Multiple genome alignments in the presence of large-scale evolutionary events Define the set of backbones and loops in ck>scly related bacterial genomes Comparative genome analysis and visualization tools for multiple genome alignments Computation of gene order conservation (syntonics) between available bacterial genomes Metabolic network reconstruction and comparative pathway analysis Metabolic pathway reconstruction Comparative analysis and annotation tools using the subsystem approach Search Tool for the Retrieval of Interacting Proteins Reconstruction of phylogcnctic relationships of complete microbial genomes Automatically assign sequences to homologous gene families from the HOGENOM database Predikce genů kódujících proteiny Prokaryotické geny Nepřerušované úseky DNA mezi startovním kodonem (ATG, gtg, ttg, ctg) a stop kodonem (TAA, TG A, TAG). Eukaryotické geny Přerušovány introny. Průměrná délka exonu je 50 kodonů, některé jsou mnohem kratší. Některé introny extrémně dlouhé, geny zabírají mbp v genomové DNA. Predikce eukaryotických genů je mnohem složitější než predikce genů prokaryotických a představuje STÁLE NEVYŘEŠENÝ problém! Prokaryotické geny • Prokaryotický gen = nejdelší ORF odpovídající danému úseku DNA. gtatgctggtgattgtggatgccgttaccctgctgagcgcctatccggaagccagccgtgatccggccgcccc gaccgtgattgatggtcgccacctgtatgttgttagcccgggcgatgccgcgcagctgggccataacgatagc cgtctgtttaccggtctgagcccgggtgatcagctgcatctgcgcgaaaccgcgctggcgctgcgcgcggaag tgagcgtgctgtttattcgctttgccctgaaagatgccggcattgttgccccgatcgaactggaagtgcgtga tgccgccaccgccgttccggatgcggatgatctgctgcatccgagctgtcgtccgctgaaagatcattattgg cgcagcgatgtgctggcggcgggcgcgaccacctgtaccgccgattttgcggtgtgcgatcgtgatggcaccg tgagcggttattttcgttgggaaaccagcattgaaattgcgggcagccagccggataccaaacagccgggctt taaaccgagcagcgatcgcaatggcaactttagcctgccgccgaataccgcctttaaagcgatcttctatgcg aacgcggcggatcgtcaggatctgaaactgtttattgatgatgcgccggaaccggccgccacctttgtgggta acagcgaagatggtgtgcgtctgtttaccctgaatagcaaaggtggtaaaattcgtattgaagcgagcgcgaa cggccgtcagagcgcgaccgatgcccgtctggcgccgctgagcgcgggcgataccgtgtggctgggctggctg ggcgcggaagatggtgccgatgcggattataatgatggcattgttattctgcagtggccgattacctaatggg nonpolar polar basic acidic (stop codon) Pfeklad DNA sekvence The table shows the 64 codons and the amino acid for each. The direction of the mRNA is 5' to 3'. 2nd base 1st base U U UUU (Phe/F) Phenylalanine UUC (Phe/F) Phenylalanine UUA (Leu/L) Leucine UUG (Leu/L) Leucine CUU (Leu/L) Leucine CUC (Leu/L) Leucine CUA (Leu/L) Leucine CUG (Leu/L) Leucine UCU (Ser/S) Serine UCC (Ser/S) Serine UCA (Ser/S) Serine UCG (Ser/S) Serine CCU (Pro/P) Proline CCC (Pro/P) Proline AU (Tyr/Y) Tyrosine UAC (Tyr/Y) Tyrosine UGU (Cys/C) Cysteine UGC (Cys/C) Cysteine ! UAA Ochre {Stop) UAG Amber (Stop) UGA Opal (Stop) AUU (lle/l) Isoleucine AUC (lle/l) Isoleucine CCA (Pro/P) Proline CCG (Pro/P) Proline ^U (Thr/T) Threonine ^CC (Thr/T) Threonine AUA (lle/l) Isoleucine_ACA (Thr/T) Threonine AUG .Met/M) Methionine. Start [A; |cG (Thr/T) Threonine GCU (Ala/A) Alanine GCC (Ala/A) Alanine GUU (Val/V) Valine GUC (Val/V) Valine GUA (Val/V) Valine GUG (Val/V) Valine CAU (His/H) Histidine CAC (His/H) Histidine CAA (Gln/Q) Glutamine CAG (Gln/Q) Glutamine AAU (Asn/'N) Asparagine AAC (Asn/N) Asparagine AAA (Lys/K) Lysine AAG (Lys/K) Lysine GAU (Asp/'D) Aspartic acid GAC (Asp/U) Aspartic acid UGG (Trp/'VV) Tryptophan CGU (Arg/R) Arginine CGC (Arg/R) Arginine CGA (Arg/R) Arginine CGG (Arg/R) Arginine AGU (Ser/S) Serine AGC (Ser/S) Serine AGA (Arg/R) Arginine AGG (Arg/R) Arginine GGU (Gly/G) Glycine GGC (Gly/G) Glycine GCA (Ala/A) Alanine GAA (Glu/E) Glutamic acid GGA (Gly/G) Glycine GCG (Ala/A) Alanine GAG (Glu/E) Glutamic acid GGG (Gly/G) Glycine Překlad DNA sekvence ATG TGA RF1 RF2 RF3 GTACCACGACAGAGGACGGCTGTTCTGGTTATT ^^TGTCTCCTGCCGACAAGACCAATAA CAUGGUGCUGUCUCCUGCCGACAAGACCAAUAA >RF1 CAU GGU GCU GUC UCC UGC CGA CAA UAA GAC CAA I_l l_l l_l l_l l_l l_l l_l l_l l_l l_l l_l His Gly Ala Val Ser Cys Arg Gin Asp Gin |-—►RF2 C AUG GUG CUG UCU CCU GCC GAC AAU AAG ACC AA I_I I_I I_I I_I L Met Val Leu Ser Pro Ala Asp Asn Lys Thr |-->RF3 CA UGG UGC UGU CUC CUG CCG ACA AUA AGA CCA A I_I I_I I_I L J L J I_I J I_I L J L J Trp Cys Cys Leu Leu Pro Thr Ile Arg Pro Překlad DNA sekvence ExPASy http://web.expasy.org/translate/ ORF Finder (NCBI) http://www.ncbi.nlm.nih.gov/gorf/gorf.html ExPASy http://www.expasy.org/vg/index/dna Bioinformatics Resource Portal DNA RNA Protein Cell Organism Population Selected keywords > translation 0 > Keywords Choose a category or a keyword COdon conversion tool protein protein Sequence reverse transcription reverse translation sequence analysis transcription SIB resources [v? External resources - (No support from the ExPASy Team) "Expert Protein Analysis System" Databases (0) Tools (5) Ö EMBOSS tran EMBOSS sequence translation tools, incl. backtranslation Keywords: ce, pr . translation ÖGra Displays the codon bias in a graphical manner Keyword* , translation Transcription, translation and reverse transcription Keywords translation Ö Reverse Tran ^^^BÉaV^sa protein sequence back to a nucleotide sequence Keywords: V ce, re , translation Translate ^ Translation o nucleotide (DNA/RNA) sequence to a protein Keywords: , , ce, pi »in, pi , translation ExPASy http://web.expasy.org/translate/ Translate is a tool which allows the translation of a nucleotide (DNA'RNA) sequence to a protein sequence. Please enter a DNA or RNA sequence in the box below (numbers and blanks are ignored). GTATGCTGGTGATTGTGGATGCCGTTACCCTGCTGAGCGCCTATCCGGAAGCCAGCCGTGATCCGGCCGCC CCGACCGTGATTGATGGTCGCCACCTGTATGTTGTTAGCCCGGGCGATGCCGCGCAGCTGGGCCATAACGA TAGCCGTCTGTTTACCGGTCTGAGCCCGGGTGATCAGCTGCATCTGCGCGAAACCGCGCTGGCGCTGCGCG CGGAAGTGAGCGTGCTGTTTATTCGCTTTGCCCTGAAAGATGCCGGCATTGTTGCCCCGATCGAACTGGAA GTGCGTGATGCCGCCACCGCCGTTCCGGATGCGGATGATCTGCTGCATCCGAGCTGTCGTCCGCTGAAAGA TCATTATTGGCGCAGCGATGTGCTGGCGGCGGGCGCGACCACCTGTACCGCCGATTTTGCGGTGTGCGATC GTGATGG(^CCGTGAGCGGTTATTTTCGTTGGGAAAC(^GCÄTTGAAATTGCGGGCAGCCAGCCGGATACC AAAGAGCCGGGCTTTAAACCGAGGAGCGATCGGAATGGGAACTTTAGCCTGCCGCCGAATACCGCCTTTAA AGCGATCTTCTATGCGAACGCGGCGGATCGTCAGGATCTGAAACTGTTTATTGATGATGCGCCGGAACCGG CCGCCÄCCTTTGTGGGTAAGAGCGAAGATGGTGTGCGTCTGTTTACCCTGAATAGGAAAGGTGGTAAAATT CGTATTGAAGCGAGCGCGAACGGCCGTCAGAGCGCGACCGATGCCCGTCTGGCGCCGCTGAGCGCGGGCGA TACCGTGTGGCTGGGCTGGCTGGGCGCGGAAGATGGTGCCGATGCGGATTATAATGATGGCATTGTTATTC TGCAGTGGCCGATTACCTAATGGG Output format: Verbose ("Met", "Stop", spaces between residues) v Reset Of[ TRANSLATE SEQUENCE | Translate Tool - Results of translation Open reading frames are highlighted in red. Please select one of the following frames - in the next page, you will be able to select your initiator and retrieve your amino acid sequence: 5'3' Frame 1 V C W Stop L W Met P L P C Stop APIRKPAVIRPPRP Stop L Met V A T C Met L L A R A Met PRSWAITIAVCLPV Stop ARVISCICAKPRWRCARK Stop ACCLFALP Stop K Met PALLPRSNWKCV Met P P P P F R Met R Met I C C I R A V V R Stop K 111 G A A Met CWRRARPPVPPILRCAIV Met A P Stop AVIFVGKPALKLRAASRIPNSRALNRAAIA Met ATLACRRIPPLKRSS Met R T R R I V R I Stop N C L L Met Met R R N R P P P L W V T A K Met V C V C L P Stop IAKVVKFVLKRARTAVRARP Met P V W R R Stop ARAIPCGWAGWARK Met V P Met R 11 Met Met ALLFCSGRLPNG 5'3' Frame 2 YAGDCGCRYPAERLSGSQP Stop S G R P D R D Stop W S P P V C C Stop PGRCRAAGP Stop R Stop PSVYRSEPG Stop SAASARNRAGAAR GSERAVYSLCPERCRHCCPDRTGSA Stop CRHRRSGCG Stop SAASELSSAERSLLAQRCAGGGRDHLYRRFCGVRS Stop W H R E RLFSLGNQH Stop NCGQPAGYQTAGL Stop TEQRSQWQL Stop P A A E Y R L Stop SDLLCERGGSSGSETVY Stop Stop CAGTGRHLCG Stop QRRWCASVYPE Stop Q R W Stop N S Y Stop SERERPSERDRCPSGAAERGRYRVAGLAGRGRWCRCGL Stop Stop WHCYSAVAD V L Met 5'3' Frame 3 Met LVIVDAVTLLSAYPEASRDPAAPTVIDGRHLYVVSPGDAAQLGHNDSRLFTGLSPGDQLHLRETALALRAEVSVLFIRFALKD AGIVAPIELEVRDAATAVPDADDLLHPSCRPLKDHYWRSDVLAAGATTCTADFAVCDRDGTVSGYFRWETSIEIAGSQPDTKQP GFKPSSDRNGNFSLPPNTAFKAIFYANAADRQDLKLFIDDAPEPAATFVGNSEDGVRLFTLNSKGGKIRIEASANGRQSATDARL APLSAGDTVWLGWLGAEDGADADYNDGIVILQWPIT Stop W 3'5' Frame 1 P I R Stop S A T A E Stop QCHHYNPHRHHLPRPASPATRYRPRSAAPDGHRSRSDGRSRSLQYEFYHLCYSG Stop TDAHHLRCYPQRW RPVPAHHQ Stop TVSDPDDPPRSHRRSL Stop R R Y S A A G Stop SCHCDRCSV Stop SPAVWYPAGCPQFQCWFPNENNRSRCHHDR TPQNRRYRWSRPPPAHRCANNDLSADDSSDAADHPHPERRWRHHALPVRSGQQCRHLSGQSE Stop TARSLPRAAPARFRAD AADHPGSDR Stop TDGYRYGPAARHRPG Stop QHTGGDHQSRSGRPDHGWLPDRRSAG Stop RHPQSPAY 3'5' Frame 2 PLGNRPLQNNNAIIIIRIGTIFRAQPAQPHGIARAQRRQTGIGRALTAVRARFNTNFTTFAIQGKQTHTIFAVTHKGGGRFRRIINKQF QILTIRRVRIEDRFKGGIRRQAKVAIAIAARFKARLFGIRLAARNFNAGFPTKITAHGAITIAHRKIGGTGGRARRQHIAAPI Met I F Q R T TAR Met QQIIRIRNGGGGITHFQFDRGNNAGIFQGKANKQHAHFRAQRQRGFAQ Met QLITRAQTGKQTAIV Met AQLRGIARANNIQV ATINHGRGGRITAGFRIGAQQGNGIHNHQH 3'5' Frame 3 H Stop V I G H C R IT Met P S L Stop SASAPSSAPSQPSHTVSPALSGARRASVAL Stop RPFALASIRILPPLLFRVNRRTPSSLLPTKVAA GSGASSINSFRS Stop R S A A F A Stop KIALKAVFGGRLKLPLRSLLGLKPGCLVSGWLPAIS Met L V S Q R K Stop PLTVPSRSHTAKSA VQVVAPAASTSLRQ Stop Stop SFSGRQLGCSRSSASGTAVAASRTSSSIGAT Met PASFRAKRINSTLTSARSASAVSRRCS Stop S PGLRPVNRRLSLWPSC A A S P G L TT Y R W R P S IT V G A A G S R L A S G Stop ALSR VTASTITSI ORF Finder (NCBI) http://www.ncbi.nlm.nih.gov/gorf/gorf.html ORF Finder (Open Reading Frame Finder) The ORF Finder (Open Reading Frame Finder) is a graphical analysis tool which finds selectable minimum size in a user's sequence or in a sequence already in the databas This tool identifies all open reading frames using the standard or alternative genetic co sequence can be saved in various formats and searched against the sequence databa The ORF Finder should be helpful in preparing complete and accurate sequence subm the Sequin sequence submission software. Enter Gl or ACCESSION OrfFind Clear or sequence in FASTA format FROM: TO: PubMed__Entrez_BLAST_OMIM_Taxonomy NCBI Tools for data mining GenBank sequence submission support and software FTP site download data and software Genetic codes 1 Standard ORF Finder (NCBI) http://www.ncbi.nlm.nih.gov/gorf/gorf.html ORF Finder (Open Reading Frame Finder) The ORF Finder (Open Reading Frame Finder) is a graphical analysis tool which finds selectable minimum size in a user's sequence or in a sequence already in the databas This tool identifies all open reading frames using the standard or alternative genetic co sequence can be saved in various formats and searched against the sequence databa The ORF Finder should be helpful in preparing complete and accurate sequence subm the Sequin sequence submission software._ Enter Gl or or sequenc FROM: Ge-e: c. cooes The Standard Code The Vertebrate Mitochondrial Code The Yeast Mitochondrial Code The Mold Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code The Invertebrate Mitochondrial Code The Ciliate, Dasycladacean and Hexamita Nuclear Code The Echinoderm and Flafworm Mitochondrial Code The Euplotid Nuclear Code The Bacterial and Plant Plastid Code The Alternative Yeast Nuclear Code The Ascidian Mitochondrial Code The Alternative Flafworm Mitochondrial Code Blepharisma Nuclear Code Chlorophycean Mitochondrial Code Trematode Mitochondrial Code Scenedesrnus Obüquus Mitochondrial Code Tlffaustochytrium Mitochondrial Code PubMed Entrez BLAST OMIM Taxonomy NCBI v.___S Tools for data mining GenBank sequence submission support and software FTP site download data and software ORF Finder (NCBI) http://www.ncbi.nlm.nih.gov/gorf/gorf.html View 1 GenBank v Redraw 100 v SixFrames Frame from to Length ] +3 -2 3. 1. .872 .857 870 857 5'3' Frame 3 MetLVIVDAVTLLSAYPEASRDPAAPTVIDGRHLYVVSPGDAAQLGHNDSRLFTGLSPGDQLHLRETALALRAEVSVLFIRFALKDAGIVAPI ELEVRDAATAVPDADDLLHPSCRPLKDHYWRSDVLAAGATTCTADFAVCDRDGTVSGYFRWETSIEIAGSQPDTKQPGFKPSSDRNGN FSLPPNTAFKAIFYANAADRQDLKLFIDDAPEPAATFVGNSEDGVRLFTLNSKGGKIRIEASANGRQSATDARLAPLSAGDTVWLGWLGA EDGADADYNDGIVILQWPIT Stop W 3'5' Frame 2 PLGNRPLQNNNAIIIIRIGTIFRAQPAQPHGIARAQRRQTGIGRALTAVRARFNTNFTTFAIQGKQTHTIFAVTHKGGGRFRRIINKQFCIILTI RRVRIEDRFKGGIRRQAKVAIAIAARFKARLFGIRLAARNFNAGFPTKITAHGAITIAHRKIGGTGGRARRQHIAAPIMetIFQRTTARMetQail R I R N G G G G I T H F Q F D R G N N A G I F Q G K A N K Q H A H F R A Q R Q R G F A Q Met QLITRAQTGKQTAI V Met AQLRGIARANNIQ V ATINHGRGGRITA G F R I G A Q Q G N G I H N HGH -_ *T^rrrj.^v-trritriMiLj..rrr^rTT-i^T-^-.n^tfrrrn".vfrj.....■L-rtLWrrrT^-^iw^i^.1,iT t.^m-r.^ — i,—. .... .-.j.^..,.. ----^---~-.----—-.... . * - » ^- yf - f7 T r.TmTlit " ~"l ■ h"nn 1TV f TTT V¥~ UVJ«aiBa«KtfiaCIWKTLUKAWL£XUCCrrK □wtw/ecuDuun i tt hoc «3 ttoattmig i^tuum wimc rtrraci«rraQ.«. «k nOOIVMSEHyilllBnMV^ TTfrTTilftiV^VH"1"^1^XH,>j"0\^i]il£L^l:^JL*-^jLVjlr^ fa^,i'i-,y..Y^lri)«nT|,l,IajrTTijJn,i-j,.viViVjrifTTTi'jwrr.rTijy tiVjvrj- vrr.i-y^ri'i'.fciri.v/^y .ir AfifTT^ ii*ruriirAjfrii.iTvjTTvrfj^v/j/vtiv/v,itrt iwrrm<(."mi*frr\TTTV/s »nmw MJixLiLricii^B«»n»B«t»»T«»aatt5873) /locus_tag=MADF63_RS25535" CDS complement(5548..>5873) /locus_tag=,,ADF63_RS25535" /inference="EXISTENCE: similar to AA sequence:Ref Seq:WP_0 09876850.1" /note="Derived by automated computational analysis using gene prediction method: Protein Homology." /codon_start=3 /transl_table=ll /product="fucose-binding lectin" /protein_id="WP049233417.1" /db_xref="GI:896235191" /translation="LPANTRFGVTAFANSSGTQTVNVLVNNETAATFSGQSTNNAVIG TQVLNSGSSGKVQVQVSVNGRPSDLVSAQVILTNELNFALVGSEDGTDNDYNDAVVVI NWPLG" Chyby • Nejcastejsi • - chyby v sekvenaci • - špatná predikce -alternace startovního kodonu • - shot gun sekvenace Eukaryotické geny Jednobuněčná eukaryota • Genomy jednobuněčných e u kary o t se výrazně liší (frekvence intronů, jak velká část genomu je tvořená geny kódujícími proteiny). • Saccharomyces cerevisiae - 67% genomu je protein-kódující, jen 4% obsahují introny. • Hlenky - průměrný gen obsahuje 3,7 intronu. • Pro některá jednobuněčná eukaryota (kvasinky) je možné použít stejné postupy jako pro prokaryota. Eukaryotické geny Mnohobuněčná e u kary o ta Mnohobuněčná eukaryota Komplexní organizace genomu, geny separovány dlouhými INTERGENOVÝMI úseky, geny obsahují množství INTRONŮ, i velmi DLOUHÝCH. 5< I-1 3< - coding region I - untK-«nsl«tí-oí region Glyceraldehyd-3-fosfát-dehydrogenasa Candida albicans Eukaryotické geny Mnohobuněčná eukaryota Mnohobuněčná eukaryota Komplexní organizace genomu, geny separovány dlouhými INTERGENOVÝMI úseky, geny obsahují množství INTRONŮ, i velmi DLOUHÝCH. 5< I-1 3' - cooling region | - untr^tsl«t*4l region Glyceraldehyd-3-fosfát-dehydrogenasa Homo sapiens Promoter Exon Intron Exon Intron Exon DNA Transcription pre mRNA Processing mRNA Translation Protein TATA ATG GT T AG TGA - - ——^- 3' Promoter Exon Intron Exon Intron Exon DNA pre mRNA mRNA Protein Transcription Processing AAAAAAAAAA Translation Precursor Lariat Spiked Lariat form intermediate product of intron Splicing Mechanism Used for mRNA Precursors. The upstream (5') exon is shown in blue, the downstream (3') exon in green, and the branch site in yellow. R stands for a purine nucleotide, Y for a pyrimidine nucleotide, and N for any nucleotide. The 5' splice site is attacked by the 2-OH group of the branch-site adenosine residue. The 3' splice site is attacked by the newly formed 3'-OH group of the upstream exon. The exons are joined, and the intron is released in the form of a lariat. [After P. A. Sharp. Cell 2(1985):3980.] Eukaryotické geny Mnohobuněčná eukaryota Eukaryotické geny Mnohobuněčná eukaryota • Rozpoznání exonů/intronů Identifikace míst sestřihu: GT na 5'konci, AG na 3'konci. • Chyby při rozpoznávání exonů/intronů Velké množství chyb. Dlouhé introny - určeny jako intergenové úseky. Krátké intergenové úseky -určeny jako introny. Algoritmy a nástroje pro identifikaci genů * Predikce genů na základě sekvenční homologie - vyhledávání v databázích pomocí algoritmů. * Predikce genů ab initio - predikce na základě statistických parametrů DNA sekvence. * Většina běžně používaných metod kombinuje oba dva přístupy. P roka ry ota ATG..................TAA Bez intronů SEKVENČNÍ HOMOLOGIE l IDENTIFIKOVANÉ GENY VYUŽITY PRO „TRÉNOVÁNÍ" STATISTICKÉ METODY l ANALÝZA ZBÝVAJÍCÍCH ČÁSTÍ GENOMU Eu kary ota Mnoho intronů, dlouhé intergenové úseky Ab initio STATISTICKÉ METODY X IDENTIFIKOVANÉ EXONY 4 SEKVENČNÍ HOMOLOGIE Algoritmy a nástroje pro identifikaci genů • Každý program má výhody a nevýhody -rozumné použít více predikčních nástrojů. GeneMark GlimmerM G R AI L GenScan Fgenes Algoritmy a nástroje pro identifikaci genů * GeneMark http://exon.gatech.edu/GeneMark Využívá Markovovy modely Vyžaduje parametry specifické pro daný organismus = nutné „natrénování" pomocí známých genů Varianty pro prokaryotické, eukaryotické, virové sekvence GeneMark http://exon.gatech.edu/GeneMark Gene Prediction in Bacteria, Archaea and Metagenomes For bacterial and archaeal gene prediction we recommend to use a parallel combination of GeneMark-P* and GeneMark.hmm-P with pre-computed models, itff sfe A novel genome can be analyzed either by the program with Heuristic models (if the sequence is shorter than 100 kb) or by the ;:^qf-0 self-training program GeneMarkS* (aka GeneMark.hmm-PS). Metagenomic sequences can be analyzed by our new program with updated heuristic models. Gene Prediction in Eukaryotes For eukaryotic gene prediction you can use the parallel combination of GeneMark-E* and GeneMark.hmm-E. For a novel genome (the one whose name is not in the list of available models) you can install and run locally GeneMark.hmm-ES, the self-training program (just 10MB sequence is needed for training). Gene Prediction in Viruses, Phages and Plasmids For novel virus, phage and plasmid gene prediction you can use either the Heuristic approach (if the sequence is shorter than 50 kb) or the self- training program GeneMarkS (aka GeneMark.hmm-PS). Both options will run the parallel combination of GeneMark and GeneMark.hmm. Algoritmy a nástroje pro identifikaci genů • GeneScan http://genes.mit.edu/GENSCAN.html Komplexní model struktury genu (transkripční, translační, sestřihové signály + statistické vlastnosti kódujících a nekódujících úseků) Primární analýza velkých úseků eukaryotické genomové DNA GeneScan http://genes.mit.edu/GENSCAN.html The New GENSCAN Web Server at MIT Identification of complete gene structures in genomic DNA \\i// (o o) -. .-. .-o00o~(_)~o00o-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. I |X| I |\ /I I |X| I |\ /I I |X| I |\ /I I |X| I |\ /I I |X| I |\ /I I |X| I |\ /I I |X| I |\ /I I |X| I |\ /I I |X| I |\ /I I |X| I |\ / I I |X| I |\ 1/ \| I |X| I 1/ \ | I |X| I 1/ \| I |X| I 1/ \| I |X| I 1/ \ | I |X| I 1/ \| I |X| I 1/ \|I 1X111/ \| I |X| I 1/ \ | I |X| I 1/ \| I |X| I 1/ \ ||| This saver provides access to the program Genscan for predicting the locations and exon-intron structures of genes in genomic sequences horn a variety of organisms. ^^^^^^^^^^^^^^^ This server can accept sequences up 1 d 1 million base pairs (1 T/lbp) in length If you have trouble with the web server or if you have a large number of sequences to process, request a local copy of the pi^am (see instn'clions at the bottom of this page) or use the GENSCAN email server. If your browser (e.g., Lynx) does not support file upload or multipart forms, use the older version. Algoritmy a nástroje pro identifikaci genů Program Organism Algorithm* Website Homology GenelD Vertebrates, plants DP http://wwwl.imim.es/geneid.btml FGENESH Human, mouse. Drosopbila> rice HMM http:/ /www.softber ry.com. / berry;phtml?topic =ŕgeneshíi. group =programsŕisubgroup=gŕi ad GeneParser Vertebrates \N http://beagle,colorado.edu/~eesnyder/ GeneParser.html EST Genie Drosophila, human, other GHMM http: / /www. fruit fly.org /seq_tools/genie,html protein GenLang Vertebrates, Drosophila, dicots Grammar rule http://www.cbil, upeon.edu/genlang/ genlang_home, html GENSCAN Vertebrates, Arabidopsis, maize GHMM http: / /genes, mit. edu/GENSC A N. ht m 1 GlimmerM Small eukaryotes, Arabidopsis, rice IV.M http://www.tigr.org/tdb/glimmerm/ glm r_form, html GRAIL Human, mouse. NN, DP http: / /compbio.ornL gov/Grai l-bin / EST, Arabidopsis, EmptyGrailForm cDNA Drosophila HMMgeoe Vertebrates, C. elegtms GHMM http: / /ww w.cbs, dtu, dk /services/HM Mgene/ AUGUSTUS Human, Arabidopsis IMMA\mM http:// august us. gobics.de/ MX1T Human, mouse, Arabidopsis, Fission yeast Quadratic i]ir:<:,:mii:i;:.:it analysis http: //tu lai .cshl.org/tools/generinder/ *DP, dynamic programming; NN, neural network: MM,{Markov model; HM\1 Hidden Markov model: t'HMM. class HMM; GHMM, generalized HMM: IMM, interpolated MKfrs^, ^-»^ ---^ Shrnutí Predikce prokaryotických genů mnohem jednodušší než u eukaryotických. Predikce genů ab /'n/í/o/na základě sekvenční homologie. Nutné kombinovat oba přístupy. Rozumné využívat více predikčních programů. Ukol - deadline 27.dubna • DEFINITION fucose-specific lectin [Arthroderma otae CBS 113480]. • ACCESSION XP 002846975 • VERSION XP 002846975.1 • DBSOURCE REFSEQ: accession XM 002846929.1