CG920 Genomics Lesson 2 Genes Identification Jan Hejätko Functional Genomics and Proteomics of Plants, Mendel Centre for Plant Genomics and Proteomics, Central European Institute of Technology (CEITEC), Masaryk University, Brno hejatko@sci.muni.cz, www.ceitec.muni.cz INVESTICE DO ROZVOJE VZDĚLÁVÁNÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 2 Literature ■ Literature sources for Chapter 02: Plant Functional Genomics, ed. Erich Grotewold, 2003, Humana Press, Totowa, New Jersey Majoros, W.H., Pertea, M., Antonescu, C. and Salzberg, S.L. (2003) GlimmerM, Exonomy, and Unveil: three ab initio eukaryotic genefinders. Nucleic Acids Research, 31(13). ■ Singh, G. and Lykke-Andersen, J. (2003) New insights into the formation of active nonsensemediated decay complexes. TRENDS in Biochemical Sciences, 28 (464). ■ Wang, L. and Wessler, S.R. (1998) Inefficient reinitiation is responsible for upstream open reading frame-mediated translational repression of the maize R gene. Plant Cell, 10, (1733) de Souza et al. (1998) Toward a resolution of the introns earlyylate debate: Only phase zero introns are correlated with the structure of ancient proteins PNAS, 95, (5094) Feuillet and Keller (2002) Comparative genomics in the grass family: molecular characterization of grass genome structure and evolution Ann Bot, 89 (3-10) Frobius, A.C., Matus, D.Q., and Seaver, E.C. (2008). Genomic organization and expression demonstrate spatial and temporal Hox gene colinearity in the lophotrochozoan Capitella sp. I. PLoS One 3, e4004 EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ OP Vzdělávání pro konkurenceschopnost > s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 3 Outline ■ Forward and reverse genetics approaches ■ Differences between the approaches used for identification of genes and their function ■ Identification of genes ab initio ■ Structure of genes and searching for them ■ Genomic colinearity and genomic homology ■ Experimental identification of genes ■ Constructing gene-enriched libraries using methylation filtration technology ■ EST libraries ■ Forward and reverse genetics INVESTICE DO ROZVOJE VZDĚLÁVÁNÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 4 Outline ■ Forward and reverse genetics approaches ■ Differences between the approaches used for identification of genes and their function EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY OP Vzdělávání pro konkurenceschopnost > s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky Forward vs. reverse genetics Revolution in understanding word „gene" classical" genetics approaches „reverse genetics" approaches 5TTATATATATATATTAAAAAATAAAATAAAA 6 Identification of the role OÍARR21 gene • Hypothetical signal transducer in two-component system of Arabidopsis EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY OP Vzdělávání pro konkurenceschopnost > s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky Identification of the role of ARR21 gene Recent Model of the CK Signaling via Multistep Phosphorelay (MSP) Pathway HPt Proteins • AHP1-6 NUCLEUS PM AHK sensor histidine kinases • AHK2 • AHK3 • CRE1/AHK4/WOL Response Regulators ^j^5x^ARR1"24 REGULATION OF TRANSCRIPTION INTERACTION WITH EFFECTOR PROTEINS 8 Identification of the role OÍARR21 gene • Hypothetical signal transducer in two-component system of Arabidopsis • Mutant identified by searching in databases of insertional mutants (SINS-sequenced insertion site) using BLAST EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ OP Vzdělávání pro konkurenceschopnost > s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 9 Identification of the role of ARR21 gene — isolation of insertional mutant Searching in databases of insertional mutants (SINS) Insert_S IIIS : 01_09_64 Query: 80 tcctagcgttcatgagcgtaccatacttgacaanagagaacgtagccagccatttacagg 139 I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I Sbjct: 58319 tcctagcgttcatgagcgtaccatacttgacaagagagaacgtagccagccatttacagg 58378 Arr21: 1830 InsertSIIIS : 01 09 64 Query: 140 tttgatatctcttgtcaaaaatgtttttggattttactgt 179 I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I Sbjct: 5 8379 tttgatatctcttgtcaaaaatgtttttggattttactgt 58418 ftrr21: 1890 Localization of dSpm insertion in genome sequence of ARR21 using sequenation of PCR products 16k-d11 ATG i D2 D1 K W 1727 bp 1728 bp _16k- 16p 10 Identification of the role of ARR21 gene • Hypothetical signal transducer in two-component system of Arabidopsis • Mutant identified by searching in databases of insertional mutants (SINS-sequenced insertion site) using BLAST • Expression of ARR21 in wild-type and inhibition of expression of ARR21 in insertional mutant confirmed at the RNA level EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ OP Vzdělávání pro konkurenceschopnost > s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 11 Identification of the role of ARR21 gene — analysis of expression Wild type insertional mutant gene/cycles X 5 »0 I Ř II 7= 5.2. ACTIN2/20 ACTIN2/25 2 „ controls water DNA I I EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY gene / cycles primers ACTIN 2 / 25 aktLM -aktL1 ARR21 /40 2UI -2LII ARR21/40 1UII -1LI ARR21 /40 2UI-dsLb vgER^ OP Vzdělávání pro konkurenceschopnost > 1/1 \ m 'á- ^^^^^^> C y) a vt oj O" E í/5 Š controls water DNA INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 12 Identification of the role of ARR21 gene • Hypothetical signal transducer in two-component system of Arabidopsis • Mutant identified by searching in databases of insertional mutants (SINS-sequenced insertion site) using BLAST • Expression of ARR21 in wild-type and inhibition of expression of ARR21 in insertional mutant confirmed at the RNA level • Phenotype analysis of insertional mutant EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ OP Vzdělávání pro konkurenceschopnost > s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 13 Identification of the role of ARR21 gene - phenotype analysis of mutant $ ^ $ ä? tí Analysis of sensitivity to plant growth regulators ■ 2,4-D a kinetin ■ ethylene ■ Light of various wavelengths No alterations - nor in flowering, nor in number of seeds t. Q CN 100 30 10 vgER^ EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, OP Vzdělávání MLÁDEŽE A TĚLOVÝCHOVY pro konkurenceschopnost > s. 3 10 30 100 300 1000 kinetin \jlq ■ ľ1 INVESTICE DO ROZVOJE VZDĚLÁVÁNÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 14 Identification of the role of ARR21 gene - causes of absence of the phenotype • Functional redundance within the gene family? EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY OP Vzdělávání pro konkurenceschopnost > s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky Identification of the role of ARR21 gene - homology of ARR genes Legenda: □ ARR-A □ ARR-& • nalezena alespoň jedna EST m'*1 16 Identification of the role of ARR21 gene - causes of absence of the phenotype • Functional redundance within the gene family? • Phenotype only in very specific conditions (?) EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY OP Vzdělávání pro konkurenceschopnost > s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 17 Identification of the role oi ARR21 gene - summary ■ Gene ARR21 identified by comparative analysis of Arabidopsis genome ■ Based on sequence analysis, its function was predicted ■ Site-specific expression of ARR21 gene was proved at the RNA-level ■ Identification of gene function by insertional mutagenesis in case of ARR21 in development of Arabidopsis was not successful, probably because of functional redundation within the gene family INVESTICE DO ROZVOJE VZDĚLÁVÁNÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 18 Outline ■ 1—«/ III \^ I \^ I I \*/\~* \J IV* wwl I LI I \^ KS KS I \^ \-A \^ I I \^ \J \mA \J \^ \A Identification of genes ab initio ■ Structure of genes and searching for them EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY OP Vzdělávání pro konkurenceschopnost > s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 19 Structure of genes promoter transcription start 5'UTR translation start TATA ATG....ATTCATÍ 5'UTR splice site stop codai 3'UTR / polyaderV signal ATTATCTGATATA ... .ATAAATAAATGCGA 3'UTR RNA splicing 20 iritron 3" splice site 4 31 L conserued regions Hiif EVROPSKÁ UNIE I MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY OP Vzdělávání pro konkurenceschopnost M INVESTICE DO ROZVOJE VZDĚLÁVÁNÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 21 Identification of genes ab initio ■ Omitting 5' a 3' UTR ■ Identification of translation start (ATG) and stop codon (TAG, TAA, TGA) ■ Finding donor (typically GT) and acceptor (AG) splice sites ■ Many ORFs are not true coding sequences - in Arabidopsis, there are on average approximately 350 milion ORFs in every 900 bp of sequence^) ■ Using various statistic models (e.g. Hidden Markov Model - HMM, see recommended literature, Majoros et a/., 2003) to evaluate and score the weight of identified donor and acceptor sites INVESTICE DO ROZVOJE VZDĚLÁVÁNÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 22 Splice site prediction Programs for splice site prediction (specifity approximately 35 %) □ GeneSplicer (http://www.tigr.org/tdb/GeneSplicer/gene spl.html) □ SplicePredictor (http://deepc2.psi.iastate.edu/cgi-bin/sp.cgi) EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ OP Vzdělávání pro konkurenceschopnost > \ s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 23 Splice site prediction _ TC.,, Bioinformatics 2 _, , , _ BCB sy ISU Download Help Tutorial References Contact SplicePredictor - a method to identify potential splice sites in (plant) prc-mRNA by sequence inspection using Bayesian statistical models (click here to access the older method using logitlincar models) Sequences should be in the one-letter-code ({a,b,c,g,h,k,ni,n,r,s,t,u,w,y}), upper or lowercase; all other characters are ignored during input. Multiple sequence input is accepted in FA ST A format (sequences separated by identifier lines of the form "^SQ^anu^ofjsequence comments1') or in Gen Bank format. Paste your genomic DNA sequence here: GAGGAGGCACAAAATGACGAATATACAAAATGATC T T AAAC AGCT AAACTAT AT T GGACATTTTTTCGAT C T CAGATATA AAAGATTTCATTCAATATAATACTTGGATAAATACTCTTATTATTTTTCTTTAGTTTATTAAAAAAAACCTCTAATAAAT ACGAGTTTAAGTCCACAAAATCGCTTAGACTAAAATACACCATATAATTTCAAACGATAAAGTTTACAAAAGTAATATCC AAGTATCTCATAGTCAACATATATATAGTAATAATTAGTTGACGTATAAGAAAATAAAAATAAATAAATTAGTATCTTAT TTTGGGTGGTGCTGACTGGTGACTGGTGACTGCAGAATGCTCGGCAAATGGAACCATATCCCAAGACATGGGTTTTAGAT ... or upload your sequence file (specify file name): [ Browse... ... or type in the GenBunk accession number of your sequence: EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ OP Vzdělávání pro konkurenceschopnost > \ s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 24 Splice site prediction What do the output columns mean? SplicePredictor. Version of February 13, 2005. Date run: Wed Nov 9 11:30:14 2005 Species: Model; Prediction cutoff [2 In[BF]): Local pruning: Non-canonical sites: Homo sapiens 2-class Bayesian 3.00 on not scored Sequence your-sequence, from 1 to 9490. Potential splice sites "C jAAT jCC TGAGATAT TGT T TC^ TAAAA"GAGňTGAT TGT T T~T A~T Tň~TACĽATGAT TT3T~TGTA!_:TAAGC~TCCTTTCl.Cl.TT TGCAATACATAGGATATAAAT TCATACATGT TCCTAATTTTATTTT ■ I .... i .... I ■ ■■■ i .... I .... i ... .-~' '----H-.....-1 i ■ i i-~—H-.....i , , , , i , , , , i , , , , i , , , , i ^-, , ■ I-——~-+-.-.-~—I ■ . ■ ■ i ■ ■ ■ ■ I ■ ■ T~A;GGAC~CTAT AACAAAGGATHT ACTCTACTAACAAA BpuEl Eglll TGCACTTGA2T"TATG2T"TTCTTTGGTGGAAGATCTATATGTATCTATATCTATATTATTTTACTCTTTTCTTCGTGGTCATTTATAGTATATTATATATATGCAGACACACACACACCTATATGTATASCTC ACGTGAACTCAAATACCAAAAfiAAACCACCTTCTAGATATACATAGATATAGATATAATAAAATGAGAAAAGAAGCAfiCAĚTAAATATCATATAATATATATACGTfiTĚTGTGTGTGTGGATATACATATCGAĚ ^bal ppuEl AATTCTAGATAAAATATATAGAAATGGATCTTGAGAATCATTTTTTTTGTATTCTTTTGTTATCAAAGGGTTTCGACTTTGCTCCGA^GAA^AA^A"AATATGAAAAGA£,CTTTTTAGG^T"TA"CAT"CTCCT TTAAGATCTATTTTATATATCTTTACGTAGAACT ^ T~AG~AAAAAAAACATAAGAAAAGAATAG~T TCC^AAAjC~GAAA"GAGGCTCC~T C~T ^ TAT TATAC~T T TGTCGAAAAATCCCAAATAGTAAGAGGA l. 1 loc sequence ? c rhc parila - .-!. G* a <— ttttttcgatctcAGat C .973 7 16 :: 0 000 7 1 D A <— ľ'4 attatttttctttAGtt C .999 14 86 :: 0 000 7 : '.- 1 D a <-- :. C:: gattttgttgtttAGtc ;: 977 7 48 0 0 c c c í :5 1 D a <-- 780 tctgttattgtatAGct ;: .986 3 '-.'ó 0 0 c c c í : 1 D Ä <— 3 46 tattttttgaaatAGat z .968 6 80 0 0 000 7 1 1 a <— 1051 -aatttatttttaAGaa z .930 5 19 0 0 000 7 1 1 A <— L213 ttatttattttttAGtt c .998 12 14 :: 0 000 7 : '.- 1 1 A <— 1313 tttcctctctcacAGga 0 .999 13 17 :: 0 C CO 7 : s 1 D A 1487 tttatatattgatAGtg :: .883 04 :.; 0 c c c 7 1 D A <— 1581 atgtgttgcttgtAGga z .982 3 u:; :: OOC 0 000 7 1 D a <— L781 ggttgtgcgaaatAGgg c .886 4 1C :: 0 000 7 i 5 1 1 A <-- 2440 taattaaaaatttAGat 0 .939 5 4Í :: 0 C 0 0 7 : s 1 1 A <-- 2479 eatctaaaattttAGat ■j .942 5 5 9 :: 0 C Ľ c 7 : s 1 1 L: -----> 2546 aagGTagta z .909 4 61 0 885 1 903 15 !-. 5] A <— 2572 ttttttttttggcAGca z .930 5 11 0 0 000 7 1 1 A <---- 2763 ■ : .... 4: c .873 86 :: 185 0 000 11 : '.- '-. 1 A <---- 2762 tttcgttttcattAGcg c .952 5 96 :: 22C 0 000 11 : '.- 5 1 ň <---- 3022 tttgtttgtactaAGct ;: .956 l b 0 221 0 C CC 11 :5 5 1 ň <---- 3048 ctttgcaatacatAGga ;: .973 7 1.: 0 229 0 C CC 11 :5 5 1 a <— 3171 "gtcgtcatttatAGta z .988 3 74 0 0 000 7 15 1 D A 3264 cttttattatcaaAGaa :i .993 1 0 (13 :: 000 n 006 í 5 2 í A <---- 3451 aatacttecteatAGaa 3 .916 4 77 :; 293 065 12 í '.- !j 2 j L: —-> 3649 cacGTatta j .933 5 25 :: 000 : 848 11 1 5) a <--- 4254 attattgttcttcAGat 0 .998 12 82 0 000 002 8 1 2) a 4351 tttcttacattgcAGaa ;: .991 9 42 0 0 C CC í :5 1 1 a <— 4633 gtcttgtttctttAGgg z .879 0 0 000 7 1 1 a <— 4976 ~. 1 '.-g.--.-c . cl cAGcl z .952 5 '•: í 0 0 000 7 1 1 A <— 5004 I.I.'..'..'..-..' ■ '--gccAGag c .996 17 :: 0 000 7 : '.- 1 D L: ----> 5356 caaGTgaat :: .821 3 C 4 0 387 0 000 11 :5 5 1 L: ----> 5384 ttgGTaaga :: .941 5 54 :; ; 0 090 13 :5 5 3) a <— 5403 actctgtttctttAGct c .894 4 26 :: 0 000 7 1 D a <---- 5441 etttctctctaacAGaa c .995 10 4:; :: 387 0 000 11 :.'-' !-. 1 A <---- 5472 ttgttaaaattacAGct 0 .965 6 6 2 :: 478 0 090 13 : s 5 3) L: -----> 57 45 gcgGTaaga ■j .991 9 s t :: 99C 1 9-6 15 : s 5 V a 5808 eatcatatcctaaAGgt z .948 5 83 0 ; 0 000 11 !-. 1 a <---- 6135 ggtctattattatAGgt z .999 13 59 0 508 0 050 12 !-. 2 A <— 6552 ggattttcacctcAGag c .938 5 4 v :: 0 000 7 1 D 343' ~ gac ~ T TG caaa a.GT G a, actgaaacgtt~tsca:t 1-02931 Ibjow-1 1-XI1067-1 pegl |Snaßl |Hpal AGGCACTTTGATCGTTGTACTTTGTTGCTTTTTATACGTATCGCTTCCTACAATAAGTTAACAATGCTTCCTCGTAGAATTeCAAAACATTTGTGGACCGTGATTTACAT TCCGTGAAACTAGCAACATGAAACAACGAAAAA^ATGCATAGCGAAGGATG^TA^TCAATTGTTACGAAGGAGCATCTTAACGTTTTGTAAACACCTGGCACTAAATGTA ECOICRI j |Sacl pul GACTGAGCT:t"TTCA3TGG:T":T"TGCAGCíG:t"CT"C:T"G3íGGACTAATCAAGACA3AAATC"3T"C:TCTAAAAACGATCGCCGTTň + . . . . I . . . .-——h--h i— ' ' I.........I.........I ~-^-^H-——~-l-i-'-' ' ' I.........~-^~\_ :tgact:gagaaaagt:accgaagaaa:g"cg"cgaagaaggaac:_c:tgat"agt"ctgt:t"taga:aaggagat"tttgctagcggcaag1 AATC"TGCOAT"CTTGACGAGTCTTGATC"TTAGA ttagaacggtaagaactgct:agaactagaaatct -LLLL TARRF ^sil pssSI jAsel ATCAAATTTATAAGGGATCACGAĚATA:ACGTATTAíTTATTA"TT"TTT"TT"TTTGCTTTTTGTGGTT —h-......I.........I.........I ' ' '----H-.....-I ■ ■ ■ ■-——h--H .-- "AGT"TAAATA"T:CCTAGTGC":TATGTGCA"AAT"AA"AATAAAAAAAAAAAAAACGAAAAACACCAATA" TCb'il pel TTCACTCAAATGATGGTGAAAGTTACAAAGCTTGTGGCTTCACGTCCAATTGTSGTC Hindll caagtgagt"tac"a:ca:t"t:aatg"ttcgaaca:cgaagtgcagg"taacaccag TTTTGCGTCCTGGTAATTCTGCTTTCTTTCTTCTAAATTATACGATGATTCTACATT"ctaotcat:TCST"OTTGTTTTTCAAATGATATAATTATTGTGTeTATAT:ACC:A"TCATjTATA"tta"tgaaa .......I......... ' ~-'--H---—-H-—-'''I'-h i 1 1-——1 i-——.........I'''----H-.....-1-—i-i-——H—.......I.........I...... AAAACGCAGGAC cat TAAGACGAAAGAAAGAAGATTTAATATGC TACT AAGATGTAAAGATGAGTAGAGCAAGAACAAAAAGTTT ACT atattaataacacacatatagtgggtaag_A.A~ataaataacttt V 1 ' psml pglll |BspEI aata_a3gcat_cctggtggttgtt_t:gagtgcat_tggatctcaaattggcgaacaacaacggagaacctagtcaaagaggtcgcttcat_taccgaagatctccggacaagtctagtttcggagattgaaa ttatatccgtaaggac:accaacaaaagc_cacgtaaacctagagt_taaccgcttg_tgttgcct:t_ggat:agtt_ctccagcgaagtaaatggcttctagaggcctgttcaga_caaagcct:taac_tt EVROPSKÁ UNIE MLÁDEŽE A TELOVÝCHOVY pro -------jni konkurenceschopnost a státním rozpočtem České republiky 25 Identification of genes ab initio Programs for splice site prediction (specifity approximately 35 %) □ GeneSplicer (http://www.tigr.org/tdb/GeneSplicer/gene spl.html) □ SplicePredictor (http://deepc2.psi.iastate.edu/cgi-bin/sp.cgi) □ NetGene2 (http://www.cbs.dtu.dk/services/NetGene2/) INVESTICE DO ROZVOJE VZDĚLÁVÁNÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky Splice site prediction CENTERFO REIOLOGI CALSEQU ENCEANA LYSIS CBS eara it" F'. V, CBS» Prediction Servers » NetGerie2 NetGene2 Server The NetGene2 server is a service producing neural network predictions of splice sites in human, C. elegans and A. thaliat Instructions Output format Abstract Performanc SUBMISSION Submission of a local file with a single sequence: File in FA5TA format_ Browse." 1 (•> Human Oc. elegans thaliana [ Clear fields ] Send file ] Submission by pasting a single sequence Sequence name O Human Oc. elegans $A. thaliana Sequence GAGGAGGCACAAAATGACGAATATÄCAAÄATGATCTTAAAC AGCTAAACTATATTGGACATTTTTTCGATC I TCAGATATA AAAGATTTCATTCAATATAATACTTGGATAAATACTCTTATTATTTTTCTTTAGTTTATTAAAAAAAACCT CTAATAAAT ACGAGTTTAAGTCCACAAAATCGCTTAGACTAAAATACACCATATAATTTCAAACGATAAAGTTTACAAAA j [ Clear fields ] Send file ] NOTE: The submitted sequences are kept confidential and will be erased immediately after processing Splice site prediction Prediction done NetGene2 v. 2.4 The sequence: Sequence has the following composition: Length: 94 90 nucleotides. 31.3% A, 17.0% C, 19.6% G, 31.7% T, Donor splice sites, direct strand .0% X, 36.5% G+C pos 5'->31 1704 1906 phase 0 ;: ■ . ■ : + + 0. 0. dence 87 99 5' exon intron 3' TTCCAAACAC* CGGTGAACGG" rAATATTT 3AGAACAT .. .. 4134 :: + D. 7 4 TCAAACACAG" rGTTAAAA 4619 : + D. 7 4 AGCAAGAAAG" 4915 3 + 3. ■;J CGTTCCTCTG* iAATACTG 5356 C + 0. 8 7 TCTCAACCAA* 3AATGTTT 5384 : + CC GA'-TTGGTTG' iAGACTCT H 5809 : + . cc TATCCTAAAG" 3TGTCCAA 6057 c + 1. CC GCAGTCTTTG" iAGCTACT H 6096 : + D. ■4 CTCTTCACAA" iAATCTAG 7369 3 + 1. cc GGACTGCCAA" iAGTTTAA H 7886 3 3. 7 4 GAACAAAATG" TAGATGAA 9323 3 3. 7 4 GAAGATTAGG" TTTTCTCT Donor splice sites, complement s tranci pos 31->51 pos 5'->31 phase ■ . ■ : dence 5' exon intron 3' Acceptor splice sites, direct strand los 5 ' ->3 1 phase strand confidence 51 intron exon 31 1213 0 + 0. rM TATTTTTT "TTATGGAGAC 1221 2 + 3. 67 AGTTA1GG "ACAAGAATCG 1373 0 + 3. '.': rCTClCAC "GACACAGAAT 1487 1 + 0. 81 ATATTGAT " TGGGACATTA 4254 0 + 1. 00 TGTTCTTC "ATCGCACCAT H 4832 2 + 3. 54 AAAATTGC "TTCCAGTGGC 5004 : + 3. 94 1'1'_TTGCC "AGATACACAC 5472 l + 3. 96 AAAA11 AC H^CTCTGCTCAA 6135 0 + 1. 00 1 A'l 1 Al "GTAAGATTAA H 6490 : + 3. 90 AAAGTTAC "TGGTGGAGAA 6744 c + 3. -9 TGTCAAAC "TTTCGTAGAG 7447 :: + . 96 TTCTGCAC "ATGCCAGAAA 7780 2 + 3. 76 TCCATTTC "ATACAGAACA 7786 2 + 3. 92 TCAGATAC "AACACATGCA ;CGAA.TGCCTGflGATATTGTTTCCTflflAATG/tGATGATTGTTTTTflTTTflTTACCATGAT TT 2t"T jTa2 T AA jC"TCCT TTCLCCTTTGCAaTACaTAGGATaTAAATTCATACATGTTCCTAaTTT~AT~TT 3GCTTACGGACTCTATAACAAAGjA~T TTACTCTAC~AACAAAAATAAATAATGGTACTAAA^AAACA^A-!^ ppuEl pglll TGCALTTGAGTTTATGGTTTTCTTTGGTGGAAGATCTATATGTATCTATATCTATATTATTTTACT;T~T TCTTCGTCGT;A~T TATAG~ATAT TA~ATATATGCACA^ACA^ACACAL.;TATA~GTA~AGCT; ACGTGAACTCAAATACCAAAAGAAACCACCT TCTAGATA~ACA~AjATATAGA~ATAATAAAATGASAAAAGAAGCAtCAjTAAA~ATCATA~AATATATATACGTGTGTGTGTGTGTGGATATACATATCGAG aattctagataaaatata~agaaatggatcttgagaatcattt~tt~tstattcttt~gtta itaagatctattttatatat 2t ~tacc tagaactcttagtaaaaaaaacataagaaaacaatag? "T:GACT"TGC"GCGASGAAGAAGA"AATATGAAAAGAGCTTT"TAGGGT"TA"CATTCTCCT AAAGCTGAAACGAGGCTCCTTCTTCTATTATACTTTTCTCGAAAAATCCCAAATAGTAAGAGGA IGACTTTGCAAAAtGTGAAATGTAAGGCACTTTGATCGTTGTACTTTGTTGCTTTTTATACGTATCGCTTCCTACAATAAGTTAACAATGCT"i::TCGTAGAA"TGCAAAACAT"TG"GGAC:G"GAT"TACAT ACTGAAACGTTTTGCACTTTACATTCCGTGAAAtTAGCAACATGAAACAACGAAAAA~ATGCATAG;GAAGGATGiTTATTCAATTGTTACGAAGjAGCATCTTAACGTTTTGTAAACACCTGGCACTAAATGTA - exon 2 — fcCLilCKI 3A-TGAjCTGT~TTCAjTGGGT~!IT~TjCAGCAGCT~CT~CCT"GGAGjACTAAT CAAGACAjAAATC~GT~C^ TCTAAAAaCGATCGCCGTT ;T jACT CGAGAAAAGT GACCGAA jAAA^G~CG~t.GAAGAAGGAACC~C^ TGAT~AGTTr,TGTCTTTAGAG AAG jAGAT~T TTGCTAG^GGCAAG' # c~tgccat~cttgacgagtcttgatctttaga agaacggtaagaactgctcagaactagaaa'ct Asel |Hindlll |vlfel atcaaatttataagggatcacgagatacacgtattaattattattttttttttttttgctttttgtggttatacaagttcactcaaatgatggtgaaagttacaaagcttgtggcttcacgtccaattgtggtc tagtttaaatatt:ci:tagt3C":tatstgca"aat"aa"aataaaaaaaaaaaaaacgaaaaaca:caatatst"i:aagtgagt"tac"a:ca:t"t:aatg"ttcgaaca:cgaagtsca3G"taacaccas ttttgcgt:c^^^b:t3c"ttctttcttctaaattatacgatgattctacatt"ttactcat:tcst"cttgtt"ttcaaa"gatataatta"t3tgtg"atat:act:a"tcatstata"tta"tgaaa aaaacgcagga^^BKgacgaaagaaagaagatttaatatgctactaagatgtaaagatgagtagagcaagaacaaaaagtttactatattaataacacacatatagtgggtaagtacatataaataacttt F c :' 1 ■ p5ml Figlll |BspEI aatatagggattcctggtggttgttttcgagtgcatttggatctcaaattggcgaacaacaacggagaacctagtcaaagaggtcgcttcatttaccgaagatctccggacaagtctagtttcggagattgaaa TTATATCCGTAAGGACCACCAACAAAAGCTCACGTAAACCTAGAGTTTAACCGCTTGTTGTTGCCT C t~3GAT CflGT T~CTCCAGCGAAGTAAATGGCT TCTAGA3GCC TGTTCAGaTCAAAGCCTCTAACTTT .aflvvvfec I w I snwrtttenlvkevasf tedlrtslvse I e 28 RNA splicing and adaptation Divergencies at splice site recognition in plants in practice example of developmental plasticity of (not only) plants Identification of mutant with point mutation (transition G—>A) exactly at the splice site at the 51 end of t|ie 4th exoAQ, Bpml PflMI Asel Psil Spei Bell CTGCGMTTACAMGTTGTTATTGTCTTGATCCTMATTGMTGCTCTTGTGTTTTCTATTTCTCCAGGMCTGGTGMGCTCACTGGTGCAMMCACATGMGC -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-h GACGCTTAATGTTTCAACAATAACAGAACTAGGATTTAACTTACGAGAACACAAAAGATAAAGAGGTCCTTGACCACTT CGAGTGACCACGTTTTTGTGTACTTCGGTT CTATTTGTAATAATTACTACAATTACCGTAATATTTCGGTCCTTCCAATCATCAACAGAGGATTGATCAAAACTAGTTTCAAAATATGGAAGTTCACACGA LPDR U1b-I RLVVVS.LVLIKVLYLQVC L S-I-no splicing- -EX0N3- t - pis1 intron - E LVKLT GAKTH EAKIN I INDVNGI I K PG I-PDRexon 30RF - BspMI H pá Stul GC >Gt t 'Stl Pull TATTCTTCTTGCTGTTGQllG»TAACACTGTTGCTTGGTCCTCCTAGCTGCGGAAMACMCTTTGTTAMGGCCTTGTCTGGAMTTTAGAM ATMGMGAACGACMCGTCCAATTGTGACMCGMCCAGGAGGATCGACGCCTTTTTGTTGMACAATTTCCGGMCAGACCTTTAMTCTTTTGTTAGATTTCCAAGATTACTACTTTCGTCMTATAGTMMGMCACTTCTAMAMACGACGTCGACACACTTCAAACATGGAAAAG LFFLLLQ L T L L L G -no splicing-1-pis 1 DEL — GCTGTTGCAí - pis 1 intron-*1 L T L L L G -pis1 EXON4- CGKTTLLKALSGNLENNLK -pis1ex)n4 0RF -EXON 4- SCGKTTLLKALSGNLENNLK -PDRexon4 0RF -1 -PDR L1- EVROPSKÁ UNIE ^^^^Ä I MINISTERSTVO ŠKOLSTVÍ, OPVidělávání ^^rf^sT ^B^r I MLÁDEŽE A TĚLOVÝCHOVY pro konkurenceschopnost 4\AV' a státním rozpočtem České republiky RNA splicing and adaptation Identification of mutant with point mutation (transition G—>A) exactly at the splice site at the 5' end of the 4th exon Analysis by RT PCR proved the presence of a fragment shorter than cDNA should be after the typical splicing event PDR_U1a/PDR_L1 PDR_U1b/PDR_L1b wt pisi wt pisi - 500 bp _£ - 400 bp • - 500 bp - 400 bp - 300 bp - 300 bp - 200 bp - 200 bp - 100 bp - 100 bp RNA splicing and adaptation Divergencies at splice site recognition in plants in practice example of developmental plasticity of (not only) plants Identification of mutant with point mutation (transition G—>A) exactly at the splice site at the 5' end of the 4th exon Analysis by RT PCR proved the presence of a fragment shorter than cDNA should be after the typical splicing event Sequenation of this fragment then suggested alternative splicing with the closest possible splice site in exon 4 Bs ml AlwN I I Bpml PflMI Asel Psil Spel Bell III II II ct gc gaatt acaaagtt gt tat tg tct tgatc ctaaattgaatgctcttgtgttttc tatttctccaggaactggtgaagctcactggt gcaaaaacac atgaagccaagat aaacatt att aatgatg ttaat ggc at tat aaagc caggaaggttagtagttgtctcctaactagttttgatcaaagttttataccttcaag tgtgc t gacg ctt aatgt tt caacaataac agaactag gattt aactt acgagaacacaaaagataaagaggtccttgaccacttcgagtgaccacgt tt ttgtgtactt cggtt ctatt tgt aataatt act ac aat taccgtaatatttcggtcct tc c aat c atc aac ag ag gat t g atc aaaac t agt t t c aaaat at g gaagt tc ac acg a lvl ikvlylqvc -no splicing- I I TATT CTT CT TGC TGTTGCAGGT TAACACTGTT GCTTGGTCC CCTAGCTGCGG, < I II lCAACTTTGTTAAAGGC cttgt ctggaaattt agaaaacaat ctaaaggt TC taatg atg aaagc ag ttatatcatt ttc ttgtgaagatttttt tgctgcagctgtgtgaagtttgtaccttttc AT AAGAAGAACGAC AAC GT CCAAT TGT GACAACGAAC CAGGA "tGATCGACGCT .'TTTTGTTGAAACAATTTCCGGAACAGACCTTTAAATCTTT TGTTAGATTT CCAAGATT AC TAC TT TCGTC AAT AT AGT AAAAGAACAC TTCTAAAAAAACGACGTCGACACACTTCAAACATGGAAAAG 31 RNA splicing and adaptation ■ Divergencies at splice site recognition in plants in practice -example of developmental plasticity of (not only) plants Identification of mutant with point mutation (transition G—>A) exactly at the splice site at the 5' end of the 4th exon Analysis by RT PCR proved the presence of a fragment shorter than cDNA should be after the typical splicing event Sequenation of this fragment then suggested alternative splicing with the closest possible splice site in exon 4 Existence of similar defense mechanisms was proven in different organisms as well (e.g. Instability of mutant mRNA with early stop codon formation (> 50 - 55 bp before typical stop codon) in eukaryotes, see recommended literature - Singh and Lykke-Andersen, 2003 EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ OP Vzdělávání pro konkurenceschopnost > \ s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 32 Identification of genes ab initio Programs for exon prediction □ 4 types of exons (according to location in the gene): initial internal terminal single □ Programs predict splice sites and they take into account the structure of the type of exon as well • initial: □ Genescan (http://genes.mit.edu/GENSCAN.html) □ GeneMark.hmm (http://opal.biology.gatech.edu/GeneMark/) • internal: □ MZEF (http://rulai.cshl.org/tools/genefinder/) INVESTICE DO ROZVOJE VZDĚLÁVÁNÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 33 Identification of genes ab initio The New GENSCAN Web Server at MIT Identification of complete gene structures in genomic DNA w i // (o o) . .-. .-oOOo-(_)-oQQo-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. |xim\ /11 |x| 11 \ /1 I |x| 11 \ /1 i ixi i i \ f\ 11 x 111\ /i i ix 11 I \ /I I I x 111\ /mix / \i i ixi i i/ \i i |x| i i/ m i ixi i i/ \i i ixi i i/ miixiii / m i ixi i i/ \i i ix| i i/ m i ixi i i/ For information about Gen scan, click here "his server provides access to the program Genscan for predicting the locations and exon-intron tmctures of genes in genomic sequences from a variety of organisms. Tiis server can accept sequences up to 1 million base pairs (1 Mbp) in length. If you have trouble with he web server or if you have a large number of sequences to process, request a local copy of the trogram (see instructions at the bottom of this page) or use the GENSCAN email server. If your browse e.g., Lynx) does not support file upload or multipart forms, use the older version. organism: ^^^^^^^^f Suboptimal cxon cutoff (optional): ^^^U name (optional): lit opiuMis: pload your DNA sequence file (one-letter code, upper or lower case, spaces/numbers ignored): r paste your DNA sequence here (one-letter code, upper or lower case, spaces/numbers ignored): GAGGAGGCACAAAATGACGAATATACAAAATGATCTTAAACAGCTAAACTATATTGGACATTTTTTCGATC TCAGATATA AAA GAT T T C ATTCAATATAAT AC T T GGATAAATACTCT T AT TAT T T T T GTT T AGTTTATTAAAAAAAACCT GTAATAAAT ACGAGTTTAAGTCCACAAAATCGCTTAGACTAAAATÄCACCATATAATTTCAAACGATAAAGTTTACAAAA GTAATATCC AAGTAT C T C ATAGT CAACAT AT AT ATAGTAATAATTA GT TGAC GT ATAAGAAAATAAAAAT AAATAAAT TA GTATCTTAT TTTGGGTGGTGCTGACTGGTGACTGGTGACTGCAGAATGCTCGGCAAATGGAACCATATCCCAAGACATGG GTTTTAGAT AGAACAAAATAAGT G T C C GAAGGAATGATAT TAAAAGT C AAAT AGAATAATTATAAAT AT TGTAATTAGCA AATAAAAAC b have the results mailed to you, enter your email address here (optional): 34 Identification of genes ab initio GENSCANW output for sequence CKI1 GENSCAN 1.0 Date run: 10-Nov-105 Time: 02:24:26 Sequence CKI1 : 9490 bp : 36.53% C+G : Isochore 1 (0-43 C+G%) Parameter matrix: Arabidopsis.smat Predicted genes/exons: Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr. 1.00 Prom 1497 1S36 1 04 Intr + 5005 5383 379 0 1 70 91 343 0 772 31 41 1 05 Intr + 5473 6056 5B4 2 2 38 99 532 0 722 50 76 1 06 Intr + 6136 7368 1233 0 0 68 108 655 0 977 56 86 1 07 Term + 7448 7660 213 1 0 43 35 212 0 999 12 65 1 08 PlyA + 7910 7915 6 -0 45 2 03 PlyA - 7 97 6 7971 6 -4 83 2 02 Term - B793 8050 744 0 0 107 37 542 0 997 43 46 2 01 Init - 9253 8936 318 1 0 105 73 386 0 999 41 IB Suboptimal exons with probability > 0.100 Exnum Type S .Begin ...End .Len Fr Ph B/Ac Do/T CodRg P.... Tscr.. 001 Init 002 Init 003 Intr 004 Intr 005 Intr 006 Intr 1867 2374 3894 4352 5005 5442 1905 2442 4110 4914 5379 6056 39 69 217 563 375 615 64 55 -3 75 70 95 40 95 -34 59 8 9 9 57 -11 307 338 298 132 177 187 335 0.212 589 0.208 3.74 2.40 11.55 26.20 22.99 57.32 35 Explanation Gn.Ex : gene number, exon number (for reference) Type : I nit = Initial exon (ATG to 5' splice site) Intr = Internal exon (3' splice site to 5' splice site) Term = Terminal exon (3' splice site to stop codon) Sngl = Single-exon gene (ATG to stop) Prom = Promoter (TATA box / initation site) PlyA = poly-A signal (consensus: AATAAA) S : DNA strand (+ = input strand; - = opposite strand) Begin : beginning of exon or signal (numbered on input strand) End : end point of exon or signal (numbered on input strand) Len : length of exon or signal (bp) Fr : reading frame (a forward strand codon ending at x has frame x mod 3). For example, if nucleotides 1,2,3 of the sequence are read as a codon, that's called reading frame 0. If 2,3,4 are read as a codon, that's reading frame 1. If 3,4,5 are read as a codon, that's reading frame 2, and so on. This information, together with the starting and ending positions of the exon, is sufficient to give the amino acid sequence encoded by the exon. Another use of the reading frame is that if you see two adjacent predicted exons separated by a relatively short intron which share the same reading frame, it may be worth looking at the possibility that the intervening intron is not correct, i.e. that the two exons plus the intervening intron might form one long exon (assuming there are no inframe stops in the intron, of course). Ph : net phase of exon (exon length modulo 3). For example, an exon of length 15 bp has net phase 0 since 15 is divisible by 3, an exon of length 16 bp has net phase 1 because 16 divided by 3 leaves a remainder of 1, an exon of length 17 bp has net phase 2, and an exon of length 18 bp has net phase 0 again. The point of this is that exons whose net phase is 0 can be omitted from the gene without disrupting the reading frame: such exons are candidates for being either 1) incorrect, or 2) alternatively spliced. I/Ac : initiation signal or 3' splice site score (tenth bit units; x 10). If below zero, probably not a real acceptor site. Do/T : 5' splice site or termination signal score (tenth bit units; x 10) If below zero, probably not a real donor site. CodRg : coding region score (tenth bit units) P : probability of exon (sum over all parses containing exon). This quantity is close to the actual probability that the predicted exon is correct. Tscr : exon score (depends on length, I/Ac, Do/T and CodRg scores). 36 Comments The SCORE of a predicted feature (e.g., exon or splice site) is a log-odds measure of the quality of the feature based on local sequence properties. For example, a predicted 5' splice site with score > 100 is strong; 50-100 is moderate; 0-50 is weak; and below 0 is poor (more than likely not a real donor site). The PROBABILITY of a predicted exon is the estimated probability under GENSCAN's model of genomic sequence structure that the exon is correct. This probability depends in general on global as well as local sequence properties, e.g., it depends on how well the exon fits with neighboring exons. It has been shown that predicted exons with higher probabilities are more likely to be correct than those with lower probabilities. What are the suboptimal exons? Under the probabilistic model of gene structural and compositional properties used by GENSCAN, each possible "parse" (gene structure description) which is compatible with the sequence is assigned a probability. The default output of the program is simply the "optimal" (highest probability) parse of the sequence. The exons in this optimal parse are referred to as "optimal exons" and the translation products of the corresponding "optimal genes" are printed as GENSCAN predicted peptides. (All the data in our J Mol Biol paper and on the other GENSCAN web pages refer exclusively to the optimal parse/optimal exons.) Of course, the optimal parse does not always correspond to the actual (biological) parse of the sequence, that is, the actual set of exons/genes present. In addition, there may be more than one parse which can be considered "correct", for example, in the case of a gene which is alternatively transcribed, translated or spliced. For both of these reasons, it may be of interest to consider "suboptimal" ("near-optimal") exons as well, i.e. exons which have reasonably high probability but are not present in the optimal parse. 37 Specifically, for every potential exon E in the sequence, the probability P(E) is defined as the sum of the probabilities under the model of all possible "parses" (gene structures) which contain the exact exon E in the correct reading frame. (This quantity is calculated as described on the GENSCAN exon probability page.) Given a probability cutoff C, suboptimal exons are those potential exons with P(E) > C which are not present in the optimal parse. Suboptimal exons have a variety of potential uses. First, suboptimal exons sometimes correspond to real exons which were missed for whatever reason by the optimal parse of the sequence. Second, regions of a prediction which contain multiple overlapping and/or incompatible optimal and suboptimal exons may in some cases indicate alternatively spliced regions of a gene (Bürge & Karlin, in preparation). The probability cutoff C used to determine which potential exons qualify as suboptimal exons can be set to any of a range of values between 0.01 and 1.00. The default value on the web page is 1.00, meaning that no suboptimal exons are printed. For most applications, a cutoff value of about 0.10 is recommended. Setting the value much lower than 0.10 will often lead to an explosion in the number of suboptimal exons, most of which will probably not be useful. On the other hand, if the value is set much higher than 0.10, then potentially interesting suboptimal exons may be missed. EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ OP Vzdělávání pro konkurenceschopnost > \ s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 38 Identification of genes ab initio CENSCAN predicted genes in sequence 02:56:23 2 1] □ [ i.........i.........i.........i.........i.........i.........i.........i.........i.........i.........i tb 0.0 0.5 1.0 1.5 2.0 2.5 1.0 .15 4.0 4.5 5.0 I.........I.........I.........I.........I.........I.........I.........I.........I......... kl-i 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9. Optimal exoi [^{'Y: Initial Interna] Terminal Single-exon 6X011 ™™ ™]1 Q SubopHmalexoi, 39 Regulation of translation • Functional purpose of splicing in untranslated regions - important regulation part of genes Translational repression by short ORFs in 5' UTR Identified e.g. in maize (Wang and Wessler, 1998, see recommended literature for additional info.) In case of CKI1 there was an attempt to prove this mechanism of regulation using transgenic lines carrying uidA under control of two versions of promoter (unconfirmed so far) M K R a f . ATGaaaagagcttttTAG ATGatggtgaaagttaca.... M K R A F . M M V K V T... ATGaaaagagcttttTAG ATGatggtgaaagttaca.... EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ ť. OP Vzdělávání pro konkurenceschopnost > s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 40 Regulation of translation • Functional purpose of splicing in untranslated regions - important regulation part of genes In case of CKI1 there was an attempt to prove this mechanism of regulation using transgenic lines carrying uidA under control of two versions of promoter (unconfirmed so far) BamHI gaggaggcacaaaatgacgaa -//- tgtattcttttgttatcaaagggtttcgactttgctccgaggaagaagataatatg^ggatcccccgggtaggtcagtcccttatgttacgtcctgtagaaaccccaacc ^\ (m)ri prvgqslmlrpvetpt -2739 GAGGAGGCACAAAATGACGAA -//- gttatacaagttcactcaaatgatggtgaaagttacaaagcttgtggcttcacgtcggatcccccgggtaggtcagtcccttatgttacgtcctgtagaaaccccaacc MMVKVTKLVASR Rl PRVGQSLMLRPVETPT - intron I exon EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ OP Vzdělávání pro konkurenceschopnost > s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky Gene modelling 41 Programs for gene modelling □ Those that take into account other parameters as well, e.g.continuity of ORFs □ Genescan (http://genes.mit.edu/GENSCAN.html) - very good foor prediction of exons in coding regions (tested for gene PDR9, Genescan identified all of the 23 (!) exons) □ GeneMark.hmm (http://opal.biology.gatech.edu/GeneMark/) □ GlimmerHMM (http://http://ccb.jhu.edu/software/gIimmerhmm/ EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ OP Vzdělávání pro konkurenceschopnost > s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 42 Identification of genes ab initio GeneMark™ A family of gene prediction programs provided by Mark Borodovsky's Bioinformatics Group at the Georgia Institute of Technology, Atlanta, Georgia. What's New: - November, 2005 Supported Prokaryotes: predicted by NIH gene database, Prokaryotes: models for GeneMark and GetieMark.li mm Gene Prediction in Bacteria and Archaea For bacterial and archaeal gene prediction, you can use the parallel combination of the GeneMark and GeneMark.hmm programs here. If the DNA sequence of interest belongs to a species whose name is not in the list of available models, you should use either the Heuristic models option or, if the sequence is longer than 1 Mb, generate models with the self-training program GeneMarkS. Both options will allow you to generate models and then to use GeneMark.hmm and GeneMark in pa rallel. Gene Prediction in Eukaryotes For eukaryotic gene prediction, you can m^B-i Lse *-ne Paral'el combination of the GeneMark and GeneMark.hmm programs here. Gene Prediction in EST and cDNA k>».*f To analyze ESTs and cDNAs, please follow this link. Gene Prediction in Viruses For viral gene prediction, or to access our (j^^B virus database VIOLIN, please follow this What the programs do: TGfr Borodovsky Group Gene Prediction Programs . GeneMark • GeneMark.hmm . Frame-by-Frame . GeneMarkS • Heuristic models Statistics . Documented GeneMark.* usage Help • References • Papers . FAQ • Contact Databases of predicted genes . ProkaryotesNeiv! • Viruses/Phages (VIOLIN) Bioinformatics Resources . Links Bioinformatics Studies at Georgia Tech • MS Degree Program • PhD Program • Lectures • Seminars • Center for Bioinformatics ar Eukaryotic GeneMark.hmm^1,2^ fReioadthisoaaei References: ■"■Borodovsky M, and Lukashin A, (unpublished) zLornsadze A., Ter-Hovhannisyan V., ChernoffY. and Borodovsky M., "Gene identification in novel eukaryotic genomes by self-training algorithm" Nucleic Acids Research, 2005, Vol. 33, No. 20, 6494-6506 Accuracy comparison UPDATE October 2005. Added pre-built models of eukaryotic GeneMark.hmm ES-3.0 (E -eukaryotic; S - self-training; 3.0 - the version) Listing of previous updates Input Sequence Title (optional): 8_ EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY OP Vzdělávání pro konkurenceschopnost > |CKI1 Sequence:^ uttitt ■= i-zt cuatt c j iiiggtt itttcgttttcafctijcgcccttt ctctcgactttcttgat gaabcttt afcttctt cfcifc gt giiifccfciifcfca gfc rt gifcgfcfcfc iiiiifcgii rfc it gfcfct iit c attttc ifcgigfc it iqatttiiqtt luact: i it atccgut gccfcgigifc ifctgttt ccfciiiifcgigifc git t gfcfcttt ifcfcfcifcfc íc it gifcttgfcfcfc g-fc ctttccccttt gciifcic ifc iggifc ifc iiifcfccit icifcgfcfc c cfciifcfcfct ifcfcfcfcfc q-z icfcfcgigtttifcggttttcfcfcfcggtggiiga ■bet it it c t it ifctatttticfcctttt 'ittcqť: gt c ifct fc it igfc ifcifct it it it it gcicicic ícícícíc cfcifcifc gt ifcigcfc c iiiifcit; cattttttttgtatt cfcfcfcfc gťt it ciiigggttt egaettt get i EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY M ER iL OP Vzdělávání pro konkurenceschopnost > (Ml A s. /ZDELAVANI Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky Identification of genes ab initio 44 Result of last submission: View PDF Graphical Output GeneMaikhmm Listing Go to: GeneMarkhnuii Protein Translations Go to:Jo1i Submission EMkariotyc G-enetlark .baran vtisiiri bp 3.9 ^ril £5, £008 SequsjiCi name: CKI1 Sequence length: 5043 bp G+C con-beri.fc: 38.7 9* Matrices file: /honne/gemnark/ eiik ghm.inatr ices/ athal i ana toYnňS _ Onnů d Thu Oct 1 11:09:24 £009 Pr e d i et ed gtnts/ eK ons Gent ĽKon Strand ĽKon If # Type In it i al Int e mal Internal Internal Internal Internal T e run inal ĽKon Range 959 1025 57 1 ! 1155 13 9 4 151ŕ £115 ££65 £544 £T34 3317 3397 4629 4709 49£1 ĽKon Length. £40 660 379 584 1£33 £13 Start/End Fr arme iL GeneMark.hmm prediction Thy Nm 10 Q3:23:47 EST 2X5_ Qrder 5^ w|ndow 9a step 12 4/8 II 1 II 111 1 1 _j_ ^ /I 1 1 1,1 l./l 1 1 ,1 /Jl , ^° 4400 0.5 B< 0.5 J_~k ,a _L I , J_L 4400 450i} 5200 I. I I I , I_J_, EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, OP Vzdělávání MLÁDEŽE A TĚLOVÝCHOVY pro konkurenceschopnost > _L_L J_I K , lA I 5600 5000 J_J_ J_J_ 5200 5600 Nucleotide Position .AVANI ancována Evropským sociálním fondem a státním rozpočtem České republiky 45 Genomic homologies ■ Searching for genes according to homologies ■ Comparison with EST databases □ BLASTN (http://www.ncbi.nlm.nih.gov/BLAST/, http://workbench.sdsc.edu/ ■ Comparison with protein databases □ BLASTX (http://www.ncbi.nlm.nih.gov/BLAST/, http://workbench.sdsc.edu/ □ Genewise (http://www.ebi.ac.uk/Wise2/) They compare protein sequence with genomic DNA (after reverse transcription), therefore the aminoacid sequence is needed ■ Comparison with homologous genome sequences from related species □ VISTA/AVID (http://www.lbl.gov/Tech-Transfer/techs/lbnl1690.html) EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ OP Vzdělávání pro konkurenceschopnost > \ s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky Outline 46 Tor identification or genes ana tneir runction ■ Identification of genes ab initio ■ Structure of genes and searching for them ■ Genomic colinearity and genomic homology EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY OP Vzdělávání pro konkurenceschopnost > 'S INVESTICE DO ROZVOJE VZDĚLÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 47 Genomic colinearity Genomes of related species (despite large differences) are characterized by analogies in sequence organization -> possibility to use this kind of information for identification of genes in related species when searching in databases General scheme of work while applying genomic colinearity (also called ..comparative genomics") for experimental identification of genes in related species: □ Mapping small genomes using low-copy DNA markers (e.g. RFLP) □ Using these markers for identification of orthologous genes (genes with the same or similar function) of related species □ Small genome (e.g. rice, 466 Mbp) can be used as a guide: molecular low-copy markers (e.g. RFLP) bound to gene of interest are identified and these regions are then used as a probe for searching in BAC libraries during identification of orthologous regions of large genomes (e.g. barley: 5 Gbp, or wheat: 16 Gbp) 48 Genomic colinearity 140 kb Maize (2500 Mbp) A ] Rice (400 Mbp) 20 kb B Hexaploid wheat (16 000 Mbp) I Barley (5000 Mbp) □ Rice (400 Mbp) 50 kb High gene density c Gene-rich region 1 Mb Feuillet and Keller, 2002 49 Genomic colinearity Can be mostly used for the species of grass (e.g. using related genes of species of barely, wheat, rice, maize) Small genome reorganizations (deletions, duplications, inversions, translocations smaller than a few cM) are then detected by detailed sequentional comparative analysis During evolution there's occured some divergencies in related species, mostly in non-coding regions (invasion of retrotransposons etc.) X MHK(2500 Mhnjl 5 Rice (400 Mbp) , IIexap!oid wheal (16000 Mbp) 1 Barley (5000 MhpJ 1 Rice {4(H) Mhpl High gene density EVROPSKÁ UNIE ť. MINISTERSTVO ŠKOLSTVÍ, OP Vzdělávání MLÁDEŽE A TĚLOVÝCHOVY pro konkurenceschopnost ÍUJI) GcnL-rich region "D df^gj^ě INVESTICt uu Ku^vujt vz.utLttvÁNI Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 50 Genomic colinearity Genomic colinearity of HOX genes in animals ■ Transcription factors controlling organisation of body in anterio-posterior axis Position of genes in corresponds with spatial during development genome expression PG9-14 Posterior ScainikJ 70 499.360 Ep Interspecies conservation nnoi Cifu-Otd Capl-Scic I If I III I Pill Capl 111 Caplpfc Cup I-HoitS Capl Oti ■ ---.'i- ■ ■ Capllw<5 -Aj :j Cap i Gapl-PMIř .LU, Captela CMk*- Flacci^agillaed* ■ UrancincyteTB cG* TnlidiurtůJl ■ 5yfnss]iliiríirí irk Cdx S^™IMedPost PG8 t- Tuh •ibdium .. T.itOH.i -bcA H*-?Hi5af,!S3?-- FůMsapit? Hart - BranefiKtoTn Hcxtí Önirxi^ostcrnn tort - Maral? HOBT -<:fMJ,?A"m<- - rrtoftrn arlp - f|accÄ»qifla Hqj.7 1— Eij^rrWintp ■^Äscr,- 1- TrlbolurnScr erJii Nu imluJniniiljüi Iii» 97 = [j XMií-<- Örp-ichiosiorr, ' CapIdta Bv Nureisj ijM T-ibulLniUtilel tare* laHel Cnphnllji Linul-*- — ÉS'an^hi3Klcma Evu rnbdiurr* c-iQ - 0.1 — FiacíJsajHa Max Eve i Mox PG7 PG6 PG5 PG4 Xlox Gsx PG3 PG1-2 Anterior Central 51 Genomic organization of the Capitella sp. I Hox cluster. A total of 11 Capitella sp. I Hox genes are distributed among three scaffolds. Black lines depict two scaffolds, which contain 10 of the Capitella sp. I Hox genes. The eleventh gene, Capl-Post1, is located on a separate scaffold surrounded by ORFs of non-Hox genes (unpublished data). No predicted ORFs were identified between adjacent linked Hox genes. Transcription units are shown as boxes denoting exons, connected by lines that denote introns. Transcription orientation is denoted by arrows beneath each box. Color coding is the same as that used in on the right-hand side for each ortholog. The phylogenic tree on the right-hand side shows that the order of the genes on the chromozome is retained in several species (genome colinearity). EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ OP Vzdělávání pro konkurenceschopnost > \ s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky Outline 52 ruivvdiu di iu icvcioc yciicuub d|j|jiudi ■ Differences between the approaches used Tor icieniiTicaiion ot oenes ano tneii Tunction ^tri ioti iro of nonoc onH cocirphinn for thom Experimental identification of genes ■ Constructing gene-enriched libraries using methylation filtration technology EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ OP Vzdělávání pro konkurenceschopnost > \ s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 53 Methylation filtration ■ Preparation of gene-enriched libraries by technology of methylation filtration ■ genes are (mostly!) hypomethylated, noncoding regions are methylated ■ using bacterial restriction-modification system, which recognizes methylated DNA with restriction enzymes McrAa McrBC McrBC recognizes methylated cytosin (in DNA), which comes after purine (G or A) □ For cleavage the distance of these sites 40-2000 bp is necessary EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ OP Vzdělávání pro konkurenceschopnost > \ s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky 54 Methylation filtration ■ Preparation of gene-enriched libraries by technology of methylation filtration ■ Scheme of work during preparation of BAC genome libraries using methylation filtration: □ preparation of genomic DNA without addition of organelle DNA (chloroplasts and mitochondria) □ fragmentation of DNA (1-4 kbp) and ligation of adaptors □ preparation of BAC libraries in mcrBC+ strain of E. coli □ selection of positive clones ■ Limitied usage: enrichment of coding DNA only approx. 5-10% EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ ť. OP Vzdělávání pro konkurenceschopnost > s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky Outline 55 ruivvdiu di iu icvcioc yciicuub d|j|jiudi ■ Differences between the approaches used top icieniiTicaiion ot oenes ano tneii Tunciion ^tn i/^ti iro of nonoc cinH cocirphinn for thom Experimental identification of genes ■ Constructing gene-enriched libraries using methylation filtration technology ■ EST libraries EVROPSKÁ UNIE MINISTERSTVO ŠKOLSTVÍ, MLÁDEŽE A TĚLOVÝCHOVY vgER^ OP Vzdělávání pro konkurenceschopnost > \ s. INVESTICE DO ROZVOJE VZDELÁVANÍ Tato prezentace je spolufinancována Evropským sociálním fondem a státním rozpočtem České republiky EST libraries Preparation of EST libraries Isolation of mRNA Reverse transcription Ligation of linkers and synthesis of the other cDNA strand Cloning into suitable bacterial vector Transformation into bacteria and isolation of DHA (amplification of DNA) Sequenation using primers specific for used plasmid Saving the results of sequenation into public database cctacgattatacccccaa ggatgctaatatgggggttatacaagtgtt JTTTTTTTTT Základy genomiky II, Identifikace genů