CG020 Genomika Lesson 2 Genes Identification Jan Hejatko M U (I ] Functional Genomics and Proteomics of Plants, Mendel Centre for Plant Genomics and Proteomics, CEITEC - Central European Institute of Technology and National Centre for Biomolecular Research, Faculty of Science, Z^^^^^****^^ li.'l U 111 J S C I Masaryk University, Brno ^« ^^^'^^l^*^^ heiatko@sci.muni.cz, www.ceitec.eu Jk__^ ^» Literature ■ Literature sources for Chapter 02: ■ Plant Functional Genomics, ed. Erich Grotewold, 2003, Humana Press, Totowa, New Jersey Majoros, W.H., Pertea, M., Antonescu, C. and Salzberg, S.L. (2003) GlimmerM, Exonomy, and Unveil: three ab initio eukaryotic genefinders. Nucleic Acids Research, 31(13). ■ Singh, G. and Lykke-Andersen, J. (2003) New insights into the formation of active nonsensemediated decay complexes. TRENDS in Biochemical Sciences, 28 (464). ■ Wang, L. and Wessler, S.R. (1998) Inefficient reinitiation is responsible for upstream open reading frame-mediated translational repression of the maize R gene. Plant Cell, 10, (1733) ■ de Souza et al. (1998) Toward a resolution of the introns earlyylate debate: Only phase zero introns are correlated with the structure of ancient proteins PNAS, 95, (5094) ■ Feuillet and Keller (2002) Comparative genomics in the grass family: molecular characterization of grass genome structure and evolution Ann Bot, 89 (3-10) ■ Frobius, A.C., Matus, D.Q., and Seaver, E.C. (2008). Genomic organization and expression demonstrate spatial and temporal Hox gene colinearity in the lophotrochozoan Capitella sp. I. PLoS One 3, e4004 OCEITEC Outline Forward and Reverse Genetics Approaches ■ Differences between the approaches used for identification of genes and their function Identification of Genes Ab Initio ■ Structure of genes and searching for them ■ Genomic colinearity and genomic homology Experimental Genes Identification ■ Constructing gene-enriched libraries using methylation filtration technology ■ EST libraries ■ Forward and reverse genetics Outline Forward and Reverse Genetics Approaches ■ Differences between the approaches used for identification of genes and their function Forward vs. Reverse Genetics Revolution in understanding the term „gene" ..classical" genetics approaches ..reverse genetics" approaches _ 5TTATATATATATATTAAAAAATAAAATAA Identification of the role OÍARR21 gene • Hypothetical signal transducer in two-component system of Arabidopsis Identification of the role OÍARR21 gene Recent Model of the CK Signaling via Multistep Phosphorelay (MSP) Pathway /cYTOKINlřTN HPt Proteins • AHP1-6 AHK sensor histidine kinases • AHK2 • AHK3 • CRE1/AHK4/WOL Response Regulators ARR1-24 REGULATION OF TRANSCRIPTION INTERACTION WITH EFFECTOR PROTEINS Identification of the role 01ARR21 gene • Hypothetical signal transducer in two-component system of Arabidopsis • Mutant identified by searching in databases of insertional mutants (SINS-sequenced insertion site) using BLAST 8 OCEITEC Identification of the role of ARR21 gene — isolation of insertional mutant Searching in databases of insertional mutants (SINS) Insert SINS: 010964 Query: 80 tccta gcgtt catgagcgtaccata cttga caana gaqaa eg tagccagc ca 111 acagg 139 I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I Sbjct: 58319 tcctagcgttcatgagcgtaccatacttgacaagagagaacgtagccagccatttacagg 58378 Arr21: 1830 Insert_SIHS: 01_09_64 Query: 140 tttgatatetcttgtcaaaaatgtttttggattttactgt 179 I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I Sbjct: 58379 tttgatatctcttgtcaaaaatgtttttggattttactgt 58418 Ärr21: 1890 Localization of dSpm insertion in genome sequence of ARR21 using sequenation of PCR products 16k - d11 "II [ J | ill|II11 J | ^J^J [ III D2 D1 K W 1727 bp 1728 bp P _16k-16p_ Identification of the role 01ARR21 gene • Hypothetical signal transducer in two-component system of Arabidopsis • Mutant identified by searching in databases of insertional mutants (SINS-sequenced insertion site) using BLAST • Expression of ARR21 in wild-type and inhibition of expression of ARR21 in insertional mutant confirmed at the RNA level 10 OCEITEC Identification of the role of ARR21 gene - analysis of expression 11 OCEITEC Identification of the role OÍARR21 gene • Hypothetical signal transducer in two-component system of Arabidopsis • Mutant identified by searching in databases of insertional mutants (SINS-sequenced insertion site) using BLAST • Expression of ARR21 in wild-type and inhibition of expression of ARR21 in insertional mutant confirmed at the RNA level • Phenotype analysis of insertional mutant 12 OCEITEC Identification of the role of ARR21 gene - phenotype analysis of mutant Analysis of sensitivity to plant growth regulators ■ 2,4-D a kinetin ■ ethylene ■ Light of various wavelengths No alterations - nor in flowering, neither in the number of the seeds 100 30 9 * 0 1 3 10 30 100 300 1000 kinetin p.g - I -i Identification of the role of ARR21 gene - possible reasons for the absence of the phenotype • Functional redundance within the gene family 14 Identification of the role of ARR21 gene - homology of ARR genes Identification of the role of ARR21 gene - possible reasons for the absence of the phenotype • Functional redundance within the gene family? • Phenotype only under specific conditions 16 OCEITEC Identification of the role o\ARR21 gene - summary ■ Gene ARR21 identified by comparative analysis of Arabidopsis genome ■ Based on sequence analysis, its function was predicted ■ Site-specific expression of ARR21 gene was proved at the RNA-level ■ Identification of gene function by insertional mutagenesis in case of ARR21 in development of Arabidopsis was not successful, probably because of functional redundancy within the gene family 17 OCEITEC Outline Forward and Reverse Genetics Approaches ■ for identification of genes and their function Identification of Genes Ab Initio ■ Structure of genes and searching for them OCEITEC Genes Structure Promoter ATG....ATTCATC ATTATCTGATATA.... ATAAATAAATGCGA 19 RNA Splicing intron 3' splice site ♦ 3' f'iUlll □=1 conserued' regions Identification of Genes Ab Initio Omitting 5' and 3' UTR ■ Identification of translation start (ATG) and stop codon (TAG, TAA, TGA) ■ Finding donor (typically GT) and acceptor (AG) splicing sites ■ Using various statistic models (e.g. Hidden Markov Model - HMM, see recommended literature, Majoros et a/., 2003) to evaluate and score the weight of identified donor and acceptor sites Splicing Site Prediction Programs for splice site prediction (specifity approximately 35 %) □ GeneSplicer (http://www.tiqr.org/tdb/GeneSplicer/qene spl.html) □ SplicePredictor (http://deepc2.psi.iastate.edu/cqi-bin/sp.cqi) 22 OCEITEC SplicePredictor BCB rjC ISU Biomformatic5 2 Down|oad ne|p Tutorial References Contact SplicePredictor - a method to identify potential splice sites in (plant) prc-mRNA by sequence inspection using Bayesiaii statistical models (click here to access the older method using logitliticar models) Sequences should be in the onc-lcttcr-code ({a,b,c,«,h,k,m,n,r,s,t,u,w,y}), upper or lower ease; all other characters are ignored during input. Multiple sequence input is accepted in FA ST A format (sequences separated by identifier lines of the form ">SQ;namc_of_scquencc comments") or in Gen Bank format. Paste your genomic DNA sequence here: GAGGaGGCaCAAAArGACGafiTATacaaAATGATCTTAAACAGCTAAACTArATTGGfiCATTTTTTCGATCTCAGaTATa I AAAGATTTCATTCaATATAaTaCTTGGATAAATACTCTTATTATTTTTCTTrAGTTTATTaaaaAAAACCTCTAATaAaT ACGAGTTTaaGTGCACAAAATCGCTTaGACTAAAATACACGATaTAATTrCAAACGATaaaGTTTACAAAAGTAATaTCC AAGTATCTCATAGTCAACATATATATaGTAATAAT TAGTTGACGTATAAGAAAATAAAAATAAATAAAT TAG TAT CTTAT TTT GGGTGGTG CTGACTGGT GACT GGTGACTGCAGAAT GCTCGGCAAATGGAACCATATCC CAAGACATGGGTTTTAGAT f ... or upload your sequence file (specijy file name): \ Browse... 1 ... or type in the (jcnBank accession number of your sequence: 23 OCEITEC SplicePredictor What du I In output columns mean? SBlio.Pwd.ot..........i Mtiwuv U ÍJJS f'"" f39"' r f- r1 Splicing Site Prediction Programs for splice site prediction (specifity approximately 35 %) □ GeneSplicer (http://www.tiqr.orq/tdb/GeneSplicer/qene spl.html) □ SplicePredictor (http://deepc2.psi.iastate.edu/cqi-bin/sp.cqi) □ NetGene2 (http://www.cbs.dtu.dk/services/NetGene2/) 25 OCEITEC NetGene2 CENTERFO HBJOLOGI CA-SEQU ENCEANA LYSIS CBS CBS " Prediction Servers NetSene2 NetGene2 Server Tne NetGenel server Is a kuiv cd producing neu a I Tjt'.vjfk p-odicrion= c.-" solicc s tDs - "i.riin C. ole and A Unyk:: Instructions Output format Abstract Performanc SUBMISSION Submission of a local file with a single sequence: File in FASTAformat_ * Hunan Cc elegans _ A. thaliara Clearf elds ] | Send file [ . Hunar Cc. elegans S*A thaliara Sequence GAG GAGGC AC AAAAT GAC G AAT AT ACAASATGAT C TTAAACA GCTAAAC TATAT 7 3 GACAT T TT T T C GATC ■ rCAGATATA AAAGATT T CAT T C AATATAATACT T GG AT AAAT AC TC TT ATT AT T TT TC T T TAG -1TAT T AAAAAAAAC CT CT AAT AAAT AC GAG TT T AAGT CCACAAAATCGCT TAGACTAAAATACACCATATAATTTCAAAC3 AT A A A GTT TACAAAA _ [ Clearfelds~] | Send file | NOTE: The submitted sequences are kept confident a a-x wi I be erased imxciatsly after processing. NetGene2 Prediction done L,7| T, 0.0* X, 35.5* ( Dor.or siliee sites, compleme: pes 3'->5' pos ; ftccapto; splica sitas. i ft;,: pes ! : .■: j \;.': I v 3 14S7 0 - 12 5 i i.1 5472 6135 ■ : K 674 ) 14 - .i : / Z6 0.59 . 0.T1 o. ai TTGCGTCCTG*t;TI T A n TT TTftG * TTAT GCACAC AGTTATGGhG-ACAAGAATCG TCrcrCACAC-GACACfl GAAT ATftTTGATAG^rGGGACATTA TGrTCTTCAG-AAAATTGCAG' TTTTrGCCAG' ATTATTArftC AAAGTTACA TGrCAAACAG' TTCTGCACAG' TCCATTTCAG-TCACATACAG' atcgcaccat i ttccagtggc 'agatacacac ■ctctqctcaa ■GtaaGattaa f TGGTGGAGAA ■rrrcGTAGAG 'atgccagaaa ■atacagaaca ■aaCaCatgca r r- r RNA Splicing and Adaptation ■ Flexibility in splicing site recognition in plants in practice -example of developmental plasticity of (not only) plants Identification of mutant with point mutation (transition G—>A) exactly at the splice site at the RNA Splicing and Adaptation Identification of mutant with point mutation (transition G—>A) exactly at the splice site at the 5' end of the 4th exon Analysis by RT PCR proved the presence of a fragment shorter than cDNA should be after the typical splicing event PDRJJla/PDFLLl PDR_Ulb/PDR_Llb wt pisl wt pisl - 500 bp _(• - 400 bp -• - 500 bp - 400 bp - 300 bp -i - 300 bp - 200 bp - 100 bp - 200 bp - 100 bp RNA Splicing and Adaptation Flexibility in splicing site recognition in plants in practice -example of developmental plasticity of (not only) plants Identification of mutant with point mutation (transition G—>A) exactly at the splice site at the 5' end of the 4th exon Analysis by RT PCR proved the presence of a fragment shorter than cDNA should be after the typical splicing event Sequenation of this fragment then suggested alternative splicing with the closest possible splice site in exon 4 RNA Splicing and Adaptation ■ Divergencies at splice site recognition in plants in practice -example of developmental plasticity of (not only) Dlants Identification of mutant with point mutation (transition G—>A) exactly at the splice site at the 5' end of the 4th exon Analysis by RT PCR proved the presence of a fragment shorter than cDNA should be after the typical splicing event Sequenation of this fragment then suggested alternative splicing with the closest possible splice site in exon 4 Existence of similar defense mechanisms was proven in different organisms as well (e.g. Instability of mutant mRNA with early stop codon formation (> 50 - 55 bp before typical stop codon) in eukaryotes, see recommended literature - Singh and Lykke-Andersen, 2003 31 OCEITEC Identification of Genes Ab Initio ■ Programs for exon prediction □ 4 types of exons (according to location in the gene): initial internal terminal single □ Programs predict splice sites and they take into account the structure of the type of exon as well • initial: □ Genescan (http://hollvwood.mit.edu/GENSCAN.html) □ GeneMark.hmm (http://opal.bioloqv.qatech.edu/GeneMark/) • internal: □ MZEF (http://rulai.cshl.org/tools/qenefinder/) OCEITEC 32 □programy kromě rozpoznávání míst sestřihu zohledňují i strukturu jednotlivých typů exonů GENSCAN The New GENSCAN Web Server at MIT Identification of complete gene structures in genomic DNA Explanation Gn.Ex : gene number, exon number (for reference) Type : Init = Initial exon (ATG to 5' splice site) Intr = Internal exon (3' splice site to 5' splice site) Term = Terminal exon (3' splice site to stop codon) Sngl = Single-exon gene (ATG to stop) Prom = Promoter (TATA box / initation site) PlyA = poly-A signal (consensus: AATAAA) S : DNA strand (+ = input strand; - = opposite strand) Begin : beginning of exon or signal (numbered on input strand) End : end point of exon or signal (numbered on input strand) Len : length of exon or signal (bp) Fr : reading frame (a forward strand codon ending at x has frame x mod 3). For example, if nucleotides 1,2,3 of the sequence are read as a codon, that's called reading frame 0. If 2,3,4 are read as a codon, that's reading frame 1. If 3,4,5 are read as a codon, that's reading frame 2, and so on. This information, together with the starting and ending positions of the exon, is sufficient to give the amino acid sequence encoded by the exon. Another use of the reading frame is that if you see two adjacent predicted exons separated by a relatively short intron which share the same reading frame, it may be worth looking at the possibility that the intervening intron is not correct, i.e. that the two exons plus the intervening intron might form one long exon (assuming there are no inframe stops in the intron, of course). Ph : net phase of exon (exon length modulo 3). For example, an exon of length 15 bp has net phase 0 since 15 is divisible by 3, an exon of length 16 bp has net phase 1 because 16 divided by 3 leaves a remainder of 1, an exon of length 17 bp has net phase 2, and an exon of length 18 bp has net phase 0 again. The point of this is that exons whose net phase is 0 can be omitted from the gene without disrupting the reading frame: such exons are candidates for being either 1) incorrect, or 2) alternatively spliced. I/Ac : initiation signal or 3' splice site score (tenth bit units; x 10). If below zero, probably not a real acceptor site. Do/T : 5' splice site or termination signal score (tenth bit units; x 10) If below zero, probably not a real donor site. CodRg : coding region score (tenth bit units) P : probability of exon (sum over all parses containing exon). This quantity is close to the actual probability that the predicted exon is correct. Tscr : exon score (depends on length, I/Ac, Do/T and CodRg scores). Comments The SCORE of a predicted feature (e.g., exon or splice site) is a log-odds measure of the quality of the feature based on local sequence properties. For example, a predicted 5' splice site with score > 100 is strong; 50-100 is moderate; 0-50 is weak; and below 0 is poor (more than likely not a real donor site). The PROBABILITY of a predicted exon is the estimated probability under GENSCAN's model of genomic sequence structure that the exon is correct. This probability depends in general on global as well as local sequence properties, e.g., it depends on how well the exon fits with neighboring exons. It has been shown that predicted exons with higher probabilities are more likely to be correct than those with lower probabilities. What are the suboptimal exons? Under the probabilistic model of gene structural and compositional properties used by GENSCAN, each possible "parse" (gene structure description) which is compatible with the sequence is assigned a probability. The default output of the program is simply the "optimal" (highest probability) parse of the sequence. The exons in this optimal parse are referred to as "optimal exons" and the translation products of the corresponding "optimal genes" are printed as GENSCAN predicted peptides. (All the data in our J Mol Biol paper and on the other GENSCAN web pages refer exclusively to the optimal parse/optimal exons.) Of course, the optimal parse does not always correspond to the actual (biological) parse of the sequence, that is, the actual set of exons/genes present. In addition, there may be more than one parse which can be considered "correct", for example, in the case of a gene which is alternatively transcribed, translated or spliced. For both of these reasons, it may be of interest to consider "suboptimal" ("near-optimal") exons as well, i.e. exons which have reasonably high probability but are not present in the optimal parse. Specifically, for every potential exon E in the sequence, the probability P(E) is defined as the sum of the probabilities under the model of all possible "parses" (gene structures) which contain the exact exon E in the correct reading frame. (This quantity is calculated as described on the GENSCAN exon probability page.) Given a probability cutoff C, suboptimal exons are those potential exons with P(E) > C which are not present in the optimal parse. Suboptimal exons have a variety of potential uses. First, suboptimal exons sometimes correspond to real exons which were missed for whatever reason by the optimal parse of the sequence. Second, regions of a prediction which contain multiple overlapping and/or incompatible optimal and suboptimal exons may in some cases indicate alternatively spliced regions of a gene (Burge & Karlin, in preparation). The probability cutoff C used to determine which potential exons qualify as suboptimal exons can be set to any of a range of values between 0.01 and 1.00. The default value on the web page is 1.00, meaning that no suboptimal exons are printed. For most applications, a cutoff value of about 0.10 is recommended. Setting the value much lower than 0.10 will often lead to an explosion in the number of suboptimal exons, most of which will probably not be useful. On the other hand, if the value is set much higher than 0.10, then potentially interesting suboptimal exons may be missed. GENSCAN GENSCAN predicted genes; in sequence 02:56:23 a a 1 1 1 1 a h muH im i 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9. Key! [litis] ^ Interní] 1^ Terminal Single-e ^ Optimal exöii 1 1 Sulioptimal extra 35 Regulation of Translation • Splicing in Untranslated Regions - important regulation part of genes Translational repression by short ORFs in 5' UTR Identified e.g. in maize (Wang and Wessler, 1998, see recommended literature for additional info.) In case of CKI1 there was an attempt to prove this mechanism of regulation using transgenic lines carrying uidA under control of two versions of promoter (unconfirmed so far) M K R A F . ATGaaaagagcttttTAG ATGatggtgaaagttaca.... MKRAF. MMVKV T... ATGaaaagagcttttTAG ATGatggtgaaagttaca.... OCEITEC 36 Regulation of translation • Functional purpose of splicing in untranslated regions - important regulation part of genes In case of CKI1 there was an attempt to prove this mechanism of regulation using transgenic lines carrying uidA under control of two versions of promoter (unconfirmed so far) BamHI GAGGAGGCACAAAATGACGAA -//- TGTATTCTT TTGTTATCAAAGGGTTTCGACTT TGCTCCGAGGAAGAAGATAATATCAGGATCCCCCGGGTAGGTCAGTCCCTTATGTTACGTCCTGTAGAAACCCCAACC -2739 GAGGAGGCACAAAATGACGAA -//- GTTATACAAGTTCACTCAAATGATGGTGAAAGT TACAAAGCTTGTGGCTTCACGTCBGATCCCCCGGGTAGGT CAGTCCCTTATGTTACGTCCTGTAGAAACCCCAACC MMVKVTKLVASfl R I PRVGQSLMLRPVETPT OCEITEC Gene Modelling Programs for gene modelling □ Those that take into account other parameters as well, e.g.continuity of ORFs □ Genescan (http://hollywood.mit.edu/GENSCAN.html) -very good foor prediction of exons in coding regions (tested for gene PDR9, Genescan identified all of the 23 (!) exons) □ GeneMark.hmm (http://opal.biologv.gatech.edu/GeneMark/) □ GlimmerHMM (https://ccb.jhu.edu/software/glimmerhmm/) 38 OCEITEC GeneMark GeneMark " A family of gene prediction programs provided by Mark Borodovsky's Bioinformaties Group at P™k/^tanas the Georgia Institute of Technology, Atlanta, Prokaryotesf moct Eukaryotic GeneMark.hmm(1,2'iE References: Gene Prediction in Bacteria and Archaea j .£ prediction, you can use the parallel i ii i i ■ , i ■ .1.....i'i ...n .in ■ I ■ of a' the sequence is longer than 1 Mb, generate models with the self-training program GeneMarkS. Both options wil allow you to generate models and thei to use GeneMark.hmm and GeneMark Gene Prediction in Eukaryotes For eukaryotic gene prediction, you ca use the parallel combination of the GeneMark and GeneMnnt.hmir piogia Gene Prediction in EST and cDNA To analyze ESTs and cDNAs, please folic Gene Prediction in Viruses For viral gene prediction, or to access < l^ffi virus database VIOLIN, please follow fl Borodovsky Group Gene Predict* . GeneMark • GeneMark.hrr • Frame-by-Frame . GeneMarkS Statistics Help • References • Papers • FAQ • Contact Databases of predicted genes . ProkaryotesN™' • Viruses/Phages (VIOLIN) Bioinformaties Studies at Georgia Tech . MS Degree Program . PhD Program . Seminars • Center for Bioinformaties and ire-built models of eukaryotic GeneMai Generate PDF graphics (sere Genarate PostScript graphics Print GeneMark 2.4- predictio GeneMark 40 GeneMark Result of last submission: View PDF Graphfcal Output GeneMarkhmm Listing Go to: Gene Mark, hmm Protein Translations Go to: Job Submission EuJcitio-tyc CtriiHiik .hravi vtnion bp 3.3 ^ril £5, £008 Ssqumc* name: CKI1 Sequence length.: -5043 bp G+C content: 38.73* lütiicM file: /h.oifie/geriTfiar]r/eulr ghin-ina-er i ce 5/ aehal i ana hrnmS. Cimod Tici Oct 1 11:03 : £4 £003 "l,u N:.:-.< K-ri.l-.-.'Cll 2i:i: : Ol(-i :.. vVin-Juw 36. ile:.- Predicted gtn«/t]ioi 11+ Init 1 £ 1 3 1 4 1 5 1 5 1 7 969 1025 57 1 3 ■ 1155 13 34 151£ £175 £256 £644 £734 3317 333T 4629 47 09 43 21 / ''1 l\ l] II 1 1 \, 1 1 1.1 11 1 1 eil" 1*_ Ii ) I ill 1 1 1 4C0 4330 S200 5600 G000 5600 E000 4400 4b0ü 5600 0000 41 «J.CEITEC Genomic Homologies ■ Searching for genes according to homologies with known sequences ■ Comparison with EST databases □ BLASTN (http://www.ncbi.nlm.nih.gov/BLAST/, http://workbench.sdsc.edu/ ■ Comparison with protein databases □ BLASTX (http://www.ncbi.nlm.nih.gov/BLAST/, http://workbench.sdsc.edu/ □ Genewise (http://www.ebi.ac.uk/Wise2/) They compare protein sequence with genomic DNA (after reverse transcription), therefore the aminoacid sequence is needed ■ Comparison with homologous genome sequences from related species □ VISTA/AVID (http://www.lbl.gov/Tech-Transfer/techs/lbnl1690.htmn OCEITEC Outline Forward and Reverse Genetics Apprc ■ for identification of genes and their function Identification of Genes Ab Initio ■ Structure of genes and searching for them ■ Genomic colinearity and genomic homology Genomic Colinearity Genomes of related species (despite large differencies) are characterized by similarities in sequence organization -> possibility to use this information for identification of genes in related species when searching in databases General scheme of work while applying genomic colinearity (also called ..comparative genomics") for experimental identification of genes in related species: □ Mapping small genomes using low-copy DNA markers (e.g. RFLP) □ Using these markers for identification of orthologous genes (genes with the same or similar function) of related species □ Small genome (e.g. rice, 466 Mbp) can be used as a guide: molecular low-copy markers (e.g. RFLP) bound to gene of interest are identified and these regions are then used as a probe for searching in BAC libraries during identification of orthologous regions of large genomes (e.g. barley: 5 Gbp, or wheat: 16 Gbp) Genomic Colinearity Maize (2500 Mbp) =|—I n Rice (400 Mbp) 20 kb Ilexaploid wheat (16 000 Mbp) Barley (5000 Mbp) Rice (400 Mbp) High gene density H I-rn-r^nm r-^-n- — 300 kb ^^-^ Gene-rich region ■ ■ i^SírT^i n1 ! i ■ ■ ■ H! MHKI-)-^—-1—1-!| II 0—(W—1 Feuillet and Keller, 2002 Genomic Colinearity Can be mostly used for the species of grass (e.g. using related genes of species of barely, wheat, rice, maize) Small genome reorganizations (deletions, duplications, inversions, translocations smaller than a few cM) are then detected by detailed sequentional comparative analysis During evolution there's occured some divergencies in related species, mostly in non-coding regions (invasion of retrotransposons etc.) j- OCEITEC Genomic Colinearity Genomic colinearity of HOX genes in animals Transcription factors controlling organisation of body in anterio-posterior axis Position of genes in genome corresponds with spatial expression during development Interspecies conservation 3CSno«l7q 499,380 bo Ml I II fin 1 Plfl C-MD cjp : 9 Ci 1) C- I' cntz ■ jit™ MllO"1 v.;.;.i.i.>..~- PG1 2 Anterior Eve i Mox Genomic organization of the Capitella sp. I Hox cluster. A total of 11 Capitella sp. I Hox genes are distributed among three scaffolds. Black lines depict two scaffolds, which contain 10 of the Capitella sp. I Hox genes. The eleventh gene, Capl-Post1, is located on a separate scaffold surrounded by ORFs of non-Hox genes (unpublished data). No predicted ORFs were identified between adjacent linked Hox genes. Transcription units are shown as boxes denoting exons, connected by lines that denote introns. Transcription orientation is denoted by arrows beneath each box. Color coding is the same as that used in on the right-hand side for each ortholog. The phylogenic tree on the right-hand side shows that the order of the genes on the chromozome is retained in several species (genome colinearity). Outline Forwi for ■ Experimental Genes Identification ■ Constructing gene-enriched libraries using methylation filtration technology Methylation Filtration Preparation of gene-enriched libraries by technology of methylation filtration ■ genes are (mostly!) hypomethylated, noncoding regions are methylated ■ using bacterial restriction-modification system, which recognizes methylated DNA with restriction enzymes McrA a McrBC □ McrBC recognizes methylated cytosin (in DNA), which comes after purine (G or A) □ For cleavage the distance of these sites 40-2000 bp is necessary 49 Methylation Filtration Preparation of gene-enriched libraries by technology of methylation filtration Scheme of work during preparation of BAC genome libraries using methylation filtration: □ preparation of genomic DNA without addition of organelle DNA (chloroplasts and mitochondria) □ fragmentation of DNA (1-4 kbp) and ligation of adaptors preparation of BAC libraries in mcrBC+ strain of £. coli □ selection of positive clones Limitied usage: approx. 5-10 % enrichment of coding DNA only 50 Outline Forward and Reverse Genetics Appr ■ for identification of genes and their function ■ ■ Experimental Genes Identification ■ Constructing gene-enriched libraries using methylation filtration technology ■ EST libraries EST Libraries Preparation of EST libraries Isolation of mRNA Reverse transcription Ligation of linkers and synthesis of second cDNA §J("$ffllg into suitable bacterial vector Transformation into bacteria and isolation of BPff£ (amplification of DNA) ^\ Sequencing using primers specific for used plasmid Saving the results of sequencing into public database cctacgattatacccccaa ggatgctaatatgggggttatacaagtgtt* Základy genomiky II, Identifikace genů 52 Outline Forward and Reverse Genetics Approaches ■ Differences between the approaches used for identification of genes and their function Identification of Genes Ab Initio ■ Structure of genes and searching for them ■ Genomic colinearity and genomic homology Experimental Genes Identification ■ Constructing gene-enriched libraries using methylation filtration technology ■ EST libraries ■ Forward and reverse genetics Discussion OCEITEC ^^^^^_ 54