Před anäýzou >P12345 Yeast chromosomel GATTACAGATTACAGATTACAGATTACAGATTACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA TTACAGATTACAGATTACAGATTACAGATTACAGAT TACAGATTAGAGATTACAGATTACAGATTACAGATT ACAGATTACAGATTACAGATTACAGATTACAGATTA CAGATTACAGATTACAGATTACAGATTACAGATTAC AGATTACAGATTACAGATTACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACAGATTACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA TTACAGATTACAGATTACAGATTACAGATTACAGAT Pö částečné anäýze >P12345 Genel - gen kódujici protein alkoholdehydrogenazy • • • TATA start exonl intronl exon2 stop TATAAA CGATTGACGATGACGAT ATG TACAGATTACAGATTACAGATTACAGATGT CAGATTACAGATTACAGATTACAGATTACAGATTCA AGATTACAGATTACAGATTACAGA TAA >P12346 Proteinl MASAQSFYLLDHNQNQNFDDHLAVDIVMILSHERFMN Analýza DNA sekvence =~ anotace genomu (sekvence) * identifikace signálů a genů * anotace genů (jejich kódujících sekvencí) Anotace genů =~ anotace proteinů * Identifikace a popis fyzikálně-chenických, funkčních a strukturních vlastností daného genu/proteinu * sekvence DNA, AA, pozice v genonnu, délka, složení * běžné názvy, odkazy na literaturu * příslušnost do rodiny, evoluce * partneři pro interakci, aktivita, regulační mechanismy * struktura, aktivní místa, role v metabolismu buňky Eukaryotic Gene Structure Transcribed Region ! II exon 1 intron 1 exon 2 intron 2 exonS ~~ir t in3 I Stárl codon Stop codon 5*UTR Upstream Intergenic Region 3>UTR Downstream Intergenic Region Analýza DNA sekvence Statistika * frekvence n-gramů ajiných prvků, repetice, kodony Signální prvky * TATA (promotor), ATG (start), STOP, GT (donor), AG (akceptor) a pod Kódující část * podobnost kódované sekvence s jinými protány Kombinované přístup Identifikace genů Uprokaryotu 95-100% spolehlivost, u složitějších eukaryotu 90% na úrovni baží, 70% na úrovni exonů/intronů * existence intronů * větší genomy * nízká hustota genů (<30% 3% u Homo sapiens) * alternativní splicing (zhruba u poloviny genů) * velké množství repetitivních sekvencí * občasný překryv genů Identifikace genů start stop start stop start stop d Ada lili I 1 I II I -----------► -----------► Eukaryotic Gene Structure 5'site branchpoint site 3'site \ \ exon 1 intron 1 exon 2 iuxroii 2 AG/GT CAG/NT RNA Splicing 51 splice site U2 snRNP 3' splice site U1 snRNP v exon 1 iOl intron U4/U6 snRNP., \ / exon ; U4/U6 snRNP U5 snRNP U5 snRNP LARIAT FORMATION AND 51 SPLICE SITE \\ CLEAVAGE lariat LARIAT FORMATION AND 51 SPLICE SITE CLEAVAGE — -lariat 31 SPLICE SITE CLEAVAGE AND JOINING OF TWO t EXON SEQUENCES excised intron sequence in the form of a lariat (will be degraded in nuc + exon 1 exon 2 portion of i— —■ 31 ľ ---------------------------- mRNA Exon/lntron Structure (Detail) ATGCTGTTAGGTGG...GCAGATCGATTGAC I______é\______é\______ŠL J\_____é\_____* <-------Exon 1 Intron 1^-----+<—Exon 2<-----► v SPLICE ATGCTGTTAGATCGATTGAC Typické signály v eukaryotických sekvencích # Promotorové elementy * CAP, CCAAT, GCaTATA # Kozákova sekvence (rozpoznávána ribozomem = RBS) # Splicing (donor, acceptor a lariat) # Terminační signál Polyadenylační signál Pol II Promoter Elements 1 1 ^^ ^_ Exon Intron Exon GCbox CCAAT box TATA box | Gene -200 bp -100 bp -30 bp * Transcription start site (TSS) Pol II Promoter Elements • Cap Region/Signal -nCAGTnG • TATA box (~ 25 bp upstream) -TATAAAnGCCC • CCAAT box (-100 bp upstream) -TAGCCAATG • GC box (-200 bp upstream) -ATAGGCG nGA Pol II Promoter Elements TATA box is found in -70 % of promoters WebLogos 12 Lambda c I and cro binding sites S Lambda O protein binding sites SS C RP binding sites S Trp R binding sites http://www.bio.cam.ac.uk/cgi-bin/seqlogo/logo.c Kozák (RBS) Sequence -7 -6 -5 -4-3-2-1 0 1 2 3 AGCCACCATGG Splice Signals AG/GT branchpoint site exon 1 intron 1 CAG/NT exon 2 2-1 in exon I intron ÔaGt intron I exon MIlL«S=S oafiOMDin^c9Mi-oq)coN(om4coMi-oi- CM CM CO Miscellaneous Signals • Polyadenylation signal -AATAAAorATTAAA - Located 20 bp upstream of poly-A cleavage site • Termination Signal -AGTGTTCA - Located -30 bp downstream of poly-A cleavage site Polyadenylation ďeovoge and Pol/aden/lation of Eukaiyoüc pre-mRNAs 5'. 5' CP5F S-___________ pre-mRNA .AAUAAA CTsF GU ridi region ■—y O -L_L_1J—AAAAAAAAAAAAA Polyilenylot ion CPSF - Cleavage & Polyadenylation Specificity Factor PAP - Poly-A Polymerase CTsF - Cleavage Stimulation Factor Anáýza genomu - kombinované metody * Neurónové sítě * Grál, GeneParser # Lineární diskriminační anáýza * GeneFinder, GeneID, MZEF # Lingvistická * GeneLang * Markovovy řetězce * Genie, GeneMark, GenScan, VEIL Podobnosti * Ftocrustes, MT Rozhodovací stromy Training Set ACGAAG AGGAAG AGCAAG ■ ACGAAA AGCAAC EEEENN ' Desired Output iural Network Definitions Sliding Window aEgäU A = [001] 1-----J > c=[oio] n G = [100] G [010100001] Input Vector ^ E = [01] N = [00] [01] Output Vector Neural Network Training [010100001] A ACGAAG Input Vector .2 .4 .1 .1.0 .4 .7 .1 .1 .0 .1 .1 .0 .0 .0 .2 .4 .1 .0 .3 .5 .1 .1 .0 .5 .3 .1 [.6 .4 .6] .1.8 .0.2 .3.3 1-e [.24 .74] compare V [01] Weight Hidden Weight Output Matrixl Layer Matrix2 Vector After Many Iterations.... .13 .08 .12 .24 .01.45 .76 .01.31 .06 .32 .14 .03 .11.23 .21.21.51 .10 .33 .85 .12 .34 .09 ►.51.31.33 < Two "Generalized" Weight Matrices .03 .93 .01.24 .12 .23 Neural Networks Matrixl Matrix2 ACGAGG New pattern EEEENN Prediction Input Layer 1 Hidden Output Layer HMM for Gene Finding Start Codon 16 Backedges Combined Methods • Bring 2 or more methods together (usually site detection + composition) • GRAIL (http://compbio.ornl.g0v/Grail-l.3/) • FGENEH (http://genomic.sanger.ac.uk/gf/gf.shtml) • HMMgene (http://www.cbs.dtu.dk/services/HMMgene/) • GENSCAN(http://genes.mit.edu/GENSCAN.html) • Gene Parser (http ://beagle.colorado.edu/~eesnyder/GeneParser.html) • GRPL (GeneTool/BioTools) How Well Do They Do? Programs seq Nucleotide acci nicy E von accuracy 7 ESP PCa Sn Sp AC CC \ESfi (2, n+ESiW IMS \WE PCP OL F GENES 195(5) 0.86 0.88 0.84 0.83 0.67 0.67 0.69 0.12 0.09 0.20 0.17 0.02 GeneMark 195(0) 0.87 0.89 0.84 0.83 |0.53 0.54 0.54 0.13 |0.11 0.29 i i 0.27 0.09 Genie 195(15) 0.91 0.90 0.89 0.88 0.71 0.70 0.71 0.19 0.11 0.15 0.15 0.02 Gens c an 195(3) 0.95 0.90 0.91 0.91 0.70 0.70 0.71 0.08 0.09 0.21 0.19 0.02 HMMgene 195(5) 0.93 0.93 0.91 0.75 0.74 0.70 0.91 0.76 0.77 0.76 0.12 0.07 0.14 0.14 0.02 Morgan | 127(0) 0.69 0.46 0.41 0.43 0.20 0.28 0.28 0.25 0.07 MZEF 119(8) 0.70 0.73 0.68 0.66 0..58 0.59 0.59 0.32 0.23 0.08 0.16 0.01 "Evaluation of gene finding programs" S. Rogic, A. K. Mackworth and B. F. F. Ouellette. Genome Research, 11: 817-832 (2001). GenomeScan - http://genes.mit.edu/genomescan.html Run GeiiomeSeaii: Organism: Vertebrate Sequence name (optional): | n Print options: | Predicted peptides only j^ Upload your DNA sequence file (one-letter code, upper or lower case, spaces/numbers ignored): Browse... Or paste your DNA sequence here (one-letter code, upper or lower case, spaces/numbers ignored): Document: Done iUSŽ, ^ aP |ig| TwinScan - http://genes.cs.wustl.edu/ SLAM- http://baboon.math.berkeley.edu/-syntenic/slam.html About Download Help \JBktS Tlie SLAJVI server: submit pairs of synteidc sequences for gene annotation and alignment The server is currently configure d for human (first sequence) and mouse (second sequence), but will work on other sequences at similar evolutionary distances. Please make sure that both sequences are in the same orientation. Enter your email address (for obtaining results): | 3 The first sequence (in F ASTA format): ! The second sequence (in F ASTA format): | Submit sequences | _5 _ rows c Browse. iL J ííííííí\: í Document: Done =u& ^ a1? m GeneComber - http://www.bioinformatics.ubc.ca/genecomber/ submit.php UBiC S GeneComber uBCBwKiFDrrnatiacent» aD \n\tio gene prediction server About Documentation Submit Sequences Retrieve Results Display Submissions Downloads contact | helpdesk | řepo GeneComber Submission Gen Ban k Accession Number .Upload Fast A DNA sequence Upload Genscan output: Genscan Training Set: Upload HMMGene output: Processing Method(s): e-mail address (required): Submit | Genecomber - Submit a Job Browse. ..| ■ L_____________J :: I f Browse. ..| l i 1 Vertebrate -r-\ 1 1 Browse. ..| |7 EUI P Gl P EUI_Frame 1 1 Home About Documentation Submit Sequences Retrieve Results Display Submissions Downloads1 Srovnávání sekvencí Různé kategorie podobnosti homoloas «rtl,ologS pifraloss orthologí \---'--- 1 ŕ n *—------------------------------^ frog a chick O. mouse a mouse p chick ß frogß (X-chain gene ß -chain gene j gene duplication early globin gene Hodnooení podobností Range of Alignment ATTGTCAAAGAdTTGM.GCTGATGCAT 11 'I iL GGCAGAOAfTGA-jOTGACAAGGGTATCG Mismatch IG^D S= Ž/Écíentities, mismatches) - Z (gap penalties) Score = Max(S) Zarovnání sekvencí ACGTGA -> ACGTGA -> CGTG -> CGTG -> 4 ACGTGA TCGTA ACGTGATGCAG GGAGAGCACG ACAGTTGACGAGATGGCAGGATGCGCGATGCAGCA GACGAGCGTGAGTGCGATCGATGACAGTGTATAT Zarovnání sekvencí ACGTGA • • • • A • • • • ^ CGTG ACGTGA • • • • A • • • • ^ TCGT-A ACGTGATGCA-G • •• ••• • / • •• ••• • # GGAGA-GCACG Aligning Two Sequences ATTGCAGTGA TCG ATTGCGTCGA TCG Solution 1: Solution 2 ATTGCGTCGATCG ATTGCA GT- G A TCG lllll II lllll ATTGC-GTCGA TCG Which alianment is better? ATTGCAGTGATCG ATTGCGTCGATCG Solution 1: Solution 2 ATTGCAGTGATCG ATTGCAGT-GATCG lllll lllll ATTGCGTCGATCG lllll II lllll ATTGC-GTCGA TCG matchesH- 3 mismatc^K 12 matches+2 gaps Scoring Match Mismatch Indel Scheme +i -i -2 Which alianment is better? ATTGCAGTGATCG ATTGCGTCGATCG Solution 1: Solution 2 ATTGCAGTGATCG ATTGCAGT-GATCG inn inn ATTGCGTCGATCG inn n inn ATTGC-GTCGA TCG Score=7 Finding the best alignment for lona For two sequences of length 300 bases there are 10179 different alignments Dynamic programming Dynamické programování Needleman-Wunsch (1970) Smith-Waterman (1981) První krok je triviální a pokrývá částečné řešení * Každé další řešení je hodnoceno na základě předcházqících qisteni Zarovnání je tak postupně prodlužováno o další triviální úseky * Opakování předchozích kroků vyústí v konečné řešení Dynamic Programming Algorithm Seq 1) * A G C Seq 2) * A A A C A1 A2 A3 C4 A G C 0 12 3 Needelman-Wunsch algorithm (1970) Dynamic Programming Algorithm *--------------AGC * A A A C match=l mismatch=-l indel=-2 A G C 0 1 2 3 0 0 -2 -4 -6 A1 -2 A2 -4 A3 -6 C4 -8 Dynamic Programming Ak jorithm A G C 0 12 3 n 0 w-2 -4 -6 * A G C * A-----A A C A1 \J w. Ĺ. ~ \J .2 1 -—1-^-3 match=l A2 -4 mismatch=-l A3 -6 indel=-2 C4 -8 F(i-l,j-l) F(i-l,j) N 1 k í ■ \* -d F(i,j-1) F(i,j) 4 -d i r ŕ F(i,])= max ^ ai pairwise F (í-f,/-Í) + sft, y; F(i-1,i)-d Fft 1-1)- d Finding the Best Score A G C 0 12 3 0 A1 A2 A3 C4 K -2^-4 -2 1<--1. i;- ■-6 -3 -2 ■8 -5 -4@ Tracing the Best Alignment o A1 A2 A3 C4 A G C 0 12 3 -2^-4 t\f\ N -8 -5 -4Q ■-6 -3 -2 A G - C A A A C Tracing the Best Alignment o A1 A2 A3 C4 A 1 G 2 C 3 K -2^-4 ■-6 -3 -2 lf-1 ■K-ÍNÍN-2 1 KtV -8 -5 -4 0 A - G C A A A C Tracing the Best Alignment o A1 A2 A3 C4 A G C 0 12 3 K -2^-4 ■-6 -3 -2 1f-í -4 -rL 0^-2 t\fM\ -6 -3 -2 -1 t fNfV. -8 -5 -4Q - A G C A A A C Local Alignment Example ATCTAA TAATA 0 T1 A2 A3 T4 a5 Smith-Waterman algorithm, 1981 ATCTAA 0 12 3 4 5 6 Local Alignment F(i,j)= max F(M,l-1) + s(x.y) < F (i-1, j)-d F (i, H)-d V 0 TCATAA TAATA 0 T1 A2 A3 T4 A5 T A C T A A 0 12 3 4 5 6 0 k0 0 0.0 0 0 0 10 0 10 0 \ NN 0 0 2 0 0 2 1 NN \ 0 0 110 13 \ 0 0 0 0 2 0 1 N \ 0 0 10 0 3^-1 T A 0 1 2 0 0 *o 0 T1 0 1 0 N a2 0 0 2 \ A3 0 0 1 t4 0 V A5 0 0 1 C T A A 3 4 5 6 (\o 0 0 0 1 0 \ 0 \ k0 0 2 1 V 1 <£ 0 2 0 N i ) 0 0 3^-1 o T1 a2 A3 T4 A5 T A C T A A 0 12 3 4 5 6 0 fc0 0 ofc o o o 0 10 0 10 0 \ \ \ 0 0 2 0 0 2 1 NN \ 0 0 110 13 \ 0 0 0 O 2.0 1 \ 0 0 10 0 feh Examples : Genomic DNA versus mRNA EXl EX2 EX3 EXl Alignment EX2 EX3 gap gap gap Gap Penalties AAC-AATTAAG-ACTAC-GTTCATGAC A-CGA-TTA-GCAC-ACTG-T-A-GA- AACAAT TAAGAC TAC G T T CAT GAC--- AACAATT-------GTTCATGACGCA Scoring Gaps i AAC-AATTAAG-ACTAC-GTTCATGAC A-CGA-TTA-GCAC-ACTG-T-A-GA- -6 AAC AAT T AAGAC T AC GT T CAT GAC--- 12 ii AACAATT GTTCATGACGCA Scoring parameters match:+1;Gap_open:-2 Scoring Insertions/Deletions i AAC-AATTAAG-ACTAC-GTTCATGAC A-CGA-TTA-GCAC-ACTG-T-A-GA- -6 AAC AAT T AAGAC T AC GT T CAT GAC--- II -6 AACAATT GTTCATGACGCA Scoring parameters match:+1;indel:-2 Considering Gap Opening and Gap Extension i AAC-AATTAAG-ACTAC-GTTCATGAC A-CGA-TTA-GCAC-ACTG-T-A-GA- -17 AACAATTAAGACTACGTTCATGAC--- II 1 AACAATT GTTCATGACGCA Scoring parameters match:+1;Gap_open:-2; Gap_exten:-1