Před analýzou >P12345 Yeast chromosomel GATTACAGATTACAGATTACAGATTACAGATTACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA TTACAGATTACAGATTACAGATTACAGATTACAGAT TACAGATTAGAGATTACAGATTACAGATTACAGATT ACAGATTACAGATTACAGATTACAGATTACAGATTA CAGATTACAGATTACAGATTACAGATTACAGATTAC AGATTACAGATTACAGATTACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACAGATTACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA TTACAGATTACAGATTACAGATTACAGATTACAGAT Po částečné analýze >P12345 Genel - gen kódujici protein alkoholdehydrogenazy • • • TATA start exonl intronl exon2 stop TATAAA CGATTGACGATGACGAT ATG TACAGATTACAGATTACAGATTACAGATGT CAGATTACAGATTACAGATTACAGATTACAGATTCA AGATTACAGATTACAGATTACAGA TAA >P12346 Proteinl MASAQSFYLLDHNQNQNFDDHLAVDIVMILSHERFMN Analýza DNA sekvence *=~ anotace genomu (sekvence) * identifikace signálů a genů * anotace genů (jejich kódujících sekvencí) Anotace genů =~ anotace proteinů > Identifikace a popis fyzikálně-chemických, funkčních a strukturních vlastností daného genu/proteinu * sekvence DNA, AA, pozice v genomu, délka, složení * běžné názvy, odkazy na literaturu * příslušnost do rodiny, evoluce * partneři pro interakci, aktivita, regulační mechanismy * struktura, aktivní místa, role v metabolismu buňky Eukaryotic Gene Structure Transcribed Region i ii e;íonl intron 1 exonl intron 2 exon3 fTt »3 I Start codon Stop codon 5*UTR Upstream Intergenic Regit» ji 3>UTR Downstream Intergenic Regit» ji Analýza DNA sekvence Statistika * frekvence n-gramů a jiných prvků, repetice, kodony * Signální prvky * TATA (promotor), ATG (start), STOP, GT (donor), AG (akceptor) a pod Kódující část * podobnost kódované sekvence s jinými proteiny * Kombinované přístupy Identifikace genů * U prokaryotů 95-100% spolehlivost, u složitějších eukaryotů 90% na úrovni baží, 70% na úrovni exonů/intronů * existence intronů * větší genomy * nízká hustota genů (<30%; 3% u Homo sapiens) * alternativní splicing (zhruba u poloviny genů) * velké množství repetitivních sekvencí * občasný překryv genů Identifikace genů start stop start stop start stop d Ada lili I 1 I II I -----------► -----------► Eukaryotic Gene Structure 5'site branchpoint site 3'site \ \ bzoii 1 intror j 1 exoni2 introni2 AG/GT CAG/NT RNA Splicing 51 splice site U2 snRNP 3' splice site U1 snRNP v / ™V M intron /^T\ / Tn 2 U4/U6 snRNP^ U4/U6 snRNP U5 snRNP U5 snRNP LARIAT FORMATION AND 51 SPLICE SITE i| CLEAVAGE lariat LARIAT FORMATION AND 51 SPLICE SITE CLEAVAGE —-lariat 31 SPLICE SITE CLEAVAGE AND JOINING OF TWO EXON SEQUENCES excised intron sequence in the form of a lariat (will be degraded in nuc + exon 1 exon 2 portion of i — 31 ľ ---------------------------- mRNA Exon/Intron Structure (Detail) ATGCTGTTAGGTGG...GCAGATCGATTGAC i_____n_____m_____éL m___n___* TT T -*----Exon !-*—*■ +— Intron 1-«—► -«— Exon 2-*—► V SPLICE ATGCTGTTAGATCGATTGAC Typické signály v eukaryotických sekvencích Promotorové elementy * CAP, CCAAT, GC a TATA Kozákova sekvence (rozpoznávána ribozomem = RBS) Splicing (donor, acceptor a lariat) * Terminační signál Polyadenylační signál Pol II Promoter Elements Exon Intron Exon GCbox -200 bp CCAAT box -100 bp TATA box -30 bp Gene Transcription start site (TSS) Pol II Promoter Elements • Cap Region/Signal -nCAGTnG • TATA box (~ 25 bp upstream) -TATAAAnGCCC • CCAAT box (-100 bp upstream) -TAGCCAATG • GC box (-200 bp upstream) -ATAGGCGnGA Pol II Promoter Elements TATA box is found in ~70 % of promoters WebLogos 12 Lambda c I and cro binding sitos S Lambda O protein binding .sites http://www.bio.cam.ac.uk/cgi-bin/seqlogo/logo.cgi Kozák (RBS) Sequence -7 -6 -5 -4-3-2-1 0 1 2 3 AGCCACCATGG Splice Signals AG/GT branchpoint site exon I introii 1 CAG/NT exoni2 2n in exon I intron ÔaGt intron I exon MIlL«S=S oafiOMDin^c9Mi-oq)coN(om4coMi-oi- CM CM CO Miscellaneous Signals • Polyadenylation signal -AATAAAorATTAAA - Located 20 bp upstream of poly-A cleavage site • Termination Signal -AGTGTTCA - Located -30 bp downstream of poly-A cleavage site Polyadenylation dcovogc and Pol/aden/lation of Eukaiyoüc pre-mRNAs 5' 5'. 5r. PAP ID CTsF JUL- y pre^RNA AAUAAA GU ridl rggion y Cleavage O, -L_L_1J—AAAAAAAAAAAAA Polyilenyld ion CPSF - Cleavage & Polyadenylation Specificity Factor PAP - Poly-A Polymerase CTsF - Cleavage Stimulation Factor Analýza genomu - kombinované metody Neurónové site * Grail, GeneParser * Lineárni diskriminační analýza * GeneFinder, GeneID, MZEF Lingvistická * GeneLang Marko vo vy řetězce * Genie, GeneMark, GenScan, VEIL Podobnosti * Procrustes, AAT Rozhodovací stromy Training Set ACGAAG AGGAAG AGCAAG ' ACGAAA AGCAAC EEEENN ■ Desired Output iura! Network Definitions Sliding Window aEgÄU A = [001] n-----J > C = [010] JI G = [100] E = [01] > N = [00] G [010100001] Input Vector [01] Output Vector Neural Network Training [010100001] A ACGAAG Input Vector .2 .4 .1 .1.0 .4 .7 .1 .1 .0 .1 .1 .0 .0 .0 .2 .4 .1 .0 .3 .5 .1 .1 .0 .5 .3 .1 [.6 .4 .6] .1.8 .0.2 .3.3 1-e [.24 .74] compare V [01] Weight Hidden Weight Output Matrixl Layer Matrix2 Vector After Many Iterations.... .13 .08 .12 .24 .01.45 .76 .01.31 .06 .32 .14 .03 .11.23 .21.21.51 .10 .33 .85 .12 .34 .09 .51.31.33 Two "Generalized" Weight Matrices .03 .93 .01.24 .12 .23 Neural Networks Matrixl Matrix2 ACGAGG New pattern EEEENN Prediction Input Layer 1 Hidden Output Layer HMM for Gene Finding Start Codon 16 Backedges Combined Methods • Bring 2 or more methods together (usually site detection + composition) • GRAIL (http://compbio.ornl.g0v/Grail-l.3/) • FGENEH (http://genomic.sanger.ac.uk/gf/gf.shtml) • HMMgene (http://www.cbs.dtu.dk/services/HMMgene/) • G ENSC AN(http://genes.mit.edu/GENSCAN.html) • Gene Parser (mtP ://beagle.colorado.edu/~eesnyder/GeneParser.html) • GRPL (GeneTool/BioTools) How Well Do They Do? 100.0 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 ^4r £7- 61 ^£ 72 7£ ^J Xpound GenelD GRAIL 2 Genie GeneP3 RPL Mo ill od Burset & Guigio test set (1996) RPL+ How Well Do They Do? Programs seq Nucleotide acci racy E von accuracy i Sp AC CC \ESfi i \esp i n+ESl \/2 ME WE PCa PCp OL F GENES 195(5) 0.86 0.88 0.84 0.83 0.67 0.67 0.69 0.12 0.09 0.20 0.17 0.02 GeneMark 195(0) 0.87 0.89 0.84 0.83 0.53 |0.54 i 0.54 jo. 13 jo. ii 0.29 0.27 0.09 Genie 195(15) 0.91 0.90 0.89 0.88 0.71 0.70 0.71 0.19 0.11 0.15 0.15 0.02 Gens c an 195(3) 0.95 0.90 0.91 0.91 0.70 0.70 0.71 0.08 0.09 0.21 0.19 0.02 HMMgene 195(5) 0.93 0.93 0.91 0.91 0.76 0.77 0.76 0.12 0.07 0.14 0.14 0.02 Morgan 127(0) 0.75 0.74 0.70 0.69 0.46 0.41 0.43 0.20 0.28 0.28 0.25 0.07 MZEF 119(8) 0.70 0.73 0.68 0.66 |0.58 |0.59 0.59 |0.32 |0.23 0.08 0.16 j 0.01 "Evaluation of gene finding programs" S. Rogic, A. K. Mackworth and B. F. F. Ouellette. Genome Research, 11: 817-832 (2001). GenomeScan - http://genes.mit.edu/genomescan.html Rim GenomeScan: Organism: Vertebrate ^1 Sequence name (optional): ^ Print options: | Predicted peptides only j»J H Upload your DNA sequence file (one-letter code, upper or lower case, spaces/numbers ignored): Browse. Or paste your DNA sequence here (one-letter code, upper or lower case, spaces/numbers ignored): | Document: Done =U& ^ && ligl \A TwinScan - http://genes.cs.wustl.edu/ n* * »IB Washington University St- LOUH, MO Home Run TWINSCAN Examples Resources Brent Lab 1 WINSCAN |Select Organism T|l iiiuuac Hw annotations oiS the UCSC i I browser. 1 | Human | Mo us D3fi glw wffl •"" 1 | Browse... |u ±r H Run TWINSCAN clear |:1 T 1 1 ► M ► | PH&=| | Document: Done =| -^ ACGTGA -> CGTG -> CGTG -> 4 ACGTGA TCGTA ACGTGATGCAG GGAGAGCACG ACAGTTGACGAGATGGCAGGATGCGCGATGCAGCA GACGAGCGTGAGTGCGATCGATGACAGTGTATAT Zarovnání ACGTGA • • • • • • • • CGTG ACGTGA • • • • • • • • TCGT-A ACGTGATGCA-G • • • • • • • • • • • • • • GGAGA-GCACG sekvencí 4 4 Aligning Two Sequences ATTGCAGTGATCG ATTGCGTCGATCG Solution 1: Solution 2 ATTGCAGTGATCG um um ATTGCGTCGATCG ATTGCAGT-Gi nm n i -GATG III ATTGC-GTCGATCG Which alianment is better? ATTGCAGTGATCG ATTGCGTCGATCG Solution 1: Solution 2 ATTGCAGTGATCG urn inn ATTGCGTCGATCG ATTGCAGT-GATCG nm n urn ATTGC-GTCGATCG 0 matches+ 3 mismatches 12 matches+2 gaps Scoring Scheme Match +1 Mismatch -1 Indel -2 Which alianment is better? ATTGCAGTGATCG ATTGCGTCGATCG Solution 1: Solution 2 ATTGCAGTGATCG urn inn ATTGCGTCGATCG ATTGCAGT-GATCG nm n urn ATTGC-GTCGATCG Score=7 core=8 Finding the best alignment for long sequences is tedious For two sequences of length 300 bases there are 10179 different alignments Dynamic programming Dynamické programování Needleman-Wunsch (1970) Smith-Waterman (1981) První krok je triviální a pokrývá částečné řešení Každé další řešení je hodnoceno na základě předcházejících zjištění Zarovnání je tak postupně prodlužováno o další triviální úseky Opakování předchozích kroků vyústí v konečné řešení Dynamic Programming Algorithm Seq 1) Seq 2) * * A G C A A A C A1 A2 A3 C4 A G C 0 12 3 Needelman-Wunsch algorithm (1970) Dynamic Programming Algorithm 4\ 4\ -----------A G C A A A C match=l mismatch=-l indel=-2 A G C 0 1 2 3 0 0 -2 -4 -6 A1 -2 A2 -4 A3 -6 C4 -8 Dynamic Programming Alaorithm * A G C * A - - A A C match=l mismatch=-l indel=-2 A1 A2 A3 C 4 A 0 1 0V2 -2 r -4 -6 -8 G C 2 3 -4 -6 ■-1<«-3 F(i-l,j-l) F(i,j-1) -d Global pairwise alignment r F(i-1,i-1) + s(x.„y.) F(i,j)= max^ F(i-1,J)-d l F(i,j-1)-d Finding the Best Score A G C 0 12 3 0 A1 A2 A3 C4 0«--2«--4«--6 t -8 -5 -4(j) Tracing the Best Alignment o A1 A2 A3 C4 A G C 0 12 3 P«--2«--4«--6 -2 1^-1^-3 t\f\ \ -8 -5 -4(j) A G - C A A A C Tracing the Best Alignment o A1 A2 A3 C4 A 0 1 G 2 C 3 -2---ŕ--6 -1--3 tVtk £2 -8 -5 -4(j) A - G C A A A C Tracing the Best Alignment o A1 A2 A3 C4 A G C 0 12 3 -2r-4f--6 -Ť--3 -8 -5 -4(j) - A G C A A A C Local Alignment Example ATCTAA T! TAATA A 2 A3 T4 A5 Smith-Waterman algorithm, 1981 ATCTAA 0 1 2 3 4 5 6 Local Alignment r F(i,j)= max Ffi-IJ-lj + sfayJ < F(i-1,j)-d F(i,J-1)-d V 0 Local Alignment Example TCATAA TAATA o T1 A2 A3 T4 A5 T A C T A A 0 1 2 3 4 5 6 0,0 0 0k 0 0 0 0 10 0 10 0 \ \ \ 0 0 2 0 0 2 1 NN \ 0 0 1 1k0 1 3 0 0fc 0 0 2 0 1 N 0 0 1 0 0 3«-1 T A 0 1 2 0 0,0 0 T1 0 1 0 \ A2 0 0 2 \ A3 0 0 1 T4 0 0fc 0 A5 0 0 1 c T A A 3 4 5 6 0, K° 0 0 0 1 fc 0 0 \ \ 0 0 2 1 \ V> 1 N° 1 <£ 0 2 0 1 ) 0 0 3«-1 A o T1 A2 A3 T4 A5 T A O 1 2 0,0 0 0 1 0 N 0 0 2 \ 0 0 1 o ofc o 0 0 1 C T A A 3 4 5 6 0^0 0 0 0 10 0 \ \ fc0 0 2 1 10 13 0 2 0 1 0 0 Q- 1 Examples : Genomic DNA versus mRNA EXl EX2 EX3 EX4 Alignment EXl EX2 EX3 _________ gap gap gap Gap Penalties AAC-AATTAAG-ACTAC-GTTCATGAC A-CGA-TTA-GCAC-ACTG-T-A-GA-AACAATTAAGACTACGTTCATGAC--- AACAATT-------GTTCATGACGCA Scoring Gaps AAC-AATTAAG-ACTAC-GTTCATGAC „ -6 A-CGA-TTA-GCAC-ACTG-T-A-GA- AACAATTAAGACTACGTTCATGAC— -in 11 AACAATT--------GTTCATGACGCA Scoring parameters match:+1;Gap_open:-2 Scoring Insertions/Deletions i AAC-AATTAAG-ACTAC-GTTCATGAC A-CGA-TTA-GCAC-ACTG-T-A-GA- -6 ii AACAATTAAGACTACGTTCATGAC--- AACAATT--------GTTCATGACGCA -6 Scoring parameters match:+1;indel:-2 Considering Gap Opening and Gap Extension i AAC-AATTAAG-ACTAC-GTTCATGAC A-CGA-TTA-GCAC-ACTG-T-A-GA- -17 AACAATTAAGACTACGTTCATGAC--- n 1 AACAATT--------GTTCATGACGCA Scoring parameters match:+1;Gap_open:-2; Gap_exten:-1