t t DNA sequence analysis Bioinformatics - lectures Introduction Information networks Protein information resources Genome information resources DNA sequence analysis Pairwise sequence alignment Multiple sequence alignment Secondary database searching Analysis packages Protein structure modelling DNA sequence analysis why to analyse DNA? gene structure gene sequence analysis expression profile, cDNA, EST EST sequences analysis Why to analyse DNA? The most sensitive between me most sensitive comparisons oetween sequences are on protein level because of redundancy of the genetic code. The loss of degeneracy is accompanied by a loss of information directly linked to the evolution -proteins are only functional abstractions of genetic events at DNA level. Silent mutations, important for phylogenetic analysis, can not be detected at protein level. Exon/intron analysis, open reading frame [ORF] analysis can not be performed at protein level. T C A G \ T TTT TTC Phe TCT TCC TCA TCG Ser TAT TAC Try TGT TGC Cys T C A G HA TTG Leu TAA TAG Stop TGA Stop TGG Trp C CTT CTC CTA CľG Leu CCT CCC CCA CCG Pro CAT CAC His CGT CG C CGA CGG Arg T C A G CAA CAG Gin A AH ATC ATA íle ACT ACC ACA ACG Thr AAT AAC As n AGT AG C Ser T C A G AAA AAG Lys AGA AGG Arg r ATG Met G GIT GTC GTA GTG Val go-go: GCA GCG Ala GAT GAC Asp GGT GGC GGA GGG Gly T C A G GAA GAG Glu Gene structure Eukaryotic are more complex then LuKaryotic genes are more complex tnen prokaryotic due to presence of introns. DNA databases typically contain genomic data: untranslated sequences, introns+exons, mRNA, cDNA. different Gene products (proteins) can length, because not all exons can be present in final mRNA. The proteins of different length originating from single sequence are called splice variants. 5'UTR 5' f Exon Intron Intro n Sense strand genomic DNA Exon t Transcription 3' Exon \f 3'UTR 5'UTR mRNA CDS t Translation 3'UTR Protein Gene structure Untranslated regions (UTRs) *■ portions of the sequence flanking the coding sequence (CDS) not translated into protein ** UTRs (especially 3' end) is highly gene/species specific Exons *■ protein-coding DNA sequences of a gene Introns ** DNA sequences interrupting protein-coding DNA sequence of a gene *■ transcribed into RNA but are edited out during post-transcriptional modifications Gene sequence analysis Conceptual translation - theoretical translation of the DNA sequence to the protein sequence using DNA code without biochemical support. Six-frame translation results in six potential protein sequences (ORF analysis). ORF analysis ** codon for methionine - initial codon in the CDS ** sufficient CDS lenght - long CDS are rare ** pattern of codon usage - species specific >■ bias towards G/C in the third base of a codon - species specific Expression profile, cDNA, EST ■ Hierarchy of genomic information ** human genome consists of ~3 billion bp ** ~3% of the DNA is coding sequence -►mRNA-»- protein ** rest of the genome need for compact structure of chromosomes, replication, control of transcription, etc. 1. chromosomal genome (genome) - genetic information common to every cell in the organism 2. expressed genome (transcriptome) - part of genome expressed in a cell at specific stage in its development 3. proteome - protein molecules that interact to give the cell its individual character Expression profile, cDNA, EST Expression profile ** characteristic range of genes expressed at particular stage of development and functioning ** goal of genome projects is to sequence entire (chromosomal) genome ** having complete sequences and knowing what they mean - two distinct stages of understanding genome ** alternative approach is analysis of parts of genome expressed in a cell at specific stage in its development ** comparison of expression profiles: identification of abnormal expressions, expression levels >■ interesting for industry - gene discovery, drug design Expression profile, cDNA, EST Complementary DNA (cDNA) ** DNA that is synthesised from a messenger RNA template using the enzyme reverse transcriptase ** cDNA captures expression profile ** preparation: cultivation/isolation of cells, mRNA extraction, reverse transcription of mRNA to cDNA, transformation of cDNA into library, sequencing of randomly chosen clones (100.000 out of 2 mil.) >~ ideally 100.000 sequences 200-400 bp length -expressed sequence tags (ESTs) >■ in reality many failures, number of sequences lower ** number of clones constructed and sequenced must be large enough to represent expression profile Origin of complementary DNA and expression sequences tags 5'UTR 5" i 3' Intron Intron Exon Exon Exon yf 3' UTR v. Sense strand genomic DNA 5'UTR mRNA Transcription CDS 3' UTR EST CDS um Translation Prolein cDNA 51 Expression profile, cDNA, EST Libraries of ESTs >- Merck/IMAGE - 300 000 ETSs from a variety of normalised libraries - higher chance to capture different genes; expression levels not known; sequences deposited to dbEST ** Incyte - quantitative information on expression levels standardised libraries; expression profiles in healthy and diseased tissues; sequences form the commercial database LifeSeq ** TIGR - TIGR Human Gene Index - integrates results from human gene projects [dbEST+GenBank] -purpose is to identify all possible human genes by sequence assembly - creates Tentative Human Consensus (THC) sequences and contigs (a) (T O I I I I I I I I I I I I I I—I I I I—I—I I I I I I I I—I—I—I—IT 100 200 300 P20233 I TBLASTN THC168921 ^THC168921 THC168921 THC168921 ^ THC168921 ^^p» THC168921 THC168921 M^THCI 68921 ► THC168921 THC 168921 THC168921 THC 168921 THC 168921 THC 168921 ► THC 169302 THC168921 THC 168921 THC169302 ^ THC169302 THC169302 ^THC169302 ^^^THC169302 ^THC 169302 THC169302 ^^> THC169302 + THC169302 THC169302 i^BMMBM^- THC169302 THC169302^- THC169302 THC169302 THC169302 THC151150 1^^ THC151150 THC151150 THC151150 THC153132 ^M THC150979 THC153132 <4 THC150979 +> THC178704 I------> THC150979 THC161890 ^^ THC214907 THC161890 ► THC214907 THC155109^—J THC150449 THC 150449 THC150450 fei THC 150450 THC208451 THC208451 THC161197 THC169129 THC171968 > í ^ f* (b) P20233 ; THC168921 THC168921 THC168921 THC168921 THC 169302 THC169302 THC169302 THC169302 THC153132 THC178704 >21 >355 >1374 >1449 >454 >143 >1155 >1230 >245 <433 >1213 r Text alignment <_■. -----I-------'-------1-------T-----1 , ' I T-™1 10 20 3d 40 50 QKEKQVRWCVKSNSELKKCKDLVDTCKNKEIKLSCVEKSNTDECSTAIQE RRRRS.Q..AV.QP.AT.. RRAR.V..AVGEQ..R. i 7 GSVT . SSA. T . BO. IALVLK V..IKRDSPIQ.IQ..A. D.T....AV.EH.AT .P.K..AL.HH.RL QSFR.HM.S DE... -A .- % r _-i •'- ^ !■:. IE..SAET.ED.IAK.MN VA..K.ASYLD.IR.-AA IB.. -SAET.ED.IAK.MN *." -f- M.CSED___T.IIKQ.IK.KSGS.IS.G.GN.TI.SS _ í EST sequences analysis EST production is highly automated (fluorescent laser systems and computer analysis i laser systems chromatograms) influencing quality Specific character of ESTs must be sequences, bpecinc cnaracte respected during their analysis: EST alphabet Insertions, deletions, frameshifts Splice variants in EST Non-coding regions EST sequences analysis ■ EST alphabet ** automated computer analysis of chromatograms ** program is sometimes unable to decide base for particular position and inserts ambiguous base N ** should be <5% of total length Insertions, deletions and frameshifts >- automated base-calling software assumes regular intervals among peaks - not always the case ** phantom INDELs (insertions and deletions) ** identification of INDELs by sequence comparisons List of base-ambiguity symbols defined by IUB-IUPAC WB symbol Represented be A A C C G G T/U M T A or C R A or G W AorT S C or G Y CorT K GorT V A or C or G H A or C or T D A or G or T B CorG orT X/N GorAorTorC EST sequences analysis ■ Splice variants ** splice variants are represented by deletions arising from non-inclusion of exons ** in EST maybe missing bases due to sequencing errors ** partially good match = splice form or sequence error? Non-coding regions ** question: does this EST represent a new gene? >- search of DNA database for similar non-coding regions » no hit found = the EST represents a new gene (CDS) or the EST represents non-coding sequence not present in the database Sequencing chromatogram í' tC » tfí 01í 52 ti* A T íi \ C 7 ií » M ÍT » W A h i .. * EST sequences analysis Three categories of EST analysis tools ■ Sequence similarity search tools ■ Sequence assembly tools ■ Sequence clustering tools EST sequences analysis ■ Sequence similarity search tools ** current database search programs are designed to cope with EST: TBLASTN (translate DNA databases), BLASTX (translate input sequence), TBLASTX (translate both) Sequence assembly tools ** search of the databases reveals several ESTs matching the query sequence ** alignment of hits and construction of consensus ** search with consensus, augment, .... >- iterative sequence alignment = sequence assembly EST sequences analysis Sequence clustering tools ** clustering of EST sequences reduces redundancy and saves the search time ** enables estimation of genes in the EST database ** approach 1: clustering based on sequences from comprehensive DNA database >- approach 2: clustering of all ESTs, construction of consensus sequences representing each cluster, DNA database search using consensus sequences only >- result = ESTs that do not match any of the database sequences EST library Clustering A B C Plus sense EST Minus sense EST D