CG920 Genomics Lesson 1 Introduction into Bioinformatics Jan Hejátko Functional Genomics and Proteomics of Plants, Mendel Centre for Plant Genomics and Proteomics, Central European Institute of Technology (CEITEC), Masaryk University, Brno hejatko@sci.muni.cz, www.ceitec.muni.cz  Syllabus Of The Course  Definition Of Genomics  Role Of Bioinformatics In Functional Genomics  Databases  Spectre Of „On-line“ Resources  PRIMARY, SECONDARY and STRUCURAL Databases  GENOME Resources  Analytical Tools  Homologies Searching  Searching Of Sequence Motifs, Open Reading Frames, Restriction Sites…  Other On-line Genome Tools Outline Course Syllabus  Chapter 01  Introduction into Bioinformatics  Chapter 02  Identification of Genes  Chapter 03  Reverse Genetics Approaches  Chapter 04  Forward Genetics Approaches Course Syllabus  Chapter 05  Functional Genomics Approaches  Chapter 06  Protein-Protein Interactions And Their Analysis  Chapter 07  Current Methods of DNA Sequencing  Chapter 08  Structure of genomes Course Syllabus  Chapter 09  Genome evolution  Chapter 10  Genomics and Systems Biology  Chapter 11  Practical Aspects Of Functional Genomics  Model Organisms,  PCR and Primer Design  Literature resources for Chapter 01:  Bioinformatics and Functional Genomics, 3rd Edition, Jonathan Pevsner, Wiley-Blackwell, 2015 http://www.bioinfbook.org/php/?q=book3  Úvod do praktické bioinformatiky, Fatima Cvrčková, 2006, Academia, Praha  Plant Functional Genomics, ed. Erich Grotewold, 2003, Humana Press, Totowa, New Jersey Literature  Syllabus of thecourse  Definition of Genomics Outline  Sensu lato (in the broad sense) – it is interested in STRUCTURE and FUNCTION of genomes  Sensu stricto (in the narrow sense) – it is interested in FUNCTION of INDIVIDUAL GENES – FUNCTIONAL GENOMICS  It uses mainly the reverse genetics approaches  Necessary prerequisite: knowledge of the genome (sequence) – work with databases GENOMICS – What is it? 3 : 1 Forward („classical“) Genetics Approaches Reverse Genetics Approaches ? Insertional mutagenesis 5‘TTATATATATATATTAAAAAATAAAATAAAA GAACAAAAAAGAAAATAAAATA….3‘ GENOMICS – What is it? The role of BIOINFORMATICS in FUNCTIONAL GENOMICS BIOINFORMATICS FUNCTIONAL GENOMICS • Syllabus of this course • Definition of genomics • Role of BIOINFORMATICS in FUNCTIONAL GENOMICS Outline  Definiction of Bioinformatics (according to NIH Biomedical Information Science and Technology Initiative Consortium) Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Bioinformatics • Interface between the biology and computers • Analysis of proteins, genes and genomes using computer algorithms and databases • Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects. What is bioinformatics? J. Pevsner,  http://www.bioinfbook.org/index.php  Bioinformatics in functional genomics  Processing and analysis of sequencing data  Identification of reference sequences  Identification of genes  Identification of homologues, orthologues and paralogues  Correlative analysis of genomes and phenotypes (incl. human)  Processing and analysis of transcriptional data  Transcriptional profiling using DNA chips or next-gen sequencing  Evaluation of experimental data and prediction of new regulations in systems biology approaches  Mathematical modelling of gene regulatory networks Bioinformatics  Syllabus of this course  Definition of genomics  Role of BIOINFORMATICS in FUNCTIONAL GENOMICS  Databases  Spectre of „on-line“ resources Outline Spectre of on-line Resources  EBI http://www.ebi.ac.uk/services Spectre of on-line Resources  NCBI http://www.ncbi.nlm.nih.gov/ Spectre of on-line Resources  Syllabus of this course  Definition of genomics  Role of BIOINFORMATICS in FUNCTIONAL GENOMICS  Databases  Spectre of „on-line“ resources  PRIMARY, SECONDARY and STRUCURAL databases Outline  EMBL  http://www.ebi.ac.uk/embl/  GenBank,  http://www.ncbi.nih.gov/Genbank/GenbankSearch.html  DDBJ,  http://www.ddbj.nig.ac.jp  Include primary datasets – DNA and Protein sequences  Sequences in databases of „The Big Three“:  Daily mutual exchange and backup of data  Works with large amount of data (capacity and software requirements)  September 2003 27,2 x 106 entries (approx. 33 x 109 bp)  August 2005 100 x 109 bp from 165.000 organisms Primary Databases Growth of GenBank Year BasepairsofDNA(millions) Sequences(millions) 1982 1986 1990 1994 1998 2002 J. Pevsner,  http://www.bioinfbook.org/index.php Growth of GenBank + Whole Genome Shotgun (1982-November 2008): we reached 0.2 terabases Numberofsequences inGenBank(millions) BasepairsofDNAinGenBank(billions) BasepairsinGenBank+WGS(billions) 0 20 40 60 80 100 120 140 160 180 200 1982 1992 2002 2008 J. Pevsner,  http://www.bioinfbook.org/index.php Growth of GenBank Feb 15 2013 WGS Interactive concepts in biochemistry, Rodney Boyer, Wiley,  2002, http://www.wiley.com//college/boyer/0470003790/ Growth of DNA Sequence in Repositories Year J. Pevsner,  http://www.bioinfbook.org/index.php Growth of DNA Sequence in Repositories B&FG 3e Fig. 2-3 Page 22 Year A vast amount of sequence data has been generated using next-generation sequencing. Growth of DNA Sequence in Repositories B&FG 3e Fig. 2-3 Page 22 Year Perhaps 40 petabases (corresponding to 10 mil. human genomes) of DNA were generated in calendar year 2014 at major sequencing centers.  They include sets of primary data – DNA and Protein sequences  Protein sequences:  PIR, http://pir.georgetown.edu/  MIPS, http://www.mips.biochem.mpg.de  SWISS-PROT, http://www.expasy.org/sprot/ Primary Databases  Standard nucleotide sequences acquired by high quality sequencing  Types of sequences in primary databases  ESTs (Expressed Sequence Tags)  HGTS (High Throughput Genome Sequencing) - Results of sequencing projects without annotation  Reference Sequences of annotated genomes  TPAs (Third Party Annotation) - sequences annotated by third party (by someone else, not the orginal authors) Primary Databases GenBank (NCBI) http://www.ncbi.nlm.nih.gov/ Primary Databases Primary Databases Primary Databases Accession number Primary Databases Primary Databases What is an Accession Number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 GenBank genomic DNA sequence NT_030059 Genomic contig Rs7079946 dbSNP (single nucleotide polymorphism) N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record Protein DNA RNA J. Pevsner,  http://www.bioinfbook.org/index.php NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: Complete genome NC_###### Complete chromosome NC_###### Genomic contig NT_###### mRNA (DNA format) NM_###### e.g. NM_006744 Protein NP_###### e.g. NP_006735 J. Pevsner,  http://www.bioinfbook.org/index.php RefSeq Accession Molecule Method Note AC_123456 Genomic Mixed Alternate complete genomic AP_123456 Protein Mixed Protein products; alternate NC_123456 Genomic Mixed Complete genomic molecules NG_123456 Genomic Mixed Incomplete genomic regions NM_123456 mRNA Mixed Transcript products; mRNA NM_123456789 mRNA Mixed Transcript products; 9-digit NP_123456 Protein Mixed Protein products; NP_123456789 Protein Curation Protein products; 9-digit NR_123456 RNA Mixed Non-coding transcripts NT_123456 Genomic Automated Genomic assemblies NW_123456 Genomic Automated Genomic assemblies NZ_ABCD12345678 Genomic Automated Whole genome shotgun data XM_123456 mRNA Automated Transcript products XP_123456 Protein Automated Protein products XR_123456 RNA Automated Transcript products YP_123456 Protein Auto. & Curated Protein products ZP_12345678 Protein Automated Protein products NCBI’s RefSeq project: many accession number formats for genomic, mRNA, protein sequences J. Pevsner,  http://www.bioinfbook.org/index.php Primary Databases Primary Databases  PROSITE, http://www.expasy.org/prosite/  Databases of functional or structural motifs, acquired by primary data (sequences) comparison Secondary Databases  PROSITE, http://www.expasy.org/prosite/ Secondary Databases  Databases of functional or structural motifs, acquired by primary data (sequences) comparison  PROSITE, http://www.expasy.org/prosite/ Secondary Databases  Databases of functional or structural motifs, acquired by primary data (sequences) comparison  PRINTS, http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/ Secondary Databases  Databases of functional or structural motifs, acquired by primary data (sequences) comparison  TRANSFAC http://www.gene-regulation.com/ Secondary Databases Scaffold/Matrix Attached Region transaction Database  PDB http://www.rcsb.org/pdb/ Structural Databases  PDB http://www.rcsb.org/pdb/ Structural Databases  PDB http://www.rcsb.org/pdb/ Structural Databases Pekárová et al., Plant Journal (2011)  Syllabus Of The Course  Definition Of Genomics  Role Of Bioinformatics In Functional Genomics  Databases  Spectre of „on-line“ Resources  PRIMARY, SECONDARY And STRUCURAL Databases  GENOME Resources Outline  Human Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway Genome Resources Genome Resources  Human Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway Genome Resources  Human Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway Genome Resources  Human Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway Genome Resources  Human Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway Genome Resources  The Arabidopsis Information Resource (TAIR) http://www.arabidopsis.org  TAIR, The Arabidopsis Information Resource, http://www.arabidopsis.org Genome Resources  Syllabus Of The Course  Definition Of Genomics  Role Of Bioinformatics In Functional Genomics  Databases  Spectre Of „On-line“ Resources  PRIMARY, SECONDARY And STRUCURAL Databases  GENOME Resources  Analytical Tools  Homology Searching Outline  Global versus Local alignment  Global Alignment: only for sequences, which are similar and of a similar length (BUT can insert spaces into one or both sequences)  Local Alignment provides identification and comparison even in case of alignment of regions of sequences with high similarity, e.g. even in case of change of order of protein domains during evolution Cvrčková, Úvod do praktické bioinformatiky  Global Alignment is used mainly in case of multiple alignment (CLUSTALW, further in the presentation) Analytical Tools  Choosing the right type of alignment using dotplot  Plotting the sequences against each other (x and y axis)  Identification of identity in „dot“ of specific size (e.g. 2 bp)  Filtering the diagonals of lengths lower than a treshold Cvrčková, Úvod do praktické bioinformatiky Analytical Tools  Examples of sequence alignment using dotplot  Global Alignment: possible only for sequences A and B  The rest of the sequences underwent change of order of protein domains and therefore it is neccessary to do a local alignment  Dotplot can be obtained using BLAST2 (see further in the presentation) Cvrčková, Úvod do praktické bioinformatiky Analytical Tools  BLAST http://ncbi.nlm.nih.gov/BLAST/ Analytical Tools  Word size: 10-11 bp or 2-3 aa  Scoring the homology with matrices PAM (Point Accepted Mutation) or BLOSUM (BLOcks Substitution Matrix)  Primary similarities (seed matches)  Expanding the homology regions to the left and to the right  Showing the results MRKEV [delece] MRKE [záměna] MRKY [inzerce] MRAKY M R . K E V | | | : M R A K Y Matice PAM 250 Cvrčková, Úvod do praktické bioinformatiky BLAST Basic Local Alignment Search Tool E= expectancy value  „expectancy value“ provides the number of expected sequence number with the same or higher similarity whe searching in the database consisiting of randomly assembled sequences  the results shows fraction of identical and in case of proteins also similar sequence positions and/or inserted spaces BLAST Basic Local Alignment Search Tool Primary Databases BLAST Basic Local Alignment Search Tool  Searching according to source (organism) of sequences, e.g. known genomes of microorganisms  Currently there exists a lot of specialized versions of BLAST  BLASTP • Given the protein query, it returns the most similar protein sequences from the protein database.  BLASTN • Given the DNA query, it returns the most similar DNA sequences from the DNA database.  BLASTX • Compares the all possible six-frame translation products of a nucleotide query sequence (both strands) against a protein sequence database. • Other variants, e.g. MEGABLAST, for identification of identical or very similar sequences (searches long similar regions of nucleotide sequences) BLAST Specialized Versions  TBLASTN • Compares a protein query against the all six reading frames of a nucleotide sequence database.  TBLASTX • Translates the query nucleotide sequence in all six possible frames and compares it against the six-frame translations of a nucleotide sequence database.  Currently there exists a lot of specialized versions of BLAST BLAST Specialized Versions  PSI-BLAST (Position-Specific Iterated Blast) • For every alignment, PSI-BLAST creates so-called PSSM (Position Specific Substitution Matrix) • PSSM takes into account relative frequency of specific aminoacid residue in a specific position within sequences identified as similar in first step, which can mean functional conservation. • First step: standard BLAST, during which PSI-BLAST identifies a list of similar sequences with E value better than minimal value (standard = 0,005)  Currently there exist a lot of specialized versions of BLAST BLAST Specialized Versions  PHI-BLAST (Pattern-Hit Initiated BLAST) • Sequence of motif must be inserted using special syntax: • [LVIMF] means either Leu, Val, Ile, Met or Phe • For identification of specific sequence, e.g. motif (pattern) in sequence of similar protein sequences • - is spacer (means nothing) • x(5) means 5 positions in which any residue is allowed • x(3, 5) means 3 to 5 positions where any residue is allowed  Currently there exists a lot of specialized versions of BLAST BLAST Specialized Versions  Example of search by PHI-BLAST BLAST Specialized Versions  Syllabus Of The Course  Definition Of Genomics  Role Of Bioinformatics In Functional Genomics  Databases  Spectre Of „On-line“ Resources  PRIMARY, SECONDARY And STRUCURAL Databases  GENOME Resources  Analytical Tools  Homologies Searching  Searching Of Sequence Motifs, Open Reading Frames, Restriction Sites… Outline  http://workbench.sdsc.edu/ Analytical Tools  http://workbench.sdsc.edu/ Analytical Tools  http://workbench.sdsc.edu/ Analytical Tools  http://workbench.sdsc.edu/ Analytical Tools  http://workbench.sdsc.edu/ Analytical Tools  http://workbench.sdsc.edu/ Analytical Tools  http://workbench.sdsc.edu/ Analytical Tools  VPCR http://grup.cribi.unipd.it/cgi-bin/mateo/vpcr2.cgi Analytical Tools  VPCR http://grup.cribi.unipd.it/cgi-bin/mateo/vpcr2.cgi Analytical Tools  Syllabus Of The Course  Definition Of Genomics  Role Of Bioinformatics In Functional Genomics  Databases  Spectre Of „On-line“ Resources  PRIMARY, SECONDARY And STRUCURAL Databases  GENOME Resources  Analytical Tools  Homologies Searching  Searching Of Sequence Motifs, Open Reading Frames, Restriction Sites…  Other On-line Genome Tools Outline  TIGR (The Institute for Genomic Research, http://www.tigr.org/software/)  Recently part of the J. Craig Venter Institute Other On-Line Genome Resources  Online Mendelian Inheritance in Man (OMIM) Other On-Line Genome Resources  Syllabus Of The Course  Definition Of Genomics  Role Of Bioinformatics In Functional Genomics  Databases  Spectre Of „On-line“ Resources  PRIMARY, SECONDARY and STRUCURAL Databases  GENOME Resources  Analytical Tools  Homologies Searching  Searching Of Sequence Motifs, Open Reading Frames, Restriction Sites…  Other On-line Genome Tools Summary Discussion