CG020 Genomika
Lesson 1
Introduction into Bioinformatics
Jan Hejátko
Functional Genomics and Proteomics of Plants,
Mendel Centre for Plant Genomics and Proteomics,
CEITEC - Central European Institute of Technology
and
National Centre for Biomolecular Research,
Faculty of Science,
Masaryk University, Brno
hejatko@sci.muni.cz, www.ceitec.eu
2
2
 Syllabus Of The Course
 Definition Of Genomics
 Role Of Bioinformatics In Functional Genomics
 Databases
 Spectre Of „On-line“ Resources
 PRIMARY, SECONDARY and STRUCURAL Databases
 GENOME Resources
 Analytical Tools
 Homologies Searching
 Searching Of Sequence Motifs, Open Reading Frames, Restriction
Sites…
 Other On-line Genome Tools
Outline
3
Course Syllabus
 Chapter 01
 Introduction into Bioinformatics
 Chapter 02
 Identification of Genes
 Chapter 03
 Reverse Genetics Approaches
 Chapter 04
 Forward Genetics Approaches
3
4
Course Syllabus
 Chapter 05
 Functional Genomics Approaches
 Chapter 06
 Protein-Protein Interactions And Their
Analysis
 Chapter 07
 Current Methods of DNA Sequencing
 Chapter 08
 Structure of Genomes
4
5
Course Syllabus
 Chapter 09
 Genome evolution
 Chapter 10
 Genomics and Systems Biology
 Chapter 11
 Practical Aspects Of Functional Genomics
 Model Organisms,
 PCR and Primer Design
5
6
6
 Literature resources for Chapter 01:
 Bioinformatics and Functional Genomics, 3rd
Edition, Jonathan Pevsner, Wiley-Blackwell, 2015
http://www.bioinfbook.org/php/?q=book3
 Úvod do praktické bioinformatiky, Fatima
Cvrčková, 2006, Academia, Praha
 Plant Functional Genomics, ed. Erich
Grotewold, 2003, Humana Press, Totowa, New
Jersey
Literature
7
7
 Syllabus of thecourse
 Definition of Genomics
Outline
8
8
 Sensu lato (in the broad sense) – it is interested in
STRUCTURE and FUNCTION of genomes
 Sensu stricto (in the narrow sense) – it is interested in
FUNCTION of INDIVIDUAL GENES – FUNCTIONAL
GENOMICS
 It uses mainly the reverse genetics approaches
 Necessary prerequisite: knowledge of the
genome (sequence) – work with databases
GENOMICS – What is it?
Genomics is a science discipline that is interested in the analysis of
genomes. Genome of each organism is a complex of all genes of the
respective organism. The genes could be located in cytoplasm
(prokaryots) nucleus (in most euckaryotic organisms), mitochondria or
chloroplasts (in plants).
The critical prerequisite of genomics is the knowledge of gene sequences.
Functional genomics is interested in function of individual genes.
9
3 : 1
Forward („classical“) Genetics Approaches Reverse Genetics Approaches
?
Insertional mutagenesis
5‘TTATATATATATATTAAAAAATAAAATAA
AAGAACAAAAAAGAAAATAAAATA….3‘
GENOMICS – What is it?
The role of BIOINFORMATICS in FUNCTIONAL GENOMICS
BIOINFORMATICS
FUNCTIONAL GENOMICS
With the knowledge of gene sequences (or the knowledge of the gene files in the
individual organisms, i.e. the knowledge of genomes), Reverse Genetics
appears that allows study their function.
In comparison to ”classical” or Forward Genetics, starting with the phenotype,
the reverse genetics starts with the sequence identified as a gene in the
sequenced genome. The gene identification using approaches of Bioinformatics
will be described later (see Lesson 02).
Reverse genetics uses a spectrum of approaches that will be described in the
Lesson 03 that allow isolation of sequence-specific mutants and thus their
phenotype analysis.
The necessity of having phenotype alterations in the forward genomics approach
introduces important difference between those two approaches. Thus, the gene is
no longer understood as a factor (trait) determining phenotype, but rather as a
piece of DNA characterized by the unique string of nucleotides. i.e. physical
DNA molecule.
9
10
10
• Syllabus of this course
• Definition of genomics
• Role of BIOINFORMATICS in FUNCTIONAL GENOMICS
Outline
11
11
 Definiction of Bioinformatics (according to NIH Biomedical
Information Science and Technology Initiative Consortium)
Research, development, or application of computational tools and
approaches for expanding the use of biological, medical, behavioral
or health data, including those to acquire, store, organize, archive,
analyze, or visualize such data.
Bioinformatics
NIH WORKING DEFINITION OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY
July 17, 2000
The following working definition of bioinformatics and computational biology were developed by the BISTIC
Definition Committee and released on July 17, 2000. The committee was chaired by Dr. Michael Huerta of the
National Institute of Mental Health and consisted of the following members:
Bioinformatics Definition Committee BISTIC Members Expert Members
Michael Huerta (Chair) Gregory Downing
Florence Haseltine Belinda Seto
Yuan Liu
Preamble
Bioinformatics and computational biology are rooted in life sciences as well as computer and information
sciences and technologies. Both of these interdisciplinary approaches draw from specific disciplines such as
mathematics, physics, computer science and engineering, biology, and behavioral science. Bioinformatics and
computational biology each maintain close interactions with life sciences to realize their full potential.
Bioinformatics applies principles of information sciences and technologies to make the vast, diverse, and
complex life sciences data more understandable and useful. Computational biology uses mathematical and
computational approaches to address theoretical and experimental questions in biology. Although
bioinformatics and computational biology are distinct, there is also significant overlap and activity at their
interface.
Definition
The NIH Biomedical Information Science and Technology Initiative Consortium agreed on the following
definitions of bioinformatics and computational biology recognizing that no definition could completely eliminate
overlap with other activities or preclude variations in interpretation by different individuals and organizations.
Bioinformatics: Research, development, or application of computational tools and approaches for expanding
the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive,
analyze, or visualize such data.
Computational Biology: The development and application of data-analytical and theoretical methods,
mathematical modeling and computational simulation techniques to the study of biological, behavioral, and
social systems.
12
12
• Interface between the biology and computers
• Analysis of proteins, genes and genomes
using computer algorithms and databases
• Genomics is the analysis of genomes.
The tools of bioinformatics are used to make
sense of the billions of base pairs of DNA
that are sequenced by genomics projects.
What is bioinformatics?
J. Pevsner, 
http://www.bioinfbook.org/index.php
13
13
 Bioinformatics in functional genomics
 Processing and analysis of sequencing data
 Identification of reference sequences
 Identification of genes
 Identification of homologues, orthologues and paralogues
 Correlative analysis of genomes and phenotypes (incl.
human)
 Processing and analysis of transcriptional data
 Transcriptional profiling using DNA chips or next-gen
sequencing
 Evaluation of experimental data and prediction of new
regulations in systems biology approaches
 Mathematical modelling of gene regulatory networks
Bioinformatics
14
14
 Syllabus of this course
 Definition of genomics
 Role of BIOINFORMATICS in FUNCTIONAL GENOMICS
 Databases
 Spectre of „on-line“ resources
Outline
15
Spectre of On Line Resources
There are many of on-line resources that could be used.
15
16
 EBI http://www.ebi.ac.uk/services
Spectre of On Line Resources
Nowadays, the resources are interconnected and could be accessed via
dedicated web pages. Among the best and mostluy used www resources
integrating plenty of database resources belong www portal of European
Bioinformatics Institute (EBI) in Europe (Germany) and National Center of
Biotechnology Information (NCBI) in the USA (
16
17
 NCBI http://www.ncbi.nlm.nih.gov/
Spectre of On Line Resources
Nowadays, the resources are interconnected and could be accessed via
dedicated web pages.
17
18
18
 Syllabus of this course
 Definition of genomics
 Role of BIOINFORMATICS in FUNCTIONAL GENOMICS
 Databases
 Spectre of „on-line“ resources
 PRIMARY, SECONDARY and STRUCURAL databases
Outline
19
 EMBL
 http://www.ebi.ac.uk/embl/
 GenBank
 http://www.ncbi.nih.gov/Genbank/GenbankSearch.html
 DDBJ
 http://www.ddbj.nig.ac.jp
 Include primary datasets – DNA and Protein sequences
 Sequences in databases of „The Big Three“:
 Daily mutual exchange and backup of data
 Works with large amount of data (capacity and software requirements)
 September 2003 27,2 x 106 entries (approx. 33 x 109 bp)
 August 2005 100 x 109 bp from 165.000 organisms
Primary Databases
19
20
20
Growth of GenBank
Year
BasepairsofDNA(millions)
Sequences(millions)
1982 1986 1990 1994 1998 2002
J. Pevsner, 
http://www.bioinfbook.org/index.php
21
Growth of GenBank + Whole Genome Shotgun
(1982-November 2008): we reached 0.2 terabases
Numberofsequences
inGenBank(millions)
BasepairsofDNAinGenBank(billions)
BasepairsinGenBank+WGS(billions)
0
20
40
60
80
100
120
140
160
180
200
1982 1992 2002 2008
J. Pevsner, 
http://www.bioinfbook.org/index.php
21
22
Growth of GenBank
Aug 2016
 Dec 1982 680 338 bp, 606 sequences
 Apr 2002 19 x 109 bp, 17 x 106 sequences + WGS 692 x 106 bp, 172 768 sequences
 Aug 2016 218 x 109 bp, 196 x 106 sequences + WGS 1,6 1012 bp, 360 x 106 sequences
22
23
WGS
Interactive concepts in biochemistry, Rodney Boyer, Wiley, 
2002, http://www.wiley.com//college/boyer/0470003790/
Shotgun sequencing allows a scientist to rapidly determine the sequence of very long
stretches of DNA. The key to this process is fragmenting of the genome into smaller
pieces that are then sequenced side by side, rather than trying to read the entire genome
in order from beginning to end. The genomic DNA is usually first divided into its
individual chromosomes. Each chromosome is then randomly broken into small strands
of hundreds to several thousand base pairs, usually accomplished by mechanical
shearing of the purified genetic material. Each of the short DNA pieces is then inserted
into a DNA vector (a viral genome), resulting in a viral particle containing "cloned"
genomic DNA (Fig. 1).
The collection of all the viral particles with all the different genomic DNA pieces is
referred to as a library. Just as a library consists of a set of books that together make up
all of human knowledge, a genomic library consists of a set of DNA pieces that together
make up the entire genome sequence. Placing the genomic DNA within the viral genome
allows bacteria infected with the virus to faithfully replicate the genomic DNA pieces.
Additionally, since a little bit of known sequence is needed to start the sequencing
reaction, the reaction can be primed off the known flanking viral DNA.
In order to read all the nucleotides of one organism, millions of individual clones are
sequenced. The data is sorted by computer, which compares the sequences of all the
small DNA pieces at once (in a "shotgun" approach) and places them in order by virtue
of their overlapping sequences to generate the full-length sequence of the genome (Fig.
2). To statistically ensure that the whole genome sequence is acquired by this method,
an amount of DNA equal to five to ten times the length of the genome must be
sequenced. (Interactive concepts in biochemistry, Rodney Boyer, Wiley, 2002,
http://www.wiley.com//college/boyer/0470003790/)
23
24
Growth of DNA Sequence in Repositories
Year
J. Pevsner, 
http://www.bioinfbook.org/index.php
25
Growth of DNA Sequence in Repositories
Year
A vast amount of sequence data has been
generated using next-generation sequencing.
26
Growth of DNA Sequence in Repositories
B&FG 3e
Fig. 2-3
Page 22 Year
Perhaps 40 petabases (corresponding
to 10 mil. human genomes) of DNA
were generated in calendar year 2014
at major sequencing centers.
27
 They include sets of primary data – DNA and Protein sequences
 Protein sequences:
 PIR, http://pir.georgetown.edu/
 MIPS, http://www.mips.biochem.mpg.de
 SWISS-PROT, http://www.expasy.org/sprot/
Primary Databases
27
28
 Standard nucleotide sequences acquired by high quality
sequencing
 Types of sequences in primary databases
 ESTs (Expressed Sequence Tags)
 HGTS (High Throughput Genome Sequencing)
- Results of sequencing projects without annotation
 Reference Sequences of annotated genomes
 TPAs (Third Party Annotation)
- sequences annotated by third party (by someone else, not the
orginal authors)
Primary Databases
28
29
GenBank (NCBI) http://www.ncbi.nlm.nih.gov/
Primary Databases
29
30
Primary Databases
30
31
Primary Databases
31
32
Accession
number
Primary Databases
32
33
Primary Databases
33
34
34
What is an Accession Number?
An accession number is label that used to identify a sequence. It is a string of letters
and/or numbers that corresponds to a molecular sequence.
Examples (all for retinol-binding protein, RBP4):
X02775 GenBank genomic DNA sequence
NT_030059 Genomic contig
Rs7079946 dbSNP (single nucleotide polymorphism)
N91759.1An expressed sequence tag (1 of 170)
NM_006744 RefSeq DNA sequence (from a transcript)
NP_007635 RefSeq protein
AAC02945 GenBank protein
Q28369 SwissProt protein
1KT7 Protein Data Bank structure record
Protein
DNA
RNA
J. Pevsner, 
http://www.bioinfbook.org/index.php
35
35
NCBI’s important RefSeq project:
best representative sequences
RefSeq (accessible via the main page of NCBI)
provides an expertly curated accession number that
corresponds to the most stable, agreed-upon “reference”
version of a sequence.
RefSeq identifiers include the following formats:
Complete genome NC_######
Complete chromosome NC_######
Genomic contig NT_######
mRNA (DNA format) NM_###### e.g. NM_006744
Protein NP_###### e.g. NP_006735
J. Pevsner, 
http://www.bioinfbook.org/index.php
36
RefSeq
36
37
37
Accession Molecule Method Note
AC_123456 Genomic Mixed Alternate complete genomic
AP_123456 ProteinMixed Protein products; alternate
NC_123456 Genomic Mixed Complete genomic molecules
NG_123456 Genomic Mixed Incomplete genomic regions
NM_123456 mRNA Mixed Transcript products; mRNA
NM_123456789 mRNA Mixed Transcript products; 9-digit
NP_123456 ProteinMixed Protein products;
NP_123456789 ProteinCuration Protein products; 9-digit
NR_123456 RNA Mixed Non-coding transcripts
NT_123456 Genomic AutomatedGenomic assemblies
NW_123456 Genomic AutomatedGenomic assemblies
NZ_ABCD12345678 Genomic AutomatedWhole genome shotgun data
XM_123456 mRNA AutomatedTranscript products
XP_123456 ProteinAutomatedProtein products
XR_123456 RNA AutomatedTranscript products
YP_123456 ProteinAuto. & Curated Protein products
ZP_12345678 ProteinAutomatedProtein products
NCBI’s RefSeq project: many accession number
formats for genomic, mRNA, protein sequences
J. Pevsner, 
http://www.bioinfbook.org/index.php
38
Primary Databases
38
39
Primary Databases
39
40
 PROSITE, http://www.expasy.org/prosite/
 Databases of functional or structural motifs, acquired by primary data
(sequences) comparison
Secondary Databases
40
41
 PROSITE, http://www.expasy.org/prosite/
Secondary Databases
 Databases of functional or structural motifs, acquired by primary data
(sequences) comparison
41
42
 PROSITE, http://www.expasy.org/prosite/
Secondary Databases
 Databases of functional or structural motifs, acquired by primary data
(sequences) comparison
42
43
 PRINTS, http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/
Secondary Databases
 Databases of functional or structural motifs, acquired by primary data
(sequences) comparison
43
44
 TRANSFAC http://www.gene-regulation.com/
Secondary Databases
Scaffold/Matrix Attached Region transaction Database
S/MARt DB (saffold/matrix attached region transaction database). This database
collects information about S/MARs and the nuclear matrix proteins that are
supposed be involved in the interaction of these elements with the nuclear matrix.
http://transfac.gbf.de/SMARtDB/index.html)
44
45
 PDB http://www.rcsb.org/pdb/
Structural Databases
45
46
 PDB http://www.rcsb.org/pdb/
Structural Databases
46
47
 PDB http://www.rcsb.org/pdb/
Structural Databases
Pekárová et al., Plant Journal (2011)
47
48
48
 Syllabus Of The Course
 Definition Of Genomics
 Role Of Bioinformatics In Functional Genomics
 Databases
 Spectre of „on-line“ Resources
 PRIMARY, SECONDARY And STRUCURAL Databases
 GENOME Resources
Outline
49
 Human Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway
Genome Resources
49
50
Genome Resources
 Human Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway
50
51
Genome Resources
 Human Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway
51
52
Genome Resources
 Human Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway
52
53
Genome Resources
 Human Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway
53
54
Genome Resources
 The Arabidopsis Information Resource (TAIR) http://www.arabidopsis.org
54
55
 TAIR, The Arabidopsis Information Resource, http://www.arabidopsis.org
Genome Resources
55
56
56
 Syllabus Of The Course
 Definition Of Genomics
 Role Of Bioinformatics In Functional Genomics
 Databases
 Spectre Of „On-line“ Resources
 PRIMARY, SECONDARY And STRUCURAL Databases
 GENOME Resources
 Analytical Tools
 Homology Searching
Outline
57
 Global versus Local alignment
 Global Alignment: only for sequences, which are similar and of
a similar length (BUT can insert spaces into one or both
sequences)
 Local Alignment provides identification and comparison even in
case of alignment of regions of sequences with high similarity,
e.g. even in case of change of order of protein domains during
evolution
Cvrčková, Úvod do praktické bioinformatiky
 Global Alignment is used mainly in case of multiple alignment
(CLUSTALW, further in the presentation)
Analytical Tools
57
58
 Choosing the right type of alignment using dotplot
 Plotting the sequences against each other (x and y axis)
 Identification of identity in „dot“ of specific size (e.g. 2 bp)
 Filtering the diagonals of lengths lower than a treshold
Cvrčková, Úvod do praktické bioinformatiky
Analytical Tools
58
59
 Examples of sequence alignment using dotplot
 Global Alignment: possible only for sequences A and B
 The rest of the sequences underwent change of order of
protein domains and therefore it is neccessary to do a local
alignment
 Dotplot can be obtained using BLAST2 (see further in the
presentation)
Cvrčková, Úvod do praktické bioinformatiky
Analytical Tools
59
60
 BLAST http://ncbi.nlm.nih.gov/BLAST/
Analytical Tools
60
61
61
 Word size: 10-11 bp or 2-3 aa
 Scoring the homology with matrices PAM (Point Accepted Mutation) or
BLOSUM (BLOcks Substitution Matrix)
 Primary similarities (seed matches)
 Expanding the homology regions to the left and to the
right
 Showing the results
MRKEV [delece]
MRKE [záměna]
MRKY [inzerce]
MRAKY M R . K E V
| | | :
M R A K Y
Matice PAM 250
Cvrčková, Úvod do praktické bioinformatiky
BLAST
Basic Local Alignment Search Tool
62
E= expectancy
value
 „expectancy value“ provides the number of expected sequence
number with the same or higher similarity whe searching in the
database consisiting of randomly assembled sequences
 the results shows fraction of identical and in case of proteins also
similar sequence positions and/or inserted spaces
BLAST
Basic Local Alignment Search Tool
62
63
Primary Databases
BLINK is a link to the pre-computed BLAST search results for the respective
sequence (see the next slide).
63
64
BLAST
Basic Local Alignment Search Tool
64
65
 Searching according to source (organism) of sequences, e.g. known
genomes of microorganisms
 Currently there exists a lot of specialized versions of BLAST
 BLASTP
• Given the protein query, it returns the most similar protein
sequences from the protein database.
 BLASTN
• Given the DNA query, it returns the most similar DNA
sequences from the DNA database.
 BLASTX
• Compares the all possible six-frame translation products of
a nucleotide query sequence (both strands) against a
protein sequence database.
• Other variants, e.g. MEGABLAST, for identification of
identical or very similar sequences (searches long similar
regions of nucleotide sequences)
BLAST
Specialized Versions
65
66
 TBLASTN
• Compares a protein query against the all six reading
frames of a nucleotide sequence database.
 TBLASTX
• Translates the query nucleotide sequence in all six
possible frames and compares it against the six-frame
translations of a nucleotide sequence database.
 Currently there exists a lot of specialized versions of BLAST
BLAST
Specialized Versions
66
67
 PSI-BLAST (Position-Specific Iterated Blast)
• For every alignment, PSI-BLAST creates so-called PSSM
(Position Specific Substitution Matrix)
• PSSM takes into account relative frequency of specific
aminoacid residue in a specific position within sequences
identified as similar in first step, which can mean functional
conservation.
• First step: standard BLAST, during which PSI-BLAST
identifies a list of similar sequences with E value better
than minimal value (standard = 0,005)
 Currently there exist a lot of specialized versions of BLAST
BLAST
Specialized Versions
67
68
 PHI-BLAST (Pattern-Hit Initiated BLAST)
• Sequence of motif must be inserted using special syntax:
• [LVIMF] means either Leu, Val, Ile, Met or Phe
• For identification of specific sequence, e.g. motif (pattern)
in sequence of similar protein sequences
• - is spacer (means nothing)
• x(5) means 5 positions in which any residue is allowed
• x(3, 5) means 3 to 5 positions where any residue is allowed
 Currently there exists a lot of specialized versions of BLAST
BLAST
Specialized Versions
68
69
 Example of search by PHI-BLAST
BLAST
Specialized Versions
69
70
70
 Syllabus Of The Course
 Definition Of Genomics
 Role Of Bioinformatics In Functional Genomics
 Databases
 Spectre Of „On-line“ Resources
 PRIMARY, SECONDARY And STRUCURAL Databases
 GENOME Resources
 Analytical Tools
 Homologies Searching
 Searching Of Sequence Motifs, Open Reading Frames, Restriction
Sites…
Outline
71
 http://workbench.sdsc.edu/
Analytical Tools
71
72
 http://workbench.sdsc.edu/
Analytical Tools
72
73
 http://workbench.sdsc.edu/
Analytical Tools
73
74
 http://workbench.sdsc.edu/
Analytical Tools
74
75
 http://workbench.sdsc.edu/
Analytical Tools
75
76
 http://workbench.sdsc.edu/
Analytical Tools
76
77
 http://workbench.sdsc.edu/
Analytical Tools
77
78
 VPCR http://grup.cribi.unipd.it/cgi-bin/mateo/vpcr2.cgi
Analytical Tools
78
79
 VPCR http://grup.cribi.unipd.it/cgi-bin/mateo/vpcr2.cgi
Analytical Tools
79
80
80
 Syllabus Of The Course
 Definition Of Genomics
 Role Of Bioinformatics In Functional Genomics
 Databases
 Spectre Of „On-line“ Resources
 PRIMARY, SECONDARY And STRUCURAL Databases
 GENOME Resources
 Analytical Tools
 Homologies Searching
 Searching Of Sequence Motifs, Open Reading Frames, Restriction
Sites…
 Other On-line Genome Tools
Outline
81
 TIGR (The Institute for Genomic Research, http://www.tigr.org/software/)
 Recently part of the J. Craig Venter Institute
Other On-Line Genome
Resources
81
82
 Online Mendelian Inheritance in Man (OMIM)
Other On-Line Genome
Resources
82
83
83
 Syllabus Of The Course
 Definition Of Genomics
 Role Of Bioinformatics In Functional Genomics
 Databases
 Spectre Of „On-line“ Resources
 PRIMARY, SECONDARY and STRUCURAL Databases
 GENOME Resources
 Analytical Tools
 Homologies Searching
 Searching Of Sequence Motifs, Open Reading Frames, Restriction
Sites…
 Other On-line Genome Tools
Summary
84
84
Discussion