Moderní metody analýzy genomu
Mgr. Lenka Radová, Ph.D.
Brno, 4.12.2015
Base calling
Reads pre-‐
processing
Mapping on
reference
Quality based
variant detection
Local realignment
Biological
interpretation
Model based variant
detection
Variant
annotation
Pipeline/Workflow
Post-‐processing
But…
many steps in
experimental
process may
introduce errors
and biases
Scales of genome size
Quality control
FASTQ format
• The first line starts with '@', followed by the label
• The third line starts with '+'. In some variants, the '+' line contains a
second copy of the label
• The fourth line contains the Q scores represented as ASCII characters
Basic biological problems
• Identification of mutations
- somatic
- germinal
• Expression analyses - genes, miRNAs, etc.
An Introduction to Genetic Analysis. 7th edition. Griffiths AJF, Miller JH, Suzuki DT, et al. New York: W. H. Freeman; 2000
Mutation identification
• Whole exome or whole genome data,
ultra-deep sequencing
• Output: VCF-format
Mutation identification
• Aim: identification of point mutations
• Application: diagnostic of diseases
- inherited (germinal, de-novo mutations)
e.g. familiar hypercholesterolemia, hemophylia, cystic fibrosis…
- gained (somatic mutations)
e.g. cancer, leukemia, …
Germinal mutations
• Comparison with reference genome
• Expected allele frequency: 30-100%
• Softwares: GATK, VarScan, …
• Usage: e.g. prenatal diagnostic
Somatic mutations
• Comparison tumor-normal (matched, unmatched)
• Expected allele frequency:
>0,2%
• Softwares: MuTect, FreeBayes, deepSNV, …
• Usage: translational research, cancer diagnostic,
personalized medicine,…
Advanced biological problems
• Structural variant discovery
(deletions, duplications, CN variants, insertions,
inversions, translocations)
Advanced biological problems
• Chromotripsis = thousands of clustered
chromosomal rearrangements occur in a single
event in localised and confined genomic regions
in one or a few
chromosomes
Expression analyses – RNA-seq
• characterization of gene expression in cells via
measurement of mRNA levels
• Output: expression level table
RNA-seq
• Aim: identification of genes differentially
expressed in tissues with different conditions
(tumor vs normal, treated vs untreated,
different stages of illness, …)
• Application: translational research, diagnostic
of diseases
Expression level in RNA-seq
= The number of reads (counts)
mapping to the biological
feature of interest (gene,
transcript, exon, etc.) is
considered to be linearly
related to the abundance of the
target feature
What is differential expression?
• A gene is declared differentially expressed if
an observed difference or change in read
counts between two experimental conditions
is statistically significant, i.e. whether it is
greater than what would be expected just due
to natural random variation.
• Statistical tools are needed to make such a
decision by studying counts probability
distributions.
Definitions
• Sequencing depth: Total number of reads
mapped to the genome. Library size.
• Gene length: Number of bases.
• Gene counts: Number of reads mapping to
that gene (expression measurement)
Experimental design
• Pairwise comparisons: Only two experimental
conditions or groups are compared.
• Multiple comparisons: More than 2 conditions
or groups.
• Biological replicates. To draw general
conclusions: from samples to population.
• Technical replicates. Conclusions are only valid
for compared samples.
Replicates
RNA-seq biases
• Influence of sequencing depth: The higher
sequencing depth, the higher counts.
• Dependence on gene length: Counts are
proportional to the transcript length times the
mRNA expression level.
• Differences on the counts distribution among
samples.
Options
1. Normalization: Counts should be previously
corrected in order to minimize these biases.
2. Statistical model should take them into
account.
Normalization methods
• RPKM (Mortazavi et al., 2008) = Reads per kilo base per million:
Counts are divided by the transcript length (kb) times the total
number of millions of mapped reads
• Upper-quartile (Bullard et al., 2010): Counts are divided by upperquartile
of counts for transcripts with at least one read.
• TMM (Robinson and Oshlack, 2010): Trimmed Mean of M values.
• Quantiles, as in microarray normalization (Irizarry et al., 2003).
• FPKM (Trapnell et al., 2010): Instead of counts, Cufflinks software
generates FPKM values (Fragments Per Kilobase of exon per Million
fragments mapped) to estimate gene expression, which are
analogous to RPKM.
Differential expression
• Parametric assumptions: Are they fulfilled?
• Need of replicates.
• Problems to detect differential expression in
genes with low counts.
Goal
• Based on a count table, we want to detect
differentially expressed genes between
conditions of interest.
• We will assign to each gene a p-value (0-1),
which shows us 'how surprised we should be'
to see this difference, when we assume there
is no difference.
Algorithms under active development
http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Detecting_differential_expression_by_count_analysis
Intuition
Difference is quantified and used for
p-value computation
Dispersion estimation
• For every gene, a NB is fitted based on the
counts. The most important factor in that model
to be estimated is the dispersion.
• DESeq2 estimates dispersion by 3 steps:
1. Estimates dispersion parameter for each gene
2. Plots and fits a curve
3. Adjusts the dispersion parameter towards the
curve ('shrinking')
Dispersion estimation
• Black dots = estimates
from the data
• Red line = curve fitted
• Blue dots = final assigned
dispersion parameter for
that gene
Model is fitted
Test runs between 2 conditions
• for each gene 2 NB
models (one for each
condition) are made, and
a Wald test decides
whether the difference is
significant (red in plot).
Test runs between 2 conditions
• for each gene 2 NB
models (one for each
condition) are made, and
a Wald test decides
whether the difference is
significant (red in plot).
i.e. we are going to perform
thousands of tests…
(if we set set a cut-off on the
p-value of 0,05 and we have
performed 20000 tests, 1000
genes will appear significant
by chance)
Check the distribution of p-values
• If the histogram of the
p-values does not
match a profile as
shown here, the test is
not reliable. Perhaps
the NB fitting step did
not succeed, or
confounding variables
are present.
Improve test results
0.05
Cut-off
False positive
fraction
Correctly identified
as DE
Improve test results
• Avoid testing = apply a filter before testing, an
independent filtering
• Apply multiple testing correction
Multiple testing corrections
• Bonferroni or Benjamini-Hochberg
correction, to control false discovery
rate (FDR).
• FDR is the fraction of false positives in the
genes that are classified as DE.
• If we set a threshold α of 0,05, 20% of
the DE genes will be false positives.
Including different factors
WT
TreatmentG
Mutant (UPC)
TreatmentAG
Additional metadata
(batch factor)
Day 1 Day 1Day 2 Day 2
Which genes are DE between UPC and WT?
Which genes are DE between G and AG?
Which genes are DE in WT between G and AG?
Statistical model
Gene = strain + treatment + day
• export results for unique comparisons
Goal
Visualization of results - heatmap
expr_tumor_S1_L001_R1_001_1
expr_tumor_S3_L001_R1_001_3
expr_tumor_prosinec_S1_L001_R1_001_1
expr_sliznice_S4_L001_R1_001_4
expr_sliznice_S2_L001_R1_001_2
expr_sliznice_prosinec_S2_L001_R1_001_2
mir-145
mir-451a
mir-139
mir-9-2
mir-9-1
mir-9-3
mir-592
mir-7-2
mir-7-3
mir-7-1
Differentially
expressed miRNAs
with adjusted p<0,01
samples