ceitec_PPT_podklad_uvod logo+napis_en OPVaVpI_loga-eu_pos_H_EN
Moderní metody analýzy genomu
Mgr. Lenka Radová, Ph.D.
Brno, 6.5.2015

Brief workflow
•RNA is isolated from cells, fragmented at random positions, and copied into complementary DNA
(cDNA).
•Fragments meeting a certain size specification (e.g., 200–300 bases long) are retained for
amplification using PCR.
•After amplification, the cDNA is sequenced using NGS; the resulting reads are aligned to a
reference genome, and the number of sequencing reads mapped to each gene in the reference is
tabulated.
•These gene counts, or digital gene expression (DGE) measures, can be transformed and used to test
differential expression
http://www.seqwright.com/images/mRNA%20Seq%20Workflow.jpg

But…
many steps in experimental process may introduce errors and biases


Scales of genome size


QC in Galaxy
OverviewNGSdataanalysis.png


FASTQ format
http://drive5.com/usearch/manual/fastq_fig.jpg
•The first line starts with '@', followed by the label
•The third line starts with '+'. In some variants, the '+' line contains a second copy of the label
•The fourth line contains the Q scores represented as ASCII characters

Q scores of FASTQ
http://drive5.com/usearch/manual/qscores.gif


Quality control tools for NGS data
•fastQC in Galaxy
•QC report in CLCbio

From microarrays to NGS data
•As research transitions from microarrays to sequencing-based approaches, it is essential that we
revisit many of the same concerns that the statistical community had at the beginning of the
microarray era
•series of articles was published elucidating the need for proper experimental design
•

Basic biological problems
•Identification of mutations
•    - somatic
•    - germinal
•
•
•
•
•Expression analyses - genes, miRNAs, etc.
An Introduction to Genetic Analysis. 7th edition. Griffiths AJF, Miller JH, Suzuki DT, et al. New
York: W. H. Freeman; 2000

Mutation identification
•Whole exome or whole genome data,             ultra-deep sequencing
•
•Output: VCF-format
•

Mutation identification
•Aim: identification of point mutations
•
•Application: diagnostic of diseases
•     - inherited (germinal, de-novo mutations)
•           e.g. familiar hypercholesterolemia, hemophylia, cystic fibrosis…
•
•    - gained (somatic mutations)
•            e.g. cancer, leukemia, …

Germinal mutations
•Comparison with reference genome
•Expected allele frequency: 30-100%
•Softwares: GATK, VarScan, …
•Usage: e.g. prenatal diagnostic

Somatic mutations
•Comparison tumor-normal (matched, unmatched)
•Expected allele frequency:
•                 >0,2%
•
•
•Softwares: MuTect, FreeBayes, deepSNV, …
•Usage: translational research, cancer diagnostic, personalized medicine,…
•

Expression analyses – RNA-seq
•characterization of gene expression in cells via measurement of mRNA levels
•
•
•
•
•
•Output: expression level table
•

RNA-seq
•Aim: identification of genes differentially expressed in tissues with different conditions (tumor
vs normal, treated vs untreated, different stages of illness, …)
•
•Application: translational research, diagnostic of diseases
•

Expression level in RNA-seq
•= The number of reads (counts) mapping to the biological feature of interest (gene, transcript,
exon, etc.) is considered to be linearly related to the abundance of the target feature

What is differential expression?
•A gene is declared differentially expressed if an observed difference or change in read counts
between two experimental conditions is statistically significant, i.e. whether it is greater than
what would be expected just due to natural random variation.
•Statistical tools are needed to make such a decision by studying counts probability distributions.

Definitions
•Sequencing depth: Total number of reads mapped to the genome. Library size.
•Gene length: Number of bases.
•Gene counts: Number of reads mapping to that gene (expression measurement)

Experimental design
•Pairwise comparisons: Only two experimental conditions or groups are compared.
•Multiple comparisons: More than 2 conditions or groups.
•Biological replicates. To draw general conclusions: from samples to population.
•Technical replicates. Conclusions are only valid for compared samples.
Replicates

RNA-seq biases
•Influence of sequencing depth: The higher sequencing depth, the higher counts.
•Dependence on gene length: Counts are proportional to the transcript length times the mRNA
expression level.
•Differences on the counts distribution among samples.

Options
•1. Normalization: Counts should be previously corrected in order to minimize these biases.
•
•2. Statistical model should take them into account.
•

Normalization methods
•RPKM (Mortazavi et al., 2008) = Reads per kilo base per million: Counts are divided by the
transcript length (kb) times the total number of millions of mapped reads
•
•
•
•Upper-quartile (Bullard et al., 2010): Counts are divided by upper-quartile of counts for
transcripts with at least one read.
•
•TMM (Robinson and Oshlack, 2010): Trimmed Mean of M values.
•
•Quantiles, as in microarray normalization (Irizarry et al., 2003).
•
•FPKM (Trapnell et al., 2010): Instead of counts, Cufflinks software generates FPKM values
(Fragments Per Kilobase of exon per Million fragments mapped) to estimate gene expression, which
are analogous to RPKM.

Differential expression
•Parametric assumptions: Are they fulfilled?
•Need of replicates.
•Problems to detect differential expression in genes with low counts.

Goal
•Based on a count table, we want to detect differentially expressed genes between conditions of
interest.
•We will assign to each gene a p-value (0-1), which shows us 'how surprised we should be' to see
this difference, when we assume there is no difference.

Goal
•


Algorithms under active development
http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Detecting_differential_expression_by_count_analysi
s

Intuition - gene
Condition A
sample1
sample2
sample3
sample4
sample5
sample6
sample7
sample8
23171
22903
29227
24072
23151
26336
25252
24122


Condition B
Sample9
sample10
sample11
sample12
sample13
sample14
sample15
sample16
19527
26898
18880
24237
26640
22315
20952
25629
Variability A
Variability B
Compare and conclude given a Mean level: similar or not?
}

Intuition
Difference is quantified and used for
p-value computation

Dispersion estimation
•For every gene, a NB is fitted based on the counts. The most important factor in that model to be
estimated is the dispersion.
•
•DESeq2 estimates dispersion by 3 steps:
•     1. Estimates dispersion parameter for each gene
•     2. Plots and fits a curve
•     3. Adjusts the dispersion parameter towards the
•         curve ('shrinking')

Dispersion estimation
•Black dots = estimates from the data
•Red line = curve fitted
•Blue dots = final assigned dispersion parameter for that gene
•
•Model is fitted

Test runs between 2 conditions
•for each gene 2 NB models (one for each condition) are made, and a Wald test decides whether the
difference is significant (red in plot).
•

Test runs between 2 conditions
•for each gene 2 NB models (one for each condition) are made, and a Wald test decides whether the
difference is significant (red in plot).
•
•i.e. we are going to perform thousands of tests…
•(if we set set a cut-off on the p-value of 0,05 and we have performed 20000 tests, 1000 genes will
appear significant by chance)

Check the distribution of p-values
•If the histogram of the p-values does not match a profile as shown here, the test is not reliable.
Perhaps the NB fitting step did not succeed, or confounding variables are present.
•

Improve test results
0.05
Cut-off
False positive
fraction
Correctly identified
as DE

Improve test results
•Avoid testing = apply a filter before testing, an independent filtering
•
•Apply multiple testing correction

Multiple testing corrections
•Bonferroni or Benjamini-Hochberg correction, to control false discovery rate (FDR).
•
•
•FDR is the fraction of false positives in the genes that are classified as DE.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•If we set a threshold α of 0,05, 20% of the DE genes will be false positives.
•

Why to apply multiple testing correction?
•Consider a case where you have 20 hypotheses to test, and a significance level of 0.05.
•
•??? What's the probability of observing at least one significant result just due to chance???
•
•P(at least one significant result) = 1 - P(no signif. results)
•= 1 - (1 – 0.05)20 ≈ 0.64
•
•So, with 20 tests being considered, we have a 64% chance of observing at least one significant
result, even if all of the tests are actually not significant.

Including different factors
WT
Treatment G
Mutant (UPC)
Treatment AG
Additional metadata (batch factor)
Day 1
Day 1
Day 2
Day 2

Including different factors
WT
Treatment G
Mutant (UPC)
Treatment AG
Additional metadata (batch factor)
Day 1
Day 1
Day 2
Day 2
Which genes are DE between UPC and WT?
Which genes are DE between G and AG?
Which genes are DE in WT between G and AG?

Statistical model
•Gene = strain + treatment + day
•
• export results for unique comparisons
•
•

Goal
•


Visualization of results - heatmap
Differentially expressed miRNAs with adjusted p<0,01
samples