Vojta Bystry
vojtech.bystry@ceitec.muni.cz
Modern methods for genome analysis
(PřF:Bi7420)
Lecture 6 : RNA-seq
differential expression
NGS data analysis
22
Raw data
.fastq
Genome/Transcriptome
Reference Mapping
.bam
Interaction
analysis
CHIP-seq
Expression
analysis
RNAseq
Variant
analysis
WES
de-multiplexing
Not known
reference
QC
QC
Experiment
design
Not ”classic”
reference
Metagenomics
Reference
assembly
Immunogenetic
VDJ-genes
CRISPR
sgRNA
Methylation
Bisulfide-seq…
3
Feature count results
Differential expression
● We have our raw read counts but we need to find the real differences
● We want to figure out the change comparing the before and after treatment
● What are the changed genes? Are there even any? Is there even difference
between the samples? And what about the experimental design - paired
samples - does it affect the evaluation?
● The tools for the differential expression have to account for different libraries
depths, model and “fix” outliers, account for different levels of expressions, and
many other things
● Luckily, there are few tools that have all of this and can be used
Differential expression - tools
● DESeq2
○ More specific
● edgeR
○ More sensitive
● The important part of the calculation is the design
○ Assignment of a group/condition to a sample
○ If the samples are paired (the same patient twice) we have to account for this as well!
○ Technically, the pairing of the samples is a batch effect so it is similar to have a technical
noise in your data
Pairing of the samples/batch effect
● Paired samples are not the same as paired-end sequencing!
● There is a bad
experimental design
and a good
experimental design
● Very simply - more
randomization gives
you better results
Pairing of the samples/batch effect
● And example pairing of the patients AND different sequencing years - double
batch
Pairing of the samples/batch effect
9
Differential expression results
10
Count normalisation
● Normalize to:
○ Gene size
○ Library size
● rpkm - Reads Per Kilobase of transcript per Million mapped reads
● fpkm - Fragments Per Kilobase of transcript per Million mapped reads
● tpm - Transcripts Per Million (TPM)
○ for every 1,000,000 RNA molecules in the RNA-seq sample, x came from this
gene/transcript
● Never ever use normalized counts for any comparisons
○ ...except comparing a single gene in a single experiment for the samples
○ If you really, really need to use any kind of normalized counts to compare use TPM
log2(fold-change)
● Fold-change is usually calculated by average expression of all samples of condition 1
vs average expression of all samples of condition 2
● Example:
a) geneA expression in pre is 5, in post is 10; fold-change of post/pre is 2 = gene is up-regulated 2x
b) geneB expression in pre is 10, in post is 5; fold-change of post/pre is 0.5 = gene is down-regulated
1/2x … (O_o)
● Solution: Adding log2 gives us log2(2) = 1, log2(0.5) = -1
● Nice and even distribution around 0 and clear interpretations
log2(fold-change)
● But it might be misleading
● Large log2FC on low-expressed genes are most likely not biologically
relevant
● Small log2FC on highly-expressed genes might be biologically relevant
● Example: “Common” cut-off value of fold-change of 2x (log2FC=+/-1) or 1.5x
(log2FC=+/-0.58)
○ geneA expression in WT is 10 and in KO is 4, log2FC = -1.32 YES (?)
○ geneB expression in WT is 1,000,000 and in KO is 500,001, log2FC = -0.99 NO (?)
P-value and adjusted p-value
● P-value tries to give you “a number” saying if the differences you are
observing are robust and the differences are not “random” between the
compared conditions/samples
● Adjusted p-value adds a correction for the multiple testing we are doing tries
to add correction of getting a p-value just by accident
● But is adjusted p-value 0.049 really better than 0.051?
● Number of replicates highly influences the estimates
○ The observations might be the same but the statistical significance might be lower
How many differentially expressed genes I have?
It depends how many you want…:)
Selection of the differentially expressed (DE) gene is completely up to you
Some people use p-value, some adjusted p-value and some people log2fc
and their combinations, some just take top n genes
Statistical significance ≠ biological relevance!!!
Scientists rise up against statistical significance, Nature 567, 305-307 (2019), doi:
10.1038/d41586-019-00857-9
P-value significance
16
● Example
Differential expression output
17www.ceitec.eu
CEITEC
@CEITEC_Brno
Vojta Bystry
vojtech.bystry@ceitec.muni.cz
Thank you for your attention!