Vojta Bystry
vojtech.bystry@ceitec.muni.cz
Modern methods for genome analysis
(PřF:Bi7420)
Lecture 6 : RNA-seq
differential expression
NGS data analysis
22
Raw data
.fastq
Genome/Transcriptome
Reference Mapping
.bam
Interaction
analysis
CHIP-seq
Expression
analysis
RNAseq
Variant
analysis
WES
de-multiplexing
Not known
reference
QC
QC
Experiment
design
Not ”classic”
reference
Metagenomics
Reference
assembly
Immunogenetic
VDJ-genes
CRISPR
sgRNA
Methylation
Bisulfide-seq…
Why RNA-seq?
The goal of RNA-seq is often to perform differential expression testing to
determine which genes are expressed at different levels between conditions.
These genes can offer biological insight into the processes affected by the
condition(s) of interest.
Great resource that has made the bulk of this talk:
https://github.com/hbctraining/DGE_workshop/blob/master/lessons/01_DGE_setup_and_overview.md
Lean tutorial for mostly wetlab humans:
https://git.embl.de/provazni/rna-seq-tutorial/-/tree/EMBLVM?ref_type=heads
RNA-seq workﬂow
Read Alignment
https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/mapping/tutorial.html
Many different aligners/pseudo-aligners:
• Eland, Maq, Bowtie, Bowtie2, BWA, SOAP, SSAHA, TopHat, SpliceMap, Novoalign, STAR, GSNAP …
• Kallisto, Salmon
Main differences:
• Publication year, maturity, development after publication, popularity
• usage of base-call qualities, calculation of mapping qualities
• speed-vs-sensitivity trade-off
• suitability for RNA-Seq (“spliced alignment”)
• suitability for special tasks
Expected alignment percentage:
• 70% to 90% on genome
• slightly lower on transcriptome
Genome or annotated transcriptome
https://en.wikipedia.org/wiki/RNA-Seq
Read Alignment
Alignment
● Mapping to genome or transcriptome?
● Genome
○ Requires spliced alignment
○ Can find novel genes/isoforms/exons
○ Information about whole genome/transcriptome
● Transcriptome
○ No spliced alignments necessary
○ Many reads will map to multiple transcripts (shared exons)
○ Cannot find anything new
○ Difficult to determine origin of reads (multiple copies of transcripts)
Duplication removal - UMI
8
• PCR duplicates
• Optical duplicates
• How the tools recognize duplicates
‒ Maps to the exact same place
• Problem is it could be identical fragment not PCR duplicate
• UMI helps
‒ Maps to the exact same place
‒ AND have identical UMI sequence
Post-alignment QC
● Number of mapped reads - unique + multi mapped
● Mapped locations – intron, exon, intergenic
● Duplication rates
● Library strand specificity
● Captured biotypes
● Contamination (rRNA, non-self)
● 5’ to 3’ end coverage bias
Post-alignment QC - Tools
● Aligner report
○ STAR – most direct assesment
● General QC tools
○ RSeQC
○ Picard
○ Qualimap
● Feature counting tools
○ featureCounts
○ RSEM
● Non-aligment tools
○ FastQ screen
○ Biobloom
Note: Gene body coverage
● Often, libraries with high fragmentation (and low RIN numbers) combined with
polyA selection might have strong 3’ end bias
○ This is a result of polyA “pulled” fragments
● Some kits, however, target only the polyA tail or sequences close to it
○ An example is Lexogen QuantSeq which sequences only one read per mRNA molecule
close to polyA tail
Note: Gene body coverage II
Examples
Source: Sigurgeirsson et al. PLoS ONE 2014
Feature counting
● Now, when we know our alignments are solid we need to get the number of
reads mapped to a gene (or other feature)
○ From there, we can calculate the differential expression
● The question is, how do we summarize the counts
○ Do we want only uniquely mapped reads
○ Do we want also multi mapped? And how do we assign them? All? One random?
Somehow else?
○ And what if we have multiple genes which overlap each other?
Strand specific library
● We can basically have three strand specificities
○ Non stranded/Unstranded - not very common anymore
■ Direction of the read mapping is completely random (50/50)
○ Forward (sense) stranded - common for target kits and “bacterial kits”
■ Direction of the read mapping is the same as the gene it originates from
○ Reverse (antisense) stranded - “default” for Illumina and NEB kits
■ Direction of the read mapping is the opposite as the gene it originates from
● In case of paired-end sequencing it’s measure by the first (R1) read orientation
(FR, RF)
Gene Counts
Tools: HTSeq-count, featureCounts,
salmon, kallisto, etc.
Be careful about:
● Correct annotation
● What you do with multimappers
16
Feature count results
Post-alignment QC - example
DE analysis tools
● DESeq2
(https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)
● edgeR (https://bioconductor.org/packages/release/bioc/html/edgeR.html)
Normalization
Main factors we need to consider:
● Sequencing depth
● Gene length
● Difference in RNA composition
Normalization - sequencing depth
Images on this and following slides: https://github.com/hbctraining/DGE_workshop
Normalization - gene length
Normalization - RNA composition
Normalization methodsNormalization method Description Accounted factors Recommendations for use
CPM (counts per million) counts scaled by total number of reads sequencing depth gene count comparisons between replicates of
the same sample group; NOT for within
sample comparisons or DE analysis
TPM (transcripts per kilobase million) counts per length of transcript (kb) per
million reads mapped
sequencing depth
and gene length
gene count comparisons within a sample or
between samples of the same sample group;
NOT for DE analysis
RPKM/FPKM (reads/fragments per kilobase of
exon per million reads/fragments mapped)
similar to TPM sequencing depth
and gene length
gene count comparisons between genes within
a sample; NOT for between sample
comparisons or DE analysis
DESeq2's median of ratios [1] counts divided by sample-specific size
factors determined by median ratio of
gene counts relative to geometric mean
per gene
sequencing depth
and RNA
composition
gene count comparisons between samples and
for DE analysis; NOT for within sample
comparisons
EdgeR's trimmed mean of M values (TMM) [2] uses a weighted trimmed mean of the
log expression ratios between samples
sequencing depth,
RNA composition
gene count comparisons between samples and
for DE analysis; NOT for within sample
comparisons
PCA
https://hbctraining.github.io/DGE_workshop/lessons/principal_component_analysis.html
PCA
https://hbctraining.github.io/DGE_workshop/lessons/principal_component_analysis.html
PCA
https://hbctraining.github.io/DGE_workshop/lessons/principal_component_analysis.html
Sample1 PC1 score = (read count * influence) + ... for all genes
PCA
https://hbctraining.github.io/DGE_workshop/lessons/principal_component_analysis.html
PCA
PCA real life examples
PCA real life examples
PCA real life examples
● There is a bad
experimental design
and a good
experimental design
● Very simply - more
randomization gives
you better results
Pairing of the samples/batch effect
● And example pairing of the patients AND different sequencing years - double
batch
Pairing of the samples/batch effect
Pairing of the samples/batch effect
● Paired samples are not the same as paired-end sequencing!
DE analysis - proper
DE analysis - proper
We need for DEseq2:
● Table describing all the samples
● Table with raw counts
Experimental design
● Biological replicates represent multiple samples from the same sample group
● Technical replicates represent the same sample (i.e. RNA from the same mouse)
but with technical steps replicated, or the same sample sequenced multiple times
● Usually biological variance is much greater than technical variance, so we do
not need to account for technical variance to identify biological differences in
expression
● Don't spend money on technical replicates - biological replicates are much
more useful
Experimental design
Experimental design - confounding variables
A confounded RNA-Seq experiment is one where you cannot distinguish
the separate effects of two different sources of variation in the data.
● Do not try to save money by “pooling” variables together!
● Avoid introducing variables
● Keep metadata
DE analysis - proper
DE analysis - result table
log2(fold-change)
● Fold-change is usually calculated by average expression of all samples of condition 1
vs average expression of all samples of condition 2
● Example:
a) geneA expression in pre is 5, in post is 10; fold-change of post/pre is 2 = gene is up-regulated 2x
b) geneB expression in pre is 10, in post is 5; fold-change of post/pre is 0.5 = gene is down-regulated
1/2x … (O_o)
● Solution: Adding log2 gives us log2(2) = 1, log2(0.5) = -1
● Nice and even distribution around 0 and clear interpretations
P-value and adjusted p-value
● P-value tries to give you “a number” saying if the differences you are
observing are robust and the differences are not “random” between the
compared conditions/samples
● Adjusted p-value adds a correction for the multiple testing we are doing tries
to add correction of getting a p-value just by accident
● But is adjusted p-value 0.049 really better than 0.051?
● Number of replicates highly influences the estimates
○ The observations might be the same but the statistical significance might be lower
How many differentially expressed genes I have?
It depends how many you want…:)
Selection of the differentially expressed (DE) gene is completely up to you
Some people use p-value, some adjusted p-value and some people log2fc
and their combinations, some just take top n genes
Statistical significance ≠ biological relevance!!!
Scientists rise up against statistical significance, Nature 567, 305-307 (2019), doi:
10.1038/d41586-019-00857-9
P-value significance
Visualisation - MA plot
Visualisation - volcano plots
Amazing interactive tool:
Glimma -
https://bioconductor.org/p
ackages/release/bioc/htm
l/Glimma.html
Visualisation - heatmaps
Main takeaway points
● Have a question that can be answered with RNA-seq.
● Avoid confounding variables.
● Have as many biological replicates as you can.
● Don’t bother with technical replicates.
● Rather do more replicates than deeper sequencing. *
● Consult your design with a sequencing facility / bioinformatician before you start.
● Keep metadata, even if it feels crazy.
● Try to make a pilot study before you commit more resources. (Student’s mental
health is a resource too!)
*Unless you want some really low expressed genes, then you probably
need both replicates and depth
50www.ceitec.eu
CEITEC
@CEITEC_Brno
Vojta Bystry
vojtech.bystry@ceitec.muni.cz
Thank you for your attention!