Vojta Bystry
vojtech.bystry@ceitec.muni.cz
Modern methods for genome analysis
(PřF:Bi7420)
Lecture 5 : RNA-seq primary
analysis
NGS data analysis
22
Raw data
.fastq
Genome/Transcriptome
Reference Mapping
.bam
Interaction
analysis
CHIP-seq
Expression
analysis
RNAseq
Variant
analysis
WES
de-multiplexing
Not known
reference
QC
QC
Experiment
design
Not ”classic”
reference
Metagenomics
Reference
assembly
Immunogenetic
VDJ-genes
CRISPR
sgRNA
Methylation
Bisulfide-seq…
The RNA-Seq workflow
Alignment
● Mapping to genome or transcriptome?
● Genome
○ Requires spliced alignment
○ Can find novel genes/isoforms/exons
○ Information about whole genome/transcriptome
● Transcriptome
○ No spliced alignments necessary
○ Many reads will map to multiple transcripts (shared exons)
○ Cannot find anything new
○ Difficult to determine origin of reads (multiple copies of transcripts)
Alignment
● Our choice is the STAR aligner
● It performs genome alignment
● Offers a lot of settings to support splicing, soft-clipping, chimeric alignments, ...
● Other techniques (Salmon or Kallisto) do not use alignment per se and can
give you the gene count information right away
○ They use only transcriptome as a reference and are very quick
○ Drawback is you see only what’s in the transcriptome and nothing else
Duplication removal - UMI
6
• PCR duplicates
• Optical duplicates
• How the tools recognize duplicates
‒ Maps to the exact same place
• Problem is it could be identical fragment not PCR duplicate
• UMI helps
‒ Maps to the exact same place
‒ AND have identical UMI sequence
Post-alignment QC
● Number of mapped reads - unique + multi mapped
● Mapped locations – intron, exon, intergenic
● Duplication rates
● Library strand specificity
● Captured biotypes
● Contamination (rRNA, non-self)
● 5’ to 3’ end coverage bias
Post-alignment QC - Tools
● Aligner report
○ STAR – most direct assesment
● General QC tools
○ RSeQC
○ Picard
○ Qualimap
● Feature counting tools
○ featureCounts
○ RSEM
● Non-aligment tools
○ FastQ screen
○ Biobloom
Note: Gene body coverage
● Often, libraries with high fragmentation (and low RIN numbers) combined with
polyA selection might have strong 3’ end bias
○ This is a result of polyA “pulled” fragments
● Some kits, however, target only the polyA tail or sequences close to it
○ An example is Lexogen QuantSeq which sequences only one read per mRNA molecule
close to polyA tail
Note: Gene body coverage II
Examples
Source: Sigurgeirsson et al. PLoS ONE 2014
Feature counting
● Now, when we know our alignments are solid we need to get the number of
reads mapped to a gene (or other feature)
○ From there, we can calculate the differential expression
● The question is, how do we summarize the counts
○ Do we want only uniquely mapped reads
○ Do we want also multi mapped? And how do we assign them? All? One random?
Somehow else?
○ And what if we have multiple genes which overlap each other?
Strand specific library
● We can basically have three strand specificities
○ Non stranded/Unstranded - not very common anymore
■ Direction of the read mapping is completely random (50/50)
○ Forward (sense) stranded - common for target kits and “bacterial kits”
■ Direction of the read mapping is the same as the gene it originates from
○ Reverse (antisense) stranded - “default” for Illumina and NEB kits
■ Direction of the read mapping is the opposite as the gene it originates from
● In case of paired-end sequencing it’s measure by the first (R1) read orientation
(FR, RF)
Feature counting
● The regular settings are - summarize reads mapping to exons (-t exon) and
sum them up to gene id (-g gene_id)
● Other possibilities:
○ Count per exons
○ Include introns
○ …
Gene counts - Tools
● featureCounts is build around the “classic” read to gene assignment
○ By default, assigns only uniquely mapped reads an only reads uniquely assignable to a single gene (but
both can be changed)
○ Gives you raw read counts per gene
● RSEM is efficient in counting also multi mapped reads and can estimate
expression of individual gene isoforms
○ Tries to “weight” the probability a mapped position of a multi mapped read and assign it correctly to the
real source
○ Gives you estimated counts per gene as well as per isoform and normalized TPM = Transcripts per
million transcripts
● But, there is a big differences in the minimal required “good” aligned reads
Minimal number of reads and expression I
● RSEM is less precise in low read counts (<40-50M reads) and for low expressed
RNAs (difficult to estimate)
● For lower read counts it’s safer to go for featureCounts
● Our best practices for a minimal read count for each tools:
○ Less than 40-50M aligned reads (to the good stuff) -> featureCounts
○ More than 40-50M aligned reads (to the good stuff) -> RSEM
● But if you want isoforms!!! -> RSEM
Source: Roberts et al. Nature Methods 2013
16
Feature count results
Post-alignment QC - example
18www.ceitec.eu
CEITEC
@CEITEC_Brno
Vojta Bystry
vojtech.bystry@ceitec.muni.cz
Thank you for your attention!