Vojta Bystry vojtech.bystry@ceitec.muni.cz Modern methods for genome analysis (PřF:Bi7420) Lecture 6 : RNA-seq differential expression NGS data analysis 22 Raw data .fastq Genome/Transcriptome Reference Mapping .bam Interaction analysis CHIP-seq Expression analysis RNAseq Variant analysis WES de-multiplexing Not known reference QC QC Experiment design Not ”classic” reference Metagenomics Reference assembly Immunogenetic VDJ-genes CRISPR sgRNA Methylation Bisulfide-seq… Why RNA-seq? The goal of RNA-seq is often to perform differential expression testing to determine which genes are expressed at different levels between conditions. These genes can offer biological insight into the processes affected by the condition(s) of interest. Great resource that has made the bulk of this talk: https://github.com/hbctraining/DGE_workshop/blob/master/lessons/01_DGE_setup_and_overview.md Lean tutorial for mostly wetlab humans: https://git.embl.de/provazni/rna-seq-tutorial/-/tree/EMBLVM?ref_type=heads RNA-seq workflow Read Alignment https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/mapping/tutorial.html Many different aligners/pseudo-aligners: • Eland, Maq, Bowtie, Bowtie2, BWA, SOAP, SSAHA, TopHat, SpliceMap, Novoalign, STAR, GSNAP … • Kallisto, Salmon Main differences: • Publication year, maturity, development after publication, popularity • usage of base-call qualities, calculation of mapping qualities • speed-vs-sensitivity trade-off • suitability for RNA-Seq (“spliced alignment”) • suitability for special tasks Expected alignment percentage: • 70% to 90% on genome • slightly lower on transcriptome Genome or annotated transcriptome https://en.wikipedia.org/wiki/RNA-Seq Read Alignment Alignment ● Mapping to genome or transcriptome? ● Genome ○ Requires spliced alignment ○ Can find novel genes/isoforms/exons ○ Information about whole genome/transcriptome ● Transcriptome ○ No spliced alignments necessary ○ Many reads will map to multiple transcripts (shared exons) ○ Cannot find anything new ○ Difficult to determine origin of reads (multiple copies of transcripts) Duplication removal - UMI 8 • PCR duplicates • Optical duplicates • How the tools recognize duplicates ‒ Maps to the exact same place • Problem is it could be identical fragment not PCR duplicate • UMI helps ‒ Maps to the exact same place ‒ AND have identical UMI sequence Post-alignment QC ● Number of mapped reads - unique + multi mapped ● Mapped locations – intron, exon, intergenic ● Duplication rates ● Library strand specificity ● Captured biotypes ● Contamination (rRNA, non-self) ● 5’ to 3’ end coverage bias Post-alignment QC - Tools ● Aligner report ○ STAR – most direct assesment ● General QC tools ○ RSeQC ○ Picard ○ Qualimap ● Feature counting tools ○ featureCounts ○ RSEM ● Non-aligment tools ○ FastQ screen ○ Biobloom Note: Gene body coverage ● Often, libraries with high fragmentation (and low RIN numbers) combined with polyA selection might have strong 3’ end bias ○ This is a result of polyA “pulled” fragments ● Some kits, however, target only the polyA tail or sequences close to it ○ An example is Lexogen QuantSeq which sequences only one read per mRNA molecule close to polyA tail Note: Gene body coverage II Examples Source: Sigurgeirsson et al. PLoS ONE 2014 Feature counting ● Now, when we know our alignments are solid we need to get the number of reads mapped to a gene (or other feature) ○ From there, we can calculate the differential expression ● The question is, how do we summarize the counts ○ Do we want only uniquely mapped reads ○ Do we want also multi mapped? And how do we assign them? All? One random? Somehow else? ○ And what if we have multiple genes which overlap each other? Strand specific library ● We can basically have three strand specificities ○ Non stranded/Unstranded - not very common anymore ■ Direction of the read mapping is completely random (50/50) ○ Forward (sense) stranded - common for target kits and “bacterial kits” ■ Direction of the read mapping is the same as the gene it originates from ○ Reverse (antisense) stranded - “default” for Illumina and NEB kits ■ Direction of the read mapping is the opposite as the gene it originates from ● In case of paired-end sequencing it’s measure by the first (R1) read orientation (FR, RF) Gene Counts Tools: HTSeq-count, featureCounts, salmon, kallisto, etc. Be careful about: ● Correct annotation ● What you do with multimappers 16 Feature count results Post-alignment QC - example DE analysis tools ● DESeq2 (https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) ● edgeR (https://bioconductor.org/packages/release/bioc/html/edgeR.html) Normalization Main factors we need to consider: ● Sequencing depth ● Gene length ● Difference in RNA composition Normalization - sequencing depth Images on this and following slides: https://github.com/hbctraining/DGE_workshop Normalization - gene length Normalization - RNA composition Normalization methodsNormalization method Description Accounted factors Recommendations for use CPM (counts per million) counts scaled by total number of reads sequencing depth gene count comparisons between replicates of the same sample group; NOT for within sample comparisons or DE analysis TPM (transcripts per kilobase million) counts per length of transcript (kb) per million reads mapped sequencing depth and gene length gene count comparisons within a sample or between samples of the same sample group; NOT for DE analysis RPKM/FPKM (reads/fragments per kilobase of exon per million reads/fragments mapped) similar to TPM sequencing depth and gene length gene count comparisons between genes within a sample; NOT for between sample comparisons or DE analysis DESeq2's median of ratios [1] counts divided by sample-specific size factors determined by median ratio of gene counts relative to geometric mean per gene sequencing depth and RNA composition gene count comparisons between samples and for DE analysis; NOT for within sample comparisons EdgeR's trimmed mean of M values (TMM) [2] uses a weighted trimmed mean of the log expression ratios between samples sequencing depth, RNA composition gene count comparisons between samples and for DE analysis; NOT for within sample comparisons PCA https://hbctraining.github.io/DGE_workshop/lessons/principal_component_analysis.html PCA https://hbctraining.github.io/DGE_workshop/lessons/principal_component_analysis.html PCA https://hbctraining.github.io/DGE_workshop/lessons/principal_component_analysis.html Sample1 PC1 score = (read count * influence) + ... for all genes PCA https://hbctraining.github.io/DGE_workshop/lessons/principal_component_analysis.html PCA PCA real life examples PCA real life examples PCA real life examples ● There is a bad experimental design and a good experimental design ● Very simply - more randomization gives you better results Pairing of the samples/batch effect ● And example pairing of the patients AND different sequencing years - double batch Pairing of the samples/batch effect Pairing of the samples/batch effect ● Paired samples are not the same as paired-end sequencing! DE analysis - proper DE analysis - proper We need for DEseq2: ● Table describing all the samples ● Table with raw counts Experimental design ● Biological replicates represent multiple samples from the same sample group ● Technical replicates represent the same sample (i.e. RNA from the same mouse) but with technical steps replicated, or the same sample sequenced multiple times ● Usually biological variance is much greater than technical variance, so we do not need to account for technical variance to identify biological differences in expression ● Don't spend money on technical replicates - biological replicates are much more useful Experimental design Experimental design - confounding variables A confounded RNA-Seq experiment is one where you cannot distinguish the separate effects of two different sources of variation in the data. ● Do not try to save money by “pooling” variables together! ● Avoid introducing variables ● Keep metadata DE analysis - proper DE analysis - result table log2(fold-change) ● Fold-change is usually calculated by average expression of all samples of condition 1 vs average expression of all samples of condition 2 ● Example: a) geneA expression in pre is 5, in post is 10; fold-change of post/pre is 2 = gene is up-regulated 2x b) geneB expression in pre is 10, in post is 5; fold-change of post/pre is 0.5 = gene is down-regulated 1/2x … (O_o) ● Solution: Adding log2 gives us log2(2) = 1, log2(0.5) = -1 ● Nice and even distribution around 0 and clear interpretations P-value and adjusted p-value ● P-value tries to give you “a number” saying if the differences you are observing are robust and the differences are not “random” between the compared conditions/samples ● Adjusted p-value adds a correction for the multiple testing we are doing tries to add correction of getting a p-value just by accident ● But is adjusted p-value 0.049 really better than 0.051? ● Number of replicates highly influences the estimates ○ The observations might be the same but the statistical significance might be lower How many differentially expressed genes I have? It depends how many you want…:) Selection of the differentially expressed (DE) gene is completely up to you Some people use p-value, some adjusted p-value and some people log2fc and their combinations, some just take top n genes Statistical significance ≠ biological relevance!!! Scientists rise up against statistical significance, Nature 567, 305-307 (2019), doi: 10.1038/d41586-019-00857-9 P-value significance Visualisation - MA plot Visualisation - volcano plots Amazing interactive tool: Glimma - https://bioconductor.org/p ackages/release/bioc/htm l/Glimma.html Visualisation - heatmaps Main takeaway points ● Have a question that can be answered with RNA-seq. ● Avoid confounding variables. ● Have as many biological replicates as you can. ● Don’t bother with technical replicates. ● Rather do more replicates than deeper sequencing. * ● Consult your design with a sequencing facility / bioinformatician before you start. ● Keep metadata, even if it feels crazy. ● Try to make a pilot study before you commit more resources. (Student’s mental health is a resource too!) *Unless you want some really low expressed genes, then you probably need both replicates and depth 50www.ceitec.eu CEITEC @CEITEC_Brno Vojta Bystry vojtech.bystry@ceitec.muni.cz Thank you for your attention!