Vojta Bystry vojtech.bystry@ceitec.muni.cz Modern methods for genome analysis (PřF:Bi7420) Lecture 6 : RNA-seq differential expression NGS data analysis 22 Raw data .fastq Genome/Transcriptome Reference Mapping .bam Interaction analysis CHIP-seq Expression analysis RNAseq Variant analysis WES de-multiplexing Not known reference QC QC Experiment design Not ”classic” reference Metagenomics Reference assembly Immunogenetic VDJ-genes CRISPR sgRNA Methylation Bisulfide-seq… 3 Feature count results Differential expression ● We have our raw read counts but we need to find the real differences ● We want to figure out the change comparing the before and after treatment ● What are the changed genes? Are there even any? Is there even difference between the samples? And what about the experimental design - paired samples - does it affect the evaluation? ● The tools for the differential expression have to account for different libraries depths, model and “fix” outliers, account for different levels of expressions, and many other things ● Luckily, there are few tools that have all of this and can be used Differential expression - tools ● DESeq2 ○ More specific ● edgeR ○ More sensitive ● The important part of the calculation is the design ○ Assignment of a group/condition to a sample ○ If the samples are paired (the same patient twice) we have to account for this as well! ○ Technically, the pairing of the samples is a batch effect so it is similar to have a technical noise in your data Pairing of the samples/batch effect ● Paired samples are not the same as paired-end sequencing! ● There is a bad experimental design and a good experimental design ● Very simply - more randomization gives you better results Pairing of the samples/batch effect ● And example pairing of the patients AND different sequencing years - double batch Pairing of the samples/batch effect 9 Differential expression results 10 Count normalisation ● Normalize to: ○ Gene size ○ Library size ● rpkm - Reads Per Kilobase of transcript per Million mapped reads ● fpkm - Fragments Per Kilobase of transcript per Million mapped reads ● tpm - Transcripts Per Million (TPM) ○ for every 1,000,000 RNA molecules in the RNA-seq sample, x came from this gene/transcript ● Never ever use normalized counts for any comparisons ○ ...except comparing a single gene in a single experiment for the samples ○ If you really, really need to use any kind of normalized counts to compare use TPM log2(fold-change) ● Fold-change is usually calculated by average expression of all samples of condition 1 vs average expression of all samples of condition 2 ● Example: a) geneA expression in pre is 5, in post is 10; fold-change of post/pre is 2 = gene is up-regulated 2x b) geneB expression in pre is 10, in post is 5; fold-change of post/pre is 0.5 = gene is down-regulated 1/2x … (O_o) ● Solution: Adding log2 gives us log2(2) = 1, log2(0.5) = -1 ● Nice and even distribution around 0 and clear interpretations log2(fold-change) ● But it might be misleading ● Large log2FC on low-expressed genes are most likely not biologically relevant ● Small log2FC on highly-expressed genes might be biologically relevant ● Example: “Common” cut-off value of fold-change of 2x (log2FC=+/-1) or 1.5x (log2FC=+/-0.58) ○ geneA expression in WT is 10 and in KO is 4, log2FC = -1.32 YES (?) ○ geneB expression in WT is 1,000,000 and in KO is 500,001, log2FC = -0.99 NO (?) P-value and adjusted p-value ● P-value tries to give you “a number” saying if the differences you are observing are robust and the differences are not “random” between the compared conditions/samples ● Adjusted p-value adds a correction for the multiple testing we are doing tries to add correction of getting a p-value just by accident ● But is adjusted p-value 0.049 really better than 0.051? ● Number of replicates highly influences the estimates ○ The observations might be the same but the statistical significance might be lower How many differentially expressed genes I have? It depends how many you want…:) Selection of the differentially expressed (DE) gene is completely up to you Some people use p-value, some adjusted p-value and some people log2fc and their combinations, some just take top n genes Statistical significance ≠ biological relevance!!! Scientists rise up against statistical significance, Nature 567, 305-307 (2019), doi: 10.1038/d41586-019-00857-9 P-value significance 16 ● Example Differential expression output 17www.ceitec.eu CEITEC @CEITEC_Brno Vojta Bystry vojtech.bystry@ceitec.muni.cz Thank you for your attention!