RNA-Seq and DE analysis Bioinformatics Core Facility, Centre for Molecular Medicine, CEITEC Masaryk University 1 CEITEC-Central European Institute of Technology, Masaryk University, Kamenice 5, 625 00 Brno, Czech Republic 2 National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech Republic Václav Hejret1,2 vaclav.hejret@ceitec.muni.cz Acknowledgement • We request co-authorship (and will assist with manuscript preparation) if we have developed novel tools, algorithms, or pipelines, participated in experiment design planning, or have contributed to a biological question addressed in a manuscript. We ask that, at a minimum, you acknowledge us in a publication to which we have contributed routine analysis without further support and discussions, data management or conversion, or data submission services. • Core Facility Bioinformatics of CEITEC Masaryk University is gratefully acknowledged for the obtaining of the scientific data presented in this paper. Description • The main goal of the experiment is to perform Differential Expression analysis. • In general, we have 10 human samples. We have 2 conditions (WT, Mut), those conditions are sequenced in five replicates, which is sufficient enough to obtain statistical significance. • Conditions to compare are: WT_vs_Mut • Samples were sequenced using Lexogen Quantseq FWD kit, therefore a lot of reads in UTR regions are expected as well as huge 3’ sequencing bias. These data can only be used for differential expression analysis. • Samples contained 6bp long UMIs used for detection of PCR duplicates in library Agreed task(s) 1. General samples QC. 2. RNA-Seq analysis. 3. DE analysis. Samples • Brief sample description and number of raw and preprocessed reads are summarized in Table 1. Table 1: Sample description - condition assignment and number of raw and preprocessed reads. Original name Sample Condition raw reads preprocessed reads Mut-1 Mut_rep1 Mut 10,729,594 10,721,500 Mut-2 Mut_rep2 Mut 10,190,321 10,182,322 Mut-3 Mut_rep3 Mut 11,309,496 11,299,605 Mut-4 Mut_rep4 Mut 10,304,741 10,296,211 Mut-5 Mut_rep5 Mut 11,986,309 11,969,557 WT-1 WT_rep1 WT 10,731,040 10,722,069 WT-2 WT_rep2 WT 10,088,717 10,079,613 WT-3 WT_rep3 WT 10,511,508 10,504,764 WT-4 WT_rep4 WT 11,060,603 11,049,278 WT-5 WT_rep5 WT 12,926,867 12,912,520 Analysis • Raw reads were quality checked (FastQC, MultiQC, minion, swan), preprocessed (Trimmomatic, FastQC, MultiQC) and mapped (STAR, Samtools, MultiQC) to the reference genome with gene annotation (genome version: Ensembl GRCh38, gene annotation: Ensembl v94) • Mapped reads were counted and summarized to genes (featureCounts) − Only uniquely mapped and uniquely assigned reads were counted − “Forward” counts were considered because of the both library prep. kit strand- specificity • Following checks were performed to estimate the overall sample quality − rRNA content estimate (fastq_screen) − Read duplication rate (dupRadar, Picard tools) − Sequenced (targeted) regions (RSeQC, Picard tools) − 5’/3’ coverage bias (Picard tools) − Expressed gene biotypes (featureCounts) − Library strandness (RSeQC) − Other quality checks are available upon request • Complete settings, used commands, tool versions and methodology part for a publication can be provided upon request Results General samples QC • The sequencing depth varies from ~10M of reads for sample WT_rep2 to ~13M of reads for sample WT_rep5. • Initial quality check for analysis did not show presence of adapters (<0%), no adapter trimming was performed • Other initial quality checks look OK Alignment and splices • The single end reads were aligned to human reference genome with the help of the gene annotation (GRCh38; Ensembl 94) using STAR aligner. • The number and percentage of uniquely- and multi-mapped reads are blablabgfgfgdfgl samples ~79% of uniquely mapped and we are getting approximately the same values across all samples, same applies for number of mapped reads, which varies between ~8M and ~10M (Table 2, Figure 1). Table 2: Number and percentage of uniquely- and multi-mapped reads Original name Sample uniquely mapped uniquely mapped (%) multimapped multimapped (%) Mut-1 Mut_rep1 8,564,030 79.88 1,468,222 13.69 Mut-2 Mut_rep2 8,128,507 79.83 1,361,834 13.37 Mut-3 Mut_rep3 8,933,108 79.06 1,622,784 14.36 Mut-4 Mut_rep4 8,186,419 79.51 1,462,967 14.21 Mut-5 Mut_rep5 9,313,518 77.81 1,661,406 13.88 WT-1 WT_rep1 8,412,480 78.46 1,643,464 15.33 WT-2 WT_rep2 7,924,970 78.62 1,485,485 14.74 WT-3 WT_rep3 8,443,782 80.38 1,503,497 14.31 WT-4 WT_rep4 8,619,720 78.01 1,590,757 14.40 WT-5 WT_rep5 10,156,024 78.65 1,868,552 14.47 Figure 1: Percentages and distribution of different mapping types. For more details please see the STAR aligner manual. rRNA contamination • Samples had a little bit higher percentage of rRNA contamination up to 6% − For polyA selection, we would expect ~ 1%, for rRNA depletion could be up to 10% − This can be due to less effective polyA selection procedure, but overall we still have a lot of usable reads and information to perform Differential expression analysis − The percentages are summarized in Figure 2 Figure 2: rRNA estimate. Mapped regions • Evaluation of the mapped genomic regions is very comparable • All samples had approximately almost no content of intronic reads ~12% counted over bases as well as ~12% counted over reads which is expected for polyA selection library protocols • All samples had majority of reads counted over bases as well as counted over reads mapped within coding regions (Figure 3 and 4). • All samples are good quality and can be safely used for Differential expression analysis. Figure 3: Mapped locations (Picard; in bases). Figure 4: Mapped locations (RSeQC; in reads). Read coverage distribution • The gene body coverage is very similar in all the samples. Figure 5: Gene body mapping distribution. Strand specificity • The analysis showed the strand-specificity of the library cannot be determined with a full confidence. − All the samples have ~86% of the forward-strand specificity, ~1% of the reversestrand specificity but ~13% mappings cannot be determined. • These are usually mappings to intergenic regions, unannotated regions or regions where two genes overlap at the same strand (Figure 6). − We consider mapping as stranded (forward), as we got the highest number of suitable reads for DE analysis. Figure 6: Library strand specificity. Read count assignment to genes • Read count assignment from featureCounts for RNA-Seq is quite high (between 52% and 57%, calculated ) − Number of assigned reads is above limits of the recommended number of reads (~5-6M). − Read assignment rates are summarized in Table 3 and Figure 7. Table 3: Summary of read-to-gene assignment. Percentages counted only from featureCounts. Sample Assigned Assigned (%) Mut-1 5,380,841 57.33 Mut-2 4,586,049 52.99 Mut-3 5,624,381 55.09 Mut-4 5,260,029 55.99 Mut-5 5,516,216 54.35 WT-1 5,251,339 52.69 Sample Assigned Assigned (%) WT-2 4,798,645 52.25 WT-3 5,048,669 54.61 WT-4 5,354,256 53.14 WT-5 6,250,925 53.61 Figure 7: Read count assignment to genes (featureCounts). Biotypes • All the samples captured mainly protein-coding genes (Figure 8). − This is expected from polyA selection based libraries. Figure 8: Captured gene biotypes. Contamination check - Biobloom tools • To check contamination we used basic model organisms eg. Mouse, Rat, Yeast, Dog, D. Melanogaster, A. thaliana, B. napus and C. elegans • All samples have about 78% of reads mapped only to human, which is quite good and expected number • The percentages are summarized in Figure 9. Figure 9: Captured read contamination. Notes and comments • We strongly recommend confirming all the events manually in genome/alignment browsers such as IGV, Tablet or Savant. − All the alignments and references are provided. Visualization IGV • You can upload the provided aligned bam files, the provided genome annotation and the gene annotation to the IGV and browse the alignments. • The IGV tutorial is available here • References are available to download (for example for IGV) here. • Used references − Homo_sapiens.GRCh38.dna.primary_assembly.fa − Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai − Homo_sapiens.GRCh38.94.sorted.gtf − Homo_sapiens.GRCh38.94.sorted.gtf.idx • All the index files (.bai, .fai, .idx) must be in the same folder as their “parent” files otherwise IGV wouldn’t open them. DE analysis • For the DE analysis we used gene counts from featureCounts tool. • Read counts for coding genes (mostly protein-coding genes) distribution is depicted in Figure 10. • Post-normalization counts (~normalized) seem to be OK (Figure 10). Figure 10: Number of reads assigned to coding genes for all the samples before normalization (raw counts). Coding genes depth for all the samples after normalization (normalized counts). • Clustering of the samples (based on ) shows that all the samples cluster together by their condition. − This is summarized in heatmap in Figure 11. • PCA shows a similar effect as the previous clustering. − PCA is depicted in Figure 12. Figure 11: Clustering of samples (DESeq2 log2 normalization). Figure 12: PCA (first two components) visualization of all samples (DESeq2 VST normalization). Differential expression results • Please note that for visualization purposes we had to set a predefined cut-off value for some of the parameters (adj.p-value and logFC) − Please read the disclaimer at the beginning of the report for more information about selected cut-off values! • In this section, we might provide several results as the number of compared conditions is variable between experiments • Differential gene expression was calculated by two separate tools - edgeR and DESeq2 − Both of the tools have different performances: edgeR is usually more sensitive and less specific, DESeq2 is usually more specific and less sensitive; edgeR is more suitable for fewer replicates (<12) as it is not as conservative as DESeq2 − Generally, DESeq2 is more conservative than edgeR. − If your goal is an exploratory analysis with a low number of replicates and with low expressed genes, edgeR might be your choice. − If you need a selection of genes which are strongly differentially expressed and you want to remove as many false-positive results as possible for a price of some falsenegative results DESeq2 might be your choice. − If you want to perform some sort of prioritization, you might select an overlap between the two tools. − Note for DESeq2: By default, DESeq2 applies independent filtering which aims at removing genes which are potential outliers or show “strange” behavior (~large variance). Filtered genes are then marked by NA in either p-value or adj.p-value column. This filtering might be in some cases too strict and might cause a loss of interesting results. For this reason, we also make DESeq2 results without the independent filtering (DESeq2_noIndFilt). If you choose DESeq2 results, I recommend to start with filtered results but look at the end of the table at the filtered genes (genes with NA in p-value/adj.-value are at the bottom of the table). In case you see some interesting results there you might consider switching to the unfiltered results. In edgeR analysis, only genes with very low expression are excluded (1 read-per-million reads in at least 3 samples). • The full description of a DE results is given after in the Output files section • Number of DE genes with default cut-off values is summarized in Table 4 − Please note this is only a tentative view on approximate differences between the conditions and not final results. • Volcano plot visualization visualizes log2FC (x-axis) and adj.p-value (y-axis). − Colored dots highlight genes above the defined adj.p-value cut-off value, blue lines show defined log2FC cut-off value. − Volcano plot from DESeq2 results can be found in Figure 13. • MA plot visualizes mean expression (x-axis) and log2FC (y-axis). − Colored dots highlight genes above the defined cut-off values (both adj.p-value and log2FC). − MA plot from DESeq2 results can be found in Figure 14. • An example of maximum top 20 most DE genes (by adj.p-value) can be found in Figure 15. − The selection of a maximum 20 genes is always based on compared conditions but visualizes expression values from all other conditions if applicable. Table 4: Summary or number of DE genes based on default cut-off values from all three DE calculations. The first two columns are counts with both adj.p-value and logFC cut-off, the other two columns are only with adj.p-value cut-off. DESeq2 results, DESeq2 results without independent filtering (noIndFilt) and edgeR results are shown. Comparison Analysis Up (adj.pval & logfc) Down (adj.pval & logfc) Up (adj.pval) Down (adj.pval) WT_vs_Mut DESeq2 1,715 1,436 3,600 3,462 Comparison Analysis Up (adj.pval & logfc) Down (adj.pval & logfc) Up (adj.pval) Down (adj.pval) WT_vs_Mut DESeq2 noIndFilt 1,669 1,404 3,439 3,291 WT_vs_Mut edgeR 1,537 1,355 3,610 3,613 Figure 13: Volcano plot (DESeq2 results with independent filtering on). Maximum of 20 most DE genes (by adj.p-value) are named. Figure 14: MA plot (DESeq2 results with independent filtering on). Maximum of 20 most DE genes (by adj.p-value) are named. Figure 15: Heatmap of maximum top 20 differentially expressed genes (DESeq2 results with independent filtering, adj.p-value < 0.05, logFC >= +/-1). Used statistics and selection of differentially expressed genes Log2FC and adjusted p-value • Differences between the conditions are expressed in log2FC (log2 of fold-change) − This is calculated by comparing average expression one condition vs the other and then the log2 transformation − This makes the results symmetric around 0 and it’s easier to understand − For example: -1 is a 2-fold decrease of the expression, +1 is a 2-fold increase of the expression. − Please note it always depends on the “direction” of the comparison - if you are not sure just compare normalized counts of one gene to see from which “direction” the log2FC was calculated. • Calculation of DE genes consists of a very high number of comparisons (thousands of genes = thousands of comparisons) − For this reason, we need to correct raw p-values for false discovery rate which emerges from the number of comparisons − This results in adjusted p-values (adj.pval, adj. p-values, adj. pval) • In case you want to use some filtering or selection based on the statistical evidence you should ALWAYS USE adjusted p-values and never raw p- values • Raw p-values can be used only in very special cases and always require a discussion with bioinformatician or statistician Note on selecting of differentially expressed genes (DEG) • Selection of differentially expressed genes always has to be primarily based on the biological background of the experiment and the hypothesis • A lot of people select DEG based on a combination of cut-off values • This might be tricky as no predefined cut-off values exist and it depends on the experiment itself, on the amount of false-positive results you are willing to accept and on the minimum level of changes if any • Commonly used statistical values, such as logFC and adj.p-value, should help you to confirm the hypothesis, results or the observations and not lead the discovery itself • If you decide to go for some cut-off values, do not blindly select only genes above the thresholds! − For example, the frequently used cut-off value of adj.p-value 0.05 at which you will reject the hypothesis the gene is not expressed the same: one gene will have adj.pvalue 0.0499 which would pass this cut-off whereas the second gene will have adj.p-value of 0.0501 and would not pass the cut-off. The difference between the genes is negligible and still, the first gene would and the other gene would not be accepted if a strict cut-off value would be set. − Another “issue” with adj.p-value is when the groups we are comparing and heterogeneous (for example clinical samples). In this case, adj.p-value are generally much higher. In this situation, we should strongly focus on biological effect and hypothesis and use statistics to either confirm the observation or to function for prioritization of the results − The similar thing applies to logFC − In the case of logFC, we also have to consider the effect of the expression of individual genes. For example, the frequently used cut-off value of 1 (2x foldchange): A gene with expression 10 in one group and 5 in other group results in logFC 1 and would pass the cut-off. The overall expression of the gene is negligible and most likely the change doesn’t have any biological effect. Another gene has an expression of 10000 in one group and 5001 in another group. This gene would not pass the cut-off value but most likely the change will have some kind of biological effect. − Sometimes small logFC might be more important than high logFC and the same applies to adj.p-value. • We should always choose the genes based on the overall properties of the gene and the comparison between conditions Output files • Results are organized by compared conditions − The compared conditions are always stated in the name of the folder − All the results in the corresponding folder are based on the currently compared conditions but in some of the cases other samples and conditions as added to the visualization/summaries as well − For example: cond1_vs_cond2 compares cond1 to cond2. In the DE results then positive log2FC signalizes more expression in cond1 (or less in cond2) and negative log2FC signalizes less expression in cond2 (or more in cond1). • Main output files are DESeq2.tsv, DESeq2_noIndFilt.tsv, and edgeR.tsv • All three main results contain several columns: − first column (no name) - Ensembl gene id − baseMean - an average expression of genes from samples from compared conditions − log2FoldChange - log2FC of the gene expression difference. The difference is calculated by comparing average expression of samples of one condition with the other samples from the other condition − stat and lfcSE (in DESeq2); LR and tgw.Disp (in edgeR) - results of the statistical test performed by the tool and variation/dispersion of the gene − pvalue - a raw p-value of the comparison − padj - multiple testing correction (Benjamini-Hochberg correction) p-value of the comparison. In case you are comparing and analyzing RNA-Seq use this as the evaluation of the statistical significance of the comparison − gene_name - common gene name − gene_biotype - biotype of the gene − **_normCounts** - normalized gene expression for each sample (not only the samples in the current comparison) − **_rawCounts** - raw read counts for each sample (not only the samples in the current comparison) • The full description is given below • Additional visualizations and/or a different combination of visualization is available upon request after discussion General/common outputs • xxx in the name of some pdf files in the description below represents compared conditions • background.txt – list of genes that were expressed in your samples (=gene was expressed in at least one of the samples). The first column is Ensembl gene ID, second is common gene name. • norm_counts.xlsx - DESeq2 normalized counts (~expression) for all genes and all samples. Can be used for visualization of the data. − norm_counts.tsv - the same information as in norm_counts.xlsx but in a text form. • all_sig_genes_normCounts.pdf – significantly DEG (here, “default cut-off” adj.p-value 0.05 and logFC 1 is considered) expression visualization. Selection of DE genes is based on the particular comparison but all conditions/samples are plotted. This is based on DESeq2 results. • counts_barplot.pdf - visualization of used (raw) read counts per sample • DESeq2.tsv – main results from differential gene expression by DESeq2. Description of the columns is given above. tsv is a tab-separated source file for the xlsx. • DESeq2_noIndFilt.tsv – results from differential gene expression without independent gene filtering (see Independent filtering and Cooks cut-off in DESeq2 manual). tsv is a tabseparated source file for the xlsx. • DESeq2_de_genes_check.txt and DESeq2_de_genes_check_noIndFilt.txt – basic summary of differential expression. “Default” adj.p-value of 0.05 and logFC of 1 are used. The upper part of the table considers only adj.p-value cut-off value, the lower part considers adj.p-value and logFC. This servers only for a demonstration of the possible effect, not as a final result • DESeq2_design_control.txt - a control file for used sample design • edgeR.tsv – main results from differential gene expression by edgeR. tsv is a tabseparated source file for the xlsx. • edgeR_de_genes_check.txt - same as in case of DESeq2 - “quick and dirty” check of the effect to get a general idea about the differences. • heatmap_selected_orderBaseMean.pdf – heatmap of maximum top 20 most differentially expressed genes (based on adj. p-value) ordered by an average expression. Blue means low expression, red means high expression. The maximum top 20 genes are selected based on the compared groups but all samples are visualized. This is based on DESeq2 results. • heatmap_selected_orderBaseMeanCluster.pdf – heatmap of maximum top 20 most differentially expressed genes (based on adj. P-value, same as above) clustered by rows and columns. Blue means low expression, red means high expression. This is based on DESeq2 results. • heatmaps_samples.pdf - heatmap visualization of sample clustering based on gene expression. Several normalizations are applied but the first one usually looks the best. • MAplot_xxx_vs_xxx_ggpubr.pdf - MA plots for the gene expression. Maximum top 20 most DEG (by adj.p-value) are named. This is based on DESeq2 results. • MAplot_xxx_vs_xxx_noIndFilt_ggpubr.pdf - the same as above but without independent filtering. • MDS_plot.pdf - MDS-based visualization of sample clustering by expression • overlap_DESeq2_edgeR_de.xlsx and overlap_DESeq2_edgeR_venn.pdf – overlap between DE genes between DESeq2 and edgeR calculation. In default, genes with adj. pvalue <0.05 and logFC +/- 1 are included. Columns DESeq2 and edgeR contain values TRUE or FALSE and this represents whether the resulting gene was included in the results by an individual tool. • pre_post_norm_counts.pdf - visualization of used raw (top figure) and normalized (bottom) read counts per sample • sample_to_sample_PCA.pdf - PCA-based visualization of sample clustering by expression. The first figure shows first two PCA components, the second shows all combinations of first three components (~3D look) • volcanoplot_xxx_vs_xxx_ggplot2.pdf – volcano plot of the results from DESeq2. Maximum top 20 most DEG (based on adj.p-value) are depicted. This is based on DESeq2. • volcanoplot_xxx_vs_xxx_noIndFilt_ggplot2.pdf - same as above but without independent filtering. • Please note the differences between MA and Volcano plots - different X and Y axes giving you a different view on the results. • Alignments can be visualized in genome browsers such as IGV, Tablet or Savant − For more details please read the manual or contact your bioinformatician • If the mapping files were shared through the online storage please download them as soon as possible and let us know so we can delete them Output files (alphabetical order) Folders Raw_fastq • Provided raw sequenced fastq files ### Alignment • Provided alignments in bam format ### QC_general • html file summarizing most of the QC done ### FeatureCounts_expression • Gene expression raw counts for each sample ### RSEM_expression • Gene and transcript expression for each sample • The normalized expressions are summarized in TPM column #### Files • .genes.results.tsv - gene expression estimates • .isoforms.results.tsv - transcript expression estimates ### UCSC_signals • Alignment signal for UCSC browser • Two signal “types” per sample are provided: − Unique - signal only from uniquely mapped reads − UniqueMultiple - signal from both uniquely and multi-mapped reads • Each set is then split into two: − str1 - signal coming from the alignment to the plus strand of DNA − str2 - signal coming from the alignment to the minus strand of DNA • If you have sense/forward specific library, the signal strand corresponds to the strand of the annotated gene (mapping —-> corresponds to the + strand annotation) • If you have antisense/reverse specific library, the signal strand corresponds to the opposite strand gene annotation (mapping <—- corresponds to the + strand annotation) DE_analysis • All files related to Differentially expressed genes analysis Gene ontology, GSEA and pathway analysis • This section summarizes possible follow-up analyses/interpretations of the differential expression results • In the following analyses the input can sometimes be specified as Ensembl ID (usually provided in the results unless noted otherwise), some tools required common gene names (usually provided in the results as well) or Entrez ID (NCBI/RefSeq) − This is always specified in the manual of the tool • GO and/or Pathway analysis can be performed for up-/down-regulated genes as well as for the whole list of de-regulated genes − Both of the analyses give a different view of the results • Analysis of separate up-/down-regulated genes gives you a direct answer on the change, in particular, GO or Pathway and the “direction” of the regulation • Analysis of all de-regulated genes (up-/down-regulated together) gives you a more broad view of the total change − Both of the results are helpful but you have to realize what kind of information you can get and what kind of answer they provide Gene Ontology • Input for gene ontology (GO) testing is usually a list of differentially expressed genes − You can either use filtering by some cut-off value(s) or select top X genes sorted by adj.p-value/logFC/… • You can either analyze all DEG or separately down-/up-regulated − Please see the section above for more details • A second input is gene background/universe − This is a list of all genes expressed in the comparison − This list is used as a “reference” for the statistical evaluation of significantly changed GO groups • For “common” organisms like human or mouse, a very nice tool is GOrilla with very nice figures and direct export to Excel − It also contains links to REViGO which can help you to summarize the GO groups into more general categories • To simply browse GO categories you can use QuickGO or Ontology Lookup Service • Another option to filter down GO categories is AmiGO2. • If, for whatever reason, you don’t have gene ontology annotations but just sequences of your genes you can still get GO analysis. Blast2GO is a nice service combining BLAST similarity search with GO annotation to produce GO analysis. There is a free and paid version where the free version gives you basic annotations and have some pretty visualizations. A tutorial is for example here https://www.youtube.com/watch?v=GqSqS_izlYg&t=335s. • Other tools include g:Profiler and many more Pathway analysis • Input data are very similar to GO (mentioned above) • PANTHER and Reactome provide quite nice analysis for the Pathway analysis and exploration. − PANTHER recently started to include GO testing as well so it became a multipurpose tool • KEGG is another commonly used tools for pathway analysis but it is not that easy to interpret the data unless you know which pathways should be involved Gene Set Enrichment • Input data to Gene Set Enrichment are a bit different than in previous cases • You do NOT subsample your results but you take a whole list of gene expression results and sort it according to specified criteria (logFC, adj.p-value, …) − In the case of logFC, you have the most positively changed genes on the top and the most negatively changed genes at the bottom and non-changed genes are in the middle but they are still in the SAME list − The enrichment then compares different parts of the list and looks for patterns or similarities • One of the most used is probably GSEA − It’s a standalone tool which takes some time to learn but provides reasonable results • GOrilla also offers gene enrichment analysis • Another option is to use DAVID which is also a multipurpose tool but I do not recommend it too much since the interpretation of the results might be difficult and very subjective. But it can still be used if there is no other option or you find it easy to work with. Protein-protein interactions • HIPPIE provides reliable and meaningful human protein-protein interaction networks. It is suitable more for gene-by-gene exploration. • STRING is a database of known and predicted protein-protein interactions supported by SIB, CPR - NNF and EMBL. This includes physical, functional as well as predicted interactions. • BioGRID is an interaction repository with data compiled through comprehensive curation efforts. Contains interactions extracted from publications for major model organisms Multipurpose tools and other • There are other tools/approaches to help with the secondary analysis • Enrichr is nice and simple to run the tool to identify enriched pathways, different gene summaries, and many others. It provides many various plots and tables • GeneMANIA visualizes genes and their interaction, co-expression as well as Gene Ontology, etc. It is suitable more for gene-by-gene exploration. • ConsensusPathDB-human integrates several different approaches including pathway analysis • BioCyc Database Collection is a collection of 9387 Pathway/Genome Databases (PGDBs), plus software tools for understanding their data. • KEGG database is a database resource that integrates genomic, chemical and systemic functional information. In particular, gene catalogs from completely sequenced genomes are linked to higher-level systemic functions of the cell, the organism, and the ecosystem. KEGG accepts common gene names and NCBI gene/transcript ID and UniProt ID. For the analysis, you can use http://www.genome.jp/kegg/tool/map_pathway2.html or http://www.genome.jp/kegg/tool/map_pathway1.html. KEGG looks for genes in your list and tries to highlight the pathway where selected genes have some kind of association. • Venny is a nice tool for a simple visualization of overlaps between up to three sets of data. R/Bioconductor • If you know how to use R you can check clusterProfiler which provides many different analyses types and has very nice manual and tutorials • topGO is another R package which offers both gene ontology and gene set enrichment analysis • ReactomePA allows for analysis of pathways with a slightly different source than KEGG • To add GO categories (and other interesting information) to your list of genes, you can try biomaRt which can directly load the Ensembl database