Gene set-based clustering of gene expression data Vlad Popovici • Gene expression profiling is a way of studying activityof genes of different organisms, organs or cell lines etcand what molecular pathwaysare active. Usually, it is performed using microarray or RNA sequencing technology. • Clustering of gene expression profiles is used to determine if there are any similar and diverse groups of samples. • When clustering expression profiles various similarity measures can be considered, depending on the objective of the analysis. One interesting perspective is brought by the so-called pathway activation scores, or similar parameters, that try to capture the activity of a given pathway or gene set. • However, when clustering samples (expression profiles) one needs a similarity score between the said samples. Here we propose to use a geneset measure, e.g. Similarity(x1, x2) = abs(Score(x1, gene_set) -Score(x2, gene set)), to produce groupings of samples relative to a reference gene set. Aims of the project • The project will implement (in R) a series of functions (eventually as a package) allowing such clustering with various activation scores (Z-score, Kolmogorov-Smirnoff, GSEA etc) and a number of alternative similarity functions (e.g. absolute difference, log-ratio etc). For testing purposes, MSigDB (http://www.gsea-msigdb.org/gsea/msigdb/)will be used for gene set selection and publicly available tissue-specific expression data sets from GTEx (https://www.gtexportal.org) • Main steps: • Study a tutorial on data clustering in R • Study some examples of R packages • Get familiar with data representation (within existing R packages) • Study some methods for gene set scoring (R packages – look at singscore package!) • Implement the new similarity function • Compare with correlation-based similarity • Identify colon-tissue data from GTEx and test the functions/package on it – use the TPM data (normalized RNA-Seq)