Gene set-based clustering of gene expression data
Vlad Popovici
• Gene expression profiling is a way of studying activity of genes
of different organisms, organs or cell lines etc and what molecular
pathways are active. Usually, it is performed using microarray or
RNA sequencing technology.
• Clustering of gene expression profiles is used to determine if there
are any similar and diverse groups of samples.
• When clustering expression profiles various similarity measures
can be considered, depending on the objective of the analysis. One
interesting perspective is brought by the so-called pathway
activation scores, or similar parameters, that try to capture the
activity of a given pathway or gene set.
• However, when clustering samples (expression profiles) one needs
a similarity score between the said samples. Here we propose to
use a gene-set measure, e.g.
Similarity(x1, x2) = abs(Score(x1, gene_set) - Score(x2, gene set)),
• to produce groupings of samples relative to a reference gene set.
Bi4013 Týmový projekt z Matematické biologie a biomedicíny – biomedicínská bioinformatika (jaro 2021)
Aims of the project:
• Main steps:
• Study a tutorial on data clustering in R
• Study some examples of R packages
• Get familiar with data representation (within existing R packages)
• Study some methods for gene set scoring (R packages)
• Implement the new similarity function
• Compare with correlation-based similarity
• Test the package on publicly available colorectal cancer or breast
cancer dataset (GEO database https://www.ncbi.nlm.nih.gov/gds) –
preferably normalized microarray data
The project will implement (in R) a package allowing such
clustering with various activation scores (Z-score, KolmogorovSmirnoff,
GSEA etc) and a number of alternative similarity
functions (e.g. absolute difference, log-ratio etc). For testing
purposes, MSigDB will be used for gene set selection and publicly
available colon-cancer expression data sets.