[13:40 28/8/2010 Bioinformatics-btq453.tex] Page: 2391 2391–2397 BIOINFORMATICS ORIGINAL PAPER Vol. 26 no. 19 2010, pages 2391–2397 doi:10.1093/bioinformatics/btq453 Sequence analysis Advance Access publication August 9, 2010 An alignment-free model for comparison of regulatory sequences Hashem Koohy1,∗, Nigel P. Dyer1, John E. Reid2, Georgy Koentges3 and Sascha Ott4,∗ 1MOAC Doctoral Training Centre, Coventry House, University of Warwick, Coventry, CV4 7AL, 2MRC Biostatistics Unit, Institute of Public Health, Forvie Site, Robinson Way, Cambridge, CB2 0SR, 3Department of Biological Sciences, Gibbet Hill Campus and 4Warwick Systems Biology Centre, Coventry House, University of Warwick, Coventry, CV4 7AL, UK Associate Editor: Martin Bishop ABSTRACT Motivation: Some recent comparative studies have revealed that regulatory regions can retain function over large evolutionary distances, even though the DNA sequences are divergent and difficult to align. It is also known that such enhancers can drive very similar expression patterns. This poses a challenge for the in silico detection of biologically related sequences, as they can only be discovered using alignment-free methods. Results: Here, we present a new computational framework called Regulatory Region Scoring (RRS) model for the detection of functional conservation of regulatory sequences using predicted occupancy levels of transcription factors of interest. We demonstrate that our model can detect the functional and/or evolutionary links between some non-alignable enhancers with a strong statistical significance. We also identify groups of enhancers that are likely to be similarly regulated. Our model is motivated by previous work on prediction of expression patterns and it can capture similarity by strong binding sites, weak binding sites and even the statistically significant absence of sites. Our results support the hypothesis that weak binding sites contribute to the functional similarity of sequences. Our model fills a gap between two families of models: detailed, data-intensive models for the prediction of precise spatio-temporal expression patterns on the one side, and crude, generally applicable models on the other side. Our model borrows some of the strengths of each group and addresses their drawbacks. Availability: The RRS source code is freely available upon publication of this manuscript: http://www2.warwick.ac.uk/fac/sci/ systemsbiology/staff/ott/tools_and_software/rrs Contact: s.ott@warwick.ac.uk; hashem.koohy@warwick.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. Received on May 20, 2010; revised on July 30, 2010; accepted on August 4, 2010 1 INTRODUCTION Cis-regulatory modules (CRMs) can drive precise spatio-temporal gene expression patterns. Some recent studies show that CRMs may function similarly in different species despite substantial sequence divergence (Hare et al., 2008; Ludwig et al., 2005). This implies that, first, alignment-based sequence comparison tools are not applicable for further decoding the conserved function of such CRMs and ∗To whom correspondence should be addressed. second, some CRMs must share common patterns that drive almost identical regulatory outputs but possibly with different arrangements of binding sites. When different, but functionally related enhancer loci in the same species are considered, then alignment-based tools are not normally suitable for regulatory sequence comparisons as these sequences are not orthologous. Here, we present an alignment-free regulatory sequence comparison model called Regulatory Region Scoring (RRS) model, which is based on the potential distribution of transcription factors (TFs). Our goals are: (1) To be able to detect functionally similar enhancer regions even if the enhancer regions do not align. (2) To find groups of similar enhancers and determine relevant sequence features shared among enhancers within a group. 1.1 Previous work There has been a great deal of attention over alignment-free methods to further reveal the mechanism of transcription control (see Vinga and Almeida, 2003). Among these methods, two families are of particular interest. Models in the first family are aimed at predicting spatio-temporal gene expression patterns from the regulatory sequences. A model in this family was recently developed by Segal et al. (2008), and then attracted the attention of other researchers (see Gertz et al., 2009; Loo and Marynen, 2009; Segal and Widom, 2009; Zinzen et al., 2009). For previous related work, see Djordjevic et al. (2003); Foat et al. (2006); Roider et al. (2007); Tanay (2006); Vinga and Almeida (2003). The model by Segal et al. (2008) is based on a thermodynamic equilibrium assumption. The probability of polymerase occupancy is computed from the intrinsic equilibrium affinities and concentrations of the TFs. The gene expression level is considered to be proportional to the polymerase occupancy. This model takes into account some important aspects of TF–DNA interaction including competition of TFs for TF binding sites, selfcooperativity of TFs, and the effects of weak binding sites.Although this model advances our understanding of how genomic sequences are translated into transcriptional outputs, the complexity and data dependency of the model do not allow for a wide application of this model as a sequence comparison tool. One must deal with a complex model-fitting procedure and provide a combination of data that is rarely found at present: spatial expression patterns of a number of related enhancers, knowledge of what the key regulators are, their binding motifs and their spatial concentration. © The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 2391 atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom [13:40 28/8/2010 Bioinformatics-btq453.tex] Page: 2392 2391–2397 H.Koohy et al. The second family of alignment-free methods is based on the rationale that functionally similar sequences must share some common words. Within these methods each sequence is associated with a vector of k-mer counts. A distance function for these vectors is then defined (Aerts et al., 2003; Blaisdell, 1986; Kantorovitz et al., 2007; Leung and Eisen, 2009; van Helden, 2004). From this family, D2z (Kantorovitz et al., 2007) is of particular interest for us because: (i) it was one of the first alignment-free methods for detection and analysis of regulatory sequences; (ii) it is a normalized version of ‘D2’Lippert et al. (2002), a natural method of comparison of k-mers in two sequences. The background normalization of this model makes it possible to account for sequences from different background distributions. However, some limitations of this method are: (1) Not all functional motifs in a pair of sequences are in the form of 6-mers. So, by considering only 6-mers as patterns underlying functional similarity of a pair of sequences, some motifs that contribute to the gene expression pattern may be overlooked. (2) Not all 6-mers are biologically meaningful words, hence using all 6-mers may mean introducing some noise to the model and furthermore, we may want to compare two sequences just based on a subset of meaningful words. (3) Within the D2z framework, degeneracy of TF binding motifs is not accounted for. So, different 6-mers are treated separately even if they only differ in one base. (4) The framework does not allow for a sequence and its reverse complement to be combined for the purposes of assessing possible TF binding. There is a gap between these two families of models. The former is based on a mechanistic understanding of the regulation of gene expression by predicting expression patterns using TF occupancy and interaction and is too dependent on a specific combination of datasets to be generally applicable. The latter family of models is defined very generally and is widely applicable, but some natural principles underlying transcriptional control, such as TF competition, motif degeneracy and effects of weak binding sites, are completely ignored. Consequently, the results are less conclusive. The RRS aims to enhance the conclusiveness of the results and lessen the data dependency of the model by borrowing the key ideas of each family of models so as to get more accurate results on a wider range of data. 1.2 Outline of model and results The first part of our model uses a modification of the thermodynamic model by Segal et al. (2008) to compute the expected number of proteins binding to each of a set of motifs in a sequence. This computation captures some of the strengths of Segal’s model, including the competition of TFs and contributions of weak binding sites. Each sequence is summarized by these expectations. The second part of our model compares the expectations of different sequences to compute a similarity score. Our model can be used to detect functionally relevant similarity in unalignable sequences (such as certain promoters and enhancers) and it provides insights into shared regulatory codes. Our main results are: (1) We have developed the model underlying the RRS. The model comprehensively evaluates the possible configurations of proteins occupying given DNA sequences and provides similarity scores based on shared combinations of motifs. (2) We show that the RRS can detect functional and evolutionary links between enhancers, which do not have significant alignments. (3) We find that weak binding sites can make a strong contribution to sequence similarity. (4) Our model treats statistically significant presence and absence of motifs symmetrically. Similarity of sequences can, therefore, be based on a combination of both for a set of motifs. We show examples of motifs making contributions to sequence similarity through their absence. (5) We use the RRS to create a network of similarities among a set of 131 known fly enhancers. The network connects 34 enhancers with 43 significant pairwise similarity scores. At least four groups of enhancers show strong statistical evidence to function similarly via a shared regulatory code. One of these groups is strongly supported by the existing experimental data. 2 METHODS The RRS takes as input a pair of sequences and a set of TF motifs. We call one of the sequences the template sequence and the other the test sequence. The task is to judge whether the test sequence has the potential to drive similar expression patterns as the template sequence, assuming expression is driven by the given set of motifs. We do not use any cutoff for probabilities of binding of these motifs to the sequences and so allow weak binding events and even absence of motifs to contribute to sequence similarity. The output from RRS is a statistical similarity score and a list of motifs that contribute to that similarity score. The model is built of two main components: one component associates each sequence with a mathematical vector reflecting which proteins with what multiplicity and what specificity have the potential to be bound to the sequence. We call the elements of these vectors motif occupancy values or, in short, o-values. These vectors give an indication of the potential enhancer function of the given sequences. The second component estimates a probability distribution of motif o-value vectors for sequences that function similar to the template sequence. We then compute a Bayes factor to evaluate if the test sequence is more similar to the template sequence or more similar to random background sequences (supplementary Fig. S1 shows a simplified schematic illustration of the RRS concept). 2.1 Occupancy values of proteins binding a sequence (motif o-values) We assume a template sequence T, a test sequence S, and a set of TF motifs M={M1,...,Mn}. We use the term configuration to denote a particular arrangement of protein molecules along the DNA sequence, which is defined by the subsequences at which each molecule is bound to the sequence. Valid configurations are those in which binding subsequences do not overlap. A configuration c with N molecules bound to a sequence is defined as c={(Mi,Pi)|1≤i≤N, Mi ∈M }, where Mi is the i-th molecule bound at a subsequence starting at position Pi. Note that N is the number of molecules bound in the configuration while n is the number of TF motifs. The length of the binding subsequence is the size of motif Mi. Further, we denote the subsequence starting at position Pi, where molecule Mi has bound, by Bi. Therefore p(Bi|Mi) means the probability of subsequence Bi using the the corresponding PSSM model and p(Bi| ¯Mi) means the probability of subsequence Bi given the background model (uniform 0−order 2392 atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom [13:40 28/8/2010 Bioinformatics-btq453.tex] Page: 2393 2391–2397 Regulatory Region Scoring Model Markov model in our case). We then associate a statistical weight with this configuration: W(c)= N i=1 p(Bi|Mi) p(Bi| ¯Mi) = N i=1 p(Mi|Bi) p( ¯Mi|Bi) × p( ¯Mi) p(Mi) (1) in which p(Bi|Mi) p(Bi| ¯Mi) is the contribution of molecule i bound to the sequence at position Pi. This variable is in turn a function of binding affinity i.e., p(Mi|Bi) p( ¯Mi|Bi) and concentration parameter i.e., p( ¯Mi) p(Mi) . This definition enables weak binding events to be included in the model. Assume that in a configuration c we have a molecule that has been weakly bound to the sequence many times. If, for the sake of simplicity we assume an equal binding affinity a (a>1) in K positions, then the contribution of this factor to the W(c) is equal to aK . Depending on K, this might be a strong contribution. The probability of each configuration c is then defined as p(c)= W(c) c∈C W(c) where C is the set of all valid configurations. We use the same dynamic programing technique as in Segal et al. (2008) to compute this probability. There can be more than one expressed protein species that can bind to a given motif. In the absence of information on either the number of protein species capable of binding a motif or the nuclear concentrations of these proteins we assume the total nuclear concentration of such proteins to be equal for each motif and set p( ¯Mi) p(Mi) to a constant value. Where such information is available it can be integrated into the RRS by setting the concentration parameters accordingly. When the concentration parameter is set to a constant value, it determines the average density of proteins bound to DNA within our model. We chose 15 as the setting for the concentration parameter and confirmed that results presented in this work are robust as long as the concentration parameter is set such that the protein density is realistic. Note that the scaling of this parameter depends on the scaling of the binding affinity and, therefore, the absolute value does not have a direct interpretation. Intuitively, this probability distribution over all possible configurations should reflect a number of aspects of enhancer function in a natural way. Overlapping binding sites will compete with each other, high-affinity binding sites will attract a binding molecule more often, and weak binding sites can exert an effect if they are present in numbers. Proteins are more likely to interact with the polymerase if they occupy the enhancer more often. Therefore, a key quantity relevant to the function of an enhancer is the expected number of copies of a given protein that bind to motifs in the enhancer (T): eT Mi =log c∈C p(c)IMi (c) (2) in which IMi (c) is the number of occurrences of motif Mi in configuration c. This definition is of particular interest because it captures both the specificity and multiplicity of a binding event of a protein to the sequence in the p(c) and IMi (c) terms, respectively. A dynamic programming approach is used to compute each occupancy value. Finally, the sequence T is associated with the vector of occupancy values, i.e. ET = and similarly sequence S is associated with ES =. Our results show that these occupancy values are length dependent. We divide them by the length of the sequences to normalize them. Therefore, each of these vectors summarizes the combined specificity and multiplicity that each protein is likely to bind to each of the sequences. 2.2 Similarity scores Our aim in this section is to define a similarity function over the space of vectors of occupancy values to extract the similarity of a given pair of o-values. Having observed o-values from the template sequence, ET , we want to test if the vector of o-values from the test sequence, ES, has been drawn from the same distribution or from a random background distribution. The logarithm of motif o-values in randomly picked sequences from the genome of the species of interest approximates a normal distribution. Therefore, the probability of a motif o-vector such as ES = can be obtained A B C D E F S f(S)Density S S Fig. 1. Illustration of the RRS similarity score for an individual motif. There are three possibilities. (A, D) The motif is neither significantly present nor absent in the template sequence. The distribution of motif o-values in sequences with the same function as the template sequence (solid line) is estimated to be equal to the random background (dashed line). In this case, irrespective of the motif o-value in the test sequence, the function f (eS Mi ) [see Equation (4)] is constant (D). (B, E) The motif o-value in the template is higher than in random sequences (B), in this case f (eS Mi ) is an increasing function (E). (C, F) The motif o-value is lower than in random sequences, indicating significant absence. In this case, f (eS Mi ) is decreasing (F). from a multivariate normal distribution. For the sake of simplicity, we shall consider an independent multivariate normal distribution. This means that the probability of the o-vector ES under the random model is p(ES|R)= n i=1 p(eS Mi |µ=µRi ,σ =σRi ), where, µRi and σRi are the mean and SD of o-values for motif i in randomly picked sequences. The probability that ES has been drawn from the same distribution as the template is p(ES|T)= n i=1 p(eS Mi |µ=eT Mi ,σ =σRi ). We define the RRS score as RRS(S|T)= p(ES|T) p(ES|R) (3) The first point to note about this definition is that it is asymmetric but one may define it as an average to make it symmetric, i.e. RRS(S,T)=(RRS(S|T)+ RRS(T|S))/2. However, it is sensible to work with the asymmetric version, in particular when comparing two sequences from different species. The second point that makes this definition more realistic and useful is the contribution of the individual motifs, which are the factors making the RRS-score: f (eS Mi ):= p(eS Mi |µ=eT Mi ,σ =σRi ) p(eS Mi |µ=µRi ,σ =σRi ) (4) for any motif Mi, where 1≤i≤n. For any test sequence S, one can consider the Equation (4) as a function of variable eS Mi with three extra parameters: eT Mi , µRi and σRi . The following cases illustrate this definition and its usage in the rest of this article: (1) If eT Mi ≈µRi (see Fig. 1A), then f (eS Mi ) can be considered as a constant function with value ≈1 (Fig. 1D). This means that if the expected number of occurrences of this motif in the template sequence is very close to the average of its expected number of occurrences in the random sequences, then the overall RRS score for the test sequence will be largely independent of number of occurrences of this motif in the test sequence. In biological terms, if the test sequence shares a regulatory code with the template sequence, but also contains additional binding sites, then these additional sites do not reduce the sequence similarity. (2) If eT Mi >µRi , then f (eS Mi ) is an increasing function. More accurately, if we assume that eT Mi >A>µRi , where A is the intersection point of the two distribution curves (Fig. 1), then f (eS Mi )≤1 if eS Mi ≤A else it is 2393 atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom [13:40 28/8/2010 Bioinformatics-btq453.tex] Page: 2394 2391–2397 H.Koohy et al. greater than one. This case occurs when the motif is strongly present in the template sequence. Accordingly, the greater the motif o-value in the test sequence, the greater the contribution of the motif (Fig. 1B and E). Note that a strongly negative RRS score in this case implies poor presence of the motif in the test sequence. (3) Similarly, if eT Mi <µRi , then f (eS Mi ) is a decreasing function. In other words, f (eS Mi )>1, if eS Mi