BIOINFORMATICS ORIGINAL PAPER
Vol. 25 no. 13 2009, pages 1632­1639
doi:10.1093/bioinformatics/btp264
Structural bioinformatics
Domain Interaction Footprint: a multi-classification approach to
predict domain­peptide interactions
Christian Schillinger1,2, Prisca Boisguerin3 and Gerd Krause1,
1Leibniz Institute for Molecular Pharmacology, Robert-Roessle-Str. 10, Berlin, 2FU-Berlin, Department of Biology,
Chemistry and Pharmacy, Takustr. 3, 14195 Berlin and 3Institute of Medical Immunology,
Charite-Universitaetsmedizin, Hessischestrasse 3-4, 10115 Berlin, Germany
Received on October 7, 2008; revised on April 1, 2009; accepted on April 15, 2009
Advance Access publication April 17, 2009
Associate Editor: Burkhard Rost
ABSTRACT
Motivation: The flow of information within cellular pathways largely
relies on specific protein­protein interactions. Discovering such
interactions that are mostly mediated by peptide recognition modules
(PRM) is therefore a fundamental step towards unravelling the
complexity of varying pathways. Since peptides can be recognized
by more than one PRM and high-throughput experiments are
both time consuming and expensive, it would be preferable to
narrow down all potential peptide ligands for one specific PRM
by a computational method. We at first present Domain Interaction
Footprint (DIF) a new approach to predict binding peptides to PRMs
merely based on the sequence of the peptides. Second, we show
that our method is able to create a multi-classification model that
assesses the binding specificity of a given peptide to all examined
PRMs at once.
Results: We first applied our approach to a previously investigated
dataset of different SH3 domains and predicted their appropriate
peptide ligands with an exceptionally high accuracy. This result
outperforms all recent methods trained on the same dataset.
Furthermore, we used our technique to build two multi-classification
models (SH3 and PDZ domains) to predict the interaction preference
between a peptide and every single domain in the corresponding
domain family at once. Predicting the domain specificity most
reliably, our proposed approach can be seen as a first step towards
a complete multi-domain classification model comprised of all
domains of one family. Such a comprehensive domain specificity
model would benefit the quest for highly specific peptide ligands
interacting solely with the domain of choice.
Contact: gkrause@fmp-berlin.de
Supplementary information: Supplementary data are available at
Bioinformatics online.
1 INTRODUCTION
Protein­protein interactions are often mediated through protein
domains that function as peptide recognition modules (PRMs),
which are small (50­150 amino acids), independently folding
domains that bind linear peptides and are found repeatedly in several
protein structures (Pawson and Nash, 2003). The PRMs of different
families (PDZ, SH3, SH2, WW, PTB, 14-3-3, etc.) are involved in a
To whom correspondence should be addressed.
variety of cellular processes and are therefore an essential key of cell
physiology as a whole. Within a family all PRMs share structural
commonalities, but differ slightly at specific positions that entails
specific binding partners. Identification of interaction partners to
different PRMs can immensely benefit in unravelling biochemical
pathways. The interaction partners are usually recognized by a short
amino acid sequence that contains specific characteristics. In case
of the SH3 domains, the binding pocket recognizes poly-prolin
motifs of other proteins. PDZ domains always bind to the last four
C-terminal residues of their interaction partner. Hence, the prediction
of a potential ligands is possible in principle, since the binding
mechanisms are very well understood (Feng et al., 1994, Songyang
et al., 1997).
Several methods have been established to elucidate a common
pattern of several peptides ( Nevill-Manning et al., 1998, Schneider
and Stephens, 1990, Timothy et al., 2006). A good classification
method should be able to classify a peptide according to previously
learned rules. These rules are subsequently used to classify unknown
peptides. Doing this accurately could save expensive and timeconsuming
experiments. Since PRMs of the same domain family
have generally a virtually similar pattern of their peptide ligands,
classifying peptides by means of manual inspection is often
ambiguously and therefore error-prone. Tong et al. (2002) for
instance, analysed different SH3 domains and constructed a specific
consensus motif of each domain. Those motifs bear such a great
resemblance to each other that a manual peptide classification
task is all but impossible, since there are peptides that match to
all such motifs at once. And as there is usually no classification
algorithm coming along with those motifs, one has to do the
classification manually. As more interaction data evolved from
various experiments, efforts were made to cluster domains in
classes and characterize those. PDZ domains, for example, can
roughly be divided into four classes based on their ligand specificity
(Boisguerin et al., 2004, Reina et al., 2002, Songyang et al.,
1997)--class 1 peptides are defined by consensus sequence motif
([S/T]X COOH), class 2 by ([ / ]X COOH), class 3 by
([D/E]XVCOOH) and class 4 by (X [D/E]COOH), respectively
( : hydrophobic, : aromatic, COOH: C-terminal carboxylic acid
of the protein). But many PDZ domains target in fact a more
comprehensive ligand sequence space. That is, to say the set
of all recognized ligands by a domain contains also peptides
whose sequences does not match the previous stated classification
1632  The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
Domain Interaction Footprint
scheme and cannot categorized by simple rules at all (Vaccaro and
Dente, 2002). Hence, we need a computational method that uses
information beyond ordinary sequence motifs to circumvent manual
classification.
There exist several methods to predict domain­peptide
interactions whose predictive capabilities have all been tested with
the phage display dataset of Tong et al. (2002) and are thus
comparable. Brannetti et al. (2000) first incorporated structural
information of a domain by SH3-specific matrices. This matrices
can be derived from SH3/peptide complexes of the Protein Data
Bank (Bernstein et al., 1997) and were used to establish a residueresidue
contact database for predicting domain­peptide interactions.
The machine learning method of Ferraro et al. (2006) relies also
on the domain­peptide contact residues in complexes of known
structures and enhances the contact matrix approach by a neural
network and an appropriate encoding of the interacting residues.
With it the accuracy of prediction of domain­peptide interactions
could further be improved. Nevertheless, a necessary prerequisite
of both methods is the necessity of structural information. Even
though both of them show a generalization ability to a certain
degree, those techniques are not applicable to cases where not
very reliable structure is available. Contrary to this dependency,
Reiss and Schwikowski (2004) developed a probabilistic generative
model of the SH3 ligand peptides which is based on a modified
Gibbs motif sampler (Lawrence, 1993). They were able to identify
ligand peptides of PRMs by combining protein sequences and
physical interaction data what makes their approach suitable for
PRMs without any known structure. Lehrach et al. (2005) replaced
the generative approach of Reiss and Schwikowski (2004) by a
Laplacian-regularized discriminative model and reached together
with Ferraro et al. (2006) the best predictive power aimed at inferring
the domain recognition specificity of the SH3 phage display dataset
so far.Another sequence-only method is the related, Hidden Markov
Model (HMM)-based work of McLaughlin et al. (2006). For a more
comprehensive overview of computational methods, see Shoemaker
and Panchenko (2007) and Lee et al. (2007).
Aside from the prediction of domain­peptide interactions, it is of
great interest to infer specific ligand peptides that preferably interact
solely with one domain despite the similar consensus patterns of
other PRMs. This demands a multi-classification approach that can
reliably classify a peptide with respect to several PRMs of the same
domain family.
In this article, we introduce Domain Interaction Footprint (DIF)
as a new sequence-based approach to predict domain­peptide
interactions and show that its performance to separate binding and
non-binding peptides is better than those of any other method.
Additionally, we show that our technique can be applied to multiclassification
tasks where it assigns a given peptide to its appropriate
PRM out of a set of domains that all share a very similar consensus
pattern of their ligand peptides.
2 METHODS
2.1 Datasets
The SH3 dataset is composed of 25 experimental result sets whereas each
one of those contains the corresponding binding and non-binding peptides of
one SH3 domain that were identified in phage display by Tong et al. (2002).
The examined peptides can be divided into classes 1 and 2, whereas
class 1 is defined by the consensus sequence motif (+x Px P) and class
Table 1. Experimentally verified binding and non-binding peptides
determined for different SH3 and PDZ domains
Domain Binding peptides Non-binding peptides
SH3
Boi1 Class 1 3 16
Boi1 Class 2 6 11
Boi2 Class 1 8 9
Bzz1-1 Class 1 14 17
Bzz1-2 Class 1 13 17
Myo3 Class 1 7 21
Myo5 Class 1 13 21
Nbp2 Class 1 16 19
Pex13 Class 1 10 12
Pex13 Class 2 16 12
Rvs167 Class 1 11 10
Rvs167 Class 2 16 5
Sho1 Class 1 18 12
Sla1-3 Class 1 8 18
Yfr024 Class 1 7 17
Yfr024 Class 2 22 6
Ygr136 Class 1 18 14
Ygr136 Class 2 15 9
Yhl002 Class 1 9 14
Yhr016 Class 1 6 16
Yhr016 Class 2 11 6
Yjl020 Class 1 4 18
Yjl020 Class 2 11 11
Ypr154 Class 1 23 11
PDZ
AF6 16
ERBIN 19
SNA1 15
N1P1 9
2 by ( Px Px+), where is a hydrophobic and + is a basic residue
(Mayer, 2001). Some domains interact not only with class 1 peptides, but
also with class 2 ones and are therefore listed twice in the dataset as two
different experiments have been carried out.An overview of the used domains
as well as the number of binding and non-binding peptides is assembled
in Table 1. Further information about this dataset can be obtained in the
supplementary material of Tong et al. (2002). In order to use the provided
sequence information with our approach, we prolonged shorter sequences
with a non-defined value X so that all sequences are of the same length. This
value has no impact in further steps as it is simply being ignored.
Additionally, we used also a PDZ dataset that consists of binding and nonbinding
peptides of four different PDZ domains [AF6 (ProteinAF-6, UniProt
P55196), SNA1 (Protein Syntrophin-1, UniProt Q13424), ERBIN (Protein
LAP2, UniProt Q96RT1), N1P1 (NHERF1 PDZ 1, UniProt O14745)] to
create a multi-classification model as all involved peptides have been
experimentally tested against every PDZ domain (Supplementary Material,
Table S1). The experimental verification of single-binding peptides makes
it possible to build a multi-dimensional classifier with disjunctive sets of
peptides.
2.1.1 Peptide array synthesis and incubation The peptides tested with
the PDZ domains were synthesized on N-modified CAPE-membrane
(Bhargava et al., 2002) and prepared with a MultiPep SPOT-robot (INTAVIS
Bioanalytical Instruments AG, Germany). Array design was performed using
the inhouse software LISA 1.71. Peptide arrays for the AF6-, ERBIN- and
SNA1-PDZ domain arrays were generated and incubated as described in
1633
C.Schillinger et al.
Boisguerin et al. (2004) whereas peptide arrays for N1P1 were synthesized
using the improved method of inverted peptides as described in Boisguerin
et al. (2007). N1P1 incubations were performed analogous to the other PDZ
domains with a concentration of 10 g/ml and a detection system consisting
in anti-His(mouse) (1:3000, Sigma)/anti-mouse-HRP (1:2600, Calbiochem).
2.2 Domain Interaction Footprint
A DIF is related to one specific protein recognition module and can be seen
as an advanced description of the most important properties of all peptides
with whom the PRM interacts with. Each given peptide contributes with its
properties partly to the model of the investigated domain interaction pattern.
Our approach of building a predictive domain interaction model used
for sequence-function analysis consists of three major steps (Fig. 1) . First,
we encode our data to be utilizable by our algorithm. This is done by
using specific property-based numerical values for amino acids, so-called
amino acid indices (Kawashima and Kanchisa, 2000). To keep our approach
as general as possible, we do not assume any previous knowledge of the
examined domain. On account of this, we do not make a preliminary selection
of amino acid's properties, but let our algorithm choose the most meaningful
of these indices by correlation-based feature selection. Thus, we capture the
specific and appropriate properties for each domain separately. Eventually,
we create a DIF by combining descriptors that were built with the empirically
observed range of the selected properties.
2.2.1 Sequence encoding The structural information and therefore the
function of a protein is due to the combination of 20 different amino acids.
Even though no explicit rule has yet been derived to conclude a protein's 3D
structure merely from its sequence, the shape is undoubtedly based on the
underlying characteristics of the amino acids in terms of volume, charge and
hydrophobicity among others. The approach relies purely on the sequence
analysis of the peptides in relation to the biological effect. The sequence of
the PRM is not considered.
In our approach, we try to capture the functional aspect of a peptide
by its amino acid sequence. For this purpose, we encode each amino acid
by specific physiochemical and biochemical properties. There are several
parameters such as logP, Verloop parameters for volume, parameters for
hydrophobicity, for polarization, for frequency of occurrence in secondary
structure elements, for flexibility, for surface description, etc. The current
database of Kawashima and Kanchisa (2000) contains 544 such amino acid
indices that we all take into consideration. A formal depiction follows.
Let S=(s1,,sns ) be a list of sequences, with ns as the total number of
sequences. Each sequence si =(ai,1,,ai,na ) is in turn composed of na amino
acids. As amino acids can be described by several specific properties and
characteristics, we define a list of np different properties by P=(p1,,pnp ).
Encoding the amino acid sequences in S with properties P leads to an
encoded sequence set p(S) containing accordingly encoded sequences p(si)=
(p1(ai,1),...,pnp (ai,na )), with a total of ne(=na np) features.
2.2.2 Feature adaptation As each PRM interacts with specific peptides,
it is reasonable that all of these peptides share a common consensus pattern
which in turn typifies an implicit interaction rule. To tackle the domain
specificity, we capture the characteristic features of each PRM separately
by scaling down the order of magnitude for the descriptor set to significant
descriptors. The selection of the most relevant properties strongly reduces the
number of relevant descriptors and provide a basis to model the mentioned
implicit interaction rule.
There are 2ne -1 possible subsets consisting of at least one feature that
can be built with P. Taking an assumed peptide length of 10 into account as
well as the 544 amino acid indices, it would result in about 25440 different
subsets. To find a preferably good descriptive model out of the vast number of
subsets, we need to select the most meaningful features. For this purpose, we
use correlation-based feature selection (CFS) to evaluate the worth of a subset
of features by considering the individual predictive ability of each feature
along with the degree of redundancy between them. The evaluation method
Fig. 1. Schematic overflow of the algorithm.
tries to select those features, whose values differ systematically between
classes (i.e. binding and non-binding). Redundant features are eliminated
since they are highly correlated with at least one of the remaining features
(Kohavi and Sommerfield, 1995). The term correlation is not intended to
refer specifically to classical linear correlation; rather it is used to refer to
a degree of dependence or predictability of one variable (position-specific
amino acid index) with another. We start with no feature and search through
the search space by adding single features until five consecutive subsets
show no improvement over the current best subset. The acceptance of a new
feature depends on the extent to which it predicts classes in areas of the
sequence space not already predicted by other features. To avoid exhaustive
enumeration of all possible subsets, we use a confined search strategy as
described by Langely (1994). This presumably best subset is labelled Q.
The size of Q(=nq) is strongly reduced with regard to the initial number of
features (na np).
The core of CFS is given through the following subset evaluation function
MA =
drcf
d+d(d-1)rf f
(1)
where MA is the heuristic merit of subset A containing d features, rcf is the
mean feature­class correlation (f A) and rf f is the average feature­feature
intercorrelation. The CFS approach given in Equation (1) is, in fact, Pearson's
correlation coefficient, where all variables have been standardized. A more
detailed survey can be found in Hall (1999).
2.2.3 Model building Once an appropriate subset Q has been selected, we
use it to create a classification model. Therefore, we define a function f (l,z)
that returns a value at position z from a list l and a Boolean function b(y) that
returns 1, if y is true and 0 otherwise. In addition, we define a position-based
list X(S,k)=(f (s1,k),...,f (sns ,k)),k Nk ns, that contains all list entries
at a given position k. Let then
Dk =(min((X(q(S),k)),max((X(q(S),k))) (2)
be a descriptor of the lowest and highest value contained in list X(q(S),k),
respectively. Hence, we can define a model DIFq(S) that describes the range
of all properties within a defined set of sequences as follows:
DIFq(S)
=(D1,...,D|Q|) (3)
Such a model captures the observed scope of the position-specific amino
acid properties of the provided data and can therefore be seen as a descriptive
pattern of S. Based on such a model, it is rather simple to estimate how well
an unknown sequence s0 fits into those generalized DIF. For this purpose,
we introduce a fitness function F [Equation (4)] that takes two arguments: a
1634
Domain Interaction Footprint
Table 2. Averaged AUC and AUC01 values of our proposed DIF and other
recently published methods applied to the SH3 dataset
Method AUC AUC01
Neural network PAIRsa 0.83 n.a.
Informatively initialized discriminativeb 0.83 0.44
Network-based motif samplerc 0.79 0.17
SH3-SPOTd 0.76 n.a.
PSSMe 0.64 n.a.
DIF 0.95 0.81
The AUC (AUC01) score was averaged twice: once over all cross-validation
experiments and again over all SH3 domains. An AUC score of 0.5 reflects utter
randomness, while a value of 1 implies perfect prediction; AUC01 represents the
calculated AUC up to a false positive rate of 0.1 (the calculated AUC01 value is stated
as a fraction of the maximum area 0.1).
n.a., not available.
aFerraro et al. (2006).
bLehrach et al. (2005).
cReiss and Schwikowski (2004).
dBrannetti et al. (2000).
ePosition-specific scoring matrix built by Ferraro et al. (2006).
peptide and a DIF the peptide will be evaluated to. The possible outcome of
F is between 0 and 1, whereas 0 stands for no commonness at all and 1 for
a perfect match.
F(s0
,DIFq(S)
)= k b(cond1)b(cond2)
k
(4)
cond1 =f (q(s0
),k)  f ( f (DIFq(S)
,k),1) (5)
cond2 =f (q(s0
),k)  f ( f (DIFq(S)
,k),2) (6)
We used this methodology to build a series of classifiers. For each
investigated SH3 domain a binary classifier was created to discriminate
between binding and non-binding peptides. In that cases, the classifier was
composed of two DIFs, one for the binding and one for the non-binding
peptides. A peptide s0 is then assigned to the DIF with the highest score.
As there is no limitation in creating DIFs of different datasets, a multiclassification
approach can comprise as many DIFs as wanted. But notice
that Step 2, feature adaption, will select different properties with regard to a
different input. We combined binding data of all PDZ domains to build one
multi-classifier and selected appropriate amino acid properties by presenting
all datasets at once. Hence, this multi-domain model can assess to which
PRM a given peptide binds most probably. We also created a multi-domain
model for all SH3 domains.
2.3 Method evaluation
In order to compare the results, our analysis was performed in the same
way as the other methods. Like the other approaches, we used a 10-fold
cross-validation to evaluate each DIF. Therefore, the dataset is divided into
10 subsets of equal size, whereas the ratio of class sizes is kept the same.
Then 10 experiments were performed in which nine distinct subsets were
used to build a model and the remaining one to evaluate it. Subsequently, the
average performance was calculated out of the 10 results. Furthermore, we
randomized the input data in different runs to attain unbiased results. In case
of multi-classification, we created a DIF for each domain and assigned an
examined peptide to the domain with the highest fitness score [Equation (4)].
In order to compare the performance of the introduced approach
to the recently published methods NN-PAIRs, PSSM (Ferraro et al.,
2006), Network-based Motif Sampler (Reiss and Schwikowski, 2004),
Informatively Initialized Discriminative (Lehrach et al., 2005) and SH3SPOT
(Brannetti et al., 2000), we used like the other authors the area under
the receiver operating characteristic (ROC) curve (AUC) (Bradley, 1997) to
evaluate the accuracy with one single measure (Table 2). An AUC of 0.5
reflects random prediction, while AUC = 1 implies perfect prediction. The
final AUC score is averaged over all 10 runs of the cross-validation. The
AUC value of the multi-classification model refers to the average score of
all the possible pairs of class combinations. Additionally, we calculated the
AUC up to 0.1, where the number of false positives is low, since the upper
range of a ROC curve is largely irrelevant for useful predictions. Note that the
maximum AUC up to a false positive rate of 0.1 is 0.1 and that the calculated
areas are stated as a fraction of this. We refer to this score as AUC01. Even
so AUC01 is more appropriate to assess the prediction performance, we use
both scores since AUC01 is not available for all of the other approaches.
The calculation of AUC as well as the feature selection were carried out
by adapting and using the Weka framework (Holmes et al., 1994).
3 RESULTS
3.1 Model building
The prerequisite of our approach is the binding information of an
examined PRM in terms of the sequences of its peptide ligands. To
build up a classifier we need at least two sets of binding information
that can be derived either from another PRM or from non-binding
peptides.
Building the actual DIF of a PRM consists of three major
steps. First, our algorithm encodes the sequences of the binding
peptides with all available amino acid indices which results in a
large multi-dimensional feature space that numerically describes
all known properties of the amino acids of the sequences. Next,
the algorithm adapts this feature space specifically to one or more
given PRM. Therefore, it selects the most relevant features by
eliminating redundancy within one set of binding information and
high intercorrelation between itself and all other sets. Eventually,
the remaining position-dependent properties are used to build one
DIF for each examined set that are subsequently combined to one
classifier.
3.2 Single-domain model
To compare the performance of our algorithm with recent methods,
we referred for the evaluation of our single-domain approach to a
well-known dataset. We used the SH3 phage display dataset of Tong
et al. (2002) and validated our method as previously done by others
(see Section 2 for more details). All prerequisites of our algorithm as
aforementioned were fulfilled since sets of binding and non-binding
peptides are available. We used this information to create 25 binary
classifiers one for each dataset and calculated the averageAUC value
(Supplementary Material, Table S2).
Our result of AUC = 0.95 (AUC01 = 0.81) outperforms all other
methods (Table 2). Using the identical dataset the performance of
the approach of Ferraro et al. (2006) and the method of Lehrach et al.
(2005) (AUC01 = 0.44) can both be stated with an average AUC of
0.83. The approach of Reiss and Schwikowski (2004) attains AUC
= 0.79 (AUC01 = 0.17), which is by and large comparable with the
result of Brannetti et al. (2000) (AUC = 0.76). A position-specific
scoring matrix reaches only an average of 0.64.
3.3 Multi-domain model
Since interaction partners of PRMs can be described by a sequence
consensus motif and those are virtually similar between PRMs of
the same domain family, it is even more challenging to discriminate
between binding peptides of two similar PRMs than to discriminate
1635
C.Schillinger et al.
Table 3. Performance of the two multiple domain models DIF-SH3 and DIF-
PDZ
Model AUC
DIF-SH3 0.88
M1DIF-Boi1c2 0.95
M1DIF-Boi2c1 0.97
M1DIF-Bzz1-1c1 0.96
M1DIF-Bzz1-2c1 0.78
M1DIF-Myo3c1 0.87
M1DIF-Myo5c1 0.80
M1DIF-Nbp2c1 0.98
M1DIF-Pex13c2 0.97
M1DIF-Rvs167c1 0.93
M1DIF-Rvs167c2 0.93
M1DIF-Sho1c1 0.77
M1DIF-Sla1-3c1 0.96
M1DIF-Yfr024c1 0.97
M1DIF-Yfr024c2 0.94
M1DIF-Ygr136c1 0.73
M1DIF-Ygr136c2 0.72
M1DIF-Yhl002c1 0.76
M1DIF-Yhr016c1 0.88
M1DIF-Yhr016c2 0.92
M1DIF-Yjl020c2 0.97
M1DIF-Ypr154c1 0.92
M1DIF-Ypr154c2 0.84
DIF-PDZ 0.89
M2DIF-AF6 0.91
M2DIF-ERBIN 0.93
M2DIF-SNA1 0.84
M2DIF-N1P1 0.87
They comprise the binding information of the SH3 and PDZ datasets, respectively. The
AUC value of DIF-SH3 (DIF-PDZ) ( in bold) is the mean of all M1DIFs (M2DIFs) .
M1DIFs and M2DIFs are part of the DIF-SH3 and DIF-PDZ multi-classification
model, respectively. Their corresponding AUC score is attained by a one-against-all
measurement.
between binding peptides and non-binding peptides at large as the
latter not necessarily share a common motif.
We attained remarkably results with both a PDZ and a SH3 multidomain
model. The average performance of our PDZ multi-domain
model is AUC = 0.89 (AUC01 = 0.81), whereas the AUC values
of the single classifiers range from 0.84 to 0.93 (Table 3). As even
the lowest AUC score of the AF6 model can be seen as a very good
result of overall accuracy, the classifier on the whole assures a strong
prediction capability. The AUC values of the individual classifiers
of the SH3 multi-domain model range from 0.72 to 0.98 with an
overall average of 0.88.
4 DISCUSSION AND CONCLUSION
We developed a sequence-based approach to predict the interaction
of PRMs with peptides. The approach extracts for this purpose the
major amino acid properties responsible for the domain­peptide
interaction and generates a DIF. Used with knowledge of nonbinding
peptides, such a DIF can be used as a general binary
classification method to distinguish between binding and nonbinding
peptides. Additionally, we derived a multi-classification
model that is able to discriminate between peptides binding to
different domains of the same family by incorporating several
datasets at once into our algorithm. We want to emphasize that no
additional structural information of the domain itself is needed and
no adjustment of any parameter need to be made at any time. The
algorithm is therefore easy to use and can be applied to all sort of
domains very quickly.
We applied our single-domain method to a well-known SH3
domain dataset as well as to a dataset of PDZ domains. The former
one was chosen to evaluate our algorithm, because it has previously
been used by other approaches (Brannetti et al., 2000, Ferraro et al.,
2006, Lehrach et al., 2005, Reiss and Schwikowski, 2004) what
makes our result comparable. We employed both datasets to build
two different multi-domain models, one for the SH3 and one for
PDZ domains, which allowed us to predict the interaction of a given
peptide to all investigated domains, either PDZ or SH3, at once.
The achievement of our single-domain method applied to the
SH3 dataset is very promising as it outperforms all other methods
compared to. While reaching an average AUC score of 0.95 and an
AUC01 score of 0.81, both the method of Ferraro et al. (2006) and
Lehrach et al. (2005) (AUC01 = 0.44) reached an AUC score of
0.83 with the identical dataset, whereas the proposed algorithms of
Brannetti et al. (2000) and Reiss and Schwikowski (2004) achieved
only 0.76 and 0.79 (AUC01 = 0.17), respectively. Such a good
result is remarkable, since we do not use any inferred information
of the domain's structure directly as the methods of Ferraro et al.
(2006) and Brannetti et al. (2000) do. Without such structural
information, the method is applicable to cases where no hints of
the 3D structure are known. Nevertheless, we capture important
aspects of the domain's structure indirectly by its propensity toward
certain properties of its peptide ligands. The identified properties'
range of observed amino acids at a fixed peptide position can be
seen as a mirror image representation of the counterpart properties
of the surrounding PRM. In other words, each residue participates
with its own value in creating the domain's mould, or footprint.
Seeing only small volume residues at a particular position of the
peptide, for example, can be interpreted as a spatial constraint due to
a narrow cleft of the bulky surrounding of the PRM. We take not only
volume constraints into consideration but also any other properties
represented in the amino acid index database of Kawashima and
Kanchisa (2000) like electrostatics, charges, hydrophobicity, etc.
The achieved good performance and improvement compared with
other approaches is likely to be attributed to the incorporation of
those encoded amino acids. Defining a scope of favourable values
for specific features is disparate to methods using frequencies of
observed amino acids. Whereas the latter need many samples to
infer a deductive scheme, the former are better suitable to interpolate
from a low sample size.
One natural concern of machine learning techniques is always
overfitting. Therefore, we took the same line as the other approaches
that we compared our method to and used cross-validation as it
was done before. Those approaches have shown successfully that
overfitting is not a problem, even though they utilize up to 57
parameters. As a result, we conclude that our technique is legitimate
as it uses comparable or less parameters in 23 out of 25 cases
(Supplementary Material, Table S4).
Additionally, we merged the binding peptides of both domains
with the same length into one dataset and measured the
discriminative ability of our method on that combined dataset. The
classifier is based upon three features and yielded as expected
1636
Domain Interaction Footprint
Table 4. Sequence motifs of peptide ligands for different PDZ domains
derived
Position
Domain -3 -2 -1 0
AF6 , , ,+ , , , , , ,
ERBIN , ,+ , , , ,+ , , ,SNA1
, ,+,- , , , ,+,N1P1
, ,+,- , , ,- , , ,- , , ,+,Amino
acids of the peptide ligands are encoded as follows: (aromatic): F, Y, W;
(hydrophobic): V, L, I, M, C, A; + (positive): D, E; - (negative): R, K, H;  (others):
N, Q, P, G, S, T. Ligand positions are numbered in reverse from the very last C-terminal
ligand residue, which is denoted as 0.
a very good result (AUC = 0.95), since the peptides represent
completely different binding mechanisms, which is quite similar
to a binding/non-binding classification. Details can be found in the
Supplementary Material (Table S3).
The structure-based method of Ferraro et al. (2006) was also
applied to a larger pep-spot dataset of 7327 tested peptides (Landgraf
et al., 2004). Using that bigger dataset improves the neural network
approach to an accuracy of AUC = 0.92. This is due to more training
instances where the net can learn from as the method is based on
interacting residues of the peptide and the domain. The relevance of
such an interacting pair in the peptide­domain complex is weighted
by its frequency. More pairs in terms of a bigger dataset leads
therefore to a more accurate model. The smaller SH3 phage display
set entails a worse performance. Hence, our single-domain approach
is more favourable on small and medium datasets.
Although our approach is very robust some prerequisites should
be considered prior an application. The method is designated when
several verified peptide interaction partners exist as training set and
is therefore not suitable in cases where only one domain­peptide
complex is available.
Furthermore, we created sequence motifs of the peptide ligands
of the PDZ domains and show that there is no general differentiation
between them (Table 4). No domain can be described by an unique
pattern. There are only very few features that in some cases can
separate the motif of one PRM from the others. A peptide, for
instance, with a negative charged amino acid (Arg, His, Lys) at
position -2 (positions are counted backwards, beginning at 0)
interacts only with the N1P1 domain according to the current dataset.
Peptides with a positive charged amino acid (Asp, Glu) at position
-1 are unique of peptide ligands of SNA1.
These examples show the rationale of highlighting position-based
properties that can be used for classification, albeit no trivial pattern
exists to classify all peptides. We take this idea further and describe
an amino acid in greatest detail by all recently known properties.
After selecting the most distinguishable properties into an advanced
sequence motif (Table 5), we have built a classifier that can reliably
predict the interaction partner of the different PDZ domains.
To build a multi-domain model, binding information of each
participating domain for a given peptide ought to be available,
otherwise the prediction cannot be evaluated unequivocally. That
is why we chose the PDZ dataset as an example for the advanced
sequence motif, since the data used for the SH3 multi-domain model
Table 5. Advanced sequence motif of used amino acid indices to build
descriptors of the PDZ multi-classifier
Peptide position Accession No. Description
0 ZASB820101 Hydrophobicity scale
0 QIAN880113 Alpha helix propensity
0 RACS820114 Value of theta (i-1)
0 AURR980112 Helix capping
-1 LEVM760103 Side chain angle theta (AAR)
-1 AURR980110 Helix capping
-1 TANS770109 frequency of coil
-1 ZASB820101 Hydrophobicity scale
-1 RICJ880108 Alpha helices preference
-1 PONP800105 Hydrophobic packing
-1 LEVM760101 Hydrophobic parameter
-1 FAUJ880103 van der Waals volume
-1 BIGC670101 Residue volume
-2 PONP800104 Surrounding hydrophobicity
-2 CHOP780205 Helix propensity
-3 MAXF760106 Frequency of alpha region
-3 HUTJ700103 Entropy of formation
-3 GEIM800103 Alpha-helix indices
-3 CHOP780208 Beta-sheet propensity
Accession number can be used to obtain more details of the corresponding publication
at http://www.genome.jp/aaindex.
cannot be stated as collectively exhaustive. In other words, there is
in the case of the SH3 dataset at least one peptide P that binds to
domain A, but we have no information whether P binds to domain
B as well (nor have we information whether P not binds to B). That
poses the question how we are supposed to handle any prediction
concerning P and B. If our approach predicts that peptide P binds
to domain B, we have no way to verify or falsify that prediction. In
case of a simple binary class decision problem of one domain, we
are not facing this sort of problem as we possess mutually exclusive
information of all available peptides. This changes, of course, if
we incorporate peptides from another domain, where we have no
information regarding the current examined domain. Nevertheless,
we combined the SH3 datasets into one multi-domain model and
received good results (Table 3). We, therefore, conclude that the
binding information of Tong's dataset is mutually exclusive albeit
this is not explicitly stated.
The PRMs of the PDZ domain family show a virtually similar
consensus pattern of their binding peptides (Table 4), however, our
multi-domain method achieves a very good result (PDZ model:
AUC = 0.89). We suggested an advanced sequence motif made
up of varying position-based amino acid characteristics (Table 5)
as a possible discriminator to separate the peptide ligands of
one PRM from all the others. Note that in our approach the
essential properties like hydrophobicity is not only described by
one parameter ( in Table 4), but by four different hydrophobic
parameters such as hydrophobicity scale, hydrophobic parameter,
surrounding hydrophobicity and hydrophobic packing (Table 5).
The more detailed description of volume and steric properties by the
essential side-chain parameters value of theta (i-1), side-chain angle
theta (AAR), van der Waals volume and residue volume (Table 5)
also contribute to the improved performance. In contrast to binary
classifications where consensus patterns are extracted to accentuate
1637
C.Schillinger et al.
differences between classes or common major characteristics of
one class, such a multi-class discriminator can not only focus on
one class but has to be composed of features whose presence or
absence assign the class belonging. Subsequently, such an advanced
motif emphasizes the characteristics'variety which is comparatively
referred to. Each of the examined PDZ domains can therefore be
defined by its preference for a peculiar specificity of that motif.
Such a multi-classification model can eventually improve the
search for interaction partners preferably binding to the domain
of choice. The complex pattern used for classification might be
used to search for new potential interaction candidates in sequence
databases. There are just few steps needed. First, the whole
sequences are splitted according to the size of peptides the DIF
was trained with. After that the descriptor of the DIF is used to
transform the encoded sequence in an evaluable form. Finally, the
evaluation takes place. Besides, it can be used to narrow down the
complete space of possible peptides for synthesis. Such an approach
might also be useful in suggesting sequences for a specific peptide
library that can be used by experimentalists. Although regulation of
PDZ domain-mediated interactions has been a major focus over the
last 10 years, only a handful of reports describe negative regulation
involving phosphorylation of ligand residues in position -2 (Chung
et al., 2004, Cohen et al., 1996, Chetkovich et al., 2002) position
-3 (Matsuda et al., 1999, Chung et al., 2000) or position -5 (Tian
et al., 2006). The phosphorylated amino acids at positions -2 and
-3 could be simulated with aspartic acid such as reported by Cohen
et al. (1996). Due to the fact that we used the whole amino acid
set during the experiment, we could also take into consideration
potential phosphorylation and therefore incorporate the result in the
search for new binding proteins in vivo. Generally, our proposed
method can be used in the analysis of any other PRM. Predicting
potential peptide ligands for peptide recognition modules would
benefit the understanding of biological networks as new interaction
partners would enhance our knowledge of existing pathways and
system biology of the cell. Considering the good performance of our
proposed method, costs of experiments can be reduced by proposing
a selection of preferably good candidates at which the examination
can focus on.
ACKNOWLEDGEMENTS
We thank D.R. Madden (Department of Biochemistry, Dartmouth
Medical School, Hanover, NH03755) for providing the N1P1 PDZ
domain.
Funding: Research Group 806 of the Deutsche
Forschungsgesellschaft (DFG); Deutsche Forschungsgemeinschaft
(DFG Grant VO885/3-1 to P.B.).
Conflict of Interest: none declared.
REFERENCES
Bernstein,F.C. et al. (1997) The Protein Data Bank: a computer-based archival file for
macromolecular structures. J. Mol. Biol., 112, 535­542.
Bhargava,S. et al. (2002) A complete substitutional analysis of VIP for better tumor
imaging properties. J. Mol. Recognit., 15, 145­153.
Boisguerin,P. et al. (2004) An improved method for the synthesis of cellulose
membrane-bound peptides with free C termini is useful for PDZ domain binding
studies. Chem. Biol., 11, 449­459.
Boisguerin,P. et al. (2007) Characterization of a Putative Phosphorylation Switch:
Adaptation of SPOT Synthesis to Analyze PDZ Domain Regulation Mechanisms.
Chembiochem., 8, 2302­2307.
Bradley,A.P. (1997) The use of the area under ROC curve in the evaluation of the
machine learning algorithms. Pattern Recognit., 30, 1145­1159.
Brannetti,B. et al. (2000) SH3-SPOT: an algorithm to predict preferred ligands
to different members of the SH3 gene family. J. Mol. Biol., 298,
313­328.
Chetkovich,D.M. et al. (2002) Phosphorylation of the postsynaptic density-95
(PSD-95)/Discs Large/Zona Occludens-1 binding site of stargazin regulates
binding to PSD-95 and synaptic targeting of AMPA receptors. J. Neurosci., 22,
5791­5796.
Chung,H.J. et al. (2000) Phosphorylation of the AMPA receptor subunit GluR2
differentially regulates its interaction with PDZ domain-containing proteins.
J. Neurosci., 20, 7258­7267.
Chung,H.J. (2004) Regulation of the NMDA receptor complex and trafficking by
activity-dependent phosphorylation of the NR2B subunit PDZ ligand. J. Neurosci.,
24, 10248­10259.
Cohen,N.A. et al. (1996) Binding of the inward rectifier K channel Kir 2.3 to PSD-95
is regulated by protein kinase A phosphorylation. Neuron, 17, 759­767.
Feng,S. et al. (1994) Two binding orientations for peptides to the Src SH3 domain:
development of a general model for SH3-ligand interactions. Science, 266,
1241­1247.
Ferraro,E. et al. (2006) A novel structure-based encoding for machine-learning
applied to the inference of SH3 domain specificity. Bioinformatics, 22,
2333­2339.
Hall,M.A. (1999) Correlation-based feature subset selection for machine learning.
PhD thesis. Department of Computer Science, University of Waikato. Available
at http://www.cs.waikato.ac.nz/~mhall/thesis.pdf.
Holmes,G. et al. (1994) Weka: a machine learning workbench. In Proceedings of the
1994 Second Australia and New Zealand Conference on Intelligent Information
Systems, pp. 357­361. Available at http://hdl.handle.net/10289/1138
Kawashima,S. and Kanchisa,M. (2000) AAindex: amino acid index database. Nucleic
Acids Res., 28, 374.
Kohavi,R. and Sommerfield, D. (1995) Feature subset selection using the wrapper
method: overfitting and dynamic search space topology. In Proceedings of the First
International Conference on Knowledge Discovery and Data Mining. Available at
https://www.aaai.org/Papers/KDD/1995/KDD95-049.pdf
Landgraf,C. et al. (2004) Protein interaction networks by proteome peptide scanning.
PLoS Biol., 2, e14.
Langely,P. (1994) Selection of relevant features in machine learning. In AAAI Fall
Symposium on Relevance, AAAI Press, CA, USA, pp. 140­144.
Lawrence,C.E. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for
multiple alignment. Science, 262, 208­214.
Lehrach,W.P. et al. (2005) A regularized discriminative model for the prediction of
protein-peptide interactions. Bioinformatics, 22, 532­540.
Lee,D. et al. (2007) Predicting protein function from sequence and structure. Nat. Rev.
Mol. Cell Biol., 8, 995­1005.
Matsuda,S. et al. (1999) Phosphorylation of Serine-880 in GluR2 by protein kinase C
prevents its C terminus from binding with glutamate receptor-interacting protein.
J. Neurochem., 73, 1765­1768.
Mayer,B.J. (2001) SH3 domains: complexity in moderation. J. Cell Sci., 114,
1253­1263.
McLaughlin,M.A. et al. (2006) Prediction of binding sites of peptide recognition
domains: an application on Grb2 and SAP SH2 domains. J. Mol. Biol., 357,
1322­1334.
Nevill-Manning,C.G. et al. (1998) Highly specific protein sequence motifs for genome
analysis. Proc. Natl Acad. Sci. USA, 95, 5865­5871.
Pawson,T. and Nash,P. (2003) Assembly of cell regulatory systems through protein
interaction domains. Science, 18, 445­452.
Reina,J. et al. (2002) Computer-aided design of a PDZ domain to recognize new target
sequences. Nat. Struct. Biol., 9, 621­627.
Reiss,D.J. and Schwikowski,B. (2004) Predicting protein-peptide interactions via a
network-based motif sampler. Bioinformatics, 20, 274­282.
Schneider,T.D. and Stephens,R.M. (1990) Sequence logos: a new way to display
consensus sequences. Nucleic Acids Res., 18, 6097­6100.
Shoemaker,B.A. and Panchenko,A.R. (2007) Deciphering protein-protein interactions.
Part II. Computational methods to predict protein and domain interaction partners.
PloS Comput. Biol., 3, 595­601.
Songyang,Z. et al. (1997) Recognition of unique carboxyl-terminal motifs by distinct
PDZ domains. Science, 275, 73­77.
1638
Domain Interaction Footprint
Tian,Q.B. et al. (2006) Interaction of LDL receptor-related protein 4 (LRP4) with
postsynaptic scaffold proteins via its C-terminal PDZ domain-binding motif, and
its regulation by Ca2+/calmodulin-dependent protein kinase II. Eur. J. Neurosci.,
23, 2864­2876.
Timothy,L. et al. (2006) MEME: discovering and analyzing DNA and protein sequence
motifs. Nucleic Acids Res., 34, 369­373.
Tong,A.H.Y. et al. (2002) A combined experimental and computational strategy to
define protein interaction networks for peptide recognition modules. Science, 295,
321­324.
Vaccaro, P.and Dente, L. (2002) PDZ domains: troubles in classification. FEBS Lett.,
512, 345­349.
Wiedemann,U. et al. (2004) Quantification of PDZ domain specificity, prediction of
ligand affinity and rational design of super-binding peptides. J. Mol. Biol., 343,
703­718.
1639