[19:47 6/6/2011 Bioinformatics-btr240.tex] Page: i43 i43–i51
BIOINFORMATICS
Vol. 27 ISMB 2011, pages i43–i51
doi:10.1093/bioinformatics/btr240
Piecewise linear approximation of protein structures using the
principle of minimum message length
Arun S. Konagurthu1,∗, Lloyd Allison1,∗, Peter J. Stuckey2 and Arthur M. Lesk3
1Clayton School of Information Technology, Monash University, Clayton, VIC 3800, 2Department of Computer
Science and Software Engineering, The University of Melbourne, Parkville, VIC 3010 Australia and 3Department of
Biochemistry and Molecular Biology and The Huck Institute for Genomics, Proteomics and Bioinformatics,
The Pennsylvania State University, University Park, PA 16802, USA
ABSTRACT
Simple and concise representations of protein-folding patterns
provide powerful abstractions for visualizations, comparisons,
classiﬁcations, searching and aligning structural data. Structures
are often abstracted by replacing standard secondary structural
features—that is, helices and strands of sheet—by vectors or linear
segments. Relying solely on standard secondary structure may result
in a signiﬁcant loss of structural information. Further, traditional
methods of simpliﬁcation crucially depend on the consistency and
accuracy of external methods to assign secondary structures to
protein coordinate data. Although many methods exist automatically
to identify secondary structure, the impreciseness of deﬁnitions,
along with errors and inconsistencies in experimental structure
data, drastically limit their applicability to generate reliable simpliﬁed
representations, especially for structural comparison.
This article introduces a mathematically rigorous algorithm to
delineate protein structure using the elegant statistical and inductive
inference framework of minimum message length (MML). Our
method generates consistent and statistically robust piecewise linear
explanations of protein coordinate data, resulting in a powerful and
concise representation of the structure. The delineation is completely
independent of the approaches of using hydrogen-bonding patterns
or inspecting local substructural geometry that the current methods
use. Indeed, as is common with applications of the MML criterion,
this method is free of parameters and thresholds, in striking contrast
to the existing programs which are often beset by them.
The analysis of results over a large number of proteins suggests
that the method produces consistent delineation of structures
that encompasses, among others, the segments corresponding to
standard secondary structure.
Availability: http://www.csse.monash.edu.au/~karun/pmml.
Contact: arun.konagurthu@monash.edu; lloyd.allison@monesh.edu
1 INTRODUCTION
With the rapid growth in the corpus of known structures, concise
representations are increasingly preferred to inspect and analyze
protein folding patterns (Abagyan and Maiorov, 1988; Lesk,
1995; Richardson, 1981; Taylor et al., 1983). At the core of this
simpliﬁcation is the idea that proteins contain repetitive substructural
elements and that the essence of a fold lies in the assembly and
∗To whom correspondence should be addressed.
interaction of these elements (Kamat and Lesk, 2007; Konagurthu
and Lesk, 2010; Lesk and Chothia, 1980; Lesk, 1995).
The appearance of some of these elements arises from the
periodicity in the patterns of hydrogen bonds between backbone
nitrogen and carbonyl groups along the protein polypeptide chain.
Among the standard secondary structure deﬁnitions are: helix
(α-helix, π-helix and 310-helix) and strand of sheet (Edsall et al.,
1966). Ideally, the spatial trace of α-Carbon (Cα) atoms of standard
secondary structure show a linear trend allowing them to be
abstracted using vectors or line segments, without much loss of
structural information about the fold. The common practice is to
ﬁt an axis to a helix and a least-square line to Cα or main chain
atoms of strands of sheet (Chothia et al., 1981; Lesk, 1995).
Replacement of secondary structural elements with line segments
is therefore one of the common methods to abstract protein structures
and construct concise representation of their folding patterns. The
number of standard secondary structural elements observed in
a protein is typically an order of magnitude smaller than the
number of residues in a chain. Therefore methods that utilize
concise representations clearly beneﬁt from a massive space and
computational saving, especially when comparing and analyzing
structures on a large scale (Abagyan and Maiorov, 1988; Konagurthu
et al., 2008; Mizuguchi and Go, 1995; Shi et al., 2007).
Methods that abstract protein structure at the level of secondary
structure generally rely on external programs that can automatically
assign secondary structures to coordinate data. However, accurate
identiﬁcation and assignment of secondary structure is an inexact
process (Cuff and Barton, 1999). Although deﬁnitions based on
hydrogen bonding provides some rigor in assigning secondary
structure, the standard deﬁnition of what constitutes a hydrogen
bond is based on the notion of bond energy whose measurement
can be imprecise and acutely sensitive even to small differences in
the position of nitrogen and carbonyl atoms, especially the carbonyl
oxygen positions. Two popular programs that use hydrogen bonding
as a basis for assignment of secondary structure are DSSP (Kabsch
and Sander, 1983) and STRIDE (Frishman and Argos, 1995).
On the other hand, secondary structure can be deﬁned using
geometric features such as distances and dihedral angles of Cα
atoms along the backbone in addition to other local structural
features. In fact, there is a direct correlation between patterns
of hydrogen bonding and the geometry that arise out of them.
However, secondary structural elements can deviate substantially
from ideal geometry, therefore posing severe challenges to detect
such elements using geometric features alone. Among the methods
that rely primarily on geometry to assign secondary structure are
© The Author(s) 2011. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
atMasarykovauniverzitaonSeptember18,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[19:47 6/6/2011 Bioinformatics-btr240.tex] Page: i44 i43–i51
A.S.Konagurthu et al.
(Dupuis et al., 2004; Labesse et al., 1997; Levitt and Greer, 1977;
Majumdar et al., 2005; Richards and Kundrot, 1988; Sklenar et al.,
1989; Srinivasan and Rose, 1999; Taylor, 2001). (See Majumdar
et al. (2005) for details of popular programs that assign secondary
structural elements.)
We note that previous comparative studies have highlighted the
difﬁculties of existing programs to assign consistently secondary
structure to coordinate data and have proposed using a ‘consensus’
deﬁnition—secondary structure assignment that is at the intersection
of all the methods—to arrive at a reliable simpliﬁcation of protein
structures (Colloc’h et al., 1993; Cuff and Barton, 1999).
The main goal for abstracting protein structures must be to achieve
maximal economy of description with minimal loss of structural
information (Taylor, 2001). However, simplifying structures at the
level of standard secondary structure is lossy because the loop
regions are ignored. Therefore, a reliable method that achieves the
above goal and that is tolerant to measurement error and noise is
preferred. Even better would be a method entirely independent of
preconceived notions of what substructures are being sought.
Here, we describe a method that generates a principled
abstractions of protein structures. Our method uses the rigorous
statistical framework of minimum message length (MML). In fact,
the realization of the goal to maximize economy and minimize
loss of information ﬁts squarely into the MML criterion, making
it extremely well-suited for this speciﬁc problem. In this work,
we treat a protein as an ordered list of Cα coordinates. Our
method uses an information-theoretic approach to explain as a
line segment the points between any pair of residues in the
structure. Each such explanation is encoded in a certain number
of bits (or code length). Using these code lengths, a globally
optimal explanation is computed which minimizes the total encoded
(message) length of the given coordinate data. The code lengths
contributing to this minimum message length result in the best
piecewise linear approximation of the structure. In a stark contrast to
the existing methods, our method is completely free of parameters
and thresholds. We emphasize that our method is not a method for
delineating secondary structures. However, as expected from such
a method, our results show that the line segments generated by
this method correspond well with standard secondary structures of
proteins.
We note that this article generalizes to three dimensions the work
of Banerjee et al. (1996), who described a polygonal approximation
method on general two dimensional sequence of points.1
Indeed,
it can be shown that our method described in this paper can be
generalized to arbitrary dimensions and other types of structural
data (over and beyond proteins). We have attempted to keep the
notations in this paper consistent with those described in the work
of Banerjee et al. (1996) for the convenience of the reader.
Section 2 brieﬂy summarizes the MML framework, followed by
Sections 3–6 which describe the mechanics of our approach. Section
7 presents an analysis of the results of our method over a large
number of protein structures.
1
Banerjee et al. (1996) use a related minimum description length principle
for their approach, which is a technique that was introduced a decade
after Wallace and Boulton (1968) proposed the MML criterion. The two
approaches are signiﬁcantly different. See Wallace (2005) for a comparison.
2 THE MINIMUM MESSAGE LENGTH
FRAMEWORK
Wallace and Boulton (1968) ﬁrst proposed the theory of MML,
where given a set of competing hypotheses (or models) that
can explain some observed data, the MML criterion provides a
statistically rigorous framework for selecting the best hypothesis
to describe the data. In many ways, MML is a formal informationtheoretic
realization of the principle of Occam’s razor.
Assume there are some observed data D and some hypothesis H
that explains the data. From Bayes’s theorem we get
p(H&D)=p(H)×p(D|H)=p(D)×p(H|D)
where p(H&D) is the joint probability of data D and the hypothesis
H, p(H) is the prior probability of hypothesis H, p(D) is the prior
probability of data D, p(H|D) is the posterior probability of H given
D, and p(D|H) is the likelihood.
MML applies the remarkable result from Shannon’s
‘Mathematical Theory of Communication’ (Shannon, 1948)
that, given an event E with a probability p(E), the message length,
l(E) for an optimal code is given by l(E)=−log2 p(E) bits.
Carrying this insight to the Bayes’s theorem, we get the following
relationship between conditional probabilities in terms of optimal
message lengths.
l(H&D)=l(H)+l(D|H)=l(D)+l(H|D).
The essence of inductive inference is to ﬁt a model to a mass of
observed data. For such an approach it is the hypothesis H with the
largest posterior probability p(H|D) that is often preferred. Among
the terms in the above equation, p(H) (and hence l(H)) can usually be
estimated well for some reasonable prior on hypotheses.At the same
time, the likelihood p(D|H) can also be estimated. But to estimate
the posterior probability distribution p(H|D), the prior of observed
data p(D) will be needed. Estimating p(D) can be problematic and
even impractical. However, for two competing hypotheses, H and
H we have
l(H|D)−l(H |D)=l(H)+l(D|H)−l(H )−l(D|H ),
thereby eliminating the necessity to estimate p(D) completely when
comparing hypotheses.
MML is best understood through a communication process where
a transmitter and a receiver are connected through one of Shannon’s
communication channels. The objective is that a transmitter must
send some data D to the receiver. The transmitter and receiver must
have previously agreed on a set of rules (that is, a code book) of
communication using common knowledge and prior expectations.
If the transmitter can ﬁnd a good hypothesis, H∗, to ﬁt the data,
(s)he will be able to transmit the data economically.
In MML, an explanation of the data comes as a two part
message:
(1) transmit the hypothesis H∗ taking l(H∗) bits, and
(2) transmit the observed data D given H∗ taking l(D|H∗) bits.
Such a message paradigm ensures complete transparency in
communication. That is, any information that is not common
knowledge cannot be included except as a part of the message sent by
the transmitter. Otherwise, the message sent will be indecipherable
i44
atMasarykovauniverzitaonSeptember18,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[19:47 6/6/2011 Bioinformatics-btr240.tex] Page: i45 i43–i51
Piecewise linear approximation of structures using MML
by the receiver. There can be no hidden parameters in this framework
of communication. In fact, this issue extends to stating and inferring
real-valued parameters to an ‘appropriate’ level of precision, which
is pertinent to the current problem on hand.
The MML framework additionally offers ‘safety’ in that if an
inefﬁcient code is used to encode a message, it can only make the
hypothesis look less attractive than otherwise. Note that MMLyields
a natural hypothesis test: the null-model corresponds to transmitting
the data raw. If a stated hypothesis takes longer than what is required
by a null-model, then clearly such an hypothesis is unacceptable. A
more complex hypothesis ﬁts the data better than a simpler model,
in general. We see that MML encoding gives a trade-off between
hypothesis complexity (l(H)), and its goodness of ﬁt to the data
(l(D|H)). Therefore, MML criterion formally justiﬁes and realises
Occam’s razor.
An important aspect of MML framework is that it is tolerant
to measurement accuracy and noise in the underlying data. For a
justiﬁcation of this and a comprehensive study of the principle of
MML, refer (Wallace, 2005).
3 FORMULATING THE PROBLEM USING
MINIMUM MESSAGE LENGTH
A protein P ={P1,···,Pn} is a sequence2
of n three-dimensional
points corresponding to the coordinates (in R3) of Cα atoms along
the protein backbone, from its N- to C- terminus.3
Deﬁne a piecewise linear approximation of P as a subsequence of
k ≤n points from P of the form Q={Q1 ≡Pi1
,...,Qk ≡Pik
} such
that 1=i1 <···<ik =n, and the ﬁrst and last points of Q are the
same as the ﬁrst and last points of P (i.e. Q1 =P1 and Qk =Pn).
Given some subsequence Q of sequence of points P, the protein
can be approximated (or simpliﬁed) using line segments drawn
between every successive pair of points in the subsequence, Qr
and Qr +1, 1≤r <k. We will use the term delineation to describe
this piecewise linear approximation. Further, we will use the term
endpoint to describe any point in Q. This is because any pair of
consecutive points, Qr ≡Pir
and Qr +1 ≡Pir +1
, form endpoints
for abstracting the points between Pir
and Pir +1
(inclusive) in
the protein with a line segment. Note that a subsequence Q with
k endpoints yields a delineation containing k−1 line segments
between successive endpoints.
The goal this article is to ﬁnd the best delineation of a given set of
coordinate data, where the objective to select the best comes from
deﬁning the problem using the minimum message length criterion.
Consistent to the communication process described in Section 2,
the transmitter explains the data in P with a hypothesis Q and sends
it as a message whose code length is globally minimum over all
possible hypotheses. Receiver will then able to infer the entire data
P from the received message to a reasonable level of precision using
the general rules they have agreed upon as a part of the code book.
2
We use the term sequence in this paper to mean an ordered list. This should
not be confused with the primary sequence of amino acids of a protein.
3
Assume that the protein P is oriented such that P1 is the origin, P2 lies
on the positive x-axis, and P3 lies on the xy-plane. This is one of the many
possible schemes that ensures that our method is invariant to rotation and
translation of the frame-of-reference in which the coordinate data is deﬁned.
(See supplementary note for a detailed discussion on this issue.)
For the problem of delineating a structure from coordinate data,
the transmitter will send the following two part message (refer
Section 2):
(1) The ﬁrst part is the subsequence of points Q which denotes
the delineation of P. This is equivalent to transmitting the
hypothesis Q in l(Q) bits.
(2) The second part will contain the remainder of points in P
(that is, P −Q) that weren’t sent in the ﬁrst part. In other
words, these are the points in P that are between the endpoints
stated in Q. The statement of these points will be encoded as
spatial deviations with respect to the line segments between
endpoints. This is equivalent to transmitting the observed data
P given the hypothesis Q over l(P|Q) bits.
Therefore, as a part of the codebook, the transmitter and receiver
must have agreed upon the encoding of the endpoints in Q and the
encoding of deviations of points P −Q explained by line segments
between successive endpoints in Q. Since the coordinate data of
proteins is available at some ﬁxed precision, the transmitter and
receiver agree on the speciﬁc precision at which the data should be
sent. We emphasize that the encoding of the above should allow the
receiver to decode the message to the agreed precision.
4 CODE LENGTH TO STATE THE DELINEATION
AND DATA UNDER MML CRITERION
In this section, we will discuss the statement and transmission of the
two part message described in Section 3.
4.1 Encoding the ﬁrst part of the message
The ﬁrst part pertains to the transmission of the delineation Q
containing k endpoints. The transmitter must therefore state the
number of points k. There are several optimal universal preﬁx codes
available to encode integers. Here, we use an asymptotically optimal
Elias omega code which encodes the integral value k in log∗k
bits4
(Elias, 1975).
Next, the coordinates of all endpoints are to be encoded. Each
endpoint is a set of three real numbers of the form x,y,z . Published
protein coordinate data contain three putatively signiﬁcant ﬁgures
after the decimal point, in Angstrom (Å) units. The transmitter can
scale this data to one decimal precision and treat the coordinates
as integers. Now, an optimal code to send these coordinates is
for the transmitter to ﬁrst send the coordinates of a bounding
rectangular box, xmin,ymin,zmin and xmax,ymax,zmax over all
possible values of x, y and z in the given data. Once this bounding
box is speciﬁed, any (x,y,z) coordinates within the box can be
coded in log(xmax −xmin)+log(ymax −ymin)+log(zmax −zmin)=
log (xmax −xmin)(ymax −ymin)(zmax −zmin) =logV bits, where V
is the volume of the bounding rectangular box. It follows from here
that all the k endpoints in Q can be stated in klogV bits.5
Therefore, the message length to state the ﬁrst part of the
transmission requires log∗k+klogV bits.
4
log∗
x=logx+loglogx+··· (over all positive terms)
5
Note that the coordinates of the bounding rectangular box is a constant
given the data, so it can be ignored at least for the purposes of comparing
two hypotheses.
i45
atMasarykovauniverzitaonSeptember18,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[19:47 6/6/2011 Bioinformatics-btr240.tex] Page: i46 i43–i51
A.S.Konagurthu et al.
Fig. 1. Deviations s,t and u of intermediate points Pi+1 ···Pj−1 to a line
segment between two endpoints Pi and Pj. (Refer main text.)
4.2 Encoding the second part of the message
In the second part, the transmitter has to encode the data, P −Q,
between endpoints stated in the ﬁrst part of the message. For a
successive pair of endpoints Qr ≡Pi,Qr +1 ≡Pj ,1≤i<j≤n,1≤
r ≤k in Q, there are j−i−1 intermediate points between Pi and Pj
in P. In this work, these intermediate data points will be treated as
noisy samples and will be stated as a set of spatial deviations with
respect to the line segment between Pi and Pj.
If such a scheme is used to communicate the second part of
the message, for each line segment in Q between successive
endpoints, the second part of the message will encode the following
information:
(1) the number of points explained by the line segment.
(2) three spatial deviations for each intermediate point with
respect to the line that will allow the receiver to recover the
original location of the intermediate point up to a reasonable
approximation.
(3) the parameters of the probability distribution associated
with each of the three sets of spatial deviations, over all
intermediate points.
To explain the encoding of this part more clearly, consider Fig. 1.
Let Lij denote the line segment between two successive endpoints
in Q, Qr ≡Pi and Qr +1 ≡Pj. This line will be used to explain
the intermediate points Pi+1···Pj−1 ∈P. For any intermediate point
Pr, i+1≤r ≤j−1, deﬁne three spatial deviations sr,tr and ur.
In the reverse order, ur is the signed distance of Pr to the plane
deﬁned by vectors Pj −Pi and z-axis. To deﬁne tr, ﬁrst project Pr
to the plane deﬁned above. Call this projection point Pr. Given this
projection, tr is the signed perpendicular distance of Pr to the line
Lij. Finally, the deviation sr is the (unsigned) lateral distance along
the line Lij between points of projection of Pr−1 and Pr onto the
line (Fig. 1). (Refer the supplementary note containing a discussion
on these deviations under arbitrary rotation of the coordinates.) Note
that once the endpoints Pi and Pj are speciﬁed, and given the sets
of spatial deviations sr’s, tr’s and ur’s for the intermediate points
Pr,∀i<r <j, the receiver can entirely recover the coordinates of all
intermediate points.
In this work, we assume the three spatial deviations s’s, t’s
and u’s of the intermediate points to be independent and normally
distributed. Individual variables of each distribution are considered
independent and random. (See supplementary note for a discussion
on these assumptions.) Given these assumptions we have three
distributions of the form: s∼N(µ s,σ2
s), t ∼N(µt,σ2
t ), and
u∼N(µu,σ2
u), where µ and σ2 are the mean and variance of
the respective normal distributions. For the structural coordinate
data, we assume that the mean of the distributions of t’s and u’s
is zero: t ∼N(0,σ2
t ), and u∼N(0,σ2
u). Therefore, to communicate
the three distribution, the transmitter has to state the following four
parameters: µ s,σ2
s,σ2
t ,σ2
u .
Consider the calculations of these parameters. For the line Lij,
there are j−i−1 intermediate points. Represent this quantity by the
variable mij. Then
µ s =
j−1
r=i+1
sr
mij
≈
j
r=i+1
sr
mij +1
=
Dij
mij +1
,
where Dij is the Euclidean distance between Pi and Pj. Note that
once the endpoints are transmitted (see Section 4.1), the receiver
can deduce the value of µ s requiring no explicit statement for this
parameter in the message. This reduces the number of parameters
to be stated from four to three: σ2
s,σ2
t ,σ2
u .
We will now compute the code lengths to state the variance of
three normal distributions. Variance for a Gaussian distribution is
simply ‘mean squared minus squared mean’:
σ2
s =
j−1
r=i+1
( sr −µ s)2
mij
=
j−1
r=i+1
s2
r
mij
−µ s
2
Similarly, we have σ2
t =
j−1
r=i+1
t2
r
mij
and σ2
u =
j−1
r=i+1
u2
r
mij
, since µt =
µu =0. We note that the code length for each parameter varies with
1
2 logmij bits. [See Chapter 5 of (Wallace, 2005)].
With the parameters of the distributions encoded, we will now
compute the code lengths required to state the individual values of
s’s. Since we have assumed that the distribution is a Gaussian, the
probability distribution of a random variable sr with parameters
µ s and σ s
2 is given by:
sr ∼N(µ s,σ s)=
1
√
2πσ s
e
−
( sr −µ s)2
2σ2
s
Since we assumed that variables are independent, we have
p si+1,..., sj−1|N(µ s,σ2
s) =
j−1
r=i+1
1
√
2πσ s
e
−
( sr −µ s)2
2σ2
s
i46
atMasarykovauniverzitaonSeptember18,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[19:47 6/6/2011 Bioinformatics-btr240.tex] Page: i47 i43–i51
Piecewise linear approximation of structures using MML
This implies,
p si+1,..., sj−1|N(µ s,σ2
s) =
1
√
2πσ s
mij
e
−mij
2 .
Therefore, using Shannon’s insight, the optimal code length
to describe the entire sets of individual deviations of s’s for
a line Lij will require −log p si+1,..., sj−1|N(µ s,σ2
s)
=−log 1√
2πσ s
mij
e
−mij
2 =
mij
2 log 2πeσ2
s bits.
Following a similar expansion, we can show that the code
lengths for the deviation tr’s and ur’s are
mij
2 log 2πeσ2
t and
mij
2 log 2πeσ2
u , respectively.
So far in this second part, we have computed the code lengths
required to state intermediate points explained by the line Lij. Note
that a delineation of a structure containing k endpoints deﬁnes
k−1 such line segments. For convenience in notation, assume the
endpoints of each line segment is of the form (Pir
,Pjr
),1≤r <k. (In
practice, for a delineation, Pir
of the r-th line segment is equivalent
to Pjr−1
of (r−1)th line segment.) Then the total code length of the
second part is the sum of the following terms:
(1) k−1
r=1 log∗mr
ij, where mr
ij =jr −ir −1, representing the total
code length to encode the number of intermediate points
described by all line segments in the delineation put together.
(2) k−1
r=1
3
2 logmr
ij bits to encode the parameters (three per
line segment) corresponding to the distribution of spatial
deviations for all lines.
(3) k−1
r=1
mr
ij
2 log 2πeσr
s
2 bits to encode sr’s over all line
segments
(4) k−1
r=1
mr
ij
2 log 2πeσr
t
2 bits to encode tr’s over all line
segments
(5) k−1
r=1
mr
ij
2 log 2πeσr
u
2 bits to encode ur’s over all line
segments
4.3 Problem statement
Given a delineation Q (hypothesis) of coordinates P (data), denote
the total message length required to explain the data by the
hypothesis as H(Q). Combining the code lengths to state the two
part message described in Sections 4.1 and 4.2, the total message
length is:
H(Q)=log∗k+klogV +
k−1
r=1
log∗mr
ij +
k−1
r=1
3
2
logmr
ij
+
k−1
r=1
mr
ij
2
log 2πeσr
s
2
+
k−1
r=1
mr
ij
2
log 2πeσr
t
2
+
k−1
r=1
mr
ij
2
log 2πeσr
u
2
(1)
Since log∗k klogV, the transmitter can ignore stating that term in
the code length. Assume
Hr
ij =logV +log∗mr
ij +
3
2
logmr
ij +
mr
ij
2
log 2πeσr
s
2
+
mr
ij
2
log 2πeσr
t
2
+
mr
ij
2
log 2πeσr
u
2
(2)
Hr
ij denotes the component code length to express each line segment
Lr
ij with endpoints Pir and Pjr , given a delineation Q. This implies
H(Q)=
k−1
r=1
Hr
ij
This allows us to formally deﬁne the delineation problem as follows:
The problem:
Given P containing a sequence of n points, ﬁnd a subsequence Q∈P
containing k ≤n points such that the total message length to explain
P with Q, H(Q)=
k−1
r=1
Hr
ij, is globally minimum.
5 FINDING THE OPTIMAL DELINEATION
This section will describe the procedure to compute the optimal
delineation Q∗ for a given coordinate data. Broadly, the search for
the optimal delineation has two steps.
Potentially every pair of points Pi and Pj, 1≤i<j≤n can be a
part of the delineation in Q∗. (We note here that the segments in
the delineation must not overlap, except for successive regions, and
those only at their endpoints.) Therefore, we will ﬁrst build a matrix
H=(Hij)1≤i<j≤n of code lengths for all possible pairs of points
in P.
Then, the matrix H will be used to ﬁnd a subsequence of points
Q∗ such that the total code length H(Q∗) of the delineation is
minimized, using a one-dimensional dynamic program.
5.1 Computation of code length over all possible
segments
Equation (2) expresses the message length Hij required to describe
any line segment Lij between two points Pi and Pj. We will examine
the complexity of computing each of the components that constitute
Equation (2).
For the n points in P, there are nC2 = n×(n−1)
2 possible line
segments. The logV term in Equation (2) is constant across all
possible segments and is computed once while reading the data
points of P. Next, for each line segment, there are three parameters
whose code lengths depend on the number of points in between the
endpoints. This is trivially computed in constant time as j−i−1.
The relatively complex part is to compute the code lengths of
the spatial deviations of the line, s’s t’s and u’s. Each of these
three deviations have code lengths that depend on their respective
variance, σ2
s,σ2
t and σ2
u. While one can compute the variance of
each set of deviations from the coordinate data, such a computation
is linear in the number of points that each line segment explains. If
this näive approach is followed, the computation of H requires O(n3)
operations. We will show in the later Section 6 that this is redundant
and that the total time required to compute H can be achieved in
O(n2) operations, by computing the variances of all three spatial
i47
atMasarykovauniverzitaonSeptember18,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[19:47 6/6/2011 Bioinformatics-btr240.tex] Page: i48 i43–i51
A.S.Konagurthu et al.
deviations incrementally from previous computations using a set of
sufﬁcient statistics. But before that we will describe the method to
compute the optimal delineation given the matrix H.
5.2 Optimal delineation as a one-dimensional dynamic
program
Dynamic programming is perfectly suited when dealing
with problems that contain sequential constraints, where the
solutions to the subproblems have a recursive overlapping
substructure (Bellman, 1957). The problem statement in
Section 4.3 is an ideal candidate for the search strategy of
dynamic programming. Since a delineation is a subsequence which
preserves the linear ordering of its elements, the optimal delineation
of the given data can be derived by computing and memoizing (i.e.
caching) the optimal delineation of its subproblems.
We will use the matrix H of code length between all possible
endpoints to ﬁnd the optimal delineation Q∗ that minimizes H(Q∗)
using a one-dimensional dynamic program.
Let Ci be an array that stores the optimal code length of
delineating points P1,...,Pi, ∀1≤i≤n. The objective is to ﬁnd
the delineation of the given points where Cn is minimum over all
possible subsequences of the given points. Therefore, the recurrence
relationship of optimal costs using a one-dimensional dynamic
program is as follows:
C1 =0,
Cj =
j−1
min
i=1
H1j,(Ci +Hij) ,∀1≤j≤n
In other words, the optimal code length to delineate the points
P1,...,Pj (1≤j≤n) builds on the optimal code length to delineate
from P1,...,Pi, if and only if the value of Ci plus the code length
to state a new line segment Hij is minimum, over all 1≤i<j.
Using the above relationship, the array C is ﬁlled iteratively
from 1 to n. Upon completion, the value Cn gives the optimal
message length corresponding to the best delineation Q∗ of P, where
H(Q∗)≡Cn is globally minimum. The subsequence of endpoints of
this optimal delineation can be computed by storing, for each j, the
back pointer i<j of the array from which the optimal value Cj was
derived. With these back pointers, a simple traceback from Cn (until
C1 is reached) gives the set endpoints (in reverse order) that form
the best delineation Q∗.
6 EFFICIENT COMPUTATION OF MATRIX H
As mentioned in Section 5.1 the matrix of code lengths H can be
computed efﬁciently in O(n2) operations and this section will show
how this can be achieved.
For the matrix H to be computable in O(n2) operations, each
element Hij in the matrix should be computable in constant time.
However terms σr
s
2, σr
t
2, and σr
u
2 in Equation (2) cannot be
computed in constant time. For a line segment Lij, näively, these
three variances take time proportional to the number of points
explained by the line to compute, leading to a O(n3) algorithm for
computing the matrix H. Below we will show that each of σr
s
2,
σr
t
2, and σr
u
2 can indeed be computed incrementally and in constant
time from previous computations resulting in a O(n2) algorithm.
6.1 Constant-time update of σ2
s’s
Consider ﬁrst these notations: for any vector v with direction ratios
x,y,z , let ||v||≡ x2 +y2 +z2 represents the vector norm of v. Let
any point Pi ∈P have the direction ratios of the form xi,yi,zi .
By the deﬁnitions of the spatial deviations in Section 4.2, any
sr,1≤i<r <j≤n is the scalar associated with the projection of the
vector (Pr −Pr−1) onto the vector Lij ≡(Pj −Pi). (Refer Fig. 1.) Let
ˆLij = ˆLx
ij, ˆL
y
ij, ˆLz
ij represent the direction cosines of the vector Lij,
where ˆLx
ij =
(xj−xi)
||Lij|| , ˆL
y
ij =
(yj−yi)
||Lij|| and ˆLz
ij =
(zj−zi)
||Lij|| .Then sr is the dot
product of (Pr −Pr−1) and ˆLij: sr =(Pr −Pr−1)· ˆLij. Expanding
this we get,
sr =(xr −xr−1)ˆLx
ij +(yr −yr−1)ˆL
y
ij +(zr −zr−1)ˆLz
ij
Denoting Sij =
j−1
r=i+1
s2
r
Sij =
j−1
r=i+1
(xr −xr−1)ˆLx
ij +(yr −yr−1)ˆL
y
ij +(zr −zr−1)ˆLz
ij
2
Expanding Sij,
Sij = ˆLx
ij
2
j−1
r=i+1
(xr −xr−1)2 + ˆL
y
ij
2
j−1
r=i+1
(yr −yr−1)2
+ˆLz
ij
2
j−1
r=i+1
(zr −zr−1)2
+2ˆLx
ij
ˆL
y
ij
j−1
r=i+1
(xr −xr−1)(yr −yr−1)
+2ˆL
y
ij
ˆLz
ij
j−1
r=i+1
(yr −yr−1)(zr −zr−1)
+2ˆLx
ij
ˆLz
ij
j−1
r=i+1
(xr −xr−1)(zr −zr−1) (3)
Now, let Sxx
ij , S
yy
ij , Szz
ij , S
xy
ij , S
yz
ij , Sxz
ij be a set of variables which we
will call here sufﬁcient statistics. These variables are of the form:
SAB
ij =
j−1
r=i+1
(Ar −Ar−1)(Br −Br−1), where A and B take the values
{x,y,z}.
Expressing Equation (3) in terms of the sufﬁcient statistics, we
get
Sij = ˆLx
ij
2
Sxx
ij + ˆL
y
ij
2
S
yy
ij + ˆLz
ij
2
Szz
ij +2ˆLx
ij
ˆL
y
ijS
xy
ij
+2ˆL
y
ij
ˆLz
ijS
yz
ij +2ˆLx
ij
ˆLz
ijSxz
ij (4)
From Equation (4) it can be clearly seen that any Sij+1 can be
updated from Sij in constant time, using the sufﬁcient statistics.
This holds because any SAB
ij+1 =SAB
ij +(Aj −Aj−1)(Bj −Bj−1), where
{A,B}∈{x,y,z}.
Therefore, using the sufﬁcient statistics the computation of σ s
for a line segment can be computed incrementally in constant time.
i48
atMasarykovauniverzitaonSeptember18,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[19:47 6/6/2011 Bioinformatics-btr240.tex] Page: i49 i43–i51
Piecewise linear approximation of structures using MML
6.2 Constant-time update of σ2
t ’s
Let n1 be the normal to a plane deﬁned by ˆz×Lij, where ˆz is the unit
vector along z-axis with the direction cosines 0,0,1 . It follows that
the direction ratios of n1 are −(yj −yi),(xj −xi),0 .
Deﬁne n2 as a vector which is normal to the plane Lij ×n1. The
direction ratios of n2 will be:
−(xj −xi)(zj −zi),−(yj −yi)(zj −zi),(xj −xi)2 +(yj −yi)2 Let
ˆn2 = ˆnx
2,ˆn
y
2,ˆnz
2 represent the direction cosines of n2, where
ˆnx
2 =
−(xj−xi)(zj−zi)
||n2|| , ˆn
y
2 =
−(yj−yi)(zj−zi)
||n2|| and ˆnz
2 =
(xj−xi)2
+(yj−yi)2
||n2|| .
Then tr =(Pr −Pi)· ˆn2. (Refer Fig. 1.) This implies
tr =(xr −xi)ˆnx
2 +(yr −yi)ˆn
y
2 +(zr −zi)ˆnz
2
Assume Tij =
j−1
r=i+1
t2
r and expanding along the steps we took in
the previous section, we get
Tij = ˆnx
2
2
Txx
ij + ˆn
y
2
2
T
yy
ij + ˆnz
2
2
Tzz
ij +2ˆnx
2 ˆn
y
2T
xy
ij
+2ˆn
y
2 ˆnz
2T
yz
ij +2ˆnx
2 ˆnz
2Txz
ij (5)
where computation of any Tij+1 can be updated from Tij in constant
time.
6.3 Constant-time update of σ2
u’s
We have seen above that n1 = −(yj −yi),(xj −xi),0 . Let ˆn1 =
ˆnx
1,ˆn
y
1,0 represent the direction cosines of n1, where ˆnx
1 =
−(yj−yi)
||n1|| ,
and ˆn
y
1 =
(xj−xi)
||n1|| . (Note ˆnz
1 =0).
Then ur =(Pr −Pi)· ˆn1. (Refer Fig 1.) Expanding as before we
get
Uij = ˆnx
1
2
Uxx
ij + ˆn
y
1
2
U
yy
ij +2ˆnx
1 ˆn
y
1U
xy
ij (6)
where again the computation of any Uij+1 can be updated from Uij
in constant time, when sufﬁcient statistics are maintained.
Therefore, the update rules in Equations (4)–(6) allows an efﬁcient
computation of the matrix H of code lengths in O(n2) operations.
7 RESULTS
In the previous sections, we have demonstrated an efﬁcient and
statistically robust algorithm to simplify a protein structure with
piecewise linear segments. We implemented the described algorithm
(in C++). Our implementation is available from http://www.csse
.monash.edu.au/~karun/pmml/.
We evaluated our method using a non-redundant dataset
containing 15 399 protein structures obtained from the protein data
bank (Berman et al., 2002). (The non-redundancy here implies that
no two structures in this dataset share a sequence identity >65%.)
This dataset was culled using the program PISCES (Wang and
Dunbrack, 2003). The list of proteins structures in the dataset and the
results of their delineation produced by our method can be obtained
from the aforementioned link.
Figure 2 gives the distribution of the measure of simpliﬁcation
of structures over the entire dataset. For a structure, the measure of
simpliﬁcation is the ratio of number of line segments identiﬁed by the
program over the number of residues in the structure. On an average
Fig. 2. Distribution of ratios of number of line segments over number of
residues per structure in the dataset. Ratios are expressed in percentages and
rounded to the nearest integral value.
over the entire dataset the delineation size (that is, the number of
line segments in the delineation) constitutes 13.85% of the total size
of structure (in residues). In addition, the average segment length
over the entire dataset is observed to be 8.11 residues. In general,
the number of segments is correlated to total size of the protein
structure.
It is of considerable interest to evaluate the agreement of standard
secondary structural elements—helices and strands of sheets—with
the delineation identiﬁed by the program. We note that an ideal
delineation of a structure must encompass these elements since
they are ideal candidates for approximation with lines or vectors
given the linear spatial trend in their geometry. In order to evaluate
the agreement, we coarsely classify each segment to one of three
secondary structure states: ‘Helix’, ‘Strand’ and ‘Other’. This threestate
classiﬁcation is based on certain geometric characteristics of the
segments in the delineation. Speciﬁcally, we compute the following
geometric proﬁles for each segment: ‘rise’, ‘pitch’ and backbone
dihedral angles φ and ψ. The rise (ρ) of the segment with endpoints
Pi and Pj is ρ=Dij/(j−i+1), where Dij is the Euclidean distance
between the endpoints. In other words, the rise gives the average
translation of points along the line between endpoints. The rise of
a standard secondary structure is directly related to the pitch (p) of
the segment. For a substructure with a geometry that repeats itself
every n residues, the relationship between rise and pitch is given by
p=nρ. Table 1 summarizes the geometric proﬁles of ideal secondary
structures (Taylor, 2001). Inspecting these proﬁles per segment,
a coarse characterisation for each segment in the delineation is
achieved.
Examining the coarse segment level assignment for the structures
in the dataset, we note that the average length of segments assigned
as ‘Helix’ is 13.01 residues while the same for those assigned as
‘Strand’ is 7.33 residues.
To evaluate our coarse assignment, we choose two popular
and extensively used secondary structure assignment programs,
DSSP (Kabsch and Sander, 1983) and STRIDE (Frishman and
Argos, 1995). DSSP and STRIDE assign each residue to one of
multiple secondary structural states, including 310-, α-, π-helices
and β-strands of sheet. For the structures in our dataset, we generate
i49
atMasarykovauniverzitaonSeptember18,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[19:47 6/6/2011 Bioinformatics-btr240.tex] Page: i50 i43–i51
A.S.Konagurthu et al.
Table 1. Geometric proﬁles of ideal secondary structures used to classify
coarsely the delineation identiﬁed by the program. φ and ψ are average
backbone dihedral angles. n is the periodicity of the local structure. ρ is the
rise. p is the pitch
Type φ ψ n ρ p
310-Helix −57.1 −69.7 3.0 2.0 6.0
α-Helix −57.8 −47.0 3.6 1.5 5.5
π-Helix −74.0 −4.0 4.4 1.1 5.0
β-Strand −139.0 135.0 2.0 3.4 6.8
Table 2. Percentage agreement of Helix and Strand assignments between
various methods
Comparison Helices (%) Strands (%)
PMML (coarse)_vs_DSSP 79.0 83.3
PMML (coarse)_vs_STRIDE 79.3 83.1
PMML (reﬁne)_vs_DSSP 92.6 92.4
PMML (reﬁne)_vs_STRIDE 91.3 92.1
STRIDE_vs_DSSP 95.7 96.9
the respective secondary structure assignments using DSSP and
STRIDE. We note that both these programs assign secondary
structure deﬁnitions at a residue level, while the coarse assignment
for our method described above is at a segment level. Therefore,
to enable a comparison between the methods we assign all residues
within a segment to the segment level secondary structure state.
Table 2 gives the concordance of Helix6
and Strand assignments
between DSSP, STRIDE, and our method, PMML.
Although even a coarse segment level assignment by our method
produced a satisfactory concordance with DSSP and STRIDE, there
is still a disagreement of ∼15% between PMML and the other two
methods. Inspecting these differences we note that the majority
of them came from the terminal parts of the segments delineated
by our program. Therefore, we reﬁne the coarse level assignment
produced by PMMLusing the hydrogen bonding patterns of residues
within each segment to reassign the secondary structure state at a
residue level. We use a simple proximity (of backbone nitrogen and
carbonyl groups) and angle (of N, O, C atoms) based computation
of hydrogen bonds. Comparing our reﬁned assignments at a residue
level with DSSP and STRIDE we notice a substantial improvement
in the concordance of helix and stand assignments with DSSP
and STRIDE. (See rows 3 and 4 in Table 2.) We emphasize that
although PMML can be used to generate protein secondary structure
assignments, its real aim is to generate concise representations
of structures, irrespective of the nature of the segments of which
they are composed. For instance, PMML could be applied to RNA
structures without needing any appeal to the types of substructure
anticipated.
Manually evaluating the delineation of a large number of
structures we notice that although PMML’s delineation identiﬁes
the regions of helix and strand consistently, there remain small
discrepancies in assigning precise beginning and end residues
6
We do not distinguish between the three types of helices and treat them as
one state.
Fig. 3. Wall-eye stereo image of 1.8 Å crystal structure of oxidized
Clostridium beijerinckii ﬂavodoxin. Each delineated segment produced by
PMML is shown in a different color. The elements of secondary structures,
of helices and strands of sheet, were derived from the wwPDB ﬁle, 5NLL,
and are shown in this ﬁgure as thick ribbons. The labels of various secondary
structures are also shown. The bound FMN co-factor is shown at the top of
the structure as thin lines.
Table 3. The residue ranges of secondary structural elements (SSEs) in the
structure of ﬂavodoxin shown in Fig. 3
SSE wwPDB PMML
β1 Lys2-Trp6 Met1-Tyr5
αA Asn11-Glu25 Asn11-Glu27
β2 Asn31–Asn34 Gly27-Ile33
αB Ile40-Asn45 Asn39-Glu46
β3 Ile48–Cys53 Asp47-Cys53
αC Phe66-Lys76 Glu65-Thr75
β4 Lys81–Tyr88 Gly79-Ser87
αD Lys94-Gly105 Gly91-Gly107
β5 Leu115–Gln118 Glu112-Gln118
αE Asp122-Ile136 Glu120-Gln126,Gln126-Ile136
The SSEs in the rows follow the order of their appearance along the chain of the
protein from its N- to C-terminus. The column wwPDB gives the residue ranges of
various SSEs as indicated in the wwPDB ﬁle 5NLL. The column PMML gives the
corresponding residue ranges of the segmentation produced by PMML.
of secondary structure elements as ascertained by an expert.
To highlight these differences consider the following example of
the delineation produced by PMML. Figure 3 shows the structure
of oxidized Clostridium beijerinckii ﬂavodoxin. This protein binds
a cofactor, ﬂavin mononucleotide (FMN). Flavodoxin is a small
α/β protein, containing a 5-stranded parallel β-sheet (β1,...,β5),
with two helices packed against each face of the sheet (αA,αE and
αC,αD). There is also a short helix (αB) located near the N-terminus
of the protein. (Fig. 3.) Different segments produced by PMML
are shown in different colors. The elements of secondary structure
shown as thick ribbons are the secondary structure assignments taken
from the structure’s wwPDB ﬁle (5NLL). Table 3 gives the residue
ranges (that is, start and end residues) for each secondary structural
element (SSE) of the ﬂavodoxin structure listed in its wwPDB ﬁle.
The residue ranges of the corresponding segmentation produced by
PMML is also presented in the table. Broadly, the program correctly
assigns segments to the SSEs. However, minor differences can be
i50
atMasarykovauniverzitaonSeptember18,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[19:47 6/6/2011 Bioinformatics-btr240.tex] Page: i51 i43–i51
Piecewise linear approximation of structures using MML
observed in the locations of their start and end residues. In most
cases, we notice an absolute difference of 1 or 2 residues in the N- or
C- terminal regions of these SSEs. The segmentation in the regions
around the SSEs αE, β2 and β5 show some discrepancies. The
residue range from wwPDB corresponding to αE was approximated
by PMML using 2 segments instead of one. The ﬁrst segment is
composed of roughly one turn of the helix at αE’s N-terminal end.
This is understandable as this turn is substantially skewed from the
main helical axis and, indeed, there is an interruption in the hydrogen
bonding. However, the second segment composed of 11 residues in
this region is consistent with the assignment in the wwPDB ﬁle.
In the case of β2, the start location identiﬁed by PMML precedes
the start location identﬁed in the wwPDB ﬁle by four residues. On
inspecting the ﬂavodoxin structure, there appears to be a backbone
hydrogen bond between the carbonyl group of residue Asp29 and
the nitrogen of Met1 (of strand β1), so the β2 strand may well start
at residue Lys28 or Asp29. Similarly, for β5, the start location of the
segment from PMML was identiﬁed to be three residues before the
location identiﬁed in the wwPDB ﬁle, and inspecting the structure,
we note the β−bulge in strand β5, and hydrogen bonds between
atoms 80O···109N and 82N···109O; assignment of the start of the
strand β5 to residue 109 is not indefensible.
8 CONCLUSION
We have presented a novel and efﬁcient method to delineate
protein structures using the MML framework; MML is tolerant
to measurement error and other inaccuracies. The model used
in this work is independent of preconceived notions of what
substructures are being sought to simplify the observed coordinate
data. Our method maximizes the economy of representation while
minimizing the loss of information, taking into account even
the loop regions of proteins. Analysis of the delineations of
a large number of protein structures suggests that the method
is consistent in, among others, delineating standard secondary
structures. The concise representations produced by this method
have a potential use for rapid and accurate structure comparison
and lookup. An implementation of our program is available from
http://www.csse.monash.edu.au/~karun/pmml/.
ACKNOWLEDGEMENTS
We thank the anonymous referees for comments that improved the
manuscript. L.A. and A.S.K. thank Nathan Hurst for useful pointers
during the development of this work.
Funding: ASK’s research is supported by Monash University’s
Talent Enhancement and Larkins Fellowship. NICTA is funded
by the Australian Government as represented by the Department
of Broadband, Communications and the Digital Economy and the
Australian Research Council.
Conﬂict of Interest: none declared.
REFERENCES
Abagyan,R.A. and Maiorov,V.N. (1988) A simple qualitative representation of
polypeptide chain folds: comparison of protein tertiary structures. J. Biomol. Struct.
Dyn., 5, 1267–1279.
Banerjee,S. et al. (1996) A minimum description length polygonal approximation
method. IBM Tech. Rep., RJ 10007, 1–19.
Bellman,R. (1957) Dynamic Programming. Princeton University Press, Princeton, New
Jersey.
Berman,H.M. et al. (2002) The protein data bank. Acta Crystallogr. D Biol. Crystallogr.,
58(Pt 6 No 1), 899–907.
Chothia,C. et al. (1981) Helix to helix packing in proteins. Proc. Natl Acad. Sci. USA,
78, 4146–4150.
Colloc’h,N. et al. (1993) Comparison of three algorithms for the assignment of
secondary structure in proteins. Protein Eng., 6, 377–382.
Cuff,J.A. and Barton,G.J. (1999) Evaluation and improvement of multiple sequence
methods for protein secondary structure prediction. Proteins, 34, 508–519.
Dupuis,F. et al. (2004) Protein secondary structure assignment through Voronoi
tessellation. Proteins, 55, 519–528.
Edsall,J.T. et al. (1966) A proposal of standard conventions and nomenclature for the
description of polypeptide conformations. J. Mol. Biol., 15, 399–407.
Elias,P. (1975) Universal codeword sets and representations of the integers. IEEE Trans.
Inf. Theory, 21, 194–203.
Frishman,D. and Argos,P. (1995) Knowledge-based protein secondary structure
assignment. Proteins, 23, 566–579.
Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary structure: pattern
recognition of hydrogen-bonded and geometrical features. Biopolymers, 22,
2577–2637.
Kamat,A.P. and Lesk,A.M. (2007) Contact patterns between helices and strands of sheet
deﬁne protein folding patterns. Proteins: Structure, Function, and Bioinformatics,
66, 869–876.
Konagurthu,A.S. and Lesk,A.M. (2010) Concise tableau representation of protein
folding patterns. J. Mol. Recogn., 23, 253–257.
Konagurthu,A.S. et al. (2008) Structural search and retreival using tableau
representation of protein folding patterns. Bioinformatics, 24, 645–651.
Labesse,G. et al. (1997) P-SEA: a new efﬁcient assignment of secondary structure from
c alpha trace of proteins. Comput. Appl. Bio. Sci., 13, 291–295.
Lesk,A.M. and Chothia,C. (1980) How different amino acid sequences determine
similar protein structures: The structure and evolutionary dynamics of the globins.
J. Mol. Biol., 136, 225–230.
Lesk,A.M. (1995) Systematic representation of protein folding patterns. J. Mol.
Graphics, 13, 159–164.
Levitt,M. and Greer,J. (1977)Automatic identiﬁcation of secondary structure in globular
proteins. J. Mol. Biol., 114, 181–239.
Majumdar,I. et al. (2005) PALSSE: A program to delineate linear secondary structural
elements from protein structures. BMC Bioinformatics, 6, 202.
Mizuguchi,K. and Go,N. (1995) Comparison of spatial arrangements of secondary
structural elements in proteins. Protein Eng., 8, 353–362.
Richardson,J.S. (1981) The anatomy and taxonomy of protein structure. Adv. Protein
Chem., 34, 167–339.
Richards,F.M. and Kundrot,C.E. (1988) Identiﬁcation of structural motifs from protein
coordinate data: secondary structure and ﬁrst-level supersecondary structure.
Proteins, 3, 71–78.
Shannon,C.E. (1948) A mathematical theory of communication. Bell Syst. Tech. Jrnl.,
27, 379–423.
Shi,S. et al. (2007) Searching for three-dimensional secondary structural patterns in
proteins with ProSMoS. Bioinformatics, 23, 1331–1338.
Sklenar,H. et al. (1989) Describing protein structure: a general algorithm yielding
complete helicoidal parameters and a unique overall axis. Proteins, 6, 46–60.
Srinivasan,R. and Rose,G.D. (1999) A physical basis for protein secondary structure.
Proc. Natl Acad. Sci. USA, 96, 14258–14263.
Taylor,W.R. et al. (1983) A elipsoidal approximation of protein shape. J. Mol. Graphics,
1, 30–38.
Taylor,W.R. (2001) Deﬁning linear segments in protein structures. J. Mol. Biol., 310,
1135–1150.
Wallace,C.S. and Boulton,D.M. (1968) An information measure for classiﬁcation.
Comp. J., 11, 185–194.
Wallace,C.S. (2005) Statistical and Inductive Inference using Minimum Message
Length. Information Science and Statistics. Springer, New York.
Wang,G. and Dunbrack,R. L. Jr. (2003) PISCES: a protein sequence culling server.
Bioinformatics, 19, 1589–1591.
i51
atMasarykovauniverzitaonSeptember18,2011bioinformatics.oxfordjournals.orgDownloadedfrom