[19:47 6/6/2011 Bioinformatics-btr240.tex] Page: i43 i43–i51 BIOINFORMATICS Vol. 27 ISMB 2011, pages i43–i51 doi:10.1093/bioinformatics/btr240 Piecewise linear approximation of protein structures using the principle of minimum message length Arun S. Konagurthu1,∗, Lloyd Allison1,∗, Peter J. Stuckey2 and Arthur M. Lesk3 1Clayton School of Information Technology, Monash University, Clayton, VIC 3800, 2Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, VIC 3010 Australia and 3Department of Biochemistry and Molecular Biology and The Huck Institute for Genomics, Proteomics and Bioinformatics, The Pennsylvania State University, University Park, PA 16802, USA ABSTRACT Simple and concise representations of protein-folding patterns provide powerful abstractions for visualizations, comparisons, classifications, searching and aligning structural data. Structures are often abstracted by replacing standard secondary structural features—that is, helices and strands of sheet—by vectors or linear segments. Relying solely on standard secondary structure may result in a significant loss of structural information. Further, traditional methods of simplification crucially depend on the consistency and accuracy of external methods to assign secondary structures to protein coordinate data. Although many methods exist automatically to identify secondary structure, the impreciseness of definitions, along with errors and inconsistencies in experimental structure data, drastically limit their applicability to generate reliable simplified representations, especially for structural comparison. This article introduces a mathematically rigorous algorithm to delineate protein structure using the elegant statistical and inductive inference framework of minimum message length (MML). Our method generates consistent and statistically robust piecewise linear explanations of protein coordinate data, resulting in a powerful and concise representation of the structure. The delineation is completely independent of the approaches of using hydrogen-bonding patterns or inspecting local substructural geometry that the current methods use. Indeed, as is common with applications of the MML criterion, this method is free of parameters and thresholds, in striking contrast to the existing programs which are often beset by them. The analysis of results over a large number of proteins suggests that the method produces consistent delineation of structures that encompasses, among others, the segments corresponding to standard secondary structure. Availability: http://www.csse.monash.edu.au/~karun/pmml. Contact: arun.konagurthu@monash.edu; lloyd.allison@monesh.edu 1 INTRODUCTION With the rapid growth in the corpus of known structures, concise representations are increasingly preferred to inspect and analyze protein folding patterns (Abagyan and Maiorov, 1988; Lesk, 1995; Richardson, 1981; Taylor et al., 1983). At the core of this simplification is the idea that proteins contain repetitive substructural elements and that the essence of a fold lies in the assembly and ∗To whom correspondence should be addressed. interaction of these elements (Kamat and Lesk, 2007; Konagurthu and Lesk, 2010; Lesk and Chothia, 1980; Lesk, 1995). The appearance of some of these elements arises from the periodicity in the patterns of hydrogen bonds between backbone nitrogen and carbonyl groups along the protein polypeptide chain. Among the standard secondary structure definitions are: helix (α-helix, π-helix and 310-helix) and strand of sheet (Edsall et al., 1966). Ideally, the spatial trace of α-Carbon (Cα) atoms of standard secondary structure show a linear trend allowing them to be abstracted using vectors or line segments, without much loss of structural information about the fold. The common practice is to fit an axis to a helix and a least-square line to Cα or main chain atoms of strands of sheet (Chothia et al., 1981; Lesk, 1995). Replacement of secondary structural elements with line segments is therefore one of the common methods to abstract protein structures and construct concise representation of their folding patterns. The number of standard secondary structural elements observed in a protein is typically an order of magnitude smaller than the number of residues in a chain. Therefore methods that utilize concise representations clearly benefit from a massive space and computational saving, especially when comparing and analyzing structures on a large scale (Abagyan and Maiorov, 1988; Konagurthu et al., 2008; Mizuguchi and Go, 1995; Shi et al., 2007). Methods that abstract protein structure at the level of secondary structure generally rely on external programs that can automatically assign secondary structures to coordinate data. However, accurate identification and assignment of secondary structure is an inexact process (Cuff and Barton, 1999). Although definitions based on hydrogen bonding provides some rigor in assigning secondary structure, the standard definition of what constitutes a hydrogen bond is based on the notion of bond energy whose measurement can be imprecise and acutely sensitive even to small differences in the position of nitrogen and carbonyl atoms, especially the carbonyl oxygen positions. Two popular programs that use hydrogen bonding as a basis for assignment of secondary structure are DSSP (Kabsch and Sander, 1983) and STRIDE (Frishman and Argos, 1995). On the other hand, secondary structure can be defined using geometric features such as distances and dihedral angles of Cα atoms along the backbone in addition to other local structural features. In fact, there is a direct correlation between patterns of hydrogen bonding and the geometry that arise out of them. However, secondary structural elements can deviate substantially from ideal geometry, therefore posing severe challenges to detect such elements using geometric features alone. Among the methods that rely primarily on geometry to assign secondary structure are © The Author(s) 2011. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. atMasarykovauniverzitaonSeptember18,2011bioinformatics.oxfordjournals.orgDownloadedfrom [19:47 6/6/2011 Bioinformatics-btr240.tex] Page: i44 i43–i51 A.S.Konagurthu et al. (Dupuis et al., 2004; Labesse et al., 1997; Levitt and Greer, 1977; Majumdar et al., 2005; Richards and Kundrot, 1988; Sklenar et al., 1989; Srinivasan and Rose, 1999; Taylor, 2001). (See Majumdar et al. (2005) for details of popular programs that assign secondary structural elements.) We note that previous comparative studies have highlighted the difficulties of existing programs to assign consistently secondary structure to coordinate data and have proposed using a ‘consensus’ definition—secondary structure assignment that is at the intersection of all the methods—to arrive at a reliable simplification of protein structures (Colloc’h et al., 1993; Cuff and Barton, 1999). The main goal for abstracting protein structures must be to achieve maximal economy of description with minimal loss of structural information (Taylor, 2001). However, simplifying structures at the level of standard secondary structure is lossy because the loop regions are ignored. Therefore, a reliable method that achieves the above goal and that is tolerant to measurement error and noise is preferred. Even better would be a method entirely independent of preconceived notions of what substructures are being sought. Here, we describe a method that generates a principled abstractions of protein structures. Our method uses the rigorous statistical framework of minimum message length (MML). In fact, the realization of the goal to maximize economy and minimize loss of information fits squarely into the MML criterion, making it extremely well-suited for this specific problem. In this work, we treat a protein as an ordered list of Cα coordinates. Our method uses an information-theoretic approach to explain as a line segment the points between any pair of residues in the structure. Each such explanation is encoded in a certain number of bits (or code length). Using these code lengths, a globally optimal explanation is computed which minimizes the total encoded (message) length of the given coordinate data. The code lengths contributing to this minimum message length result in the best piecewise linear approximation of the structure. In a stark contrast to the existing methods, our method is completely free of parameters and thresholds. We emphasize that our method is not a method for delineating secondary structures. However, as expected from such a method, our results show that the line segments generated by this method correspond well with standard secondary structures of proteins. We note that this article generalizes to three dimensions the work of Banerjee et al. (1996), who described a polygonal approximation method on general two dimensional sequence of points.1 Indeed, it can be shown that our method described in this paper can be generalized to arbitrary dimensions and other types of structural data (over and beyond proteins). We have attempted to keep the notations in this paper consistent with those described in the work of Banerjee et al. (1996) for the convenience of the reader. Section 2 briefly summarizes the MML framework, followed by Sections 3–6 which describe the mechanics of our approach. Section 7 presents an analysis of the results of our method over a large number of protein structures. 1 Banerjee et al. (1996) use a related minimum description length principle for their approach, which is a technique that was introduced a decade after Wallace and Boulton (1968) proposed the MML criterion. The two approaches are significantly different. See Wallace (2005) for a comparison. 2 THE MINIMUM MESSAGE LENGTH FRAMEWORK Wallace and Boulton (1968) first proposed the theory of MML, where given a set of competing hypotheses (or models) that can explain some observed data, the MML criterion provides a statistically rigorous framework for selecting the best hypothesis to describe the data. In many ways, MML is a formal informationtheoretic realization of the principle of Occam’s razor. Assume there are some observed data D and some hypothesis H that explains the data. From Bayes’s theorem we get p(H&D)=p(H)×p(D|H)=p(D)×p(H|D) where p(H&D) is the joint probability of data D and the hypothesis H, p(H) is the prior probability of hypothesis H, p(D) is the prior probability of data D, p(H|D) is the posterior probability of H given D, and p(D|H) is the likelihood. MML applies the remarkable result from Shannon’s ‘Mathematical Theory of Communication’ (Shannon, 1948) that, given an event E with a probability p(E), the message length, l(E) for an optimal code is given by l(E)=−log2 p(E) bits. Carrying this insight to the Bayes’s theorem, we get the following relationship between conditional probabilities in terms of optimal message lengths. l(H&D)=l(H)+l(D|H)=l(D)+l(H|D). The essence of inductive inference is to fit a model to a mass of observed data. For such an approach it is the hypothesis H with the largest posterior probability p(H|D) that is often preferred. Among the terms in the above equation, p(H) (and hence l(H)) can usually be estimated well for some reasonable prior on hypotheses.At the same time, the likelihood p(D|H) can also be estimated. But to estimate the posterior probability distribution p(H|D), the prior of observed data p(D) will be needed. Estimating p(D) can be problematic and even impractical. However, for two competing hypotheses, H and H we have l(H|D)−l(H |D)=l(H)+l(D|H)−l(H )−l(D|H ), thereby eliminating the necessity to estimate p(D) completely when comparing hypotheses. MML is best understood through a communication process where a transmitter and a receiver are connected through one of Shannon’s communication channels. The objective is that a transmitter must send some data D to the receiver. The transmitter and receiver must have previously agreed on a set of rules (that is, a code book) of communication using common knowledge and prior expectations. If the transmitter can find a good hypothesis, H∗, to fit the data, (s)he will be able to transmit the data economically. In MML, an explanation of the data comes as a two part message: (1) transmit the hypothesis H∗ taking l(H∗) bits, and (2) transmit the observed data D given H∗ taking l(D|H∗) bits. Such a message paradigm ensures complete transparency in communication. That is, any information that is not common knowledge cannot be included except as a part of the message sent by the transmitter. Otherwise, the message sent will be indecipherable i44 atMasarykovauniverzitaonSeptember18,2011bioinformatics.oxfordjournals.orgDownloadedfrom [19:47 6/6/2011 Bioinformatics-btr240.tex] Page: i45 i43–i51 Piecewise linear approximation of structures using MML by the receiver. There can be no hidden parameters in this framework of communication. In fact, this issue extends to stating and inferring real-valued parameters to an ‘appropriate’ level of precision, which is pertinent to the current problem on hand. The MML framework additionally offers ‘safety’ in that if an inefficient code is used to encode a message, it can only make the hypothesis look less attractive than otherwise. Note that MMLyields a natural hypothesis test: the null-model corresponds to transmitting the data raw. If a stated hypothesis takes longer than what is required by a null-model, then clearly such an hypothesis is unacceptable. A more complex hypothesis fits the data better than a simpler model, in general. We see that MML encoding gives a trade-off between hypothesis complexity (l(H)), and its goodness of fit to the data (l(D|H)). Therefore, MML criterion formally justifies and realises Occam’s razor. An important aspect of MML framework is that it is tolerant to measurement accuracy and noise in the underlying data. For a justification of this and a comprehensive study of the principle of MML, refer (Wallace, 2005). 3 FORMULATING THE PROBLEM USING MINIMUM MESSAGE LENGTH A protein P ={P1,···,Pn} is a sequence2 of n three-dimensional points corresponding to the coordinates (in R3) of Cα atoms along the protein backbone, from its N- to C- terminus.3 Define a piecewise linear approximation of P as a subsequence of k ≤n points from P of the form Q={Q1 ≡Pi1 ,...,Qk ≡Pik } such that 1=i1 <···65%.) This dataset was culled using the program PISCES (Wang and Dunbrack, 2003). The list of proteins structures in the dataset and the results of their delineation produced by our method can be obtained from the aforementioned link. Figure 2 gives the distribution of the measure of simplification of structures over the entire dataset. For a structure, the measure of simplification is the ratio of number of line segments identified by the program over the number of residues in the structure. On an average Fig. 2. Distribution of ratios of number of line segments over number of residues per structure in the dataset. Ratios are expressed in percentages and rounded to the nearest integral value. over the entire dataset the delineation size (that is, the number of line segments in the delineation) constitutes 13.85% of the total size of structure (in residues). In addition, the average segment length over the entire dataset is observed to be 8.11 residues. In general, the number of segments is correlated to total size of the protein structure. It is of considerable interest to evaluate the agreement of standard secondary structural elements—helices and strands of sheets—with the delineation identified by the program. We note that an ideal delineation of a structure must encompass these elements since they are ideal candidates for approximation with lines or vectors given the linear spatial trend in their geometry. In order to evaluate the agreement, we coarsely classify each segment to one of three secondary structure states: ‘Helix’, ‘Strand’ and ‘Other’. This threestate classification is based on certain geometric characteristics of the segments in the delineation. Specifically, we compute the following geometric profiles for each segment: ‘rise’, ‘pitch’ and backbone dihedral angles φ and ψ. The rise (ρ) of the segment with endpoints Pi and Pj is ρ=Dij/(j−i+1), where Dij is the Euclidean distance between the endpoints. In other words, the rise gives the average translation of points along the line between endpoints. The rise of a standard secondary structure is directly related to the pitch (p) of the segment. For a substructure with a geometry that repeats itself every n residues, the relationship between rise and pitch is given by p=nρ. Table 1 summarizes the geometric profiles of ideal secondary structures (Taylor, 2001). Inspecting these profiles per segment, a coarse characterisation for each segment in the delineation is achieved. Examining the coarse segment level assignment for the structures in the dataset, we note that the average length of segments assigned as ‘Helix’ is 13.01 residues while the same for those assigned as ‘Strand’ is 7.33 residues. To evaluate our coarse assignment, we choose two popular and extensively used secondary structure assignment programs, DSSP (Kabsch and Sander, 1983) and STRIDE (Frishman and Argos, 1995). DSSP and STRIDE assign each residue to one of multiple secondary structural states, including 310-, α-, π-helices and β-strands of sheet. For the structures in our dataset, we generate i49 atMasarykovauniverzitaonSeptember18,2011bioinformatics.oxfordjournals.orgDownloadedfrom [19:47 6/6/2011 Bioinformatics-btr240.tex] Page: i50 i43–i51 A.S.Konagurthu et al. Table 1. Geometric profiles of ideal secondary structures used to classify coarsely the delineation identified by the program. φ and ψ are average backbone dihedral angles. n is the periodicity of the local structure. ρ is the rise. p is the pitch Type φ ψ n ρ p 310-Helix −57.1 −69.7 3.0 2.0 6.0 α-Helix −57.8 −47.0 3.6 1.5 5.5 π-Helix −74.0 −4.0 4.4 1.1 5.0 β-Strand −139.0 135.0 2.0 3.4 6.8 Table 2. Percentage agreement of Helix and Strand assignments between various methods Comparison Helices (%) Strands (%) PMML (coarse)_vs_DSSP 79.0 83.3 PMML (coarse)_vs_STRIDE 79.3 83.1 PMML (refine)_vs_DSSP 92.6 92.4 PMML (refine)_vs_STRIDE 91.3 92.1 STRIDE_vs_DSSP 95.7 96.9 the respective secondary structure assignments using DSSP and STRIDE. We note that both these programs assign secondary structure definitions at a residue level, while the coarse assignment for our method described above is at a segment level. Therefore, to enable a comparison between the methods we assign all residues within a segment to the segment level secondary structure state. Table 2 gives the concordance of Helix6 and Strand assignments between DSSP, STRIDE, and our method, PMML. Although even a coarse segment level assignment by our method produced a satisfactory concordance with DSSP and STRIDE, there is still a disagreement of ∼15% between PMML and the other two methods. Inspecting these differences we note that the majority of them came from the terminal parts of the segments delineated by our program. Therefore, we refine the coarse level assignment produced by PMMLusing the hydrogen bonding patterns of residues within each segment to reassign the secondary structure state at a residue level. We use a simple proximity (of backbone nitrogen and carbonyl groups) and angle (of N, O, C atoms) based computation of hydrogen bonds. Comparing our refined assignments at a residue level with DSSP and STRIDE we notice a substantial improvement in the concordance of helix and stand assignments with DSSP and STRIDE. (See rows 3 and 4 in Table 2.) We emphasize that although PMML can be used to generate protein secondary structure assignments, its real aim is to generate concise representations of structures, irrespective of the nature of the segments of which they are composed. For instance, PMML could be applied to RNA structures without needing any appeal to the types of substructure anticipated. Manually evaluating the delineation of a large number of structures we notice that although PMML’s delineation identifies the regions of helix and strand consistently, there remain small discrepancies in assigning precise beginning and end residues 6 We do not distinguish between the three types of helices and treat them as one state. Fig. 3. Wall-eye stereo image of 1.8 Å crystal structure of oxidized Clostridium beijerinckii flavodoxin. Each delineated segment produced by PMML is shown in a different color. The elements of secondary structures, of helices and strands of sheet, were derived from the wwPDB file, 5NLL, and are shown in this figure as thick ribbons. The labels of various secondary structures are also shown. The bound FMN co-factor is shown at the top of the structure as thin lines. Table 3. The residue ranges of secondary structural elements (SSEs) in the structure of flavodoxin shown in Fig. 3 SSE wwPDB PMML β1 Lys2-Trp6 Met1-Tyr5 αA Asn11-Glu25 Asn11-Glu27 β2 Asn31–Asn34 Gly27-Ile33 αB Ile40-Asn45 Asn39-Glu46 β3 Ile48–Cys53 Asp47-Cys53 αC Phe66-Lys76 Glu65-Thr75 β4 Lys81–Tyr88 Gly79-Ser87 αD Lys94-Gly105 Gly91-Gly107 β5 Leu115–Gln118 Glu112-Gln118 αE Asp122-Ile136 Glu120-Gln126,Gln126-Ile136 The SSEs in the rows follow the order of their appearance along the chain of the protein from its N- to C-terminus. The column wwPDB gives the residue ranges of various SSEs as indicated in the wwPDB file 5NLL. The column PMML gives the corresponding residue ranges of the segmentation produced by PMML. of secondary structure elements as ascertained by an expert. To highlight these differences consider the following example of the delineation produced by PMML. Figure 3 shows the structure of oxidized Clostridium beijerinckii flavodoxin. This protein binds a cofactor, flavin mononucleotide (FMN). Flavodoxin is a small α/β protein, containing a 5-stranded parallel β-sheet (β1,...,β5), with two helices packed against each face of the sheet (αA,αE and αC,αD). There is also a short helix (αB) located near the N-terminus of the protein. (Fig. 3.) Different segments produced by PMML are shown in different colors. The elements of secondary structure shown as thick ribbons are the secondary structure assignments taken from the structure’s wwPDB file (5NLL). Table 3 gives the residue ranges (that is, start and end residues) for each secondary structural element (SSE) of the flavodoxin structure listed in its wwPDB file. The residue ranges of the corresponding segmentation produced by PMML is also presented in the table. Broadly, the program correctly assigns segments to the SSEs. However, minor differences can be i50 atMasarykovauniverzitaonSeptember18,2011bioinformatics.oxfordjournals.orgDownloadedfrom [19:47 6/6/2011 Bioinformatics-btr240.tex] Page: i51 i43–i51 Piecewise linear approximation of structures using MML observed in the locations of their start and end residues. In most cases, we notice an absolute difference of 1 or 2 residues in the N- or C- terminal regions of these SSEs. The segmentation in the regions around the SSEs αE, β2 and β5 show some discrepancies. The residue range from wwPDB corresponding to αE was approximated by PMML using 2 segments instead of one. The first segment is composed of roughly one turn of the helix at αE’s N-terminal end. This is understandable as this turn is substantially skewed from the main helical axis and, indeed, there is an interruption in the hydrogen bonding. However, the second segment composed of 11 residues in this region is consistent with the assignment in the wwPDB file. In the case of β2, the start location identified by PMML precedes the start location identfied in the wwPDB file by four residues. On inspecting the flavodoxin structure, there appears to be a backbone hydrogen bond between the carbonyl group of residue Asp29 and the nitrogen of Met1 (of strand β1), so the β2 strand may well start at residue Lys28 or Asp29. Similarly, for β5, the start location of the segment from PMML was identified to be three residues before the location identified in the wwPDB file, and inspecting the structure, we note the β−bulge in strand β5, and hydrogen bonds between atoms 80O···109N and 82N···109O; assignment of the start of the strand β5 to residue 109 is not indefensible. 8 CONCLUSION We have presented a novel and efficient method to delineate protein structures using the MML framework; MML is tolerant to measurement error and other inaccuracies. The model used in this work is independent of preconceived notions of what substructures are being sought to simplify the observed coordinate data. Our method maximizes the economy of representation while minimizing the loss of information, taking into account even the loop regions of proteins. Analysis of the delineations of a large number of protein structures suggests that the method is consistent in, among others, delineating standard secondary structures. The concise representations produced by this method have a potential use for rapid and accurate structure comparison and lookup. An implementation of our program is available from http://www.csse.monash.edu.au/~karun/pmml/. ACKNOWLEDGEMENTS We thank the anonymous referees for comments that improved the manuscript. L.A. and A.S.K. thank Nathan Hurst for useful pointers during the development of this work. Funding: ASK’s research is supported by Monash University’s Talent Enhancement and Larkins Fellowship. NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council. Conflict of Interest: none declared. REFERENCES Abagyan,R.A. and Maiorov,V.N. (1988) A simple qualitative representation of polypeptide chain folds: comparison of protein tertiary structures. J. Biomol. Struct. Dyn., 5, 1267–1279. Banerjee,S. et al. (1996) A minimum description length polygonal approximation method. IBM Tech. Rep., RJ 10007, 1–19. Bellman,R. (1957) Dynamic Programming. Princeton University Press, Princeton, New Jersey. Berman,H.M. et al. (2002) The protein data bank. Acta Crystallogr. D Biol. Crystallogr., 58(Pt 6 No 1), 899–907. Chothia,C. et al. (1981) Helix to helix packing in proteins. Proc. Natl Acad. Sci. USA, 78, 4146–4150. Colloc’h,N. et al. (1993) Comparison of three algorithms for the assignment of secondary structure in proteins. Protein Eng., 6, 377–382. Cuff,J.A. and Barton,G.J. (1999) Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins, 34, 508–519. Dupuis,F. et al. (2004) Protein secondary structure assignment through Voronoi tessellation. Proteins, 55, 519–528. Edsall,J.T. et al. (1966) A proposal of standard conventions and nomenclature for the description of polypeptide conformations. J. Mol. Biol., 15, 399–407. Elias,P. (1975) Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory, 21, 194–203. Frishman,D. and Argos,P. (1995) Knowledge-based protein secondary structure assignment. Proteins, 23, 566–579. Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577–2637. Kamat,A.P. and Lesk,A.M. (2007) Contact patterns between helices and strands of sheet define protein folding patterns. Proteins: Structure, Function, and Bioinformatics, 66, 869–876. Konagurthu,A.S. and Lesk,A.M. (2010) Concise tableau representation of protein folding patterns. J. Mol. Recogn., 23, 253–257. Konagurthu,A.S. et al. (2008) Structural search and retreival using tableau representation of protein folding patterns. Bioinformatics, 24, 645–651. Labesse,G. et al. (1997) P-SEA: a new efficient assignment of secondary structure from c alpha trace of proteins. Comput. Appl. Bio. Sci., 13, 291–295. Lesk,A.M. and Chothia,C. (1980) How different amino acid sequences determine similar protein structures: The structure and evolutionary dynamics of the globins. J. Mol. Biol., 136, 225–230. Lesk,A.M. (1995) Systematic representation of protein folding patterns. J. Mol. Graphics, 13, 159–164. Levitt,M. and Greer,J. (1977)Automatic identification of secondary structure in globular proteins. J. Mol. Biol., 114, 181–239. Majumdar,I. et al. (2005) PALSSE: A program to delineate linear secondary structural elements from protein structures. BMC Bioinformatics, 6, 202. Mizuguchi,K. and Go,N. (1995) Comparison of spatial arrangements of secondary structural elements in proteins. Protein Eng., 8, 353–362. Richardson,J.S. (1981) The anatomy and taxonomy of protein structure. Adv. Protein Chem., 34, 167–339. Richards,F.M. and Kundrot,C.E. (1988) Identification of structural motifs from protein coordinate data: secondary structure and first-level supersecondary structure. Proteins, 3, 71–78. Shannon,C.E. (1948) A mathematical theory of communication. Bell Syst. Tech. Jrnl., 27, 379–423. Shi,S. et al. (2007) Searching for three-dimensional secondary structural patterns in proteins with ProSMoS. Bioinformatics, 23, 1331–1338. Sklenar,H. et al. (1989) Describing protein structure: a general algorithm yielding complete helicoidal parameters and a unique overall axis. Proteins, 6, 46–60. Srinivasan,R. and Rose,G.D. (1999) A physical basis for protein secondary structure. Proc. Natl Acad. Sci. USA, 96, 14258–14263. Taylor,W.R. et al. (1983) A elipsoidal approximation of protein shape. J. Mol. Graphics, 1, 30–38. Taylor,W.R. (2001) Defining linear segments in protein structures. J. Mol. Biol., 310, 1135–1150. Wallace,C.S. and Boulton,D.M. (1968) An information measure for classification. Comp. J., 11, 185–194. Wallace,C.S. (2005) Statistical and Inductive Inference using Minimum Message Length. Information Science and Statistics. Springer, New York. Wang,G. and Dunbrack,R. L. Jr. (2003) PISCES: a protein sequence culling server. Bioinformatics, 19, 1589–1591. i51 atMasarykovauniverzitaonSeptember18,2011bioinformatics.oxfordjournals.orgDownloadedfrom