R E S E A R C H A R T I C L E
FiRES: A computational method for the de novo identification
of internal structure similarity in proteins
Claudia Alvarez-Carreño1,2
| Gerardo Coello3
| Marcelino Arciniega1
1
Department of Bioquímica y Biología
Estructural, Instituto de Fisiología Celular,
Universidad Nacional Autónoma de México,
Mexico City, Mexico
2
School of Chemistry and Biochemistry,
Georgia Institute of Technology, Atlanta,
Georgia
3
Unidad de Cómputo, Instituto de Fisiología
Celular, Universidad Nacional Autónoma de
México, Mexico City, Mexico
Correspondence
Claudia Alvarez-Carreño and Marcelino
Arciniega, Department of Bioquímica y
Biología Estructural, Instituto de Fisiología
Celular, Universidad Nacional Autónoma de
México, Address Circuito Exterior s/n, Ciudad
Universitaria, Apartado Postal 70-243, Mexico
City 04510, Mexico.
Email: calvarez@ifc.unam.mx (C.A.-C.) and
marciniega@ifc.unam.mx (M.A.)
Funding information
Dirección General de Asuntos del Personal
Académico, Universidad Nacional Autónoma
de México, Grant/Award Number: PAPIITDGAPA-IN213320;
Dirección General de
Cómputo y de Tecnologías de Información y
Comunicación, Universidad Nacional
Autónoma de México, Grant/Award Number:
LANCAD-UNAM-DGTIC-320; Universidad
Nacional Autónoma de México, Grant/Award
Number: IN213320
Peer Review
The peer review history for this article is
available at https://publons.com/publon/10.
1002/prot.25886.
Abstract
Internal structure similarity in proteins can be observed at the domain and subdomain
levels. From an evolutionary perspective, structurally similar elements may arise
divergently by gene duplication and fusion events but may also be the product of
convergent evolution under physicochemical constraints. The characterization of proteins
that contain repeated structural elements has implications for many fields of
protein science including protein domain evolution, structure classification, structure
prediction, and protein engineering. FiRES (Find Repeated Elements in Structure) is
an algorithm that relies on a topology-independent structure alignment method to
identify repeating elements in protein structure. FiRES was tested against two hand
curated databases of protein repeats: MALIDUP, for very divergent duplicated
domains; and RepeatsDB for short tandem repeats. The performance of FiRES was
compared to that of lalign, RADAR, HHrepID, CE-symm, ReUPred, and Swelfe. FiRES
was the method that most accurately detected proteins either with duplicated
domains (accuracy = 0.86) or with multiple repeated units (accuracy = 0.92). FiRES is
a new methodology for the discovery of proteins containing structurally similar elements.
The FiRES web server is publicly available at http://fires.ifc.unam.mx. The
scripts, results, and benchmarks from this study can be downloaded from https://
github.com/Claualvarez/fires.
K E Y W O R D S
internal structure similarity, protein domain, protein repeats, structural motif
1 | INTRODUCTION
Protein repeats consist of non-overlapping copies of either subdomain
elements or entire domains located within a single protein. These copies,
called repeated units, can be arranged in tandem or interspersed
throughout the sequence1-3
and may fold into similar threedimensional
structures. Protein repeats can adopt a variety of native
conformations such as intrinsic disorder,4
globular domains and open
structures. For instance, six out of the 10 most prevailing globular
domains are formed by repeated units.5
These units have been studied
as remnants of hypothetical peptide-like predecessors of the first
folded proteins.5,6
Open solenoid domains are characteristically
formed by a stack of multiple short tandem repeats.2,7,8
Solenoid
domains appear to be an evolutionary adaptation more commonly
found in eukaryotes and are usually involved in protein-ligand and
[Correction added on 09 April 2020, after first online publication: Author name AlvarezCarreño
Claudia has been updated to Claudia Alvarez-Carreño.]
Received: 28 June 2019 Revised: 12 November 2019 Accepted: 24 February 2020
DOI: 10.1002/prot.25886
Proteins. 2020;88:1169–1179. wileyonlinelibrary.com/journal/prot © 2020 Wiley Periodicals, Inc. 1169
protein-protein interactions.2,9
At the domain level, duplication, fusion
and terminal losses, have played a major role in the evolution of the
modern repertoire of protein structures and functions.10-12
Thus,
repeated units are functionally and structurally diverse, as are the evolutionary
mechanisms that preserve them.
Duplicated sequences accumulate point mutations as well as
insertions and deletions, which ultimately hide the trace of similarity
between them.3,6,9,13,14
Sequence divergence determines the amount
of structural variation displayed by protein repeats. However, threedimensional
structure changes more slowly over evolutionary time
than sequence.15
For instance, distantly related domains maintain a
distinctive core of secondary structural elements even in the absence
of significant sequence similarity16
and circularly permutated proteins
frequently preserve the same overall three-dimensional disposition of
Cɑ atoms.17
In the case of tandem repeats, the disparity between
sequence divergence and structure conservation can be extreme.18
Thus, many studies have incorporated structure-based analysis to
facilitate the precise identification of very divergent polypeptide
chains. Non-sequential structure comparison algorithms, such as
CLICK,19
allow similarities between protein structures to be detected
irrespective of the topological connectivity of their secondary structural
elements.
Sequence- and structure-based methods have been developed to
identify internal similarity in proteins (Table 1). Algorithms that detect
similarity at the sequence level are particularly useful for the analysis of
protein repeats because their results facilitate homology inference.
Examples of such algorithms include lalign,20
HHrepID,21
RADAR,22
TPRpred,23
and TRUST.24
On the other hand, structure-based algorithms
can be advantageous for the discovery of remote homologs.
Structure-based repeat-detection algorithms (reviewed by Pellegrini25
)
are broadly classified in two categories: (a) inference methods, such as
ReUPred,18
RepeatsDB-Lite,26
and IRIS,27
which are based on libraries
of already well-characterized reference units; and (b) de novo identification
methods, which do not rely on previous knowledge of the features
defining a repeated unit. The main strategies employed by de novo
identification methods are self-structure comparison, and detection of
periodicities using one or more structural parameters. Examples of selfstructure
comparison methods include CE-symm,28
DAVROS,29
GANGSTA+,30
and SymD.31
Examples of pattern recognition algorithms
include Swelfe,32
which employs ɑ-angles; ProSTRIP,33
which calculates
dihedral angles; ConSole,8
which relies on contact maps; and TAPO,34
which detects periodicities of atomic coordinates and other parameters.
It should be noted, however, that structure similarity may be the product
of convergent evolution under functional and structural
constraints,35-37
thus, its identification does not unambiguously imply
homology. This is especially true at the subdomain level, where structural
similarity may represent energetically favorable conformations of
secondary structural elements.36,38,39
Therefore, the present study
focuses on similar structural elements, which include both homologous
and convergently evolved regions of internal structure similarity within
a protein.
Here, we present FiRES (Find Repeated Elements in Structure), a
computational protocol for the de novo identification of tandem and
non-tandem repetitive elements in protein structures. FiRES exploits a
topology-independent structure alignment method in order to detect
similar groups of elements. The performance of FiRES was assessed
TABLE 1 Algorithms for the study of protein repeats
Algorithm Type of data Description
lalign20
Sequence de novo identification of
repeats
RADAR22
Sequence de novo identification of
repeats
TRUST24
Sequence de novo identification of
repeats
TPRpred23
Sequence TPRs, PPRs, and SEL1-like
solenoid repeats
HHrepID21
Sequence de novo identification of
repeats
ARD29
Sequence de novo identification of
α-solenoid repeats
HMMER59
/
Pfam51
Sequence, Pfam
database
Reference-based
identification of repeats
DAVROS29
Structure de novo identification of
repeats
GANGSTA+30
Structure de novo identification of
repeats
REPETITA60
Structure de novo identification of
solenoid repeats
SymD31
Structure Identification of internal
structure symmetry
ProSTRIP33
Structure de novo identification of
repeats
RAPHAEL61
Structure de novo identification of
solenoid repeats
CE-symm28
Structure Identification of internal
structure symmetry and
repeating units
ConSole8
Structure de novo identification of
solenoid repeats
REPRO62
Sequence de novo identification of
repeats
DeepSymmetry63
Structure de novo identification of
tandem repeats
TAPO34
Structure de novo identification of
tandem repeats
IRIS27
Sequence or
structure
Reference-based
identification of repeats
RepeatsDB-Lite26
Structure,
RepeatsDB
Reference-based
identification of short
tandem repeats
ReUPred18
Structure,
RepeatsDB
Reference-based
identification of short
tandem repeats
Swelfe32
Sequence or
structure
de novo identification of
repeats
Abbreviations: TPR, Tetratrico peptide repeat; PPR, Pentatrico peptide
repeats.
1170 ALVAREZ-CARREÑO ET AL.
on two types of data: proteins with short tandem repeats and proteins
with very divergent internal domain duplications. Finally, we show
that FiRES can be used for the discovery of proteins containing similar
structural elements with very low sequence identity (<20%), where
homology inference remains an open question.
2 | MATERIALS AND METHODS
The FiRES algorithm (Figure 1) searches internal structure similarity
within a protein by an iterative self-alignment process, which includes
a scoring system based on the template modeling score40
(TM-score).
2.1 | Generation of query-target pairs
FiRES generates a series of alignments between a fragment of an
input protein structure and the input protein structure itself. To avoid
alignments at the diagonal, each fragment, referred to as query qi, is
aligned to a target region ti defined as the complement of qi over the
protein P:
ti = Pnqif g ð1Þ
During the first iteration, the size of each query qi depends on the
total number N of secondary structural elements (SSEs) in the protein
P. To determine the number of SSEs, each residue in P is tagged with
a secondary structure label, namely helix, strand or loop. These tags
are assigned by the DSSP algorithm.41
Residues presenting geometrical
features that do not classify as helix or strand are treated as loops.
Consecutive residues with the same secondary structure label are
grouped into SSEs. The maximal number n of SSEs within each query
is calculated as follows:
n =
round
N
4
 
,N < 28
7,N ≥ 28
8
><
>:
ð2Þ
FIGURE 1 Flow diagram indicating the main steps of the FiRES algorithm [Color figure can be viewed at wileyonlinelibrary.com]
ALVAREZ-CARREÑO ET AL. 1171
Where N is the total number of SSEs in P and n is the maximum
number of SSEs in each query qi. The set of i queries is generated by
shifting the starting position of qi to the next helix or strand (Figure 1,
step 1).
2.2 | Local structure alignment
Query-target pairs are aligned using the CLICK algorithm with default
parameters.19
CLICK is a graph theory-based algorithm, which employs
the Cartesian coordinates of Cɑ atoms of the query and target structures
to form cliques. CLICK produces pairwise alignments by iteratively
matching increasingly larger cliques of points of the query and target
structures. For each query-target pair, CLICK returns pairs of Cɑ atoms
that render a low RMSD alignment. When the whole set of query-target
pairs has been aligned with CLICK, all pairs of matching residues are
stored in a single two-dimensional matrix (Figure 1, step 2).
2.3 | Identification and evaluation of self-aligned
regions
From the two-dimensional matrix containing matching states, local
alignments are determined by a dynamic programming procedure,
which uses the Smith-Waterman42
scoring system and traceback
strategy (Figure 1, step 3). The following parameters were used to
generate the scoring matrix from which the local alignments are
determined: match +2; gap-opening −1; and no gap-extensions.
Only locally aligned regions with lengths over 10 residues are
maintained. By the end of this step, the first element ek of the
aligned pair (ek, e'k) becomes a new query element, for which a new
complementary target pair is generated by:
qi = ek
ti = Pnek [e0
k
È É
ð3Þ
FIGURE 2 Schematic representation of the final evaluation and scoring step of FiRES. Internal structure similarity in the N-terminal TPR
domain of p67phox. A, Dot plot of the local structure alignments generated once the iterations over steps 2 and 3 converge. Orange arrowheads
indicate the terminal sites of the gap-extension process (˄ start, and ˅ end positions). B, Local structure alignment pairs that have passed the size
filter. The middle position of gapped and ungapped pairs is indicated by yellow and red squares, respectively. C, Clustering by transitivity. Two
different clusters are indicated. The first cluster includes pairs a, a', b, c, d', e', f', and g' (red arrows), the second cluster includes pairs h and h' (blue
arrows). D, Similar pairs of elements within the first cluster. The length of individual elements along the x-axis is indicated by colored stripes. E,
Visualization of the individual elements of the first cluster on the tree-dimensional model (color code as in C). F, Table of results for 1W5M_A
1172 ALVAREZ-CARREÑO ET AL.
New query-target pairs enter a loop over steps 2 and 3, until no
new matched residue pairs are found within a local alignment.
2.4 | Final evaluation step and scoring
Once the number and length of candidate elements remain constant
through the iteration of steps 2 and 3, a gap extension process is initialized
(Figure 2). The goal of this process is to retrieve pairs of elements
that fold into similar three-dimensional structures, but that may be
formed by non-sequential SSEs. A gap is extended between two fragment
pairs if these are separated by <55 residues in sequence (Figure 2A).
Gapped and ungapped pairs over 20 residues length (Figure 2B) undergo
a final structure alignment, using CLICK. The TM-score of the CLICKgenerated
superimpositions is calculated to assess the similarity between
candidate pairs of elements. Finally, elements are clustered by transitivity.
Elements are considered equivalent upon transitivity if the middle position
of any of the elements in a pair is at most two positions away on its
sequence representation from the middle position of an element in any
other pair (Figure 2B, C). Similar elements grouped by transitivity are displayed
with the same initial reference in the output (Figure 2E).
2.5 | Protein repeats datasets
Two databases were selected in order to evaluate the ability of the
methods to identify proteins with different types of repeats:
MALIDUP for proteins with duplicated domains and RepeatsDB for
proteins with multiple tandem repeats. To facilitate comparisons to
other sequence and structure-based methods, two data sets were
assembled from MALIDUP and RepeatsDB with entries fulfilling the
following requirements: (a) at least one of the reference units should be
longer than 15 AMino acids; (b) the protein should contain only one
type of repeat; and (c) repeat units should not have circular permutations.
The first data set contains 137 proteins with duplicated domains
from the MALIDUP database.43
The second data set includes 3522
proteins with short-tandem repeats retrieved from RepeatsDB.44
PDB files were obtained from the PDB database using the script
get_pdb.py from Rosetta (www.rosettacommons.org). FASTA
sequences were obtained from UniProt45
using the corresponding
PDB code and chain.
2.6 | Database of no repeats
A database of proteins without internal sequence and structure similarity
was generated using three sequential filters. First, a database of nonsymmetrical
protein structures was constructed. To this end, a subset of
the PDB composed of 28 337 non-redundant chains, as determined by
BLASTClust at 30% sequence identity, was evaluated by SymD.31
From a
total of 6088 structures that were considered non-symmetrical (Z-score
< 4), 3300 were randomly selected to continue to the next stage. Then,
two random groups were assembled with 300 and 3000 proteins to
resemble the sizes of MALIDUP and RepeatsDB, respectively. The second
filter consisted in excluding proteins that presented repeated sequence signatures
according to InterPro.46
These signatures include annotations from
CATH,47
Gene3D,48
CDD,49
PANTHER,50
Pfam,51
ProDom,52
PROSITE,53
SMART,54
SUPERFAMILY,55
and TIGRFAMs.56
The remaining proteins
were evaluated by all algorithms tested in this study (see performance evaluation).
Pairwise structure alignments of the predictions made by any of
the tested algorithms were performed with TM-align. Predicted repeat
pairs with >15 aligned residues and a TM-score higher than 0.5 were visually
inspected to evaluate the content, connectivity and orientation of their
SSEs. This inspection revealed the presence of 81 proteins with internal
structure similarity, which were removed from the final control sets (Suppl.
Table S1). Finally, two control sets, integrated by 132 and 2125 proteins
with low probability of containing repeated elements, were established as
negative controls for MALIDUP and RepeatsDB, respectively.
2.7 | Performance evaluation
The ability of FiRES to identify repeated structural elements was compared
to that of three sequence-based and three structure-based
methods. Sequence-based algorithms consisted of lalign from FASTA version
36,20
RADAR22
version 1.3, and HHrepID21
version 1.0.0. Structurebased
algorithms consisted of ReUPred,18
Swelfe,32
and CE-symm28
version
2.0. In all cases, standalone versions of the software were used. To
evaluate the performance of HHrepid, lalign, and FiRES, which output multiple
answers, only their best-scoring results were considered. The results
of lalign, HHrepID, and FiRES were ranked based on E-value, P-value, and
TM-score, respectively. For sequence-based methods, only predictions
made on parts of the protein for which a structural model is available were
evaluated. To confirm structural similarity, the predicted units were evaluated
with TM-align.23
Alignments involving at least 50% of each unit and
rendering a TM-score higher than 0.5 were considered correct.
The methods were evaluated both at the protein and at the unit
level. At the protein level, a prediction was considered true positive if
at least half of the predicted units had a correct alignment to at least
one other predicted unit. This test evaluates if the methods can differentiate
between proteins with and without internal similarity regardless
of whether the predicted units correspond to the database definitions
or not. At the unit level, a predicted unit was considered true positive if
it had a correct alignment with at least one of the reference units as
reported by MALIDUP or RepeatsDB. At both levels, the performance
was evaluated by their true positive rate (TPR). Additionally, at the protein
level, true negative rate (TNR) and accuracy were calculated based
on the predictions of the methods in the control sets.
3 | RESULTS
3.1 | Benchmark test
The ability of FiRES, ReUPred, Swelfe, CE-symm, HHrepID, lalign, and
RADAR to detect proteins with very divergent duplicated domains
ALVAREZ-CARREÑO ET AL. 1173
was tested using a subset of MALIDUP. All methods obtained similar
TPRs at the protein and at the unit level (Table 2). At both levels,
FiRES obtained the highest TPR within the MALIDUP dataset. FiRES
correctly identified repeats in 102 out of 137 proteins, which is more
than twice the number of proteins identified by RADAR, HHrepID or
laling (Table 2). CE-symm, which correctly identified 64 proteins, displayed
the second best TPR. Both FiRES and CE-symm rely on structural
alignments to detect internal similarity, which, in general,
makes them better suited to identify larger repeated units.28
However,
CE-symm detects similar units only when the interface
between units is conserved.28
Swelfe identified only 26 proteins in
the MALIDUP dataset and was clearly outperformed by their
sequence-based counterparts. ReUPred detected six proteins containing
duplicated domains. These six proteins are themselves
formed by the repetition of supersecondary structure elements,
namely ɑɑ- and ββ-hairpins, and βɑβ-elements. ReUPred is a method
that was specifically designed for the identification and classification
of solenoid proteins.18
Thus, it is not surprising that ReUPred
performs poorly on a set, which exclusively includes duplicated
domains.
Although protein structure tends to be more conserved than
sequence, structural divergence of the units and of the interface
between the units constitute a mayor challenge for structure-based
repeats identification methods. Collectively, all methods predicted
only 110 out of 137 proteins in MALIDUP. The 27 proteins that
remained undetected contain very divergent pairs of domains that
share <20% identical residues or that render an average TM-score
below 0.5 (Suppl. Table S2).
The RepeatsDB dataset tests the ability of the methods to identify
proteins containing units that are repeated multiple times. Proteins
in RepeatsDB have on average 7.7 repeated units. All algorithms
obtained better results at the protein level on the RepeatsDB dataset
than on the MALIDUP dataset (Table 2). However, at the unit level
the TPR of FiRES (0.18), Swelfe (0.03) and lalign (0.07) was very low,
compared to their TPR at the protein level. The TPR at the unit level
indirectly evaluates the ability of the different methods to assign
boundaries to the units. At this level, CE-symm achieved the highest
TPR (0.70), followed by HHrepID (0.49) and RADAR (0.39). FiRES
tends to output units that are longer than the reference units in
RepeatsDB. ReUPred, which was designed as a classification algorithm
for solenoid domains, has a very similar TPR both at the protein
and at the unit levels.
At the protein level, FiRES identified 91% of proteins in this set.
The TPR of the rest of the methods at the protein level remained
below 70% (Table 2). To gain further insights on the strengths and
weaknesses of the FiRES algorithm, the results were broken down
based on the classification of protein repeats (Table 3). The set of
structures within RepeatsDB that was analyzed here included repeats
within classes II, III, IV, and V (Table 3). FiRES displayed a high TPR
(above 80%) over a broad spectrum of repeats. The most notable
exceptions to this trend were sub-categories II.2) α helical coiled coil;
III.4) β-trefoil/β hairpins; IV.3) β-trefoil; and V.3) ɑ/β-Beads. In fact,
none of the algorithms achieved adequate results for subclasses II.2) α
helical coiled coil; III.4) β Trefoil/β Hairpins; and IV.3) β-trefoil. In contrast,
some types of repeats turned out to be relatively easy to identify.
For instance, almost all methods, apart from ReUPred, achieved
TPR above 60% for IV.6) ɑ-Barrel; IV.7) ɑ/β Barrel; and IV.9) ɑ/β Trefoil
(Table 3).
The ability of the methods to differentiate proteins with
repeated elements from proteins without internal similarity was
tested on a new database called database of no repeats. This database
was curated such that repeats of all types (short, long, tandem,
and non-tandem) are filtered out (see Methods). The database of no
repeats contains a total of 2257 non-redundant structures from
the PDB.
As expected, structure-based methods produced a high true negative
rate (Table 4). Structure-based methods directly evaluate structure,
as opposed to sequence-based methods, for which structure
similarity is a prediction. Remarkably, HHrepID, a sequence-based
method, accomplished a true negative rate of 0.98 for both control
sets (Table 4). In contrast, two thirds of the predictions made by
RADAR turned out to be false positive results. Overall, FiRES was the
method that most accurately differentiated between proteins with
and without internal structure similarity, followed by CE-symm and
TABLE 2 True positive rate of the detection methods
Structure-based methods Sequence-based methods
FIRES ReUPred Swelfe CE-symm HHrepID lalign RADAR
MALIDUP
Protein
(n = 137)
102 (0.74) 6 (0.04) 26 (0.19) 64 (0.47) 34 (0.25) 39 (0.28) 28 (0.20)
Unit
(n = 274)
185 (0.68) 7 (0.03) 55 (0.20) 136 (0.50) 80 (0.29) 65 (0.24) 55 (0.20)
RepeatsDB
Protein
(n = 3522)
3225 (0.92) 669 (0.19) 1524 (0.43) 2421 (0.69) 1704 (0.48) 1608 (0.46) 1748 (0.50)
Unit (n = 26 980) 4852
(0.18)
4960 (0.18) 906 (0.03) 18 755 (0.70) 13 189 (0.49) 1891 (0.07) 10 626 (0.39)
1174 ALVAREZ-CARREÑO ET AL.
HHrepID (Table 5). FiRES is a powerful method for the identification
of proteins that contain domain-size or subdomain-size similar structural
elements.
3.2 | Detecting similar elements with very low
sequence identity
During the construction of the database of no repeats, the algorithms
identified a total of 81 proteins with internal structure similarity.
Internal similarity within these proteins was not documented in InterPro.
More than half of these cases were identified only by FiRES,
whereas 21 were identified by FiRES and another method. Only
16 cases were identified by a method different from FiRES (Suppl.
Table S1). From the 44 results that were exclusive to FiRES, two
examples were selected to illustrate the use of FiRES to detect hidden
evolutionary relationships between structurally similar elements
(Figures 3 and 4 and Suppl. Methods). In both examples, the discrimination
between homology and analogy required a combination of
structure- and sequence-based methods.
TABLE 4 True negative rate of the detection methods
Structure-based methods Sequence-based methods
FIRES ReUPred Swelfe CE-symm HHrepID lalign RADAR
Control 1
[N = 132]
TN: 129
(97.7%)
TN: 128
(99.7%)
TN: 132
(100%)
TN: 131
(99.2%)
TN: 129
(97.7%)
TN: 113
(85.6%)
TN: 48
(36.3%)
Control 2
[N = 2125]
TN: 2009
(94.5%)
TN: 2070
(97.4%)
TN: 2125
(100%)
TN: 2112
(99.4%)
TN: 2078
(97.7%)
TN: 1749
(82.3%)
TN: 867
(40%)
Note: Negative control sets for MALIDUP (control 1) and RepeatsDB (control 2).
TABLE 3 Number of correctly detected proteins with short tandem repeats in RepeatsDB
FiRES ReUPred Swelfe CE-symm HHrepID lalign RADAR
II.2 (n = 9) 4 (0.44) 3 (0.33) 0 0 0 0 0
III.1 (N = 321) 258 (0.80) 80 (0.25) 113 (0.35) 72 (0.22) 110 (0.34) 89 (0.28) 93 (0.29)
III.2 (N = 322) 308 (0.96) 83 (0.26) 247 (0.77) 231 (0.72) 190 (0.59) 254 (0.79) 212 (0.66)
III.3 (N = 863) 804 (0.93) 323 (0.37) 552 (0.64) 747 (0.87) 582 (0.67) 509 (0.59) 566 (0.66)
III.4 (N = 49) 29 (0.59) 2 (0.04) 14 (0.29) 14 (0.29) 10 (0.20) 14 (0.29) 10 (0.20)
III.5 (N = 57) 54 (0.95) 2 (0.04) 31 (0.54) 15 (0.26) 20 (0.35) 17 (0.30) 17 (0.30)
IV.1 (N = 523) 471 (0.90) 47 (0.09) 3 (0.01) 232 (0.44) 10 (0.02) 29 (0.05) 60 (0.11)
IV.2 (N = 77) 68 (0.88) 4 (0.05) 12 (0.16) 25 (0.32) 14 (0.18) 8 (0.10) 18 (0.23)
IV.3 (N = 24) 10 (0.42) 0 9 (0.38) 13 (0.54) 0 0 0
IV.4 (N = 780) 755 (0.97) 108 (0.14) 360 (0.46) 712 (0.91) 446 (0.57) 415 (0.53) 478 (0.61)
IV.5 (N = 177) 176 (0.99) 2 (0.01) 68 (0.38) 176 (0.99) 175 (0.99) 135 (0.76) 151 (0.85)
IV.6 (N = 5) 4 (0.80) 0 4 (0.80) 4 (0.80) 4 (0.80) 4 (0.80) 4 (0.80)
IV.7 (N = 5) 5 (1.00) 0 3 (0.60) 5 (1.00) 3 (0.60) 3 (0.60) 3 (0.60)
IV.8 (N = 102) 100 (0.98) 6 (0.06) 35 (0.34) 79 (0.77) 47 (0.46) 58 (0.57) 50 (0.49)
IV.9 (N = 15) 14 (0.93) 0 10 (0.67) 11 (0.73) 13 (0.87) 9 (0.60) 9 (0.60)
IV.10 (N = 45) 38 (0.84) 4 (0.09) 0 45 (1.00) 15 (0.33) 6 (0.13) 7 (0.16)
V.1 (N = 13) 13 (1.00) 2 (0.15) 9 (0.70) 10 (0.77) 8 (0.62) 9 (0.69) 7 (0.54)
V.2 (N = 37) 32 (0.87) 0 24 (0.65) 5 (0.14) 28 (0.76) 10 (0.27) 30 (0.81)
V.3 (N = 14) 6 (0.43) 0 2 (0.14) 11 (0.79) 2 (0.14) 2 (0.14) 3 (0.21)
V.4 (N = 41) 35 (0.85) 1 (0.02) 17 (0.41) 8 (0.20) 8 (0.20) 19 (0.46) 15 (0.37)
V.5 (N = 43) 41 (0.95) 2 (0.05) 11 (0.26) 6 (0.14) 19 (0.44) 18 (0.42) 15 (0.35)
Note: The highest true positive rate for each protein fold is highlighted in bold. II.1) α helical coiled coil; III.1) Solenoid; III.2) ɑ/β Solenoid; III.3) ɑ-Solenoid;
III.4) β Trefoil/β Hairpins; III.5) Anti-parallel β Layer/β Hairpins. IV.1) TIM-Barrel; IV.2) β-Barrel/β-Hairpins; IV.3) β-Trefoil; IV.4) β-Propeller; IV.5) ɑ/β Prism;
IV.6) ɑ-Barrel; IV.7) ɑ/β Barrel; IV.8) ɑ/β Propeller; IV.9) ɑ/β Trefoil; IV.10) Aligned prism. V.1) ɑ-Beads; V.2) β-Beads; V.3) ɑ/β-Beads; V.4) β Sandwich
beads; V.5) ɑ/β Sandwich.
Abbreviation: N, number of cases.
ALVAREZ-CARREÑO ET AL. 1175
3.2.1 | Non-ribosomal peptide synthetase
(D1GLU5)
The non-ribosomal peptide synthetase from Streptomyces lydicus is
formed by an N-terminal MbtH domain (PF03621), a central AMPbinding
domain (PF00501) and a C-terminal AMP-binding_C domain
(PF13193). FiRES identified a repeated motif within the central AMPbinding
domain, with a TM-score of 0.59 (Figure 3). However, between
these two elements there are only 14% identical residues. These elements
were analyzed using sequence-based methods, which confirmed
sequence relationships between element 1 and element 2 (Supp.
Methods). Put together, these arguments suggest that the AMP-binding
domain originated from a duplicated motif.
3.2.2 | Elongation factor Tu GTP binding domain
of BipA (Q9L5X8)
FiRES detected a repeated element within the GTP-binding domain
(IPR005225) of the protein BipA from Vibrio parahaemolyticus. This
domain belongs to the P-loop NTPase superfamily. The structural
alignment of these elements produces an RMSD of 1.68 Å and a TMscore
of 0.55. The pair of similar elements within the GTP-binding
domain shares 10% identical residues, according to their structurebased
sequence alignment (Figure 4). However, sequence-based analyses
showed evidence to support that these structural motifs have a
common evolutionary origin (Supp. Methods).
4 | DISCUSSION
We tested FiRES and six other repeat identification methods on two
hand-curated databases: MALIDUP, for duplicated domains, and
RepeatsDB for short tandem repeats. MALIDUP and RepeatsDB contain
proteins with very divergent repeated units. In both cases, FiRES
obtained outstanding results at the protein level in terms of TPR, TNR
and accuracy. Furthermore, FiRES demonstrated to provide more consistent
results than lalign, RADAR, CE-symm, HHrepID, ReUPred, and
Swelfe to detect proteins with different types of repeats. FiRES was
designed as a tool to detect internal structure similarity in proteins
TABLE 5 Accuracy of the detection
methods at the protein level
Structure-based methods Sequence-based methods
FIRES ReUPred Swelfe CE-symm HHrepID lalign RADAR
Benchmark 1
[N = 269]
0.86 0.50 0.59 0.72 0.61 0.57 0.28
Benchmark 2
[N = 5647]
0.93 0.49 0.65 0.80 0.67 0.59 0.46
Note: Benchmark 1: union of MALIDUP and control set 1; Benchmark 2: union of RepeatsDB and control
set 2.
FIGURE 3 Structurally similar
elements identified by FiRES in D1GLU5
from Streptomyces lydicus (PDB code:
4GR4_A). The ribbon representation of
elements 1 (blue) and 2 (cyan) is
shown. A, Element 1 found within the
central AMP-binding domain. B, Element
2 also located within the central AMPbinding
domain. C, Structural alignment of
elements 1 and 2 [Color figure can be
viewed at wileyonlinelibrary.com]
FIGURE 4 Structurally similar
elements identified by FiRES in a P-loop
NTPase domain (PDB code: 3E3X_A). The
ribbon representation of elements 1 (blue)
and 2 (cyan) is shown. A, Element 1. B,
Element 2. C, Structural alignment of
elements 1 and 2 [Color figure can be
viewed at wileyonlinelibrary.com]
1176 ALVAREZ-CARREÑO ET AL.
where these similarities remain unnoticed. FiRES is not intended for
classification of repetitive elements. However, the self-structure comparison
strategy employed by FiRES may lead to the development of
new methods for protein repeats classification.
Three key features were implemented on FiRES to produce highly
accurate results. First, the comparison of non-sequential structural
elements increases the sensibility of FiRES because it enables the
identification of incomplete units, as well as of units with insertions,
deletions, circular permutations and other types of fold change. Second,
the iteration of the identification process makes it possible to
detect multiple types of repeated elements within the same protein
structure. Third, the last step of the algorithm is a time-consuming
exhaustive structural comparison of each candidate pair, which renders
the FiRES algorithm highly specific.
The identification of structurally similar units that lack sequence similarity
can help elucidate remote homologous relationships. Here, we presented
two examples where FiRES, in combination with state-of-the-art
sequence similarity detection methods, provides new insights into the
evolution of protein domain structure. Besides being of interest for evolutionary
studies, protein repeats have been employed to construct wellfolded
and stable protein chimeras.7,57
Individual elements contribute to
the functional properties of the chimera protein, making it possible to target
specific biological activities throughout a design process.58
Computational
tools allowing the identification of internal similarity within a
protein structure can aid the design of new protein functions by recombination
of specific domains or subdomains. Consequently, FiRES may have
a positive impact on many fields of protein science.
ACKNOWLEDGMENTS
M.A. acknowledges to Programa de Apoyo a Proyectos de
Investigación e Innovación Tecnológica to Dirección General de Cómputo
y de Tecnologías de Información y Comunicación from the
Universidad Nacional Autónoma de México for supporting this study
through the grants PAPIIT-DGAPA-IN213320 and LANCAD-UNAMDGTIC-320.
G.C. acknowledges the computing facilities at the Cell
Physiology Institute of the Universidad Nacional Autónoma de México.
C.A.-C. was supported by a Fulbright-García Robles Fellowship.
CONFLICT OF INTERESTS
The authors declare that they have no conflicts of interest with the
content of this article.
ORCID
Claudia Alvarez-Carreño https://orcid.org/0000-0002-1827-8946
Marcelino Arciniega https://orcid.org/0000-0002-7526-6941
REFERENCES
1. Andrade MA, Perez-Iratxeta C, Ponting CP. Protein repeats: structures,
functions, and evolution. J Struct Biol. 2001;134(2–3):117-131.
https://doi.org/10.1006/jsbi.2001.4392.
2. Marcotte EM, Pellegrini M, Yeates TO, Eisenberg D. A census of protein
repeats. J Mol Biol. 1999;293(1):151-160. https://doi.org/10.
1006/jmbi.1999.3136.
3. Rajathei DM, Selvaraj S. Analysis of sequence repeats of proteins in
the PDB. Comput Biol Chem. 2013;47:156-166. https://doi.org/10.
1016/j.compbiolchem.2013.09.001.
4. Jorda J, Xue B, Uversky VN, Kajava AV. Protein tandem repeats—the
more perfect, the less structured. FEBS J. 2010;277(12):2673-2682.
https://doi.org/10.1111/j.1742-4658.2010.07684.x.
5. Söding J, Lupas AN. More than the sum of their parts: on the evolution
of proteins from peptides. Bioessays. 2003;25(9):837-846.
https://doi.org/10.1002/bies.10321.
6. Broom A, Doxey AC, Lobsanov YD, et al. Modular evolution and the
origins of symmetry: reconstruction of a three-fold symmetric globular
protein. Structure. 2012;20(1):161-171. https://doi.org/10.1016/j.
str.2011.10.021.
7. Main ER, Lowe AR, Mochrie SG, Jackson SE, Regan L. A recurring
theme in protein engineering: the design, stability and folding of
repeat proteins. Curr Opin Struct Biol. 2005;15(4):464-471. https://
doi.org/10.1016/j.sbi.2005.07.003.
8. Hrabe T, Godzik A. ConSole: using modularity of contact maps to
locate solenoid domains in protein structures. BMC Bioinformatics.
2014;15(119). https://doi.org/10.1186/1471-2105-15-119.
9. Fournier D, Palidwor GA, Shcherbinin S, et al. Functional and genomic
analyses of alpha-solenoid proteins. PLoS One. 2013;8(11), e79894.
https://doi.org/10.1371/journal.pone.0079894.
10. Bornberg-Bauer E, Beaussart F, Kummerfeld SK, Teichmann SA,
Weiner J. The evolution of domain arrangements in proteins and
interaction networks. Cell Mol Life Sci. 2005;62(4):435-445. https://
doi.org/10.1007/s00018-004-4416-1.
11. Moore AD, Björklund ÅK, Ekman D, Bornberg-Bauer E, Elofsson A.
Arrangements in the modular evolution of proteins. Trends Biochem
Sci. 2008;33(9):444-451. https://doi.org/10.1016/j.tibs.2008.05.008.
12. Nacher JC, Hayashida M, Akutsu T. The role of internal duplication in
the evolution of multi-domain proteins. Biosystems. 2010;101(2):127-
135. https://doi.org/10.1016/j.biosystems.2010.05.005.
13. Grishin NV. Fold change in evolution of protein structures. J Struct
Biol. 2001;134(2–3):167-185. https://doi.org/10.1006/jsbi.2001.
4335.
14. Russell RB, Ponting CP. Protein fold irregularities that hinder
sequence analysis. Curr Opin Struct Biol. 1998;8(3):364-371. https://
doi.org/10.1016/S0959-440X(98)80071-7.
15. Illergård K, Ardell DH, Elofsson A. Structure is three to ten times more
conserved than sequence—a study of structural response in protein
cores. Proteins Struct Funct Bioinform. 2009;77(3):499-508. https://
doi.org/10.1002/prot.22458.
16. Chothia C, Lesk AM. The relation between the divergence of
sequence and structure in proteins. EMBO J. 1986;5(4):823-826.
https://doi.org/10.1002/j.1460-2075.1986.tb04288.x.
17. Uliel S, Fliess A, Amir A, Unger R. A simple algorithm for detecting circular
permutations in proteins. Bioinformatics. 1999;15(11):930-936.
https://doi.org/10.1093/bioinformatics/15.11.930.
18. Hirsh L, Piovesan D, Paladin L, Tosatto SCE. Identification of
repetitive units in protein structures with ReUPred. Amino Acids.
2016;48(6):1391-1400. https://doi.org/10.1007/s00726-016-
2187-2.
19. Nguyen MN, Madhusudhan MS. Biological insights from topology
independent comparison of protein 3D structures. Nucleic Acids Res.
2011;39(14), e94. https://doi.org/10.1093/nar/gkr348.
20. Pearson WR, Lipmant DJ. Improved tools for biological sequence
comparison. Proc Natl Acad Sci USA. 1988;85(April):2444-2448.
21. Biegert A, Söding J. De novo identification of highly diverged protein
repeats by probabilistic consistency. Bioinformatics. 2008;24(6):807-
814. https://doi.org/10.1093/bioinformatics/btn039.
22. Heger A, Holm L. Rapid automatic detection and alignment of repeats
in protein sequences. Proteins Struct Funct Genet. 2000;41(2):
224-237.
ALVAREZ-CARREÑO ET AL. 1177
23. Karpenahalli MR, Lupas AN, Söding J. TPRpred: a tool for prediction
of TPR-, PPR- and SEL1-like repeats from protein sequences. BMC
Bioinformatics. 2007;8(2). https://doi.org/10.1186/1471-2105-8-2.
24. Szklarczyk R, Heringa J. Tracking repeats using significance and transitivity.
Bioinformatics. 2004;20(suppl 1):311-317. https://doi.org/10.
1093/bioinformatics/bth911.
25. Pellegrini M. Tandem repeats in proteins: prediction algorithms and
biological role. Front Bioeng Biotechnol. 2015;3:143 (September).
https://doi.org/10.3389/fbioe.2015.00143.
26. Hirsh L, Paladin L, Piovesan D, Tosatto SCE. RepeatsDB-lite: a web
server for unit annotation of tandem repeat proteins. Nucleic Acids
Res. 2018;46(W1):W402-W407. https://doi.org/10.1093/nar/
gky360.
27. Kao HY, Shih TH, Pai TW, Da Lu M, Hsu HH. A comprehensive system
for identifying internal repeat substructures of proteins. Paper
presented at: CISIS 2010—4th International Conference on Complex,
Intelligent and Software Intensive Systems; 2010:689–693. doi:
https://doi.org/10.1109/CISIS.2010.92
28. Bliven SE, Lafita A, Rose PW, Capitani G, Prlic A, Bourne PE. Analyzing
the symmetrical arrangement of structural repeats in proteins
with CE-Symm. PLoS Comput Biol. 2019;15(4):e1006842. https://doi.
org/10.1371/journal.pcbi.1006842.
29. Murray KB, Taylor WR, Thornton JM. Toward the detection and validation
of repeats in protein structure. Proteins Struct Funct Bioinforma.
2004;380(June):365-380. https://doi.org/10.1002/prot.20202.
30. Guerler A, Knapp E-W. Novel protein folds and their nonsequential
structural analogs. Protein Sci. 2008;17(8):1374-1382. https://doi.
org/10.1110/ps.035469.108.
31. Jha A, Flurchick KM, Bikdash M, Kc DB. Parallel-SymD: a parallel
approach to detect internal symmetry in protein domains. Biomed Res
Int. 2016;2016:4628592.
32. Abraham A, Rocha EPC, Pothier J. Swelfe: a detector of internal
repeats in sequences and structures. Bioinformatics. 2008;24(13):
1536-1537. https://doi.org/10.1093/bioinformatics/btn234.
33. Sabarinathan R, Basu R, Sekar K. ProSTRIP: a method to find similar
structural repeats in three-dimensional protein structures. Comput
Biol Chem. 2010;34(2):126-130. https://doi.org/10.1016/j.
compbiolchem.2010.03.006.
34. Do Viet P, Roche DB, Kajava AV. TAPO: a combined method for the
identification of tandem repeats in protein structures. FEBS Lett.
2015;589(19):2611-2619. https://doi.org/10.1016/j.febslet.2015.
08.025.
35. Russell RB, Saqi MAS, Sayle RA, Bates PA, Sternberg MJE. Recognition
of analogous and homologous protein folds: analysis of sequence
and structure conservation. J Mol Biol. 1997;269:423-439. https://
doi.org/10.1006/jmbi.1997.1019.
36. Krishna SS, Grishin NV. Structurally analogous proteins do exist!
Structure. 2004;12(7):1125-1127. https://doi.org/10.1016/j.str.2004.
06.004.
37. Jung J, Lee B. Circularly permuted proteins in the protein structure
database. Protein Sci. 2001;10:1881-1886. https://doi.org/10.1101/
ps.05801.Results.
38. Salem GM, Hutchinson EG, Orengo CA, Thornton JM. Correlation of
observed fold frequency with the occurrence of local structural
motifs. J Mol Biol. 1999;287(5):969-981. https://doi.org/10.1006/
jmbi.1999.2642.
39. Cheng H, Kim B, Grishin NV. Discrimination between distant homologs
and structural analogs: lessons from manually constructed, reliable
data sets. J Mol Biol. 2008;377(4):1265-1278. https://doi.org/10.
1016/j.jmb.2007.12.076.
40. Zhang Y, Skolnick J. Scoring function for automated assessment of
protein structure template quality. Proteins Struct Funct Genet. 2004;
57(4):702-710. https://doi.org/10.1002/prot.20264.
41. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern
recognition of hydrogen-bonded and geometrical features.
Biopolymers. 1983;22(12):2577-2637. https://doi.org/10.1002/bip.
360221211.
42. Smith T, Waterman MS. Identification of common molecular subsequences.
J Mol Evol. 1981;147:195-197.
43. Cheng H, Kim B, Grishin NV. MALIDUP: a database of manually constructed
structure alignments for duplicated domain pairs. Proteins
Struct Funct Bioinform. 2007;70(4):1162-1166. https://doi.org/10.
1002/prot.21783.
44. Paladin L, Hirsh L, Piovesan D, et al. RepeatsDB 2.0: improved annotation,
classification, search and visualization of repeat protein structures.
2017;45(November 2016):308-312. https://doi.org/10.1093/
nar/gkw1136.
45. Bateman A. UniProt: a worldwide hub of protein knowledge. Nucleic
Acids Res. 2019;47(D1):D506-D515. https://doi.org/10.1093/nar/
gky1049.
46. Mitchell AL, Attwood TK, Babbitt PC, et al. InterPro in 2019: improving
coverage, classification and access to protein sequence annotations.
Nucleic Acids Res. 2019;47(November 2018):351-360. https://
doi.org/10.1093/nar/gky1100.
47. Dawson NL, Lewis TE, Das S, et al. CATH: an expanded resource to
predict protein function through structure and sequence. Nucleic
Acids Res. 2017;45(Database issue):D289-D295. https://doi.org/10.
1093/nar/gkw1098.
48. Lewis TE, Sillitoe I, Dawson N, et al. Gene3D: extensive prediction of
globular domains in proteins. Nucleic Acids Res. 2018;46(Database
issue):D435-D439. https://doi.org/10.1093/nar/gkx1069.
49. Marchler-bauer A, Bo Y, Han L, et al. CDD/SPARCLE: functional classification
of proteins via subfamily domain architectures. Nucleic Acids
Res. 2017;45(Database issue):D200-D203. https://doi.org/10.1093/
nar/gkw1129.
50. Mi H, Huang X, Muruganujan A, et al. PANTHER version 11:
expanded annotation data from gene ontology and Reactome pathways,
and data analysis tool enhancements. Nucleic Acids Res. 2017;
45(Database issue):D183-D189. https://doi.org/10.1093/nar/
gkw1138.
51. El-gebali S, Mistry J, Bateman A, et al. The Pfam protein families database
in 2019. Nucleic Acids Res. 2019;47(October 2018):D427-D432.
https://doi.org/10.1093/nar/gky995.
52. Servant F, Bru C, Peyruc D, Kahn D. ProDom: automated clustering
of homologous domains. Brief Bioinform. 2002;3(3):246-251.
53. Sigrist CJA, de Castro E, Cerutti L, et al. New and continuing developments
at PROSITE. Nucleic Acids Res. 2013;41(November 2012):
D344-D347. https://doi.org/10.1093/nar/gks1067.
54. Letunic I, Bork P. 20 years of the SMART protein domain annotation.
Nucleic Acids Res. 2018;46(October 2017):493-496. https://doi.org/
10.1093/nar/gkx922.
55. Wilson D, Pethica R, Zhou Y, et al. SUPERFAMILY—sophisticated
comparative genomics, data mining, visualization and phylogeny.
Nucleic Acids Res. 2009;37(November 2008):D380-D386. https://doi.
org/10.1093/nar/gkn762.
56. Haft DH, Loftus BJ, Richardson DL, et al. TIGRFAMs: a protein family
resource for the functional identification of proteins. Nucleic Acids
Res. 2001;29(1):41-43.
57. Eisenbeis S, Höcker B. Evolutionary mechanism as a template for protein
engineering. J Pept Sci. 2010;16(10):538-544. https://doi.org/10.
1002/psc.1233.
58. Rico JAF, Höcker B. Design of chimeric proteins by combination of
subdomain-sized fragments. Methods Enzymol. 2013;523:389-405.
https://doi.org/10.1016/B978-0-12-394292-0.00018-7.
59. Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011;
7(10):e1002195. https://doi.org/10.1371/journal.pcbi.1002195.
60. Marsella L, Sirocco F, Trovato A, Seno F, Tosatto SCE. REPETITA:
detection and discrimination of the periodicity of protein solenoid
repeats by discrete Fourier transform. Bioinformatics. 2009;25:289-
295. https://doi.org/10.1093/bioinformatics/btp232.
1178 ALVAREZ-CARREÑO ET AL.
61. Walsh I, Sirocco FG, Minervini G, Di Domenico T, Ferrari C,
Tosatto SCE. RAPHAEL: recognition, periodicity and insertion assignment
of solenoid protein structures. Bioinformatics. 2012;28(24):
3257-3264. https://doi.org/10.1093/bioinformatics/bts550.
62. Heringa J, Argos P. A method to recognize distant repeats in protein
sequences. Proteins Struct Funct Bioinform. 1993;17(4):391-411.
https://doi.org/10.1002/prot.340170407.
63. Pagès G, Grudinin S. DeepSymmetry: using 3D convolutional networks
for identification of tandem repeats and internal symmetries in
protein structures. Bioinformatics. 2019;35(24):5113-5120. https://
doi.org/10.1093/bioinformatics/btz454.
SUPPORTING INFORMATION
Additional supporting information may be found online in the
Supporting Information section at the end of this article.
How to cite this article: Alvarez-Carreño C, Coello G,
Arciniega M. FiRES: A computational method for the de novo
identification of internal structure similarity in proteins.
Proteins. 2020;88:1169–1179. https://doi.org/10.1002/prot.
25886
ALVAREZ-CARREÑO ET AL. 1179