BIOINFORMATICS APPLICATIONS NOTE
Vol. 25 no. 15 2009, pages 1963­1965
doi:10.1093/bioinformatics/btp335
Sequence analysis
InterMap3D: predicting and visualizing co-evolving protein
residues
Rodrigo Gouveia-Oliveira, Francisco S. Roque, Rasmus Wernersson,
Thomas Sicheritz-Ponten, Peter W. Sackett, Anne Mlgaard and Anders G. Pedersen
Center for Biological Sequence Analysis, Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark
Received on January 31, 2009; revised on April 29, 2009; accepted on May 26, 2009
Advance Access publication June 15, 2009
Associate Editor: Thomas Lengauer
ABSTRACT
Summary: InterMap3D predicts co-evolving protein residues and
plots them on the 3D protein structure. Starting with a single protein
sequence, InterMap3D automatically finds a set of homologous
sequences, generates an alignment and fetches the most similar
3D structure from the Protein Data Bank (PDB). It can also accept
a user-generated alignment. Based on the alignment, co-evolving
residues are then predicted using three different methods: Row and
Column Weighing of Mutual Information, Mutual Information/Entropy
and Dependency. Finally, InterMap3D generates high-quality images
of the protein with the predicted co-evolving residues highlighted.
Availability: http://www.cbs.dtu.dk/services/InterMap3D/
Contact: gorm@cbs.dtu.dk
1 INTRODUCTION
Co-evolution of amino acid residues occurs when two or more
residues in a protein exert selective pressure on each other, so each
residue has an influence on the evolution of the rest. Co-evolution
has been conceived of as occurring in a variety of settings, but mostly
between amino acids that are close to each other in a protein's 3D
structure. For this reason, visualization of the location of co-evolving
sites in the 3D structure is of interest.
InterMap3D is a tool for detection and visualization of
co-evolving residues useful also to non-expert users. InterMap3D
is able to take a single sequence as input, from which it
automatically finds a set of homologous sequences, constructs a
multiple alignment, discovers co-evolving sites and produces as
output an image of the protein 3D structure with co-evolving
residues highlighted. The tool can also accept a user-generated
alignment. A manually curated dataset will most often be of better
quality than the automatically generated one, thus improving the
quality of the predictions. An additional goal with InterMap3D
is to make already existing methods for detection of co-evolution
available to the protein research community. While there has been
considerable interest in detecting co-evolving protein residues in the
past years, the dissemination of these methods has not been as active,
and only a few, such as Dependency (Tillier and Lui, 2003), CAPS
(Fares and McNally, 2006), CoMap (Dutheil and Galtier, 2007) and
PCOAT (Qi and Grishin, 2004) have been made available to the
To whom correspondence should be addressed.
These authors contributed equally.
general community. InterMap3D tries to remedy this situation by
providing several methods in a freely available web server, built in
a modular fashion that enables easy expansion to new methods.
2 IMPLEMENTATION
InterMap3D is a synergy of several tools in the fields of biological
data representation, phylogeny and detection of co-evolution.
InterMap3D can take the user's alignment or create one from a single
sequence given by the user (in FASTA format). If the user cannot
provide an alignment, but only a sequence, InterMap3D compares
it with UniProt (Apweiler et al., 2004) via BLASTP. All significant
database hits covering a minimum of 50% of the protein length
are retrieved (this value can be set by the user). All compatible
homologs are then aligned using either MAFFT (Katoh et al., 2005),
MUSCLE (Edgar, 2004) or ClustalW (Thompson et al., 1994).
That alignment is then processed by MaxAlign (Gouveia-Oliveira
et al., 2007), diminishing the number of gapped columns in the
alignment, and passed to the tools predicting co-evolving residues.
The prediction of co-evolving residues is done by one or more
of the three methods currently implemented in InterMap3D: Row
and Column Weighing of Mutual Information (RCW-MI) (GouveiaOliveira
and Pedersen, 2007), Mutual Information/Entropy (MI/E)
(Martin et al., 2005) and Dependency. Finally the results are mapped
onto a 3D structure if possible, using the FeatureMap3D program
(Wernersson et al., 2006). Briefly, FeatureMap3D searches for the
most similar homologous protein with an experimentally determined
3D structure, and then uses PyMOL (Delano, 2002) to plot the
predicted pairs of co-evolving sites onto that structure. The final
result is a 3D image of the protein structure in several formats.
Highlighted in this image are the co-evolving pairs (or networks)
and also completely conserved sites, as co-evolution analysis cannot
rule out interactions between these. The reliability of inferring the
location of the co-evolving residues, for proteins that have not
themselves been structurally characterized obviously depends on
sequence similarity. To help the user judge how representative a
structure is for a given protein sequence, it is possible to create a
figure of the structure, color-coded by sequence conservation.
From the output page, the user has several options for getting
additional information related to the analysis. This includes a PNGformat
plot of the labeled 3D structure, the corresponding PyMol
script and PDB file, the alignment, etc. For each predicted pair of
co-evolving sites, the user can also plot a phylogenetic tree showing
 The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 1963
R.Gouveia-Oliveira et al.
how the amino acids present at these two sites change over the
tree (Zmasek and Eddy, 2001). If the same pair of amino acids arise
independently at several places in the tree, then co-evolution is more
likely to be true.
3 PREDICTION OF CO-EVOLUTION
There are currently three methods in InterMap3D for predicting
co-evolving pairs of residues: RCW-MI, MI/E and Dependency.
The user can also choose to use the intersection between predictions
produced by these methods.
MI/E extracts the entropy dependency from the signal by dividing
MI by the joint site's entropy.
MI/H(X,Y)=
i j P(xiyj)log P(xiyj)/P(xi)P(yj)
- i j P(xiyj)log P(xiyj)
RCW-MI and Dependency aim at extracting both the phylogenetic
signal and the entropy dependency. RCW-MI does that by
considering that the signal in most pairwise comparisons results
from both the phylogenetic signal and the entropy-driven signal.
Thus, it weighs down pairwise MI score by the average MI of each
site as:
RCW(A;B)=
MIij
MI.j +MIi. -2MIij / n-1
Where MIij represents the MI between sites i and j, and a dot stands
for the summation over all sites. RCW-MI also allows discarding
the top hits in each summation in order to accommodate for more
than two-way co-evolution.
The Dependency method, on the other hand, considers that the
above weighing, used by RCW-MI, extracts only the phylogenetic
signal, and uses a second weighing to account for the entropy-driven
signal, producing a set of best hits entitled S1. This set is then filtered,
yielding S2, which also contains information about P-values. In
Intermap3D, the output provided is the S1 set, while the S2 set can
be downloaded in text format.
For both the MI/E and RCW-MI methods, we also estimate
P-values for the predicted co-evolving sites. This is done using
a three-step, heuristic approach: first, we estimate a maximum
likelihood phylogenetic tree and substitution model parameters from
the processed alignment using the PhyML program (Guindon and
Gascuel, 2003). The tree and other fitted model parameters are
then used to generate a number of simulated alignments using
the program Seq-Gen (Rambaut and Grassly, 1997). (The default
number of simulated alignments is two, but this can be controlled
by the user). By design, these alignments do not contain any
co-evolving site pairs. Finally, we compute the score distribution
from the simulated alignments, and use this as a null distribution
based on which P-values for real, biological scores can be estimated.
Specifically, this is done by fitting a generalized Pareto distribution
(GPD) to the right tail of the null distribution (the top 2%) and then
using the fitted GPD for computing tail-probabilities (i.e. P-values)
for each of the predicted co-evolving pairs. The GPD is well suited to
model tails of a wide variety of distributions (Coles, 2001). We here
use it as a heuristic shortcut for providing tail probabilities with much
finer resolution than what is supported by the empirical cumulative
distribution function based on the relatively few simulations we
perform. This way we partially avoid the computationally expensive
Fig. 1. Comparison of the performance for all the methods available in
Intermap3D (DEP: Dependency; combined methods are indicated by the
initials of the used methods). Performance was measured on a synthetic
dataset with mostly independently evolving sites, at four different rates. The
fraction of positives was calculated at the threshold of 20 best hits.
construction and analysis of a large number of simulated alignments,
while still having a sound basis for P-value estimation. GPD
fitting and computation of GPD tail probabilities is done using
the R-packages ismev and evd (R development core team, 2008;
Stephenson, 2002).
We compared the performance of the different methods and
their intersections using 100 simulated datasets of 64 taxa and 300
residues each, evolved along balanced trees. In each alignment,
there were 20 pairs of co-evolving residues. Residues were divided
into four classes evolving at different rates, both when evolving
independently and when co-evolving. The results, shown in Figure 1,
suggest RCW-MI to be the best method in these conditions. All
methods were very good at spotting pairs of residues co-evolving
slowly, but RCW-MI performed better at residues evolving at
intermediate rates.
ACKNOWLEDGEMENTS
The authors thank the Danish Center for Scientific Computing.
Funding: Foundation for Science and Technology, Portugal (grant
SFRH/BD/12448/2003 to R.G.-O.).
Conflict of Interest: none declared.
REFERENCES
Apweiler,R. et al. (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids
Res., 32, D115­D119.
Coles,S.G. (2001) An Introduction to Statistical Modelling of Extreme Values. Springer,
London.
Delano,W. (2002) The PyMOL Graphics System. DeLano Scientific, San Carlos, CA.
Dutheil,J. and Galtier,N. (2007) Detecting groups of coevolving positions in a molecule:
a clustering approach. BMC Evol. Biol., 7, 242
Edgar,R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Res., 32, 1792­1797.
Fares,M.A. and McNally,D. (2006) CAPS: coevolution analysis using protein
sequences. Bioinformatics, 22, 2821­2822.
Gouveia-Oliveira,R. and Pedersen,A.G. (2007) Finding coevolving amino acid residues
using row and column weighting of mutual information and multi-dimensional
amino acid representation. Alg. Mol. Biol., 2, 12.
Gouveia-Oliveira,R. et al. (2007) MaxAlign: maximizing usable data in an alignment.
BMC Bioinformatics, 8, 312.
1964
InterMap3D
Guindon,S. and Gascuel,O. (2003) A simple, fast, and accurate algorithm to estimate
large phylogenies by maximum likelihood. Syst. Biol., 52, 696­704.
Katoh,K. et al. (2005) MAFFT version 5: improvement in accuracy of multiple sequence
alignment. Nucleic Acids Res., 33, 511­518
Martin,L.C. et al. (2005) Using information theory to search for co-evolving residues
in proteins. Bioinformatics, 21, 4116­4124.
Qi,Y. and Grishin,N.V. (2004) PCOAT: positional correlation analysis using multiple
methods. Bioinformatics, 20, 3697­3699.
R Development Core Team (2008) R: a language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria.
Rambaut,A. and Grassly NC. (1997) Seq-Gen: an application for the Monte Carlo
simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl.
Biosci., 13, 235­238.
Stephenson,A.G. (2002) evd: Extreme Value Distributions. R News, 2, 31­32.
Thompson,J.D. et al. (1994) CLUSTAL W: improving the sensitivity of progressive
multiple sequence alignment through sequence weighting, position-specific gap
penalties and weight matrix choice. Nucleic Acids Res., 22, 4673­4680.
Tillier,E.R. and Lui,T.W. (2003) Using multiple interdependency to separate
functional from phylogenetic correlations in protein alignments. Bioinformatics, 19,
750­755.
Wernersson,R. et al. (2006) FeatureMap3D­a tool to map protein features and sequence
conservation onto homologous structures in the PDB. Nucleic Acids Res., 34,
W84­W88.
Zmasek,C.M. and Eddy,S.R. (2001) ATV: display and manipulation of annotated
phylogenetic trees. Bioinformatics, 17, 383­384.
1965