Bioinformatics
Wojciech Makalowski, Pennsylvania State University, Philadelphia, Pennsylvania, USA
Bioinformatics may be defined as the development and/or application of computational
tools and approaches for expanding the use of biological data, including those to acquire,
store, organize, archive, analyze or visualize such data.
Introduction
Molecular biologists generate data at an unprecedented
speed (Figure 1). For example, an average bacterial
genome can be sequenced in one day by a sequencing
center (factory) and the analysis of the expression of
hundreds of genes in a single experiment has become
routine. The demands and opportunities for interpreting
the data are expanding more than ever. In this
context, a new discipline – bioinformatics – has
emerged as a strategic frontier between biology,
computer science and statistics. Although the term
‘bioinformatics’ does not appear in the literature until
around 1991, since the 1960s, long before anyone
thought to label this activity with a special term, some
evolutionarily oriented biologists (e.g. Margaret O.
Dayhoff, Russell F. Doolittle, Walter M. Fitch and
Masatoshi Nei) have been building databases, developing
algorithms and making biological discoveries by
sequence analysis. The ﬁeld is not mature and
therefore not very well deﬁned. Almost anybody active
in this ﬁeld has a different deﬁnition of bioinformatics
and the discussion of what does and what does not
belong to it may go for hours.
Classical bioinformatics
Fredj Tekaia at the Institut Pasteur offers this
deﬁnition of bioinformatics: ‘The mathematical, statistical
and computing methods that aim to solve
biological problems using deoxyribonucleic acid
(DNA) and amino acid sequences and related information.’
Most biologists talk about ‘doing bioinformatics’
when they use computers to store, retrieve,
analyze or predict the composition or the structure of
biomolecules, for example, nucleic acids and proteins.
These are the concerns of ‘classical’ bioinformatics,
dealing primarily with sequence analysis. It is a
mathematically interesting property of most large
biological molecules that they are polymers: ordered
chains of simpler molecular modules called monomers.
Each monomer molecule is of the same general class,
but each kind of monomer has its own well-deﬁned set
of characteristics. Many monomer molecules can be
joined together to form a single, far larger, macromolecule,
which has exquisitely speciﬁc informational
content and/or chemical properties. Therefore, the
monomers in a given macromolecule of DNA or
protein can be treated computationally as letters of an
alphabet put together in preprogrammed
arrangements to carry messages or do work in a cell.
New bioinformatics
The greatest achievement of biology, the Human
Genome Project (HGP), has recently been completed.
The project brought not only the ultimate blueprint of
our genetic material but also rapid development of
new technologies, for instance microarrays. As we
enter the ‘postgenomic’ era, the nature and priorities
of bioinformatics research and applications are changing,
for instance statistic analysis plays a more
prominent role.
Introductory article
Article contents
 Introduction
 Databases
 Sequence Analysis
 Nucleotide Sequence Assembly
 Gene Prediction
 Homology Search
 Protein Analysis
 Technical Aspects
 Education
 Conclusions
doi: 10.1038/npg.els.0005247
15
GenBank
Millions
MEDLINE
10
5
0
1965
1970
1975
1980
1985
1990
1995
2000
Figure 1 Growth of biological information measured by number of
entries in two major databases: GenBank – a collection of annotated
nucleotide sequences and MEDLINE – a collection of biomedical
literature.
Bioinformatics
ENCYCLOPEDIA OF LIFE SCIENCES & 2005, John Wiley & Sons, Ltd. www.els.net 1
Databases
Databases are an invaluable component of bioinformatics.
They can be deﬁned as large organized sets of
data usually linked with software that allows database
searching, extraction of the relevant information and
the update of their content. A good database should
enable easy access to the data and allow the extraction
of information in a way that all required data will be
extracted and none of unwanted data will be retrieved.
Biological databases can be divided into primary
and derivative databases. Primary databases store the
original information submitted by experimentalists.
Database staff organize but do not add additional or
alter information submitted by a biologist. Three
nucleic acid databases, DDBJ, EMBL and GenBank,
are the best known examples of primary databases.
Derivative databases may be human curated or
computationally derived. In such a case, they usually
represent a subset of primary database with added
value, such as correction of the data or adding
some new information. SWISS-PROT and UniGene
represent human-curated and computer-generated
derivative databases respectively.
Sequence Analysis
Development of technologies to determine the sequence
of the nucleotides in nucleic acids or amino
acids in proteins has created the need for sequence
analysis. It is a process of investigating the information
content of raw sequence data. This is not a trivial task
as the information may be obscured from the
researcher by long stretches of sequences that do not
carry any information (this is particularly true for
eukaryotic genomic sequences).
Nucleotide Sequence Assembly
The automation of DNA sequencing was a major
contributing factor to the success of the HGP in the
ﬁrst place. Despite all improvements, DNA sequencing
has one crucial limitation – no more than several
hundreds of nucleotides can be read in a single
reaction. To determine the sequence of a longer
molecule, it has to be chopped into smaller pieces,
sequenced and then the whole sequence of the long
molecule has to be deduced based on the nucleotide
sequence of shorter molecules. The process of the long
sequence deduction is called sequence assembly. Both
public and private sequencing projects used a shotgun
technique to produce raw sequencing data. The
difference lies in the size of the human genome
fragments that are subjected to the shotgun sequencing
process – about 200 kb long bacterial artiﬁcial
chromosomes (BACs) in the public project and the
whole human genome in the private one. The raw data,
after some cleaning, are used to assemble longer
contigs. First, all the sequences are checked for the
overlaps. The information about overlapping fragments
can be summarized in a graph structure. Then,
the problem of sequence assembly is to ﬁnd a path that
covers all fragments. Unfortunately, this is an NPhard
problem (which means that it requires nondeterministic
polynomial time to be solved) and
therefore all known algorithms are relatively slow.
Recently, a Eulerian approach (one of the algorithms
that uses graph theory) has been proposed to solve the
assembly problem but it remains to be shown if this
approach is applicable to large eukaryotic genomes.
One of the greatest challenges of the human genome
assembly is the recently duplicated fragments of a
genome. These fragments, which can be as long as
10 kb, have a sequence similarity higher than a sequencing
error rate. Therefore, the assembly software
treats them as an identical sequence. This causes the
known problem of collapsing legitimate duplicons into
one fragment. On the other hand, it is not uncommon
that some fragments are ‘artiﬁcially’ duplicated by the
assembly software, placing the same fragment in two
distinct chromosomal locations.
Gene Prediction
The assembled genome is the starting point, not the
goal, of the sequencing projects. The linear sequence is
useless unless annotated. The annotation process
reveals biological information hidden in the nucleotide
(or amino acid) sequence. Only approximately 2% of
the human genome codes for proteins, and the proteincoding
regions are not contiguous – exons are
interlaced with introns. Many techniques were developed
to identify exons. Computational methods can be
divided into homology-based or model-based. The
former methods work well if a homologous sequence
of a given gene is already known for other organisms.
The latter do not require any previous knowledge
about a given gene. Since many prokaryotic genomes
have been sequenced in the last several years, the
homology-based approach is a ﬁrst choice for these
organisms. In fact, most of the genes of the newly
sequenced prokaryotic genomes can be determined in
this way. Although for eukaryotic genomes modelbased
methods are still dominating, a comparative
genomic approach can add valuable information to
any method applied. None of the existing methods are
perfect and we are far from solving the gene prediction
problem. Inpractice, several methods should be applied
Bioinformatics
2
and experimental validation is required. Luckily, it
seems that computational methods can at least point
to the protein-coding region, but precise boundaries of
the exons are still quite difﬁcult to predict. The major
problem is that with increased sensitivity, speciﬁcity
declines. Therefore, to avoid too many false positives
we have to give up some sensitivity and allow a
fraction of exons not to be discovered.
About 5% of mammalian genes do not code for
proteins. In such a case, a transcribed molecule is a
functional ribonucleic acid (RNA), for example,
ribosomal RNAs (rRNAs), transfer RNAs (tRNAs)
and small nuclear RNAs (snRNAs). Identifying these
noncoding RNAs is much more complicated than
identifying protein-coding genes. They are usually
small, often present in multiple copies, have conserved
a secondary structure rather than a linear sequence,
and can be found in unusual locations, such as introns
of other genes. A mathematical model called ‘stochastic
context-free grammars’ (SCFGs) has been introduced
as a framework for RNA structure and sequence
alignment. A successful implementation of this approach
is the program tRNAscan-SE, which detects
tRNA genes with high sensitivity.
Homology Search
One of the ﬁrst questions asked by a biologist after a
new nucleotide molecule has been sequenced is
whether a sequence is already known. The best way
to ﬁnd this out is to compare a new sequence to all
known nucleotide sequences. With the rapid growth of
repository databases, an exhaustive comparison of the
database is practically impossible, for example comparison
of 1000 nt sequence with the whole GenBank
(261011
nt) would take 5555 h (232 days) on a single
CPU computer. The key idea of the homology search
is that most of the sequences in the database do not
match a query sequence (the sequence that is used for
the database search). Therefore, ﬁnding a fast way to
eliminate sequences that do not match would speed up
the whole process of the database comparison. There
are several methods for the database screening, but
two of them, FASTA and BLAST, have dominated
the whole ﬁeld (the original paper describing BLAST
algorithm with almost 14 000 citations is by far the
most cited paper in the history of biology). The
inﬂuence of both programs is so profound that all
database searching programs have converged to a
basic format: a graphical description of the results,
a list of top scoring sequences from the database and a
series of alignments for some of the top scoring sequences.
The common strategy of the database searching
program is fast screening to eliminate unrelated
sequences followed by a complete local alignment of
top scoring sequences. The important part of the
process is a statistical assessment of the obtained
alignment, in other words estimating the probability of
obtainingagivenalignmentbychance.Differentﬂavors
of programs can compare nucleotide or amino acid
sequence against nucleic acid or protein databases.
Interestingly, because of the larger alphabet and more
sophisticated scoring systems, protein-based searches
are usually more sensitive than nucleotide searches.
Protein Analysis
Once the amino acid sequence of a protein is
determined, a biologist would like to know the
function of that protein in a cell. Protein function
can be inferred based on an overall similarity to other
proteins of known function or based on the domains
or motifs present in a previously analyzed protein.
Finally, since structure is the most evolutionarily
conserved feature of a protein, structural similarity
between proteins may suggest functional similarity.
Protein similarity searches, for example using BLAST,
against a comprehensive database is the easiest and
most frequently utilized way to infer function. However,
there are some traps awaiting in the similarity
searches. Annotation of a protein in a database might
be faulty. For example, the annotation of a doublefunctioning
protein may be missing one of its
functions, two different proteins may have two
different names in the database, the annotation of
one protein may be mistakenly propagated on other
proteins. A protein may also change its function
during the evolution. For instance, birds’ lysozyme
and mammalian lactalbumin are homologous despite
their completely different function. Finally, protein
function may depend on the place where it is expressed.
Common domains or motifs shared between proteins
with relatively low (less than 20%) similarity may
also shed light on unknown protein function. Domains,
such as DNA binding domains, are functional
units located in a particular region of a protein that
perform a speciﬁc function. Functional domains
usually coincide with structural domains. Motifs are
sequence patterns common among proteins
performing a given function. A given motif presence
does not infer a common ancestry of proteins bearing
it, in other words, two proteins may possess the same
motif because of convergent evolution. Several computational
methods exist to deﬁne motifs in amino acid
sequences from the same protein family. The simplest
way is to look for the conserved amino acids in
columns of a multiple sequence alignment of the
family. Other methods do not require multiple sequence
alignment and include machine learning approach,
the hidden Markov models (HMM) and Gibbs
Bioinformatics
3
sampling approach. Newly sequenced genomes bring a
high number of proteins with unknown function. For
instance, out of 25 498 predicted genes in a weed thale
cress (Arabidopsis thaliana), only 30% have known
function and only 9% of proteins have their function
conﬁrmed experimentally. (See Gibbs Sampling and
Bayesian Inference; Hidden Markov Models.)
For proteins without similarity to any known
protein, further computational analysis has to be
done in order to obtain more biological information
on analyzed proteins. It is possible to predict transmembrane
proteins based on physicochemical properties
of amino acids forming a peptide chain. Statistical
analysis can help ﬁnd a signal peptide. Since the
function of a protein is determined by its threedimensional
structure, structurally similar proteins
are likely to perform similar function. As the number
of known protein structures continue to expand (as of
20 August 2002, there are 18 488 biological macromolecule
structures deposited in Protein Data Bank
(PDB)), the chance that a new gene product folds like a
known structure will continue to increase. Paradoxically,
the database noise continues to grow as well, and
ﬁnding the true matching fold may not be a trivial
task. One of the techniques to ﬁnd similar structure is
threading. This method is based on the assumption
that proteins are in a state of minimum free energy,
and that this energy may be computed for any given
structure. The energy computation takes into account
the compatibility of different amino acids at each
position in the structure. Given a function that can
evaluate the compatibility of a sequence with a
structural template, threading algorithms attempt to
minimize this function by considering various possible
sequences to structure alignments.
Technical Aspects
Most of the tasks in bioinformatics have to handle
large data sets and the HGP is not an exception. Very
few of them could be easily completed on the personal
computers and operating systems usually favored by
biologists, that is, MacOS on an Apple Macintosh or
MS Windows on an Intel-based PC. Consequently,
Unix platforms are ﬁrst choice for bioinformaticians
and most of the software related to the HGP was
developed under this operating system. The major
features of Unix that determined its popularity among
bioinformaticians are: support of high-end hardware
with large memory and robust multithreading, support
for large (42 GB) data ﬁles, availability of efﬁcient
scripting and data manipulation languages (pearl,
awk, python and unix shell are widely used for that)
and ﬁnally general stability of the operating system – a
well-maintained Unix machine needs rebooting only
for major hardware updates. The major providers of
high-end computers and operating system to the
bioinformatics community have been Sun Microsystems,
Silicon Graphics Inc. and Compaq. IBM has
recently invested in hardware and software development
for the life sciences as well. Linux, a freeware
version of Unix running on Intel-based machines, has
become very popular among bioinformaticians especially
for distributed computing (PC clusters or farms).
PC farms, because of the large numbers in which
they are manufactured, have a better cost/performance
ratio than high-end machines. The main drawback of
this approach is that the system management of PC
clusters is quite difﬁcult and they are prone to failure if
components are not chosen carefully. Nevertheless,
commercial solutions to the problem exist and one can
buy ready-to-run PC clusters. Also, commercial
versions of some crucial software, ready to run on
PC clusters, are available as well. These include
BLAST, hidden Markov models, sensitive sequence
alignment (implementations of Smith–Waterman
algorithm) and gene discovery systems. The Grid
computing will be likely the next step in the distributed
computing within the ever CPU-hungry
bioinformatics community, especially that major
players such as Sun Microsystems make a serious
effort to bring effortless solutions to this approach.
Although the internet was not invented for the
HGP, it did provide the right infrastructure at the right
time. Biologists depend heavily on the internet for data
dissemination and collaborative work. The websites
provide easy and convenient access to the data and
visualization of the data analysis results. Many
bioinformatics tools and HGP results are easily
accessible to the whole community, including those
using personal computers. (See Table 1 for a noncomprehensive
list of bioinformatics resources on
the web.)
Education
Despite the huge demand for the trained bioinformaticians,
until recently there were no formal training
programs in bioinformatics. Practically, all bioinformaticians
are either computer scientists who have
some idea about biology, or biologists who know how
to program. Interestingly, molecular evolutionary
geneticists make an especially easy transition, most
likely because of their background and training. Some
physicists, statisticians and mathematicians moved to
the ﬁeld attracted by either interesting, difﬁcult to
solve biological problems or simply by better job
prospects. There is an ongoing discussion of who
makes a better bioinformatician – retrained biologists
Bioinformatics
4
Table 1 Some bioinformatics resources on the web
Site Address Description
General sites and compilations of resources
NCBI http://www.ncbi.nlm.nih.gov/ Three homes of nucleic acid repository databases and
gateways to many resources
EBI http://www.ebi.ac.uk/index.html
DDBJ http://www.ddbj.nig.ac.jp/
KEGG http://www.genome.ad.jp/kegg/
kegg3.html
Kyoto encyclopedia of pathways
and maps
NAR http://www3.oup.co.uk/nar/database/c/ Compilation of databases
ExPASy http://www.expasy.ch/ ExPASy Molecular Biology Server
TreeLife http://tolweb.org/tree/phylogeny.html Tree of life taxonomy project
BCM Launcher http://searchlauncher.bcm.tmc.edu/ Launches numerous nucleotide and
protein tools
EMBL Tools http://www.embl-heidelberg.de/
Services/index.html
Various Computational Services at
EMBL-Heidelberg
The Biology WorkBench http://workbench.sdsc.edu/ Unified area for searching and analysis
at San Diego Supercomputer Center
(requires a free account)
MDC http://www.bioinf.mdc-berlin.de/ Bioinformatics tools at Delbruck Center
Posnania http://posnania.biotec.psu.edu/ Bioinformatic tools server at Penn State
Database searches and alignment tools
BLAST http://www.ncbi.nlm.nih.gov/BLAST NCBI gateway to db searches
FASTA http://www.ebi.ac.uk/fasta33/ EBI web interface to FASTA searches
BlastU http://www.proweb.org/proweb/
Tools/WU-blast.html
BLAST query against user FASTA database
Blat http://genome.cse.ucsc.edu/cgi-bin/
hgBlat?command ¼ start&db ¼ hg12
Quick alignment of query to human genome assembly
WABA http://www.cse.ucsc.edu/  kent/xenoAli/
xenAliTwo.html
Genomic to genomic comparisons
Pipmaker http://bio.cse.psu.edu/pipmaker/ Per cent identity plot alignment of
two genomic sequences
ClustalW http://www.ebi.ac.uk/clustalw/ General purpose multiple sequence
alignment program
Mulatalin http://prodes.toulouse.inra.fr/multalin/
multalin.html
Aligns multiple sequences with
flexible formatting
Genome browsers and ‘complete’ genomes
Genome Browser http://genome.cse.ucsc.edu/cgi-bin/
hgGateway?db ¼ hg12
Human genome assembly and
track browser at UCSC
Ensembl http://www.ensembl.org/ Ensembl Genome Browser
Genomic Biology http://www.ncbi.nlm.nih.gov/
Genomes/index.html
Genomic resources at NCBI
euGenes http://iubio.bio.indiana.edu:8089/man/ Five species map browser and 37 049 human gene reports
Vista http://sichuan.lbl.gov/ids-bin/VistaInput Visualization tool for long
genomic alignments
TIGR CMR http://www.tigr.org/tigr-scripts/
CMR2/CMRHomePage.spl
Comprehensive microbial resource and
genome browser
Sanger http://www.sanger.ac.uk/HGP/links.shtml Human genome project at Sanger Centre
Jax http://www.informatics.jax.org/ Mouse genomics at Jackson Laboratory
RatMap http://ratmap.gen.gu.se/ Rat genome database at Goteborg
Zfin http://zfin.org/cgi-bin/
webdriver?MIval ¼ aa-ZDB_home.apg
Zebra fish information network and database project
Exofish http://www.genoscope.cns.fr/externe/tetraodon/ Genome analysis of Tetraodon nigroviridis
Flybase http://flybase.bio.indiana.edu/ Database of Drosophila genomics at Indiana
WormBase http://www.wormbase.org/ Nematode mapping, sequencing and
phenotypic repository
Bioinformatics
5
Site Address Description
SGD http://genome-www.stanford.edu/
Saccharomyces/
Yeast genome database
TAIR http://arabidopsis.org/ The Arabidopsis information resource and tool center
GOLD http://igweb.integratedgenomics.com/GOLD/ Monitors genome projects worldwide
Gene prediction and feature finding
GenomeScan http://genes.mit.edu/genomescan.html Predicts genes incorporating protein homology
RGD http://rgd.mcw.edu/ Comparative mapping tool for rat–mouse–human synteny
FGENEH http://genomic.sanger.ac.uk/gf/gf.shtml Splice sites, coding exons, gene models,
promoter and poly A
Softberry http://www.softberry.com/berry.phtml Nucleotide sequence analysis, genes, promoters
TwinScan http://genes.cs.wustl.edu/query.html Gene prediction for eukaryotic
genomic sequences
WebGene http://www.itba.mi.cnr.it/webgene/ Tools for analysis of protein-coding gene structure
GrailEXP http://grail.lsd.ornl.gov/grailexp/ Exons, genes, promoters, poly A, CpG islands at ORNL
GeneWise2 http://www.sanger.ac.uk/Software/Wise2/
genewiseform.shtml
Compares protein to genomic with introns and frameshifts
Genie http://www.fruitfly.org/seq_tools/genie.html Finds multiexon genes only, trained on human
AAT http://genome.cs.mtu.edu/aat.html Analysis and annotation tool for finding genes
HMMgene http://www.cbs.dtu.dk/services/HMMgene/ Prediction of vertebrate and nematode genes
NetGene2 http://www.cbs.dtu.dk/services/NetGene2/ Neural network predictions of splice sites
GeneMark http://www.ebi.ac.uk/genemark/ Gene prediction server at EBI
GeneID http://www1.imim.es/geneid.html Gene prediction of selected signals and exons
RNAGenie http://gobi.lbl.gov/  steveh/rnagene/ Locates RNA genes from secondary structures
Vienna RNA http://www.tbi.univie.ac.at/  ivo/RNA/ Seven RNA analysis tools
PromoterScan http://bimas.dcrt.nih.gov/molbio/
proscan/index.html
Predicts promoters using homologies
with Pol II promoters
ORF Finder http://www.ncbi.nlm.nih.gov/gorf/gorf.html Open reading frame finder for ORFs
of cutoff size
Emboss http://www.ebi.ac.uk/emboss/cpgplot/ Predicts CpG islands and isochores
UTR http://bighost.area.ba.cnr.it/BIG/UTRHome/ Untranslated regions of eukaryotic mRNAs
AltSplice http://141.80.80.48/splice/ Database of alternate splicing in expressed sequence tags
and disease genes
Webcutter http://www.firstmarket.com/firstmarket/
cutter/cut2.html
Finds restriction endonuclease sites
FSED http://ir2lcb.cnrs-mrs.fr/d_fsed/fsed.html Frameshift error detection in new sequences
ERR_WISE http://www.bork.embl-heidelberg.de/
ERR_WISE/
Detection of frameshift sequencing errors
TfScan http://bioweb.pasteur.fr/seqanal/
interfaces/tfscan.html
Scans DNA sequences for transcription factors
RepeatMasker http://ftp.genome.washington.edu/
cgi-bin/RepeatMasker
Finds retroposons and repeats
Censor http://www.girinst.org/
Censor_Server-Data_Entry_Forms.html
Finds repeated elements, Repbase updates
Tandem http://tandem.biomath.mssm.edu/trf/trf.html Tandem repeat finder in DNA
TandemRepeats http://c3.biomath.mssm.edu/
trf.submit.options.html
Finds adjacent imperfect repeat
patterns in DNA
Tnoco http://www.biophys.uni-duesseldorf.de/
local/TINOCO/tinoco.html
Suggests secondary structure in
RNA or DNA
RNA_align http://www.csd.uwo.ca/  kzhang/
rna/rna_align.html
Aligns two RNA species from secondary and
tertiary structures
Protein tools
SwissTools http://www.expasy.ch/tools/ Large collection of protein tools and databases
SAPS http://www.ebi.ac.uk/saps/ Statistical analysis of protein sequences
Table 1 Continued
Bioinformatics
6
Site Address Description
Signalp http://www.cbs.dtu.dk/services/SignalP/ Predicts signal peptide cleavage sites
TMpred http://www.ch.embnet.org/software/
TMPRED_form.html
Predicts transmembrane segments and orientation
TMHMM http://www.cbs.dtu.dk/services/TMHMM/ Predicts transmembrane helices
TMAP http://www.mbb.ki.se/tmap/index.html Predicts membrane proteins from multiple alignment
COILS http://www.ch.embnet.org/software/
COILS_form.html
Prediction of coiled-coil regions in proteins
PREDATOR http://www-db.embl-heidelberg.de/jss/servlet/
de.embl.bk.wwwTools.GroupLeftEMBL/argos/
predator/predator_info.html
Secondary structure from multiple sequences
SSCP http://www.bork.embl-heidelberg.de/SSCP/ Predicts helix, strand and coil from composition
STRIDE http://www-db.embl-heidelberg.de/jss/servlet/
de.embl.bk.wwwTools.GroupLeftEMBL/
argos/stride/stride_info.html
Secondary structure from atomic coordinates
Jpred http://jura.ebi.ac.uk:8888/ Consensus method for secondary structure prediction
Interpro http://www.ebi.ac.uk/interpro/ Integrated resource of protein families,
domains and sites
Pfam http://www.cgr.ki.se/Pfam/ Alignments and hidden Markov models of
protein domains
Sift http://blocks.fhcrc.org/  pauline/SIFT.html Predicts tolerated and untolerated
protein substitutions
TransFac http://transfac.gbf.de/TRANSFAC/index.html Transcription factor database and tool collection
SMART http://smart.embl-heidelberg.de/ Simple modular architecture research tool
ProDom http://prodes.toulouse.inra.fr/prodom/
doc/prodom.html
Protein domain database
ScanProsite http://www.expasy.org/tools/scanprosite/ Scans a sequence against Prosite
ProfileScan http://hits.isb-sib.ch/cgi-bin/PFSCAN Scans sequence against profile databases
HMMER http://hmmer.wustl.edu/ Profile hidden Markov models for sequence analysis
SAM http://www.cse.ucsc.edu/research/
compbio/sam.html
Sequence alignment and modeling
system by HMM
Blimps http://bioweb.pasteur.fr/seqanal/motif/
blimps-uk.html
BLOCKS search tool
Motif http://motif.genome.ad.jp/ Searches for motifs in query sequence
Scop http://scop.mrc-lmb.cam.ac.uk/scop/ Structural classification of proteins
Pratt http://scop.mrc-lmb.cam.ac.uk/scop/ Search for conserved patterns in protein sequences
MAST http://meme.sdsc.edu/meme/website/mast.html Motif alignment search tool
Profam http://mips.gsf.de/proj/protfam/ Curated protein classification and homology domains
Dali http://www.ebi.ac Coordinates of query compared to PDB database
123D http://genomic.sanger.ac.uk/123D/
run123D.shtml
Predicts tertiary structure
PredictProt http://dodo.cpmc.columbia.edu/pp/
submit_adv.html
Threading, secondary, solvent and
transmembrane
CE http://cl.sdsc.edu/ce/all-to-all/all-to-all-r.html Finds structural alignments from PDB
SCWRL http://www.fccc.edu/research/labs/
dunbrack/scwrl/
Side-chain placement using a rotamer library
AA Contacts http://promoter.ics.uci.edu/BRNN-PRED/ Residue contacts, solvent accessibility
SwissPdbView http://www.expasy.ch/spdbv/ Deep view for manipulation and
analysis of PDB structures
Kinemages http://www.faseb.org/protein/kinemages/
kinpage.html
Interactive three-dimensional protein web display
RasMol http://www.umass.edu/microbio/
rasmol/index2.htm
Molecular visualization resource
Chime http://www.umass.edu/microbio/chime/ Interactive web plug-in for
3D molecular structures
Table 1 Continued
Bioinformatics
7
or retrained computer scientists. This dilemma will
most likely disappear soon, as a cohort of
scientists who have been trained as professional
bioinformaticians leave different universities with
newly established degrees in bioinformatics and
have a background in these traditionally disparate
disciplines.
Most of the universities that want to keep up to date
with modern trends have established some programs in
bioinformatics (see Table 2). Balancing a background
in biology with a background in computer science is
not easy. As it is unlikely that we can train
bioinformaticians with an exhaustive background in
both biology and computer science, we will continue to
Site Address Description
COMBOSA3D http://bioinformatics.org/combosa3d/
index_b.html
Combines alignment, protein three-dimensional and
Chime to color
Boxshade http://www.ch.embnet.org/software/
BOX_form.html
Pretty shading of multiple alignments
Table 1 Continued
Table 2 Selected graduate programs in bioinformatics
University URL Short description
Boston University http://bioinfo.bu.edu/ Offers an MSc and a PhD in
bioinformatics. The program has 30
faculties and a number of centers and
departments involved in the program
George Mason University http://www.krasnow.gmu.edu/
bioinformatics/bioinfresrch.htm
GMU offers a PhD in bioinformatics and
computational biology
University of Manchester School
of Biological Sciences
http://bioinf.man.ac.uk/ This college offers an MSc in bioinformatics and
computational molecular biology
Pasteur Institute http://www.pasteur.fr/formation/infobio/
infobio-en.html
The bioinformatics courses at the Pasteur
Institute are organized with the
collaboration of the two largest
Parisian scientific universities: Paris VI
‘Pierre et Marie Curie’ and Paris VII
‘Denis Diderot’
University of Pennsylvania http://www.cbil.upenn.edu/UPCB/ This program admits students into standard
PhD programs of either the Computer
and Information Science department,
or a biological department, but then
provides additional training in
computational biology
Rutgers University http://cmb.rutgers.edu/ This joint effort of Rutgers and
the University of Medicine and Dentistry
of New Jersey offers a PhD program
S-star http://s-star.org/main.html Online bioinformatics program
The WM Keck Center for
Computational Biology in
Houston, Texas
http://www.keckcenter.org/training.cfm The center offers studies in
computational biology through
three partner institutions: Baylor College
of Medicine, Rice University and the
University of Houston
University of Toronto http://p-b.med.utoronto.ca/ Multidepartment and affiliated
research institutes’ academic program
Iowa State University http://www.bcb.iastate.edu/ An interdisciplinary PhD program in
bioinformatics and computational
biology (BCB)
University of the Sciences
in Philadelphia
http://tonga.usip.edu/zauhar/
bioinformatics_program.html
MSc program in bioinformatics
University of Bielefeld, Germany http://www.cebitec.uni-bielefeld.de/
GradSchool/
International Graduate School in
bioinformatics and genome research
Chalmers University of
Technology, Gothenburg, Sweden
http://www.md.chalmers.se/
Stat/Bioinfo/Master/
International Master’s programme
in bioinformatics
Bioinformatics
8
produce two types of bioinformaticians: biology
oriented and more computer science oriented. The
ﬁrst group will be better suited for the biological data
analysis, the latter for more traditional computer
science tasks, for example algorithms development,
database design and software engineering. As bioinformatics
analysis increasingly depends on statistical
methods, statistics should be a signiﬁcant component
of bioinformatics education. The Canadian Genetic
Diseases Network recently published a white paper:
Bioinformatics Curriculum Recommendations for Undergraduate,
Graduate, and Professional Programs, which
recognizes the importance of the two tracks in
bioinformatics education (see Web Links).
Conclusions
The success of the HGP would not be possible without
the parallel development of bioinformatics techniques.
All the stages of the HGP, from cloning and
sequencing to disease gene discovery, heavily depend
on computer aid. Many methods and tools were
developed. However, there are even more challenges
awaiting. Despite tremendous effort, eukaryotic gene
prediction methods are far from being perfect. It is still
relatively difﬁcult to navigate in the vast amount of
information predicted by the HGP. New tools to make
these results conveniently available to all (including
hard wet bench) biologists are required. For the
moment, the ‘gene-centric’ approach dominates, but
to fully understand how an organism works, we need
the holistic approach to understand all of the
interactions between the different components. System
biology is a promising direction toward this goal.
Bioinformatics will continue to evolve and will
continue to revolutionize biology.
See also
Bioinformatics: Technical Aspects
Genetic Databases
Genetic Databases: Mining
Microarray Bioinformatics
Further Reading
Branden C-I and Tooze J (1999) Introduction to Protein Structure.
New York: Garland Publishing.
Mount DW (2001) Bioinformatics: Sequence and Genome Analysis.
Plainview, NY: Cold Spring Harbor Laboratory Press.
Web Links
Bioinformatics Curriculum. This document provides recommendations
for undergraduate, graduate and professional programs in
bioinformatics developed by a group of Canadian scientists
www.bioinformatics.ca/docs/whitepaper.pdf
Bioinformatics
9