The European Bioinformatics Institute’s data
resources 2014
Catherine Brooksbank*, Mary Todd Bergman, Rolf Apweiler, Ewan Birney and
Janet Thornton
European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus,
Hinxton, Cambridge CB10 1SD, UK
Received September 17, 2013; Revised November 1, 2013; Accepted November 4, 2013
ABSTRACT
Molecular Biology has been at the heart of the ‘big
data’ revolution from its very beginning, and the
need for access to biological data is a common
thread running from the 1965 publication of
Dayhoff’s ‘Atlas of Protein Sequence and
Structure’ through the Human Genome Project in
the late 1990s and early 2000s to today’s population-scale
sequencing initiatives. The European
Bioinformatics Institute (EMBL-EBI; http://www.ebi.
ac.uk) is one of three organizations worldwide that
provides free access to comprehensive, integrated
molecular data sets. Here, we summarize the principles
underpinning the development of these public
resources and provide an overview of EMBL-EBI’s
database collection to complement the reviews of
individual databases provided elsewhere in this
issue.
INTRODUCTION
The molecular life sciences are becoming increasingly datadriven
and reliant on open-access databases (1). This is as
true of the applied sciences as it is of fundamental research:
in the past year, we have witnessed announcements that the
UK’s National Health Service will invest in sequencing the
genomes of up to 100 000 citizens (see http://www.gov.uk/
government/speeches/strategy-for-uk-life-sciences-one-yearon
and http://news.sciencemag.org/biology/2012/12/u.k.-
unveils-plan-sequence-whole-genomes-100000-patients);
the Faroe Islands are planning to sequence the genome of
every citizen who wishes to have this information (see
http://www.fargen.fo/en/), and large-scale metagenomics
projects are helping us to map the global biodiversity of
the oceans (2).
The European Bioinformatics Institute (EMBL-EBI),
part of the European Molecular Biology Laboratory,
makes these large-scale efforts possible. It helps scientists
deposit their research data into public collections,
produces value-added knowledge bases and makes its
entire holdings accessible to all, thereby enabling
millions of scientists worldwide to explore, analyse, interpret
and derive new knowledge from decades of scientiﬁc
endeavour.
Among its other roles (Appendix 1), EMBL-EBI has a
mission to provide free and open access to biomolecular
information, spanning scientiﬁc literature and the data
supporting it: DNA and protein sequences; biomolecules
and their structures, functions, reactions and interactions;
and practical tools for analysis and discovery.
These offerings include personally identiﬁable genetic
and phenotypic data resulting from biomedical research
projects—an area of growing importance as healthcare
systems embrace genomic medicine. Managing access to
these data sets is a high-priority activity at EMBL-EBI.
EMBL-EBI’s core resources are foundational members
of international consortia, which share data globally and
foster competitiveness among their members. Some of
these collaborations have a long history [e.g. the
International Nucleotide Sequence Database
Collaboration (INSDC) (3), the worldwide Protein Data
Bank (wwPDB) (4), UniProt (5) and Ensembl (6)]. Others,
driven by EMBL-EBI, are more recent [e.g. IMEx (7) for
protein interaction data; ProteomExchange (8) for protein
identiﬁcation data and COSMOS (9) for metabolomics
data]. The Global Alliance (10)—a large-scale international
effort to enable the secure sharing of genomic and
clinical data—is the most recent of these. Each of these
collaborations exempliﬁes the fundamental principles of
EMBL-EBI service provision (Appendix 2).
DESIGNED TO BE USED
EMBL-EBI embraces user-centred design (UCD), an
approach that focuses on the behaviour and needs of the
people who will actually use the product. UCD has been
successfully applied to design in many different domains,
although its application to bioinformatics services (11) is
*To whom correspondence should be addressed. Tel: +44 1223 492525; Fax: +44 1224 494468; Email: cath@ebi.ac.uk
D18–D25 Nucleic Acids Research, 2014, Vol. 42, Database issue Published online 23 November 2013
doi:10.1093/nar/gkt1206
ß The Author(s) 2013. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
atMasarykUniversityonApril8,2014http://nar.oxfordjournals.org/Downloadedfrom
relatively recent. The case for using UCD for bioinformatics
services is compelling: even major bioinformatics
resources are known to suffer from usability problems
(12), which prevent users from completing tasks (13).
By placing the user at the forefront of our minds as we
design, test and implement our services, we create more
useful and user-friendly resources. This approach has been
used to completely redesign the EMBL-EBI website—a
major project that has involved every team at EMBLEBI.
The redesign puts users at the centre of the
process, providing an intuitive new interface to EMBLEBI
services. It aims for consistent functionality without
stiﬂing the individual data resource brands.
EMBL-EBI’s search engine (14) displays results in an
organized manner, according to the central dogma of molecular
biology (i.e. DNA makes RNA makes protein).
This results in an uncluttered results ‘dashboard’ from
which users can explore genes, protein sequences, gene
expression, molecular structures and related scientiﬁc literature.
The search allows easy comparison of key information
for human, mouse, ﬂy and other species.
Bioinformatics services on the EMBL-EBI website are
displayed according to nine major themes (Figure 1; see
also http://www.ebi.ac.uk/services), which were informed
by user feedback. We have organized this review in the
same way.
LITERATURE
Access to the scientiﬁc literature is a basic requirement for
research. EMBL-EBI coordinates the development of
Europe PubMed Central Europe (PMC) (15) in
Figure 1. EMBL-EBI’s core data resources. The ﬁgure summarizes the resources described in this review; a fuller summary of EMBL-EBI’s data
resources and tools, which links to each resource, is available at http://www.ebi.ac.uk/services.
Nucleic Acids Research, 2014, Vol. 42, Database issue D19
atMasarykUniversityonApril8,2014http://nar.oxfordjournals.org/Downloadedfrom
collaboration with The University of Manchester (Mimas
and NaCTeM) and the British Library. PMC is part of
PMC International, which is coordinated by the US
National Center for Biotechnology Information and
includes PMC Canada. Launched in November 2012,
Europe PMC is funded through a collective of European
funders, coordinated by the Wellcome Trust.
Europe PMC combines the entire collection of PubMed
abstracts, PMC full-text articles, patent abstracts
(European, US and international), National Health
Service (NHS) clinical guidelines, Agricola records and
other record types, and builds innovative tools to help
researchers explore every aspect of the literature.
Because it is developed at EMBL-EBI, it is uniquely positioned
to link abstracts and articles seamlessly to the
underlying data. For example, Europe PMC is integrated
with UniProt, the Protein Data Bank in Europe (PDBe)
and the European Nucleotide Archive (ENA).
CROSS-DOMAIN TOOLS AND RESOURCES
The integration of different -omics data types requires the
consistent application of metadata to data sets derived
from the same sample. Sample metadata are managed
within EMBL-EBI’s BioSamples database (16), which
provides links to assays for speciﬁc samples (including reference
samples such as cell lines) and accepts direct submissions.
In 2013, the BioSamples database launched a new
user interface, application programming interface (API)
and submission-accessioning service.
ONTOLOGIES
Biologists and bioinformaticians look to ontologies and
other types of controlled vocabularies as a means of
standardizing the way data are described, queried and
analysed. Subtle differences in the use of terms and
phrases can hamper communication among scientists,
and can make automated data exchange prohibitively difﬁcult.
Ontologies address these issues, making information
in databases more readily human- and computer-readable.
The Gene Ontology (GO) (17) is a major bioinformatics
initiative to unify the representation of gene and geneproduct
attributes across all species. Groups participating
in the GO Consortium include major model organism
databases and other bioinformatics resource centres. At
EMBL-EBI, the GO editors play a key role in managing
the distributed task of developing and improving GO,
whereas the UniProt GO annotation (GOA) program
adds high-quality GOAs to proteins in the UniProt
Knowledgebase (UniProtKB) (18). Recent enhancements
include expanding the functionality of GO’s direct
ontology submission tool, TermGenie, which now
includes a ‘free form’ input for experienced users; integration
of the GO and ChEBI ontologies (19); and improvements
to the electronic GOA pipeline.
Another cross-cutting tool is the Experimental Factor
Ontology (EFO) (20), which began its life as a practical
means of categorizing gene expression data sets. In the
past year, it has broadened its application considerably
to support annotation of genome-wide association
studies (GWAS) and the integration of genomic and
disease data.
DNA AND RNA
Nucleotide sequence data are a central reference point
onto which many other types of information can be built.
The public record of nucleotide sequence data
The ENA (21) manages the staggering volumes of data
generated by next-generation sequencing. The ENA team
has developed CRAM, an openly accessible software
toolkit and ﬁle format for compressing sequence data,
leveraging the speciﬁc data properties of DNA sequence
(22). Ofﬁcially launched in November 2012, CRAM is a
community-led endeavour that is being incorporated into
existing tools and pipelines so that researchers can save on
local storage space. It also has the advantage of keeping the
public archives to a manageable size.
Reference genomes
Ensembl (6), produced jointly by EMBL-EBI and the
Wellcome Trust Sanger Institute, enables and advances
genome science by providing high-quality integrated annotation
on vertebrate genomes within a consistent and
accessible infrastructure. Ensembl’s new features include
the Variant Effect Predictor (23), which predicts the effects
of variant positions and alleles on overlapping transcripts
and regulatory regions. Ensembl features the genomes of
75 vertebrate species [the mountain gorilla (24) being a
notable recent addition]. It also houses the substantial
data sets produced by the ENCODE project (25).
Ensembl Genomes (26), launched in 2009 to expand
EMBL-EBI’s taxonomic coverage of reference genomes,
added a signiﬁcant number of new species to its database
in the past year. It now includes the genomes of biting
midges, butterﬂies (27), barley (28), wheat (29) and
>6000 bacterial species.
Ensembl Genomes also provides the underlying architecture
for several new community portals. Understanding
the basis of crop diseases was the driver behind the launch
of PhytoPath (http://www.phytopathdb.org), a new portal
for plant pathogen data. EMBL-EBI’s involvement in the
transPLANT project has spawned a new integrative portal
for plant genomics data (http://www.transplantdb.eu).
Ensembl Genomes has also made metabolic data for
>4000 bacterial genomes available through the Microme
portal (http://www.microme.eu).
Linking genotype to phenotype
EMBL-EBI’s philosophy is to make its data openly available
to the research community, but where personally
identiﬁable data are involved, it is important that we
honour the consent agreements under which patients
provide data, which nearly always exclude the use of
genetic data to identify individuals. The European
Genome-phenome Archive (EGA) is EMBL-EBI’s
service for permanent archiving and sharing of all types
of personally identiﬁable genetic and phenotypic data
resulting from biomedical research projects. The EGA
D20 Nucleic Acids Research, 2014, Vol. 42, Database issue
atMasarykUniversityonApril8,2014http://nar.oxfordjournals.org/Downloadedfrom
contains exclusive data collected from individuals whose
consent authorizes data release only for speciﬁc research
use or to bona ﬁde researchers. The EGA provides the
necessary security required to control access, maintain
patient conﬁdentiality and provide access for those researchers
and clinicians who are authorized to view the
data. In all cases, data access decisions are made by the
appropriate data access-granting organization (DAO) and
not by the EGA. An independent Ethics Committee audits
the EGA protocols and infrastructure.
Resequencing projects are providing vast amounts of
data that link genotype to phenotype, with the ultimate
goal of establishing the connections between genetic variation
and disease. Two resources reviewed in this NAR
database issue provide access to these data, one focusing
on GWAS, the other on knockout mice.
EMBL-EBI and the US National Human Genome
Research Institute (NHGRI) jointly develop the Catalog
of Published GWAS (30). The catalogue is a publicly
available manually curated collection of published
GWAS with a distinctive and dynamic visualization tool
that enables users to click on single-nucleotide polymorphism
(SNP)–trait associations mapped to chromosomal
locations. Each association is annotated with
terms from the EFO (see ‘Cross-domain tools and resources’
above) to help the user identify SNPs associated
with a speciﬁc phenotype.
The International Mouse Phenotyping Consortium
(IMPC) is building the ﬁrst comprehensive functional
catalogue of a mammalian genome (31). To do this, it is
creating a knockout mouse strain for every known
protein-coding gene—20 000 mouse strains in total—
using a rigorously standardized set of phenotyping protocols.
These strains will be made available in public
repositories, and data pertaining to each will be made
publicly available in near real time, along with open
tools for their analysis. Project data will be delivered
through a service (http://www.mousephenotype.org)
managed by the MPI2 consortium (EMBL-EBI, the
Wellcome Trust Sanger Institute, MRC Harwell). Users
will be able to search by term, gene, tissue or disease, so
they may identify associations between phenotype, gene
and protocol swiftly. The results are displayed using
the same principles and underlying architecture as
EMBL-EBI’s global search. The service is expected to
launch in early 2014.
Metagenomics
While the projects described above are accumulating
an ever greater depth of knowledge about the genomes
of long-studied organisms, another approach—
metagenomics (32)—increases ‘breadth’ of knowledge
by presupposing nothing about the identity of the organisms
present in a sample. EMBL-EBI’s ENA and
InterPro teams have created an integrated resource—the
Metagenomics Portal (http://www.ebi.ac.uk/metagenomics/)—that
allows researchers to submit, archive
and analyse genomic information from environments containing
many species. New functionality is being added
regularly in response to user demand.
EXPRESSION
The combination of transcriptomics, proteomics and
metabolomics data can provide a powerful basis
for deriving a system-based understanding of biological
systems. To facilitate such integration, EMBLEBI
is working towards the integration of the
ArrayExpress Archive (33), Expression Atlas (33),
Proteomics Identiﬁcations Database (PRIDE) (34)
and MetaboLights (35), EMBL-EBI’s newly launched
metabolomics database.
EMBL-EBI is developing a Baseline Expression Atlas,
which uses high-throughput sequencing-based expression
data to report ‘absolute’ gene expression levels, rather
than relative levels. Concurrently, signiﬁcant improvements
have been made to the ArrayExpress archive interface,
and the resource has accepted its millionth assay.
PRIDE (34) is a public resource for mass spectrometrybased
protein expression data. In 2013, PRIDE achieved
the successful and stable implementation of the
ProteomeXchange data workﬂow (http://www.
proteomexchange.org). As a result, data depositions
more than tripled in number of submissions and in volume.
The ﬁnal database making up this ‘expression and distribution
trinity’ is the newly launched MetaboLights
(35)—the ﬁrst general purpose open-access database for
metabolomics and its derived information.
MetaboLights includes a reference layer with information
about individual metabolites, their chemistry, spectroscopy
and biological roles, connected with a study
archive into which researchers deposit primary and
metadata on metabolomics studies.
PROTEINS
Protein sequence provides another ‘information hub’ for
the molecular biologist, onto which experimentally
validated information about the behaviour and localization
of proteins can be hung, and from which hypotheses
about structure and function may be generated.
UniProt (5), the uniﬁed resource of protein sequence
and functional information, is maintained by EMBLEBI
in collaboration with the Swiss Institute of
Bioinformatics and Universities of Georgetown and
Delaware. UniProt is closely integrated with Ensembl (6)
and Ensembl Genomes (26), and has generated new reference
proteome sets to match their genes in the reference
genomes. UniProt prioritizes the manual annotation of
experimental data for human and other reference proteomes
in collaboration with other worldwide resources,
ensuring the highest quality knowledge is available to researchers.
UniProt’s automatic annotation exploits the
results of manual annotation, resulting in a widening of
taxonomic and annotation depth.
EMBL-EBI’s protein resources have embraced UCD
(see ‘Designed to be used’). The UniProt development
team has designed new interfaces that enhance user interaction
with the website and facilitate access to data.
The UniProt content team has extended the databases to
accommodate a rapidly growing volume of data and to
incorporate variation and proteomics data.
Nucleic Acids Research, 2014, Vol. 42, Database issue D21
atMasarykUniversityonApril8,2014http://nar.oxfordjournals.org/Downloadedfrom
InterPro (36), EMBL-EBI’s database of protein
families, domains and motifs, has completely re-implemented
its back-end to optimize user query functionality.
A new user interface has been developed and tested with
users, and the results have informed the global EMBLEBI
website redesign process.
Pfam (37), a database of hidden Markov models and
alignments describing conserved protein families and
domains, is one of InterPro’s 11 member databases, and
is in the process of migrating from the Wellcome Trust
Sanger Institute to EMBL-EBI. Pfam’s latest release adds
real-time searches of DNA sequences for matches to Pfam
models, representative proteome sequence sets to provide
non-redundant views of alignments and annotations to
disease.
STRUCTURES
Understanding molecular structure is crucial to understanding
function. PDBe (38,39), the European arm of
the worldwide Protein Data Bank collaboration
(wwPDB), provides sophisticated tools for analysing
structures, several of which were improved signiﬁcantly
in the past year. These include tools that enhance the
analysis of nuclear magnetic resonance entries and many
improvements to EMDB (38), the European resource for
electron microscopy-based models. EMDB now has a new
search service and an interactive viewer for electron tomograms.
PDBe is increasingly integrated with other types of
information, including sequence data [through the SIFTS
service (40)] and the GO (17) (through a new module in
the PDBeXplore tool).
SYSTEMS
The genes and gene products encoded by genomes do not
act in isolation but do so in coordinated systems, often
containing protein, small molecule and oligonucleotide
or oligosaccharide components. EMBL-EBI’s molecular
systems resources enable researchers to build a holistic
view of life at the molecular level, building up from
enzymes and their mechanisms, through protein—
protein interactions and networks, to pathways and quantitative
models.
The Enzyme Portal (41,42), launched in February
2012, combines high-quality data from 10 previously
isolated databases and organizes information about
each enzyme in such a way that the user can ﬂip from
information about a single enzyme function to resolved
structures, reactions and pathways, substrates and
products, relevance to disease and relevant publications.
Users can also search the Enzyme Portal by protein
sequence.
IntAct (43), EMBL-EBI’s database of molecular interactions,
is now closely integrated with the MINT database
at the University of Rome, and is serving as the curation
platform for eight global partner organizations based in
Canada, India, Ireland, Italy, Singapore, UK and the
USA through the IMEx Consortium (7).
EMBL-EBI, the Ontario Institute of Cancer Research
and New York University Medical Center jointly develop
the Reactome (44,45) database of curated human
pathways. Reactome’s website (http://www.reactome.
org) has been completely redeveloped, and now features
a modular pathway browser, a comprehensive set of web
services and integrated molecular-interaction, structural
and expression data.
Submissions to the BioModels database (46), EMBLEBI’s
database of computational models of biological
processes, have more than doubled since 2011. A new
‘top-down’ approach to building quantitative models
has been implemented. Rather than building up from
the mechanistic details of a speciﬁc process, the new
approach uses pathways from data resources such as
Kyoto Encyclopedia of Genes and Genomes and
Reactome (44) as starting points. The BioModels
database is also one of several EMBL-EBI databases
that are actively involved in exposing their data to the
semantic web: there is now a resource description
framework (RDF) representation of the models in the
database, and users have access to powerful queries
using SPARQL.
CHEMICAL BIOLOGY
Chemical biology has been a major growth area for
EMBL-EBI in the past decade. Major drivers for
this growth have included the emergence and maturation
of computational systems biology (47) and public investment
in computational approaches to drug discovery
(48,49).
ChEMBL (50), EMBL-EBI’s database of drugs and
bioactive entities, now has a uniﬁed chemistry resource
lookup and registration system called UniChem (51).
The ChEMBL team supports the neglected-disease
community, and now provides one-stop access to all
data from the Medicines for Malaria Venture’s openaccess
MalariaBox (52) and other open-access malaria
research efforts, including new high-value malaria and
tuberculosis data sets. A version of ChEMBL has been
built using only open-source software, and this has been
made available as a virtual machine.
ChEBI, a resource for reference chemical structures,
nomenclature and ontological classiﬁcation, now offers a
standalone tool for classifying compounds that resemble
natural products. Several thousand natural products
have been added to the database, and a new version of
the SENECA tool, which helps to elucidate structures
for natural products, has been implemented. ChEBI has
also introduced new tools for searching (OntoQuery) and
analysis (BiNChE) of information contained in the ChEBI
Ontology.
USER TRAINING
It is essential that our users can access EMBL-EBI’s data
efﬁciently and get the most out of their own datasets
when comparing them with the public record. To that
end, EMBL-EBI provides an extensive user training programme
(http://www.ebi.ac.uk/training), coordinated and
funded centrally, but with input from all the resource
D22 Nucleic Acids Research, 2014, Vol. 42, Database issue
atMasarykUniversityonApril8,2014http://nar.oxfordjournals.org/Downloadedfrom
teams. In turn, as training activities offer a unique interface
between service developers and users, they are invaluable
in the evolution of existing resources and the creation
of new ones. EMBL-EBI’s diversifying community of
users is reﬂected in its user training offering. The programme,
courses and materials are created in response
to user demand, and cover the full spectrum of EMBLEBI’s
activities.
Face-to-face training courses reach 4000 users a
year—a small fraction of EMBL-EBI’s user community.
Train online (http://www.ebi.ac.uk/training/online/),
EMBL-EBI’s eLearning resource launched in 2011,
supplies on-demand instruction to users the world over,
in the form of free short courses designed for bench-based
biologists. Quick tours provide an overview of each of the
core data resources and show users where to go for more
information. Introductory courses explain some of the important
concepts behind the bioinformatics resources and
introduce key subject areas, such as functional genomics.
Walk-through courses provide a more in-depth exploration
of a resource, structured as tutorials with use
cases, guided examples and quizzes. Finally, video
courses are based on some of our most popular face-toface
training courses. They provide video lectures and accompanying
course materials.
Simply being able to discover relevant training courses
is challenging for many life scientists, and EMBL-EBI’s
training team has entered the realm of ‘training informatics’
through its involvement in the EMTRAIN project,
which launched on-courseÕ
(http://www.on-course.eu)—
a new resource for course seekers in the biomedical
sciences—in June 2012 (53).
CONCLUDING REMARKS
The foundations of EMBL-EBI’s data collection are comprehensive
archival collections of biomolecular information,
and our expanding and diversifying user base
demands that we serve a growing number of specialized
communities. To support research in the applied sciences,
we must provide access to data from medical sequencing
projects, with appropriately consented access to the data.
EMBL-EBI is dedicated to remaining user-focused and
developing interfaces and training opportunities that
enable discovery in all areas of molecular biology. Our
continuous interaction with our users is the driver that
enables us to feed back to our developers, which, in
turn, helps our services to remain relevant to our users’
needs.
ACKNOWLEDGEMENTS
The EMBL-EBI is indebted to the support of its funders:
EMBL’s member states, the European Commission, the
Wellcome Trust, the UK Research Councils, the US
National Institutes of Health and our industry partners.
The authors are also indebted to hundreds of thousands
of scientists who have submitted data and annotation
to the shared data collections. The authors would like
to thank the many colleagues who provided input to this
manuscript.
FUNDING
Funding for open access charge: Wellcome Trust.
Conﬂict of interest statement. None declared.
REFERENCES
1. Marx,V. (2013) Biology: the big challenges of big data. Nature,
498, 255–260.
2. Karsenti,E., Acinas,S.G., Bork,P., Bowler,C., De Vargas,C.,
Raes,J., Sullivan,M., Arendt,D., Benzoni,F., Claverie,J.M. et al.
(2011) A holistic approach to marine eco-systems biology.
PLoS Biol., 9, e1001177.
3. Nakamura,Y., Cochrane,G. and Karsch-Mizrachi,I. (2013) The
international nucleotide sequence database collaboration.
Nucleic Acids Res., 41, D21–D24.
4. Berman,H., Henrick,K., Nakamura,H. and Markley,J.L. (2007)
The worldwide Protein Data Bank (wwPDB): ensuring a
single, uniform archive of PDB data. Nucleic Acids Res., 35,
D301–D303.
5. The UniProt Consortium. (2013) Update on activities at the
Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res.,
41, D43–D47.
6. Flicek,P., Ahmed,I., Amode,M.R., Barrell,D., Beal,K., Brent,S.,
Carvalho-Silva,D., Clapham,P., Coates,G., Fairley,S. et al. (2013)
Ensembl 2013. Nucleic Acids Res., 41, D48–D55.
7. Orchard,S., Kerrien,S., Abbani,S., Aranda,B., Bhate,J., Bidwell,S.,
Bridge,A., Briganti,L., Brinkman,F.S., Cesareni,G. et al. (2012)
Protein interaction data curation: the International
Molecular Exchange (IMEx) consortium. Nat. Methods, 9,
345–350.
8. Vizcaı´no,J.A., Coˆ te´ ,R.G., Csordas,A., Dianes,J.A., Fabregat,A.,
Foster,J.M., Griss,J., Alpi,E., Birim,M., Contell,J. et al. (2013)
The Proteomics Identiﬁcations (PRIDE) database and
associated tools: status in 2013. Nucleic Acids Res., 41,
D1063–D1069.
9. Salek,R.M., Haug,K. and Steinbeck,C. (2013) Dissemination of
metabolomics results: role of MetaboLights and COSMOS.
Gigascience, 2, 8.
10. Hayden,E.C. (2013) Geneticists push for global data-sharing.
Nature, 498, 16–17.
11. Pavelin,K., Cham,J.A., De Matos,P., Brooksbank,C., Cameron,G.
and Steinbeck,C. (2012) Bioinformatics meets user-centred design:
a perspective. PLoS Comp. Biol., 8, e1002554.
12. Javahery,H., Seffah,A. and Radhakrishnan,T. (2004) Beyond
power: making bioinformatics tools user-centered. Commun.
ACM, 47, 58–63.
13. Bolchini,D., Finkelstein,A., Perrone,V. and Nagl,S. (2009) Better
bioinformatics through usability analysis. Bioinformatics, 25,
406–412.
14. McWilliam,H., Li,W., Uludag,M., Squizzato,S., Park,Y.M.,
Buso,N., Cowley,A.P. and Lopez,R. (2013) Analysis tool web
services from the EMBL-EBI. Nucleic Acids Res., 41,
W597–W600.
15. McEntyre,J.R., Ananiadou,S., Andrews,S., Black,W.J.,
Boulderstone,R., Buttery,P., Chaplin,D., Chevuru,S., Cobley,N.,
Coleman,L.A. et al. (2011) UKPMC: a full text article
resource for the life sciences. Nucleic Acids Res., 39, D58–D65.
16. Gostev,M., Faulconbridge,A., Brandizi,M., Fernandez-Banet,J.,
Sarkans,U., Brazma,A. and Parkinson,H. (2012) The BioSample
Database (BioSD) at the European Bioinformatics Institute.
Nucleic Acids Res., 40, D64–D70.
17. The Gene Ontology Consortium. (2012) The Gene
Ontology: enhancements for 2011. Nucleic Acids Res., 40,
D559–D564.
18. The Gene Ontology Consortium. (2013) Gene ontology
annotations and resources. Nucleic Acids Res., 41, D530–D535.
Nucleic Acids Research, 2014, Vol. 42, Database issue D23
atMasarykUniversityonApril8,2014http://nar.oxfordjournals.org/Downloadedfrom
19. Hill,D.P., Adams,N., Bada,M., Batchelor,C., Berardini,T.Z.,
Dietze,H., Drabkin,H.J., Ennis,M., Foulger,R.E., Harris,M.A.
et al. (2013) Dovetailing biology and chemistry: integrating
the gene ontology with the ChEBI chemical ontology.
BMC Genomics, 14, 513.
20. Malone,J., Holloway,E., Adamusiak,T., Kapushesky,M., Zheng,J.,
Kolesnikov,N., Zhukova,A., Brazma,A. and Parkinson,H. (2010)
Modeling sample variables with an experimental factor ontology.
Bioinformatics, 26, 1112–1118.
21. Cochrane,G., Alako,B., Amid,C., Bower,L., Cerden˜ o-Ta´ rraga,A.,
Cleland,I., Gibson,R., Goodgame,N., Jang,M., Kay,S. et al.
(2013) Facing growth in the European nucleotide archive.
Nucleic Acids Res., 41, D30–D35.
22. Hsi-Yang Fritz,M., Leinonen,R., Cochrane,G. and Birney,E.
(2011) Efﬁcient storage of high throughput DNA sequencing
data using reference-based compression. Genome Res., 21,
734–740.
23. McLaren,W., Pritchard,B., Rios,D., Chen,Y., Flicek,P. and
Cunningham,F. (2010) Deriving the consequences of genomic
variants with the Ensembl API and SNP effect predictor.
Bioinformatics, 26, 2069–2070.
24. Scally,A., Dutheil,J.Y., Hillier,L.W., Jordan,G.E., Goodhead,I.,
Herrero,J., Hobolth,A., Lappalainen,T., Mailund,T.,
Marques-Bonet,T. et al. (2012) Insights into hominid evolution
from the gorilla genome sequence. Nature, 483, 169–175.
25. The ENCODE Consortium, Bernstein,B.E., Birney,E., Dunham,I.,
Green,E.D., Gunter,C. and Snyder,M. (2012) An integrated
encyclopedia of DNA elements in the human genome. Nature,
489, 57–74.
26. Kersey,P.J., Staines,D.M., Lawson,D., Kulesha,E., Derwent,P.,
Humphrey,J.C., Hughes,D.S., Keenan,S., Kerhornou,A.,
Koscielny,G. et al. (2012) Ensembl Genomes: an integrative
resource for genome-scale data from non-vertebrate species.
Nucleic Acids Res., 40, D91–D97.
27. The Heliconius Genome Consortium. (2012) Butterﬂy genome
reveals promiscuous exchange of mimicry adaptations among
species. Nature, 487, 94–98.
28. The International Barley Genome Sequencing Consortium,
Mayer,K.F., Waugh,R., Brown,J.W., Schulman,A., Langridge,P.,
Platzer,M., Fincher,G.B., Muehlbauer,G.J., Sato,K. et al. (2012)
A physical, genetic and functional sequence assembly of the
barley genome. Nature, 491, 711–716.
29. Brenchley,R., Spannagl,M., Pfeifer,M., Barker,G.L., D’Amore,R.,
Allen,A.M., McKenzie,N., Kramer,M., Kerhornou,A., Bolser,D.
et al. (2012) Analysis of the bread wheat genome using
whole-genome shotgun sequencing. Nature, 491, 705–710.
30. Welter,D., MacArthur,J., Morales,J., Burdett,T., Hall,P.,
Junkins,H., Klemm,A., Xu,M., Flicek,P., Manolio,T. et al.
(2014) The NHGRI GWAS Catalog, a curated resource of
SNP-Trait associations. Nucleic Acids Res., [epub ahead
of print].
31. Koscielny,G., Yaikhom,G., Iyer,V., Meehan,T.F., Morgan,H.,
Atienza-Herrero,J., Blake,A., Chen,C.-K., Easty,R., Di Fenza,A.
et al. (2014) The International Mouse Phenotyping Consortium
(IMPC) web portal, a uniﬁed point of access for knockout mice
and related phenotyping data. Nucleic Acids Res., [epub ahead
of print].
32. Handelsman,J. (2004) Metagenomics: application of genomics to
uncultured microorganisms. Microbiol. Mol. Biol. Rev., 68,
669–685.
33. Rustici,G., Kolesnikov,N., Brandizi,M., Burdett,T., Dylag,M.,
Emam,I., Farne,A., Hastings,E., Ison,J., Keays,M. et al. (2013)
ArrayExpress update—trends in database growth and
links to data analysis tools. Nucleic Acids Res., 41,
D987–D990.
34. Vizcaı´no,J.A., Coˆ te´ ,R.G., Csordas,A., Dianes,J.A., Fabregat,A.,
Foster,J.M., Griss,J., Alpi,E., Birim,M., Contell,J. et al. (2013)
The PRoteomics IDEntiﬁcations (PRIDE) database and
associated tools: status in 2013. Nucleic Acids Res., 41,
D1063–D1069.
35. Haug,K., Salek,R.M., Conesa,P., Hastings,J., de Matos,P.,
Rijnbeek,M., Mahendraker,T., Williams,M., Neumann,S.,
Rocca-Serra,P. et al. (2013) MetaboLights—an open-access
general-purpose repository for metabolomics studies and
associated meta-data. Nucleic Acids Res., 41, D781–D786.
36. Hunter,S., Apweiler,R., Attwood,T.K., Bairoch,A., Bateman,A.,
Binns,D., Bork,P., Das,U., Daugherty,L., Duquenne,L. et al.
(2009) InterPro: the integrative protein signature database.
Nucleic Acids Res., 37, D211–D215.
37. Punta,M., Coggill,P.C., Eberhardt,R.Y., Mistry,J., Tate,J.,
Boursnell,C., Pang,N., Forslund,K., Ceric,G., Clements,J. et al.
(2012) The Pfam protein families database. Nucleic Acids Res.,
40, D290–D301.
38. Gutmanas,A., Oldﬁeld,T.J., Patwardhan,A., Sen,S., Velankar,S.
and Kleywegt,G.J. (2013) The role of structural bioinformatics
resources in the era of integrative structural biology. Acta
Crystallogr. D Biol. Crystallogr., 69, 710–721.
39. Velankar,S., Alhroub,Y., Best,C., Caboche,S., Conroy,M.J.,
Dana,J.M., Fernandez Montecelo,M.A., van Ginkel,G.,
Golovin,A., Gore,S.P. et al. (2012) PDBe: Protein Data Bank in
Europe. Nucleic Acids Res., 40, D445–D452.
40. Velankar,S., Dana,J.M., Jacobsen,J., van Ginkel,G., Gane,P.J.,
Luo,J., Oldﬁeld,T.J., O’Donovan,C., Martin,M.-J. and
Kleywegt,G.J. (2013) SIFTS: structure integration with function,
taxonomy and sequences resource. Nucleic Acids Res., 41,
D483–D489.
41. de Matos,P., Cham,J.A., Cao,H., Alca´ ntara,R., Rowland,F.,
Lopez,R. and Steinbeck,C. (2013) The Enzyme Portal: a case
study in applying user-centred design methods in bioinformatics.
BMC Bioinformatics, 14, 103.
42. Alca´ ntara,R., Onwubiko,J., Cao,H., Matos,P.D., Cham,J.A.,
Jacobsen,J., Holliday,G.L., Fischer,J.D., Rahman,S.A., Jassal,B.
et al. (2013) The EBI enzyme portal. Nucleic Acids Res., 41,
D773–D780.
43. Kerrien,S., Aranda,B., Breuza,L., Bridge,A., Broackes-Carter,F.,
Chen,C., Duesbury,M., Dumousseau,M., Feuermann,M., Hinz,U.
et al. (2012) The IntAct molecular interaction database in 2012.
Nucleic Acids Res., 40, D841–D846.
44. D’Eustachio,P. (2013) Pathway databases: making chemical and
biological sense of the genomic data ﬂood. Chem. Biol., 20,
629–635.
45. Haw,R. and Stein,L. (2012) Using the Reactome database. Curr.
Protoc. Bioinform., 38:8.7.1-8.7.23.
46. Chelliah,V., Laibe,C. and Le Nove` re,N. (2013) BioModels
Database: a repository of mathematical models of biological
processes. Methods Mol. Biol., 1021, 189–199.
47. Kitano,H. (2002) Computational systems biology. Nature, 420,
206–210.
48. A˚ rdal,C. and Røttingen,J.A. (2012) Open source drug discovery
in practice: a case study. PLoS Negl. Trop. Dis., 6, e1827.
49. Williams,A.J., Wilbanks,J. and Ekins,S. (2012) Why open drug
discovery needs four simple rules for licensing data and models.
PLoS Comp. Biol., 8, e1002706.
50. Willighagen,E.L., Waagmeester,A., Spjuth,O., Ansell,P.,
Williams,A.J., Tkachenko,V., Hastings,J., Chen,B. and Wild,D.J.
(2013) The ChEMBL database as linked open data.
J. Cheminform., 5, 23.
51. Chambers,J., Davies,M., Gaulton,A., Hersey,A., Velankar,S.,
Petryszak,R., Hastings,J., Bellis,L., McGlinchey,S. and
Overington,J.P. (2013) UniChem: a uniﬁed chemical
structure cross-referencing and identiﬁer tracking system.
J. Cheminform., 5, 3.
52. Spangenberg,T., Burrows,J.N., Kowalczyk,P., McDonald,S.,
Wells,T.N. and Willis,P. (2013) The open access malaria box: a
drug discovery catalyst for neglected diseases. PloS One, 8,
e62906.
53. Payton,A., Janko,C., Renn,O., Hardman,M. and EMTRAIN
Consortium. (2013) on-courseÕ
portal: a tool for in-service
training and career development for biomedical scientists.
Drug Discov. Today, 18, 803–806.
D24 Nucleic Acids Research, 2014, Vol. 42, Database issue
atMasarykUniversityonApril8,2014http://nar.oxfordjournals.org/Downloadedfrom
APPENDIX 1
EMBL-EBI’s mission
. To provide freely available data and bioinformatics
services to all facets of the scientiﬁc community.
. To contribute to the advancement of biology through
basic investigator-driven research.
. To provide advanced bioinformatics training to
scientists at all levels.
. Tohelpdisseminatecutting-edgetechnologiestoindustry.
. To coordinate biological data provision throughout
Europe.
APPENDIX 2
EMBL-EBI’s principles of service provision
Open: Our data and tools are freely available,
without restriction. The only exception is potentially
identiﬁable human genetic information, for which
access depends on research consent agreements.
Compatible: EMBL-EBI is a world leader in the
development of global bioinformatics standards,
which are key to data sharing.
Comprehensive: Thanks to our many data-sharing
agreements, EMBL-EBI resources are comprehensive
and up-to-date. We work with publishers to ensure
that biological data must be placed in a public
repository and cross-referenced in the relevant
publication.
Portable: All of our data and many of our software
systems can be downloaded and installed locally.
High quality: Our databases are enhanced through
annotation: highly qualiﬁed biologists add value to databases
by incorporating features of genes or proteins
from other sources, and automated annotation is
subjected to rigorous quality control.
Nucleic Acids Research, 2014, Vol. 42, Database issue D25
atMasarykUniversityonApril8,2014http://nar.oxfordjournals.org/Downloadedfrom