D10–D17 Nucleic Acids Research, 2021, Vol. 49, Database issue Published online 23 October 2020 doi: 10.1093/nar/gkaa892 Database resources of the National Center for Biotechnology Information Eric W. Sayers * , Jeffrey Beck, Evan E. Bolton, Devon Bourexis, James R. Brister, Kathi Canese, Donald C. Comeau, Kathryn Funk, Sunghwan Kim , William Klimke, Aron Marchler-Bauer, Melissa Landrum, Stacy Lathrop, Zhiyong Lu , Thomas L. Madden, Nuala O’Leary, Lon Phan, Sanjida H. Rangwala, Valerie A. Schneider, Yuri Skripchenko, Jiyao Wang, Jian Ye, Barton W. Trawick, Kim D. Pruitt and Stephen T. Sherry National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA Received September 15, 2020; Revised September 25, 2020; Editorial Decision September 28, 2020; Accepted October 08, 2020 ABSTRACT The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 34 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Custom implementations of the BLAST program provide sequence-based searching of many specialized datasets. New resources released in the past year include a new PubMed interface and NCBI datasets. Additional resources that were updated in the past year include PMC, Bookshelf, Genome Data Viewer, SRA, ClinVar, dbSNP, dbVar, Pathogen Detection, BLAST, Primer-BLAST, IgBLAST, iCn3D and PubChem. All of these resources can be accessed through the NCBI home page at https://www.ncbi. nlm.nih.gov. INTRODUCTION NCBI overview The National Center for Biotechnology Information (NCBI), a center within the National Library of Medicine at the National Institutes of Health, was created in 1988 to develop information systems for molecular biology (1). In this article we provide a brief overview of the NCBI Entrez system of databases, followed by a summary of resources that we either introduced or significantly updated in the past year. We provide more complete discussions of NCBI resources on the home pages of individual databases, on the NCBI Learn page (https://www.ncbi.nlm.nih.gov/ learn/) and in the NCBI Handbook (https://www.ncbi.nlm. nih.gov/books/NBK143764/). The entrez system Entrez (2) is an integrated database retrieval system that provides access to a diverse set of 34 databases that together contain 3.0 billion records (Table 1 and Figure 1). Links to the web portal for each of these databases are provided on the Entrez global search page (https://www.ncbi.nlm.nih. gov/search/). Entrez supports text searching using simple Boolean queries, downloading of data in various formats and linking records between databases based on asserted relationships. The records retrieved in Entrez can be displayed in many formats and downloaded singly or in batches. An Application Programming Interface for Entrez functions (the E-utilities) is available, and detailed documentation is provided at https://eutils.ncbi.nlm.nih.gov/. Data sources and collaborations NCBI receives data from three sources: direct submissions from researchers, national and international collaborations or agreements with data providers and research consortia, and internal curation efforts. For example, NCBI manages the GenBank database (3) and participates with the EMBLEBI European Nucleotide Archive (ENA) (4) and the DNA Data Bank of Japan (DDBJ) (5) as a partner in the International Nucleotide Sequence Database Collaboration (INSDC) (6). Details about direct submission processes are available from the NCBI Submit page (www.ncbi.nlm.nih. gov/home/submit.shtml) and from the resource home pages (e.g. the GenBank page, www.ncbi.nlm.nih.gov/genbank/). More information about the various collaborations, agreements, and curation efforts are also available through the home pages of the individual resources. *To whom correspondence should be addressed. Tel: +1 301 496 2475, Fax: +1 301 480 9241; Email: sayers@ncbi.nlm.nih.gov Published by Oxford University Press on behalf of Nucleic Acids Research 2020. This work is written by (a) US Government employee(s) and is in the public domain in the US. Downloadedfromhttps://academic.oup.com/nar/article/49/D1/D10/5937080bygueston24February2021 Nucleic Acids Research, 2021, Vol. 49, Database issue D11 Table 1. The Entrez Databases (as of 9 September 2020) Database Records Description Literature PubMed 31 471 600 scientific and medical abstracts/citations PubMed Central 6 447 271 full-text journal articles NLM catalog 1 619 856 index of NLM collections Books 825 385 books and reports MeSH 300 500 ontology used for PubMed indexing Genomes Nucleotide 429 731 711 DNA and RNA sequences BioSample 14 628 076 descriptions of biological source materials SRA 11 807 161 high-throughput DNA and RNA sequence read archive Taxonomy 2 401 136 taxonomic classification and nomenclature catalog Assembly 837 406 genome assembly information BioProject 458 893 biological projects providing data to NCBI Genome 55 580 genome sequencing projects by organism BioCollections 8 138 museum, herbaria and other biorepository collections Genes GEO Profiles 128 414 055 gene expression and molecular abundance profiles Gene 28 377 759 collected information about gene loci GEO datasets 4 002 373 functional genomics studies PopSet 350 627 sequence sets from phylogenetic and population studies HomoloGene 141 268 homologous gene sets for selected organisms Genetics SNP 720 643 623 short genetic variations dbVar 6 030 887 genome structural variation studies ClinVar 845 008 human variations of clinical significance MedGen 335 277 medical genetics literature and links GTR 76 814 genetic testing registry dbGaP 1 397 genotype/phenotype interaction studies Proteins Protein 874 272 642 protein sequences Identical protein groups 329 946 078 protein sequences grouped by identity Protein clusters 1 137 329 sequence similarity-based protein clusters Structure 167 650 experimentally-determined biomolecular structures Sparcle 149 462 conserved domain architectures Conserved domains 59 951 conserved protein domains Chemicals PubChem substance 285 048 146 deposited substance and chemical information PubChem compound 111 325 418 chemical information with structures, information and links PubChem BioAssay 1 229 071 bioactivity screening studies BioSystems 983 968 molecular pathways with links to genes, proteins and chemicals RECENT DEVELOPMENTS Literature updates PubMed. After previewing an updated version of PubMed in 2019, we activated this updated version as the default system in May 2020. Among the numerous enhancements is a responsive layout that offers better support for accessing PubMed content on increasingly popular small-screen devices such as mobile phones and tablets. The interface is compatible with any screen size and provides a fresh, consistent look and feel throughout the application, no matter how one accesses it. Search results can now be sorted using ‘Best Match’ ordering that employs a machine learning algorithm to help users find relevant citations quickly (7). Search results also include ‘snippets’, highlighted text fragments from the article abstract that are selected based on their relatedness to the query. These snippets give users additional information to help them decide if an article is useful. Additional improvements to the interface make it easier to discover related content such as similar articles, references and citations. Since 2012 PubMed has allowed users to search for a distinct author, and not merely for an author name, through an automatic author name disambiguation algorithm (8). We recently enhanced this algorithm by leveraging the significant growth of ORCID use in PubMed articles. Users can search PubMed directly with ORCIDs using the following syntax: 0000-0001-6166-3199[auid]. Additionally, to further strengthen how PubMed handles synonyms, we developed and implemented an updated algorithm to obtain word synonym pairs that improve PubMed indexing and retrieval (9). The updated version of PubMed takes advantage of several new technologies to improve the user experience. The underlying document data indexed in the updated version is a merger of content from PubMed, Bookshelf and PubMed Central (PMC). This combined dataset allows us to display relevant information not previously available in a PubMed Downloadedfromhttps://academic.oup.com/nar/article/49/D1/D10/5937080bygueston24February2021 D12 Nucleic Acids Research, 2021, Vol. 49, Database issue 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% Assem blyClinVar IdenƟcalProtein Groups SRA GTR BioSam ple BioProject GEO DataSets Genom e PubChem Substance PubChem Com pound PubChem BioAssayProtein PubM ed CentralGeneBooks StructureM eSHSparcle Taxonom yPopSet NucleoƟde SNP Conserved Dom ains PubM eddbGaP M edGendbVar NLM Catalog BioCollecƟons Protein Clusters GEO Profiles Hom oloGene BioSystem s AnnualGrowth Figure 1. Annual growth rates of the number of records in each Entrez database as of 9 September 2020. record, such as reference citations from PMC. While legacy PubMed limited the number of variants for a wildcard (‘*’) search, PubMed is now capable of unlimited wildcard searches thanks to Solr (https://lucene.apache.org/solr/), the open-source enterprise search system that PubMed now uses for document indexing. Users will find that PubMed now has greater scalability and reliability, provided not only by Solr, but also by the MongoDB storage solution and the modern cloud architecture that together ensure both redundancy between data centers and also trustworthy backup environments. When visiting PubMed, users will enjoy a modern web experience using the latest web technologies and standards, all provided by the Django web framework. PubMed Central (PMC). PMC continued to expand access to biomedical and life science literature over the past year, with the corpus now including more than 6 million journal articles and author manuscripts. Reflecting the NLM’s ongoing commitment to public access to research results supported by the NIH and other funding partners, we released a new NIH Manuscript Submission (NIHMS) system in January 2020. The new NIHMS system streamlines the author manuscript submission process to PMC and offers more transparent options to aid authors and investigators in avoiding processing delays and ensuring timely compliance with NIH policy. In the first 6 months after release, we received nearly 40 000 successful submissions (https://www.nihms.nih.gov/about/statistics/). PMC launched the Public Health Emergency COVID-19 Initiative (https://www.ncbi.nlm.nih.gov/pmc/about/covid- 19/) in March 2020, which helped enable the creation of the COVID-19 Open Research Dataset (CORD-19) hosted by the Allen Institute. This initiative resulted from a call made by the national science and technology advisors of a dozen countries, including the US, to support ongoing public health emergency response efforts. Specifically, this call asked publishers and societies to agree voluntarily to make their publications and supporting data related to COVID-19 and the novel coronavirus immediately accessible in PMC and other appropriate public repositories. As of August, this initiative included more than 50 publishers and has added or updated the licenses on 80 000 articles in PMC to support secondary re-use and analysis. On 9 June 2020, NLM launched a pilot project to test the viability of allowing users to obtain from PMC preprints resulting from NIH-funded research (https://www.ncbi.nlm. nih.gov/pmc/about/nihpreprints/). The primary goal of this NIH Preprint Pilot is to explore approaches to increase the discoverability of early NIH research results. Following standard NLM practice, citations for these preprint records are available in PubMed to increase the discoverability of this content. To ensure transparency, large banners clearly identify these preprint records in PMC and PubMed. The banners explain that the papers have not been peer reviewed, and link to information about the pilot for additional context. The pilot will run for a minimum of 12 months and will focus on preprints that relate to the current COVID-19 pandemic. Lessons we learn during that time will inform future NLM efforts involving preprints. Bookshelf. The NCBI Bookshelf provides free online access to over 8500 books and documents in life science and healthcare from over 150 content providers. In the past year, we migrated the content management of two large full-text toxicology databases into its archive. The first, LactMed, is an NLM database of over 1500 peer-reviewed summaries Downloadedfromhttps://academic.oup.com/nar/article/49/D1/D10/5937080bygueston24February2021 Nucleic Acids Research, 2021, Vol. 49, Database issue D13 containing information on drugs and other chemicals to which breastfeeding mothers may be exposed. It also includes information on the levels of such substances in breast milk and infant blood, and the possible adverse effects in the nursing infant. The second, LiverTox, is an NIDDK database of over 1100 peer-reviewed documents containing information on the diagnosis, cause, frequency, clinical patterns and management of liver injury attributable to both prescription and nonprescription medications and also selected herbal and dietary supplements. Migrating these full-text databases into the Bookshelf increases their discoverability in two ways: the citations for each full-text LactMed and LiverTox summary record are now available in PubMed, where they link to the full-text in Bookshelf; and the toxicological data for these records are seamlessly integrated in PubChem with the scientific evidence summarized and cited in the full-text documents in Bookshelf. Users can still search within LactMed or LiverTox, and the full-text, machine-readable records for both are available through the NLM LitArch Open Access subset for data mining and reuse. Genome updates NCBI Datasets. We have continued to develop improvements to sequence search through the introduction of Datasets, a new resource that enables users to easily gather content from across NCBI databases (https://www.ncbi. nlm.nih.gov/datasets/). Developed with the FAIR (findability, accessibility, interoperability and reusability) principles of data management in mind, Datasets allows users to create custom datasets through web, RESTful API, and command-line interfaces, and download them in structured file packages. As of this writing, Datasets supports queries for genomes and genes across a wide range of taxonomies, including quick access to SARS-CoV-2 genomes and proteins. Users searching for genomic data can assemble a package consisting of genome, transcript and protein sequences as well as annotations. The package will also include a data report with rich metadata. The Datasets web interface (Figure 2) allows users to browse genomes on a taxonomic tree and choose any set of complete eukaryotic genomes in the NCBI Assembly database. The API and command-line interfaces provide access to prokaryotic and viral genomes in addition to eukaryotic genomes, and support searches with taxonomic identifiers or assembly accessions. For gene searches, Datasets allows users to construct data tables based on either an NCBI Gene ID or a combination of organism and gene symbol. On the web interface, users can continue editing the set of genes on the table and then select a set of desired data columns before downloading the data. The API and command-line interfaces provide similar tables programmatically. A Python library and corresponding Jupyter notebooks allow users to explore and learn the datasets API, and these are available on GitHub (https://github.com/ncbi/datasets). Graphical sequence viewers. NCBI’s graphical sequence viewer tools visualize sequences, annotations and experimental data alignments archived in NCBI databases. These viewers include the NCBI Sequence Viewer (SV) and the Multiple Sequence Alignment Viewer (MSAV). SV is available as a standalone application (https://www.ncbi.nlm.nih. gov/projects/sviewer/) and also appears as a fully-functional embed on many NCBI pages, including Gene and SNP records and the NCBI Variation Viewer. SV is also the graphical viewer for Nucleotide and Protein records and the BLAST (10) and Primer-BLAST (11) results pages. MSAV (https://www.ncbi.nlm.nih.gov/projects/msaviewer/) displays multiple sequence alignments created by NCBI BLAST, COBALT and NCBI Virus pages, and also displays custom alignments from researchers. In the last year, we have further enhanced the sequence and data download capabilities within SV (https://www. ncbi.nlm.nih.gov/tools/sviewer/). Users can now download gene, feature and NCBI SNP annotation data from the graphical view tool in multiple common formats that include GFF3, VCF and BED. Users can also copy short strings of sequence data directly to the clipboard and can also download larger ranges in FASTA or GenBank flat file format. Improved tooltips now report positional information that changes dynamically depending on the position of the cursor. For gene features, including transcript and proteins, the tooltip provides the transcript, CDS, and/or protein position and exon or intron number. For BAM, SRA or BLAST alignments, the tooltip information also includes the ability to view any unaligned data, including insertions and 5 and 3 tails. Recent enhancements to NCBI’s MSAV (https://www.ncbi.nlm.nih.gov/tools/msaviewer/) allow users to re-sort rows to find sequences of interest and then customize their display further by ‘hiding’ undesired sequences from view (such as partial or duplicate alignments) or in their SVG/PDF image output. Genome data viewer. NCBI’s flagship genome browser, Genome Data Viewer (GDV) (https://www.ncbi.nlm.nih. gov/genome/gdv/), integrates the SV graphical display with a robust search/retrieval/analysis console. Users can view their own data next to NCBI tracks as references. In our endeavor to better support our users’ analyses and research needs, we released numerous enhancements in the past year. Of particular note is a dynamic sidebar that users can show or hide. Hiding the sidebar allows the graphical display to stretch for a wider view. Additional enhancements now allow navigation by assemblies, chromosomes and components directly from within the browser. Users interested in non-human variation data can now add SNP data from the European Variation Archive as tracks directly to the view using the tracks configuration menu in SV. Researchers can also add their own or thirdparty data by streaming data hosted on external URLs. GDV also supports connectivity to UCSC-style track hubs directly from the EBI Track Hub Registry. GDV supports visualization of most common bioinformatics formats including GFF3, bigWig, multiWig, bigBED and indexed BAM and VCF files. Helpful tutorials to get started are available on the NCBI YouTube channel in the ‘NCBI Genome Data Viewer’ playlist. Additional documentation is available at https://www.ncbi.nlm.nih.gov/genome/gdv/ browser/help/. Downloadedfromhttps://academic.oup.com/nar/article/49/D1/D10/5937080bygueston24February2021 D14 Nucleic Acids Research, 2021, Vol. 49, Database issue Figure 2. Landing page for the new NCBI Datasets product (https://www.ncbi.nlm.nih.gov/datasets/) that provides packaged downloads of genomic datasets using either a web interface, an API, or a UNIX/LINUX command-line tool. Sequence Read Archive. NCBI maintains the NIH Sequence Read Archive (SRA), an archival database designed to support storage, retrieval and analysis of nextgeneration nucleotide sequence data. This archive now includes 8.8 Petabytes of publicly available data and another 4.6 Petabytes of controlled-access dbGaP data. While the archive holds tremendous promise for biomedical research, the sheer size of this dataset makes it difficult to store, retrieve and analyze. NCBI remains committed to continuing to expand SRA and improving access to the archive based on FAIR data principles. As part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation and Sustainability Initiative (https://datascience.nih.gov/strides/), NCBI is now maintaining the entire SRA on two commercial cloud platforms, Amazon Web Services (AWS) and Google Cloud Platform (GCP). The cloud environment offers many advantages for the transfer and analysis of data (12), and the availability of SRA on cloud platforms should facilitate large-scale computational operations and collaborations. This idea has been supported by early pilot experiments in which thousands of metagenomic samples were analyzed for organismal content using cloud-adapted workflows and software (13). An additional example are COVID-focused datasets (including source and normalized SRA file formats) recently added to the AWS Public Dataset Program. These datasets provide researchers easy, no-cost access to more than 13 000 SRA runs that include Coronaviridae content identified by a kmer-based approach using the SRA Taxonomy Analysis Tool. We have made several additional updates that allow more effective use of cloud-hosted SRA data. By providing SRA run metadata and BioSample data on GCP BigQuery and AWS Athena, we give researchers more ways to identify sequence sets of interest. In addition to maintaining originally submitted source and SRA formatted data in the cloud, we are introducing normalized data formats that bin base quality scores to binary read assessments or align reads to references. These reduce file size and can expedite certain computations. We have also updated the SRA data location service and toolkit (https://github.com/ncbi/sra-tools/wiki/02. -Installing-SRA-Toolkit) to retrieve data from cloud locations in the desired data format. Pathogens. The NCBI Pathogen Detection Project (https: //www.ncbi.nlm.nih.gov/pathogens/) helps public health scientists investigate foodborne disease outbreaks by integrating pathogen genomic sequences obtained from cultured bacterial isolates and quickly clustering and identifying related sequences (14). As of August 2020, over 600 000 pathogen isolates covering 31 bacterial taxa and one emerging fungal pathogen, Candida auris, are actively be- Downloadedfromhttps://academic.oup.com/nar/article/49/D1/D10/5937080bygueston24February2021 Nucleic Acids Research, 2021, Vol. 49, Database issue D15 ing analyzed. We make these analysis results available in the Isolates Browser on a daily basis (https://www.ncbi.nlm.nih. gov/pathogens/isolates). As part of our effort to improve the compliance of the pipeline results (15) with FAIR principles, we annotate the generated assemblies using PGAP (16), submit them to GenBank and incorporate them into the NCBI Assembly resource. These assemblies link back to the Isolates Browser and also to the SNP Tree Viewer if that particular isolate is a member of a clonally related cluster. The Isolates Browser allows users to subset isolates and then download the assembled sequence and/or annotation of the submitted assemblies. Antimicrobial resistance (AMR) resources. The Pathogen Detection team has continued to improve and release updated resources for antimicrobial resistance (AMR) (https://www.ncbi.nlm.nih.gov/pathogens/antimicrobialresistance/). The Reference Gene Catalog (https: //www.ncbi.nlm.nih.gov/pathogens/isolates#/refgene/) now provides a searchable and browsable interface for two sets of genes: a ‘core’ AMR reference set of acquired genes and proteins as well as point mutations conferring AMR, and a ‘plus’ set of genes related to stress responses (acid, metal, heat and biocide) and virulence. The 16 July 2020 release included 6428 total proteins (5588 AMR proteins, 210 stress response proteins and 630 virulence proteins) as well as 682 point mutations. We also released an updated version of the AMRFinder software (17) called AMRFinderPlus that uses the reference set above (https://www.ncbi.nlm.nih. gov/pathogens/antimicrobial-resistance/AMRFinder/). All isolates in the Pathogen Detection Isolates Browser other than C. auris are analyzed with AMRFinderPlus, and the three categories of genes (AMR, stress and virulence) are available in the Isolates Browser. Currently over 590 000 isolates have at least one identified AMR gene, over 510 000 have at least one identified stress response gene and over 210 000 have at least one identified virulence gene. For the subset of isolates from the Isolates Browser that are both deposited in GenBank and that have genes identified by AMRFinderPlus, a new tabular viewer called the Microbial Browser for Genetic and Genomic Elements is available (MicroBIGG-E, https://www.ncbi.nlm. nih.gov/pathogens/isolates#/microbigge/). Every row in the MicroBIGG-E viewer displays a gene or point mutation that has been identified. This new interface provides easy access to the gene and contig sequences to facilitate further analyses. The Isolate Browser and MicroBIGG-E allow cross-browser selection, so that selections of isolates in the Isolate Browser enable selections of genes they encode in MicroBIGG-E and vice versa. Genetics updates ClinVar. ClinVar is an archive of submitted reports of relationships among human variations and phenotypes with supporting evidence. In December 2019, ClinVar reached the milestone of 1 million submitted records, representing more than half a million variants. To support ClinVar’s continued growth, we have improved the submission processing pipeline so that submitted data can be published faster. From October 2019 to August 2020, we added validation to the ClinVar Submission Portal so that all of the required fields in a file submission are validated before the file is submitted to ClinVar. As of August 2020, submitters also have the option to stop a submission and correct errors immediately, or to submit all records that pass validation and receive a report of any failures to correct later. In January 2020, we added a new feature to ClinVar that allows a user to ‘follow’ a particular variant and be notified if the overall clinical interpretation in ClinVar changes, for example from a pathogenic category to a non-pathogenic one. This feature makes it easier for a laboratory to become aware of variants that may need to be re-evaluated, and for clinicians to know when they should contact their clinical testing laboratory and/or patient with new information. dbSNP. The Database of Single Nucleotide Polymorphisms (dbSNP) is a repository of human genomic variations and frequency data that includes both common and rare single-nucleotide variations and other small-scale variations. In 2020 dbSNP released the new NCBI Allele Frequency Aggregator (ALFA) dataset (https://www.ncbi.nlm. nih.gov/snp/docs/gsr/alfa/). We calculated the ALFA frequency data from 98 500 dbGaP subjects for whom genotypes were available. Aggregated from 551 billion genotypes, the results include allele counts and frequencies for 443 million known variations and 4 million novel ones. Future ALFA releases will include additional dbGaP studies, and we expect the dataset to expand to over a billion variants from millions of subjects. The ALFA data are available as part of the dbSNP regular release (https://ftp.ncbi.nih.gov/snp/latest release/) and also as a separate download (https://ftp.ncbi.nih.gov/snp/ population frequency/latest release/). The ALFA data are also accessible through an API (https://www.ncbi.nlm.nih. gov/snp/docs/gsr/alfa/#api-queries). dbVar. In the past year, dbVar added 20 new human structural variation studies, bringing the total in the database to 190 studies containing 6 million regions and 36 million variants. Some notable studies are the Decipher dataset (https: //www.ncbi.nlm.nih.gov/dbvar/studies/nstd183), gnomAD (https://www.ncbi.nlm.nih.gov/dbvar/studies/nstd166/) and the NCBI Curated Common Structural Variants dataset (https://www.ncbi.nlm.nih.gov/dbvar/studies/nstd186) that includes frequency data. dbVar continues to make it easier to find and use structural variation data by making selected datasets available on TrackHub for viewing in the NCBI GDV and other genome browsers, such as the UCSC browser. Tracks are available for the structural variants imported from ClinVar (https://www.ncbi.nlm.nih.gov/dbvar/ content/clinvar summary/#homepage) and for the NCBI Curated Common Structural Variants (nstd186). BLAST updates BLAST in the cloud. In the past year we have made some important changes to our BLAST databases. We released a new version of BLAST databases (v5) that are taxonomically aware, allowing users to limit a stand-alone BLAST+ search (18) by taxonomy using information stored in the database. To take advantage Downloadedfromhttps://academic.oup.com/nar/article/49/D1/D10/5937080bygueston24February2021 D16 Nucleic Acids Research, 2021, Vol. 49, Database issue of this feature, users will need a recent version of the BLAST+ package (2.9.0 release or later). The older BLAST database version (v4) has been deprecated. The new BLAST databases are available on the NCBI FTP site as well as on GCP and AWS cloud providers, and all three sites now offer the same 23 databases. These databases range from a Betacoronavirus collection (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE TYPE= BlastSearch&BLAST SPEC=Betacoronavirus) to RefSeq representative genomes for eukaryotes, prokaryotes and viruses. Also included are the default NCBI nucleotide collection (nt), the non-redundant protein (nr) database, and genome assemblies for human and mouse. On AWS and GCP, users can access these databases with the BLAST+ Docker package described at https://github.com/ncbi/blast plus docs. Additionally, four databases based on the NCBI Targeted Loci Project (including collections of 18S, 28S and prokaryote 16S ribosomal RNA genes along with a database of fungal internal transcribed spacers) have been added to both the BLAST webpage and are also available as stand-alone databases on the NCBI FTP site and on GCP and AWS. Primer-BLAST. Primer-BLAST (11) now allows users to design primers that are common for a group of highly similar sequences. This makes it easier for researchers to perform tasks such as amplifying multiple transcript variants for a single gene or detecting a group of highly related bacteria strains. Another new feature allows users to set the maximal nucleotide match at the 3 end of exon-exon junctions of transcripts to address non-specific polymerase chain reaction amplification. IgBLAST. We have added several new features to IgBLAST (19). IgBLAST now supports annotations of the FWR4 region so that the entire V region can be annotated. Another new feature is that IgBLAST indicates if a sequence has a complete V(D)J region so that users can keep track of sequences with a full V region. Additional new functions include allowing users to extend J gene alignments at the 3 end and to analyze sequences from an organism of their choice. Protein updates NCBI released an updated version (2.19.0) of iCn3D (20), a three-dimensional (3D) molecular structure viewer that runs directly in web browsers. Interactive iCn3D views are embedded in structure summary pages of NCBI’s Molecular Modeling Database (MMDB), and iCn3D visualizes the results of 3D structure comparisons computed by VAST+ as well as pairwise sequence-to-structure alignments computed by protein BLAST. iCn3D simultaneously displays 3D structures, 2D interaction schematics, alignments and protein/nucleotide sequences, as well as sequence annotations such as functional sites and conserved domain footprints. Sequence variations can be displayed for some human protein structures, and recently we have made sequence variation data accessible for selected structures of SARS-CoV-2 proteins. Other recently added features include extended visualization of 2D interaction networks between proteins and ligands or other proteins, visualization of electrostatic potentials as computed by Delphi (21) and visualization of membrane bilayer location relative to membrane protein structures as provided by OPM (22). Finally, a Jupyter notebook widget version of iCn3D (called icn3dpy) enables users to view 3D structures in a Jupyter environment. iCn3D is available at https://github. com/ncbi/icn3d, and novel features are demonstrated on the gallery page at https://www.ncbi.nlm.nih.gov/Structure/ icn3d/icn3d.html#gallery. Chemical updates PubChem (23–25) (pubchem.ncbi.nlm.nih.gov) is a public chemical data repository at NCBI. Over the past year, PubChem expanded the scope of its information content by integrating data from more than 50 new data sources. Notably, the World Intellectual Property Organization (WIPO) provided PubChem with more than 16 million chemical structures searchable in its patent database called PATENTSCOPE (https://go.usa.gov/xdhfK). In addition, SpringerMaterials provided links to hundreds of chemical and physical properties for more than 32 000 compounds, helping users to quickly locate articles for the property in question (https://go.usa.gov/xvqfq). Another new source of data came from ToxNet, a collection of NLM databases that provided a wide range of toxicological information. These databases were retired last year and their content was integrated into PubChem (https://go.usa.gov/xfwyU). They include the Genetic Toxicology Data Bank (GeneTox), the Chemical Carcinogenesis Research Information System (CCRIS), the Hazardous Substances Data Bank (HSDB), ChemIDplus, LactMed and LiverTox. Finally, in response to the COVID-19 pandemic, PubChem created a special data collection containing data related to COVID- 19 and SARS-CoV-2 (https://go.usa.gov/xfwmG). We gathered the data in this collection from authoritative and curated sources, and a link to the collection appears on the PubChem home page. FOR FURTHER INFORMATION The resources described here include documentation, other explanatory materials and references to collaborators and data sources on their respective web sites. An alphabetical list of NCBI resources is available from a link above the category list on the left side of the NCBI home page. The NCBI Help Manual and the NCBI Handbook (www.ncbi. nlm.nih.gov/books/NBK143764/), both available as links in the common page footer, describe the principal NCBI resources in detail. The NCBI Learn page (www.ncbi.nlm. nih.gov/learn/) provides links to documentation, tutorials, webinars, courses and upcoming conference exhibits. A variety of video tutorials are available on the NCBI YouTube channel that can be accessed through links in the standard NCBI page footer. A user-support staff is available to answer questions at info@ncbi.nlm.nih.gov, and users can view support articles at https://support.nlm.nih.gov. Updates on NCBI resources and database enhancements are described on the NCBI Insights blog (https://ncbiinsights. ncbi.nlm.nih.gov/), NCBI social media sites (FaceBook, Downloadedfromhttps://academic.oup.com/nar/article/49/D1/D10/5937080bygueston24February2021 Nucleic Acids Research, 2021, Vol. 49, Database issue D17 Twitter and LinkedIn) and the several mailing lists and RSS feeds that provide updates on services and databases. Links to these resources are in the NCBI page footer and on NCBI Insights. ACKNOWLEDGEMENTS The authors would like to thank all of the NCBI staff who through their dedicated efforts continue to allow NCBI to provide our full collection of services to the community. FUNDING Funding for open access charge: National Institutes of Health. Funding to pay the Open Access publication charges for this article was provided by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. Conflict of interest statement. None declared. REFERENCES 1. Sayers,E.W., Beck,J., Brister,J.R., Bolton,E.E., Canese,K., Comeau,D.C., Funk,K., Ketter,A., Kim,S., Kimchi,A. et al. (2020) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 48, D9–D16. 2. Schuler,G.D., Epstein,J.A., Ohkawa,H. and Kans,J.A. (1996) Entrez: molecular biology database and retrieval system. Methods Enzymol., 266, 141–162. 3. Sayers,E.W., Cavanaugh,M., Clark,K., Ostell,J., Pruitt,K.D. and Karsch-Mizrachi,I. (2020) GenBank. Nucleic Acids Res., 48, D84–D86. 4. Amid,C., Alako,B.T.F., Balavenkataraman Kadhirvelu,V., Burdett,T., Burgin,J., Fan,J., Harrison,P.W., Holt,S., Hussein,A., Ivanov,E. et al. (2020) The European nucleotide archive in 2019. Nucleic Acids Res., 48, D70–D76. 5. Ogasawara,O., Kodama,Y., Mashima,J., Kosuge,T. and Fujisawa,T. (2020) DDBJ database updates and computational infrastructure enhancement. Nucleic Acids Res., 48, D45–D50. 6. Karsch-Mizrachi,I., Takagi,T., Cochrane,G. and International Nucleotide Sequence Database, C. (2018) The international nucleotide sequence database collaboration. Nucleic Acids Res., 46, D48–D51. 7. Fiorini,N., Canese,K., Starchenko,G., Kireev,E., Kim,W., Miller,V., Osipov,M., Kholodov,M., Ismagilov,R., Mohan,S. et al. (2018) Best Match: New relevance search for PubMed. PLoS Biol., 16, e2005343. 8. Liu,W., Islamaj Dogan,R., Kim,S., Comeau,D.C., Kim,W., Yeganova,L., Lu,Z. and Wilbur,W.J. (2014) Author Name Disambiguation for PubMed. J. Assoc. Inf. Sci. Technol., 65, 765–781. 9. Yeganova,L., Kim,S., Chen,Q., Balasanov,G., Wilbur,W.J. and Lu,Z. (2020) Better synonyms for enriching biomedical search. J. Am. Med. Inform. Assoc., in press. 10. Boratyn,G.M., Camacho,C., Cooper,P.S., Coulouris,G., Fong,A., Ma,N., Madden,T.L., Matten,W.T., McGinnis,S.D., Merezhuk,Y. et al. (2013) BLAST: a more efficient report with usability improvements. Nucleic Acids Res., 41, W29–W33. 11. Ye,J., Coulouris,G., Zaretskaya,I., Cutcutache,I., Rozen,S. and Madden,T.L. (2012) Primer-BLAST: a tool to design target-specific primers for polymerase chain reaction. BMC Bioinformatics, 13, 134. 12. Langmead,B. and Nellore,A. (2018) Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet., 19, 208–219. 13. Connor,R., Brister,R., Buchmann,J.P., Deboutte,W., Edwards,R., Marti-Carreras,J., Tisza,M., Zalunin,V., Andrade-Martinez,J., Cantu,A. et al. (2019) NCBI’s virus discovery Hackathon: Engaging research communities to identify cloud infrastructure requirements. Genes (Basel), 10, 714. 14. NCBI Resource Coordinators. (2017) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 45, D12–D17. 15. Sayers,E.W., Agarwala,R., Bolton,E.E., Brister,J.R., Canese,K., Clark,K., Connor,R., Fiorini,N., Funk,K., Hefferon,T. et al. (2019) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 47, D23–D28. 16. Tatusova,T., DiCuccio,M., Badretdin,A., Chetvernin,V., Nawrocki,E.P., Zaslavsky,L., Lomsadze,A., Pruitt,K.D., Borodovsky,M. and Ostell,J. (2016) NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res., 44, 6614–6624. 17. Feldgarden,M., Brover,V., Haft,D.H., Prasad,A.B., Slotta,D.J., Tolstoy,I., Tyson,G.H., Zhao,S., Hsu,C.H., McDermott,P.F. et al. (2019) Validating the AMRFinder tool and resistance gene database by using antimicrobial resistance Genotype-Phenotype correlations in a collection of Isolates. Antimicrob. Agents Chemother., 63, e00483-19. 18. Camacho,C., Coulouris,G., Avagyan,V., Ma,N., Papadopoulos,J., Bealer,K. and Madden,T.L. (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10, 421. 19. Ye,J., Ma,N., Madden,T.L. and Ostell,J.M. (2013) IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res., 41, W34–W40. 20. Wang,J., Youkharibache,P., Zhang,D., Lanczycki,C.J., Geer,R.C., Madej,T., Phan,L., Ward,M., Lu,S., Marchler,G.H et al. (2020) iCn3D, a web-based 3D viewer for sharing 1D/2D/3D representations of biomolecular structures. Bioinformatics, 36, 131–135. 21. Li,L., Li,C., Sarkar,S., Zhang,J., Witham,S., Zhang,Z., Wang,L., Smith,N., Petukh,M. and Alexov,E. (2012) DelPhi: a comprehensive suite for DelPhi software and associated resources. BMC Biophys., 5, 9. 22. Lomize,M.A., Pogozheva,I.D., Joo,H., Mosberg,H.I. and Lomize,A.L. (2012) OPM database and PPM web server: resources for positioning of proteins in membranes. Nucleic Acids Res., 40, D370–D376. 23. Kim,S. (2016) Getting the most out of PubChem for virtual screening. Expert. Opin. Drug Discov., 11, 843–855. 24. Kim,S., Thiessen,P.A., Bolton,E.E., Chen,J., Fu,G., Gindulyte,A., Han,L., He,J., He,S., Shoemaker,B.A et al. (2016) PubChem substance and compound databases. Nucleic Acids Res., 44, D1202–D1213. 25. Kim,S., Chen,J., Cheng,T., Gindulyte,A., He,J., He,S., Li,Q., Shoemaker,B.A., Thiessen,P.A., Yu,B et al. (2019) PubChem 2019 update: improved access to chemical data. Nucleic Acids Res., 47, D1102–D1109. Downloadedfromhttps://academic.oup.com/nar/article/49/D1/D10/5937080bygueston24February2021