Perspective Perspectives on ENCODE https://doi.org/10.1038/s41586-020-2449-8 Received: 7 December 2019 Accepted: 5 May 2020 Published online: 29 July 2020 Open access ~Sl Check for updates The ENCODE Project Consortium*, Michael P. Snyder12S, Thomas R. Gingeras3, Jill E. Moore4, Zhiping Weng4 5 6, Mark B. Gerstein7, Bing Ren8 9, Ross C. Hardison10, John A. Stamatoyannopoulos11'12,13, Brenton R. Graveley14, Elise A. Feingold15, Michael J. Pazin15, Michael Pagan15, Daniel A. Gilchrist15, Benjamin C. Hitz1, J. Michael Cherry1, Bradley E. Bernstein16, Eric M. Mendenhall1718, Daniel R. Zerbino19, Adam Frankish19, Paul Flicek19 & Richard M. Myers18 The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate c/s-regulatory elements (cCREs) that may serve functional roles in regulating gene expression1. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community. The ENCODE Project was launched in 2003, as the first nearly complete human genome sequence was reported2. At that time,our understanding of the human genome was limited. For example,although 5% of the genome was known to be under purifying selection in placental mammals3,4, our knowledge of specific elements, particularly with regards to non-protein codinggenes and regulatory regions, was restricted to a few well-studied loci25. ENCODE commenced as an ambitious effort to comprehensively annotate the elements in the human genome, such as genes, control elements, and transcript isoforms, and was later expanded to annotate the genomes of several model organisms. Mapping assays identified biochemical activities and thus candidate regulatory elements. Analyses of the human genome in ENCODE proceeded in successive phases (Extended Data Fig. 1). Phase I (2003-2007) interrogated a specified 1% of the human genome in order to evaluate emerging technologies6. Half of this 1% was in regions of high interest, and the other half was chosen to sample the range of genomic features (such as G+C content and genes). Microarray-based assays were used to map transcribed regions, open chromatin, and regions associated with transcription factors and histone modification in a wide variety of cell lines, and these assays began to reveal the basic organizational featuresof the human genome and transcriptome. Phase II (2007-2012) introduced sequencing-based technologies (for example, chromatin immunoprecipitation with sequencing (ChlP-seq) and RNA sequencing (RNA-seq)) that interrogated the whole human genome and transcriptome7. General assays such as transcript, open-chromatin and histone modification mapping were used on a wide variety of cell lines, while more specific assays, such as mapping transcription factor binding regions, were performed extensively on a smaller number of cell lines to provide detailed annotations on, and to investigate the relationships of, many regulatory proteinsacross the genome. Transcriptome analysis of subcellular compartments (the nucleus, cytosol and subnuclear compartments) of these cells enabled the locations of transcripts to be analysed7. ENCODE phase III ENCODE 3 (2012-2017) expanded production and added new types of assays8 (Fig. 1, Extended Data Fig. 1), which revealed landscapes of RNA bindingand the 3Dorganization of chromatin via methods such as chromatin interaction analysis by paired-end tagging (ChlA-PET) and Hi-C chromosome conformation capture. Phases2and 3 delivered 9,239 experiments (7,495 in human and 1,744 in mouse) in more than 500 cell types and tissues, including mapping of transcribed regions and transcript isoforms, regionsof transcripts recognized by RNA-binding proteins, transcription factor binding regions, and regions that harbour specific histone modifications, open chromatin, and 3D chromatin interactions. The results of all of these experiments are available at the ENCODE portal (http://www.encodeproject.org). These efforts, combined with those of related projects and many other laboratories, have produced a greatly enhanced view of the human genome (Fig. 2), identifying20,225 protein-codingand 37,595 noncodinggenes 'Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA. cardiovascular Institute, Stanford School of Medicine, Stanford, CA, USA. functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA. 4University of Massachusetts Medical School, Program in Bioinformatics and Integrative Biology, Worcester, MA, USA. department of Thoracic Surgery, Clinical Translational Research Center, Shanghai Pulmonary Hospital, The School of Life Sciences and Technology, Tongji University, Shanghai, China. 6Bioinformatics Program, Boston University, Boston, MA, USA. 7Yale University, New Haven, CT, USA. BLudwig Institute for Cancer Research, University of California, San Diego, La Jolla, CA, USA. 9Center for Epigenomics, University of California, San Diego, La Jolla, CA, USA. ,cDepartment of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA. "Altius Institute for Biomedical Sciences, Seattle, WA, USA. ^Department of Genome Sciences, University of Washington, Seattle, WA, USA. ^Department of Medicine, University of Washington, Seattle, WA, USA. "Department of Genetics and Genome Sciences, Institute for Systems Genomics, UConn Health, Farmington, CT, USA. ,5National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. ,6Broad Institute and Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA. "Biological Sciences, University of Alabama in Huntsville, Huntsville, AL, USA. ,BHudsonAlpha Institute for Biotechnology, Huntsville, AL, USA. ,9European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Cambridge, UK. *An alphabetical list of authors and their affiliations appears online. A formatted list of authors appears in the Supplementary Information, ^e-mail: mpsnyder@stanford.edu Nature | Vol 583 I 30 July 2020 | 693 Perspective 2qqq i □ 3D chromatin structure ' ■ DNA accessibility □ DNA methylation 2011 11 II I I ~H binding □ Knockdown RNA-seq _ □ Transcription 2013 III || || □ Histone ChlP-seq n □ TF ChlP-seq > ■ , , , ,-1-1-p. ■ Other 2015 I I I I I I I II □ ENCODE4 2017 II I I I I I I II 2019 || | | | | | | III I_I_I_I_I_I 0 2,000 4,000 6,000 8,000 Cumulative number of ENCODE assays (human and mouse) Fig. 11 ENCODE assays by year. Accumulations of assays over the three phases of ENCODE. 3D chromatin structure includes Chi A-PET (62 experiments), Hi-C (31), and chromatin conformation capture carbon copy (5C, 13). Chromatin accessibility includes DNAase-seq (524), assay for transposase-accessible chromatin using sequencing (ATAC-seq, 129), transcription activator-like effector nuclease (TALEN)-modified DNAase-seq (40), formaldehyde-assisted isolation of regulator elements with sequencing (FAIRE-seq, 37) and micrococcal nuclease digestion with deep sequencing (MNase-seq, 2). DNA methylation includes DN Ame arrays (259), WGBS (124), reduced-representation bisulfite sequencing (RRBS, 103), methylation-sensitive restriction enzyme sequencing (MRE-seq, 24) and methylated DNA immunoprecipitation coupled with next-generation sequencing (MeDIP-seq, 4). Histone modification includes ChlP-seq (1,605) on histone and modified histone targets. Knockdown transcription includes RNA-seq preceded by small interfering RN A (siRNA, 54), short hairpin RN A (shRNA, 531), clustered regularly interspaced short palindromic repeats (CRISPR, 50) or CRISPR interference (CRISPRi, 77). RN A binding includes enhanced cross-linking immunoprecipitation (eCLIP, 349), RNAbind-n-seq (158), RNA immunoprecipitation sequencing (RIP-seq, 158), RNA-binding protein immunoprecipitation-microarray prof iling (RIP-chip, 32), individual nucleotide-resolutionCLIP (iCLIP, 6) and Switchgear (2). Transcription includes RNA annotation and mapping of promoters for the analysis of gene expression (RAMPAGE, 155), cap analysis gene expression (CAGE, 78), RN Apaired-end tag (RNA-PET, 31), microRNA-seq (114), microRNA counts (114), more classical RNA-seq (900) and RNA-microarray (170), including 112 experiments at single-cell resolution. Transcription factor (TF) binding is ChlP-seq on non-histone targets (2,443). Other assays include genotyping array (123), nascent DNA replication strand sequencing (Repli-seq, 104), replication strand arrays (Repli-chip, 63), tandem mass spectrometry (MS/MS, 14), genotyping by high-throughput sequencing (genotyping HTS, 12) and DNA-PET (6) can be looked at in detail at https://www.encodeproject.org. (Fig. 2a), 2,157,387 open chromatin regions, 750,392 regions with modified histones (mono-, di- or tri-methylation of histone H3 at lysine 4 (H3K4mel, H3K4me2 or H3K4me3), or acetylation of histone 3 at lysine 27 (H3K27ac)), 1,224,154 regions bound by transcription factors and chromatin-associated proteins (Fig. 2c), 845,000 RNA subregionsoccupied by RNA-binding proteins, and more than 130,000 long-range interactions between chromatin loci. These annotations have greatly enhanced our view of the human genomefrom its original annotation in 2003toa much richerand higher-resolution view(forexample,Fig. 2d,e). Indeed, although the number of human protein-coding genes known has changed only modestly, the number of transcript isoforms, long noncoding RNAs (IncRNAs), and potential regulatory regions identified has increased greatly since the project began (Fig. 2a-c). An important part of ENCODE 3 is that the regulatory mapping efforts have now been integrated and synthesized into the first version of an encyclopedia, highlighting a registry of 0.9 million cCREs in human and 0.3 million cCREs in mouse. Details can be found in the accompanying ENCODE paper8 and companion papers in this issue and other journals914. Technology, quality control and standards Reaching the present annotation required a substantial expansion of technology development, from ENCODE groups and others, as well as the establishment of standards to ensure that the data are reproducible and of high quality. Most ENCODE 2 assays used sequence-based readouts (for example, RNA-seq1516 and ChlP-seq1718) rather than the array-based methods19,20 used in the pilot phase, and in ENCODE 3, methods such as global mappingof 3D interactions13 and RNA-binding regions14 were added. Throughout the project, computational and visualization approaches were developed for mapping reads and integrating different data types (Supplementary Note 1). A key feature of ENCODE is theapplication of data standards, including the use of independent replicates (separate experiments on two or more biological samples521), except when precluded by the limited availability of materials (for example, postmortem human tissues). Of the 8,699 ENCODE 2and ENCODE 3 experiments, 6,101 have independent replicates. Of equal importance was the use of well-characterized reagents, such as antibodies for mapping sites of transcription factor binding, chromatin modifications and protein-RNA interactions22. ENCODE developed protocols to test each antibody 'lot' to demonstrate their experimental suitability, captured extensive metadata, and implemented controlled vocabularies and ontologies. Standards for reagents, experimental data, and metadata areon the ENCODE website: https://www.encodeproject.org/data-standards/. Many metrics, including sequencing depth, mappingcharacteristics, replicate concordance, library complexity, and signal-to-noise ratio, were used to monitor the quality of each data set, and quality thresholds were applied21. A minority of experiments thatfell short of the standards (for example, insufficiently validated antibodies) are still reported, but are marked with a badge to indicate that an issue was found. This is a compromise for having some data versus none when an experiment did not meet ENCODE-defined thresholds. An important component is uniform data processing. Data from the major ENCODE assays (ChlP-seq, DNase I hypersensitive sites sequencing (DNase-seq), RNA-seq, and whole-genome bisulfite sequencing (WGBS)) are uniformly processed and the processing pipelines are available for users to apply to their own data, by downloading the codefrom the GitHub (http://github.com/ENCODE-DCC) or by accessing the pipelines at the DNAnexus cloud provider. The standards and pipelines will continue to evolve as new technologies arise and are implemented. The ENCODE Consortium isa good exampleof how large-scale group efforts can have a large impact on the scientific community, and many other national and international projects-including the NIH Road-map Epigenomics Program, The Cancer Genome Atlas (TCGA), the International Human Epigenome Consortium (IHEC), BLUEPRINT, the Canadian Epigenetics, Environment and Health Research Consortium (CEEHRC), the Genotype and Tissue Expression Project (GTEx), Psy-chENCODE, Functional Annotation of Animal Genomes (FAANG), the Global Alliance for Genomics and Health (GA4GH), the 4D Nucleome Program (4DN), the Human Cell Atlas and the FANTOM consortium-have nowformed (Supplementary Note 1). ENCODE has engaged with most of these consortia to share standards for data quality control, submission, and uniform processing and has helped to facilitate the use of common ontologies with some of these consortia. Data from the now-completed NIH Roadmap Epigenomics Program have been reprocessed and are available in the ENCODE database and are part of the Encyclopedia annotation. ENCODE continues to work with other consortia, individually and as part of the IHEC and GA4GH (for example, http://epishare-project.org) to increase data interoperability and the value of its resources. 694 I Nature | Vol 583 I 30 July 2020 Ensembl/GENCODE releases 2012 ENCODE 2 2019 ENCODE 2, Roadmap & ENCODE 3 E 50 o a 25 i 45.7% 1 36.9% 20.6% ! 22.3% 11 o%M _ 13-0% 26.9% 1 5.8% 1.1% 0.4% 0.3% 0.2% 2003 2007 2012 2019 Year 2012:GRCh37/hg19 chr16 Open chromatin j= J= § TFBSs ffl xxx Open chromatin j= J= J co co co xxx TFBSs