Article Expanded encyclopaedias of DN A elements in the human and mouse genomes https://doi.org/10.1038/s41586-020-2493-4 Received: 26 August 2017 Accepted: 27 May 2020 Published online: 29 July 2020 Open access "*>j Check for updates The ENCODE Project Consortium*, Jill E. Moore',,s, Michael J. Purcaro111S, Henry E. Pratt"18, Charles B. Epstein2 ,,s, Noam Shoresh2 ,,s, Jessika Adrian3"8, Trupti Kawli3118, Carrie A. Davis4 ,,s, Alexander Dobin4"8, Rajinder Kaul56"8, Jessica Halow5"8, Eric L. Van Nostrand7"8, Peter Freese8"8, David U.Gorkin910118, Yin Shen,0,,,,s, Yupeng He'2"8, MarkMackiewicz13118, Florencia Pauli-Behn13118, Brian A. Williams'4, Ali Mortazavi15, Cheryl A. Keller16, Xiao-Ou Zhang', Shaimae I. Elhajjajy1, Jack Huey1, Diane E. Dickel17, Valentina Snetkova17, Xintao Wei18, Xiaofeng Wang192021, Juan Carlos Rivera-Mulia2223, Joel Rozowsky24, Jing Zhang24, Surya B.Chhetri13 25, JialingZhang26, Alec Victorsen27, Kevin P. White28, Axel Visel172930, Gene W. Yeo7, Christopher B. Bürge31, Eric Lecuyer192021, David M. Gilbert22, Job Dekker32, John Rinn33, Eric M. Mendenhall1325, Joseph R. Ecker1234, Manolis Kellis235, Robert J. Klein36, William S. Noble37, Anshul Kundaje3, Roderic Guigö38, Peggy J. Farnham39, J. Michael Cherry31190, Richard M. Myers131190, Bing Ren910119S, Brenton R.Graveley181190, Mark B. Gerstein241190, Len A. Pennacchio1729401190, Michael P. Snyder341119S, Bradley E. Bernstein421190, Barbara Wold14119S, Ross C. Hardison161190, Thomas R. Gingeras41190, John A. Stamatoyannopoulos5 6 371190 & Zhiping Weng143441190 The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of926,535 human and 339,815 mouse candidate ds-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen. encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes. The human genome comprises a vast repository of DNA-encoded instructions that are read, interpreted, and executed by the cellular protein and RNA machinery to enable the diverse functions of living cells and tissues. The ENCODE Project aims to delineate precisely and comprehensively the segments of the human and mouse genomes that encode functional elements136. Operationally, functional elements are defined as discrete, linearly ordered sequence features that specify molecular products (for example, protein-coding genes or noncoding RNAs) or biochemical activities with mechanistic roles in gene or genome regulation (for example, transcriptional promoters or enhancers)5. Commencing with the ENCODE Pilot Project in 2003 (which focused on a defined 1% of the human genome sequence4) and scaling to the entire genome in a production phase II that began in 20071, ENCODE has applied a succession of state-of-the-art assays to identify likely functional elements with increasing precision across an expanding range of cellular and biological contexts. To capitalize on the value of the laboratory mouse, Musmusculus, for both comparative functional genomic analysis and modelling of human biology, a Mouse ENCODE Project of more limited scope was initiated in 20096. An accompanying Perspective7 provides further context for the evolution of the ENCODE Project and describes how ENCODE data are being used to illuminate both basic biological and biomedical questions that intersect genome structure and function. Beginning in 2012, both the human and mouse ENCODE Projects initiated programs to broaden and deepen their respective efforts to discover and annotate functional elements, and to systematize the *A list of affiliations appears at the end of the paper. An alphabetical list of authors and their affiliations appears online. A formatted list of authors is provided in the Supplementary Information. Nature | Vol 583 I 30 July 2020 | 699 Article Table 11 Summary of ENCODE3 production Assay Description and details No. of experiments No. of targets No. of biosamples DNA binding and chromatin modification ChlP-seq Chromatin immunoprecipitation sequencing Chromatin-associated proteins 1,343 653 151 Histone marks 1,082 13 158 Transcription RNA-seq RNA sequencing Total RNA 224 - 209 polyARNA 116 - 106 microRNA 112 - 108 small RNA 86 - 85 Knockdown/knockout RNA sequencing CRISPR 50 28 2 CRISPR interference 77 74 1 Short hairpin RNA 523 253 2 Small inhibitory RNA 54 35 3 scRNA-seq Single-cell RNA sequencing 13 - 12 RAMPAGE RNA annotation and mapping of promoters for the analysis of gene expression 155 - 154 Chromatin accessibility DNase-seq DNase I cleavage site sequencing 246 - 246 DNase-seq of genetically modified cells 46 28 1 ATAC-seq Assay for transposase accessible chromatin using sequencing 129 - 129 DNA methylation WGBS Whole-genome bisulfite sequencing 132 - 129 DNAme array DNA methylation profiling by array 154 - 151 RNA binding eCLIP Enhanced UV crosslinking and immunoprecipitation of RNA binding proteins (RBPs) followed by sequencing to identify bound RNAs in cells 170 117 3 RNA Bind-n-seq In vitro method for quantifying RBP-RNA interactions and identifying binding motifs 78 78 - 3D chromatin structure ChlA-PET Chromatin interaction analysis by paired-end tag sequencing 49 6 29 Hi-C Genome-wide chromosome conformation capture (all-versus-all interactions) 33 - 33 Replication timing Repli-chip Measures DNA replication timing using microarrays 36 - 30 Repli-seq Measures DNA replication timing using sequencing 14 - 14 Control experiments were excluded from this table but can be found in Extended Data Table 1. Counts were obtained on 1 December 2019. production,curation, and dissemination of ENCODE data with theaim of broadly empowering the scientific community. ENCODE data have served as an enabling interface between the human genome sequence and its application to biomedical research because of both the rangeof biological and biochemical features encompassed by ENCODE assays and the breadth and depth with which these assays have been applied across cell and tissue contexts. ENCODE has now expanded on both of these axes by (i) incorporating newassays such as RNA-binding-protein localization and chromatin looping; (ii) increasingthe depthsat which current assays such as transcription factor chromatin immunoprecipitation and sequencing (ChlP-seq) interrogate referencecell lines; and (iii) collecting data over a greatly expanded biological range, with an emphasis on primary cells and tissues. In addition, ENCODE has now incorporated and uniformly processed the substantial data from the RoadmapEpigenomics Project2 that conform to ENCODE standards (see Methods). Here, we describe the generation of nearly 6,000 new experiments (4,834 using human tissues or cells and 1,158 using mouse tissues or cells) in phase III that have extended previous phases of ENCODE in order to define and annotate diverse classes of functional elements in the human and mouse genomes (Table 1). Whereas many experiments duringearlier phases of ENCODE used model cell lines, a major goal of phase III was to broaden coverage of primary cells and tissues. Together, the ENCODE-Roadmap Encyclopedia now encompasses 503 biological cell or tissue types from more than 1,369 biological sample sources (biosamples) (Extended Data Tablel). As a newfeature of ENCODE, we have systematically integrated DNAaccessibility and chromatin modification data to create a categorized registry of candidatea's-regulatory elements (cCREs) in both the human and mouse genomes. We have also developed a new web-based interface called SCREEN to facilitate access to the human and mouse registries and to facilitate their application to diverse biological problems. Across multiple data types, the increase in the scale of experimental data has provided new insights into genome organization and function, and catalysed new capabilities for deriving biological understand-ingsand principles, as illustrated below and detailed in accompanying papers716. In summary, we: • Define core gene sets that correspond to major cell types using extensive new maps of RNA transcripts in a broad range of primary cell types8. 700 I Nature | Vol 583 I 30 July 2020 Human a Transcription Chromatin accessibility DNA binding DNA methylation RNA binding 3D chromatin structure Replication timing No. of experiments No. of unique biosampli 333 833 60 33 161 98 13 18 410 1,317 18 100 128 49 16 9 2 78 8 63 8 3 19 4 27 GENCODE genes I cCREs| | RAMPAGE ^ " I I RNA-seq DNase H3K4me3 H3K27ac CTCF WGBS chr12: LL. 1 I ■ I Mouse d No. of experiments No. of unique biosamples Chromatin accessibility DNA binding DNA methylation I f ^1.64 throughout, and low otherwise. Considering the max-Z values across all biosamples but not the Z-scores in a specific biosample, cCREs were classified into seven states and five groups. A state stands for a specific high-low combination of a cCRE's H3K4me3, H3K27ac, or CTCF max-Z values; seven states are possible because at least one mark needs to have a high signal. For the group classification, we further took into account the genomic distance from the centre of the cCRE to the nearest TSS (<200 bp for TSS-overlapping, 200-2,000 bp for TSS-proximal, and >2,000 bp for TSS-distal). We defined TSSs as the 5'ends of all basic transcripts annotated by GENCODE (V24 for human and M18 for mouse). A cCRE was assigned to one of five mutually exclusive groups on the basis of its state and TSS proximity (Box 1): TSS-overlapping with promoter-like signatures (PLS), TSS-proximal with enhancer-like signatures (pELS), TSS-distal with enhancer-like signatures (dELS), not TSS-overlapping and with high DNase and H3K4me3 signals only (DNase-H3K4me3), not TSS-overlapping and with high DNase and CTCF signals only (CTCF-only). Note that this set of seven states and five groups is defined across all biosamples, and therefore is cell-type agnostic. We next define cell type-specific state and group classifications. To classify cCREs in a particular biosample covered by all four core assays, we used DNase, H3K4me3, H3K27ac, or CTCF Z-scores in that particular biosample. We had all four types of data for 25 human and 15 mouse biosamples. The cCREs in each of these biosamples were assigned to one of nine states-one low-DNase state regardless of H3K4me3, H3K27ac, and CTCF Z-scores, and eight high-DNase states with the high-low combinations of their H3K4me3, H3K27ac, and CTCF Z-scores. Theseeight high-DNase states were again combined with the distance from the nearest TSS to yield six mutually exclusive groups-PLS, pELS, dELS, DNase-H3K4me3, CTCF-only, and DNase-only, accord-ingto theclassification diagram (Supplementary Fig. 2). The low-DNase state is included as the seventh group. Thus, in a particular biosample fully covered by all four core assays, cCREs were classified into nine states and seven groups. Biosamples that are not fully covered by all four assays can also be used to define cCREs. To distinguish a low signal for a mark from missing data for that mark (that is, the assay was not performed for that mark in the biosample), we assign a confidence tier to each cCRE based on its supporting data (Box 1). Tier 1 cCREs are supported by a high DNase signal plus minimally one more high-signal mark in the same biosample; that is, these two high signals are concordantly observed in the same sample. Tier lcCREs were further separated into sub-tiers la and lb, depending on whether the biosample that had high signals for this cCRE was fully covered by thefour core assays (Box 1). Thus, all tier la cCREs are from the 25 human and 15 mouse biosamples that are fully covered by the four core assays, whereas tier lb cCREs are from biosamples not fully covered by the four core assays. Tier 2 cCREs are supported by a high DNase signal in one biosample and a high signal for one more mark in a different biosample, but the concordance test could not be performed for the tier 2 cCREs owing to missing pertinent data for the cell type-agnostic classification of the cCRE. For example, for a tier 2 cCRE with a cell type-agnostic group classification of PLS, none of the biosamples with a high DNase signal at this cCRE had available H3K4me3 ChlP-seq data, and none of the biosamples with a high H3K4me3 signal at this cCRE had available DNase-seq data. There are also tier 3 and tier 4 cCREs, which were excluded from the current versions of the registries (see Supplementary Methods for details). We also attempted to make group assignments for cCREs in a particular biosample that was not fully covered by the four core assays, making some approximations. The specific schemes are illustrated in Supplementary Fig. 3 and summarized as follows. For samples with DNase data, weclassified elements using theavailable marks. For example, if a sample lacked H3K27ac (Supplementary Fig. 3e) its cCREs was assigned to the PLS and DNase-H3K4me3 groups but not the pELS or dELS groups. For biosamples lacking DNase data, we do not have the resolution to identify specific elements (Supplementary Fig. 3f). Therefore, for these biosamples, we simply labelled the cCRE as havinga high or low signal for every available assay. In these biosamples, cCREs with low H3K4me3, H3K27ac, or CTCF signals were labelled 'unclassified' because we were unable to classify them as low-DNase without DNase data. In both SCREEN and in downloadable files, biosamples lacking data are clearly labelled as such. For average conservation score analysis on each set of cCREs (Extended Data Fig. 2b), we calculated the average phyloP69 score (calculated from the alignment of 100 vertebrate genomes http://hgdownload.cse.ucsc.edu/goldenpath/hg38/phyloP100way/hg 38.phyloP100way.bw) per base, ±250 bp from the centreof each cCRE. Homologous human and mouse cCREs were identified by liftOver70 with a minimum match score of 0.5 (Extended Data Fig. 2c). Test cCREs with transgenic mouse assays We selected regions containing cCRE-dELSs in three Ell.5 mouse tissues (midbrain, hindbrain, and limb) for testing using Ell.5 transgenic mouse assays. We excluded dELS-containing regions that overlapped any previously tested regions that were already in the VISTA database (http://enhancer.lbl.gov/). We ranked dELS-containing regionsfrom the most to the least significant by theaverage rankof DNase and H3K27ac signals in thecorresponding tissue and then selected regionsfrom three segments of each tissue's ranked list (the top, around 1,500, and around 3,000 by rank). We used H3K27ac peaks (called using the ENCODE uniform processing pipeline) that overlapped the cCRE-dELSs to choose the boundaries of the tested regions. In total, we tested 151 regions across the three tissues (Supplementary Table 22). Transgenic mouse assays were performed in FVB/NCrl strain M. muscu/usanimals (Charles River) as described previously49. In brief, predicted enhancers were PCR amplified and cloned into a plasmid upstream of a minimal Hsp68 promoter and a lacZ reporter gene. The plasmids were pronuclear injected into fertilized mouse eggs, and the transgenic embryos were implanted into surrogate mothers, collected at E11.5, and stained for (3-galactosidase activity. A predicted element was scored positive as an enhancer if at least three embryos had identical (3-galactosidase staining in the same tissue. Conversely, a prediction was deemed inactive if no reproducible staining was observed and at least five embryos harbouring a transgene insertion were obtained. Evaluating cCREs using public MPRA data We downloaded theSNPs tested by MPRA50 in human lymphoblastoid cells from Supplementary Table 1 of that study and reconstructed tested regions by generating a ±75-bp window around each SNP. We then intersected cCREs with these regions using bedtools intersect, requiring at least 25% of each cCRE to overlap. Of the cCREs that overlapped a tested region, wecalculated the percentage that overlapped an MPRA+ region. We analysed all cCREs and GM12878-specific cCREs stratified bythecCRE group. Evaluating cCREs with public SuRE data We downloaded SuRE peaks in human K562 cells from the Supplementary Data Set of an earlier study51. Using bedtools intersect, we compared the SuRE peaks with the hg38 cCREs lifted down to the hgl9 genome version, counting the number of base pairs overlapping each cCRE or region of interest. We then calculated the total percentage of base pairs for each cCRE group that overlapped a SuRE peak. Reporting summary Further information on research design is available in the Nature Research Reporting Summary linked to this paper. 62. Ram, O. et al. Combinatorial patterning of chromatin regulators uncovered by genome-wide location analysis in human cells. Cell 147,1628-1639 (2011). 63. Cai, S. F., Chen, C.-W. & Armstrong, S. A. Drugging chromatin in cancer: recent advances and novel approaches. Mol. Cell 60, 561-570 (2015). 64. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25,1754-1760 (2009). 65. Rivera-Mulia, J. C. & Gilbert, D. M. Replication timing and transcriptional control: beyond cause and effect—part III. Curr. Opin. Cell Biol. 40,168-178 (2016). 66. Ritchie, M. E. et a I. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucieic Acids Res. 43, e47(2015). 67. Langdon, W. B. Performance of genetic programming optimised Bowtie2 on genome comparison and analytic testing (GCAT) benchmarks. BioData Min. 8,1 (2015). 68. Landt, S. G. et al. ChlP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22,1813-1831 (2012). 69. Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20,110-121 (2010). 70. Hinrichs, A. S. et al. The UCSC Genome Browser Database: update 2006. Nucieic Acids Res. 34, D590-D598 (2006). Acknowledgements We thank additional members of our laboratories and institutions who contributed to the experimental and analytical components of this project. We also thank the external advisors of the ENCODE Project for providing valuable input. This work was supported by grants from the NIH under U01HG007019, U01HG007033, U01HG007036, U01HG007037, U41HG006992, U41HG006993, U41HG006994, U41HG006995, U41HG006996, U41HG006997 U41HG006998, U41HG006999, U41HG007000, U41HG007001, U41HG007002, U41HG007003, U54HG006991, U54HG006997, U54HG006998, U54HG007004, U54HG007005, U54HG007010 and UM1HG009442. Author contributions See the consortium author list in the Supplementary Information for full details of author contributions. Data analysis coordination (data analysis): J.E.M., M.J.P., H.E.P., B.W., R.C.H., T.R.G., J.A.S., Z.W. Data production coordination (data production): C.B.E., N.S., J.A., T.K., C.A.D., A.D., R.K., J.H., E.L.V.N., P.F., D.U.G., Y.S., Y.H., M.M., F.P.-B., R.M.M., B.R., B.R.G., L.A.P., M.P.S., B.E.B., B.W., R.C.H., T.R.G., J.A.S. Data analysis leads (data analysis): J.E.M., M.J.P., H.E.P., X.-O.Z., S.I.E., J.H., J.R., J.Z., M.K., R.J.K., W.S.N., A.K., R.G., M.B.G., B.W., R.C.H., Z.W. Data production leads (data production): C.B.E., N.S., J.A., T.K., C.A.D., A.D., R.K., J.H., E.L.V.N., P.F., D. U.G., Y.S., Y.H., M.M., F.P.-B., B.A.W., A.M., C.A.K., S.B.C., J.Z., A.V., K.P.W., A.V., G.W.Y., C.B.B., E. L, D.M.G., J.D., J.R., E.M.M., J.R.E., P.J.F., R.M.M., B.R., B.R.G., L.A.P., M.P.S., B.E.B., B.W., R.C.H., T.R.G., J.A.S. Writing group: R.M.M., B.R., B.R.G., L.A.P., M.P.S., B.E.B., B.W., R.C.H., T.R.G., J.A.S., Z.W. Principal investigators (steering committee): J.M.C., R.M.M., B.R., B.R.G., M.P.S., B.E.B., T.R.G., J.A.S., Z.W. Data availability All data are availableon the ENCODE data portal: www.encodeproject. org. Code availability All code is availableon GitHub from the links provided in the methods section. Code related to the Registry of cCREs can be found at https:// github.com/weng-lab/ENCODE-cCREs. Code related to SCREEN can be found at https://github.com/weng-lab/SCREEN. 57. Djebali, S. etal. Landscape of transcription in human cells. Nature 489,101-108(2012). 58. Kanamori-Katayama, M. etal. Unamplified cap analysis of gene expression on a single-molecule sequencer. Genome Res. 21,1150-1159 (2011). 59. Frankish, A. etal. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47 (D1), D766-D773 (2019). 60. Sloan, C. A. et al. ENCODE data at the ENCODE portal. Nucleic Acids Res. 44 (D1), D726-D732 (2016). 61. Lambert, N. et al. RNA Bind-n-Seq: quantitative assessment of the sequence and structural binding specificity of RNA binding proteins. Mol. Cell 54, 887-900 (2014). Competing interests B.E.B. declares outside interests in Fulcrum Therapeutics, ICellBio, HiFiBio, Arsenal Biosciences, Cell Signaling Technologies, BioMillenia, and Nohla Therapeutics. P. Flicek is a member of the Scientific Advisory Boards of Fabric Genomics, Inc. and Eagle Genomics, Ltd. M.P.S. is cofounder of Personalis, SensOmics, Mirvie, Obio, January, Filtircine, and Genome Heart. He serves on the scientific advisory board of these companies and Genapsys and Jupiter. Z. Weng is a cofounder of Rgenta Therapeutics and she serves on its scientific advisory board. G.W.Y. is co-founder, member of the Board of Directors, on the SAB, equity holder, and paid consultant for Locana and Eclipse Biolnnovations, and a visiting professor at the National University of Singapore. G.W.Y.'s interests have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. E.L.V.N, is co-founder, member of the Board of Directors, on the SAB, equity holder, and paid consultant for Eclipse Biolnnovations. E.L.V.N.'s interests have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. B.R. is a co-founder and member of SAB of Arima Genomics, Inc. The authors declare no other competing financial interests. Additional information Supplementary information is available for this paper at https://doi.org/10.1038/s41586-020-2493-4. Correspondence and requests for materialsshould be addressed to J.M.C., R.M.M., B.Ren., B.R.G., M.B.G., L.A.P., M.P.S., B.E.B., B.W., R.C.H., T.R.G., J.A.S. or Z.W. Peer review information Nature thanks Piero Carninci and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Reprints and permissions information is available at http://www.nature.com/reprints. Article Cell type-agnostic group Cell type-specific groups PLS pELS DNase-H3K4me3 CTCF-only DNase-only 25 50 75 % of GRCh38 cCREs 100 Cell type-agnostic group Cell type-specific groups PLS pELS DNase-H3K4me3 CTCF-only DNase-only 25 50 75 %of mmlOcCREs 100 GENCODE genes cCREs DNase H3K4me3 H3K27ac CTCF cCREs DNase H3K4me3 H3K27ac CTCF cCREs DNase H3K4me3 H3K27ac CTCF cCREs DNase H3K4me3 H3K27ac CTCF EH38E2652345 ATF6B EH38E2459760 ■ I ■ I I ■ ■ ■ 4 I ■ ■ ■ 1 A _j I ■ ■ ■ ■ to I ■ ■ ■ ■ & [0-1.5] [0-50] t [0-25] chr8: 100,907,000 100,910,000 chr6: 32,112,500 Extended Data Fig. 11 Classification of human cCREs is largely consistent across biosamples. a, b, For the 25 human (a) and 15 mouse (b) biosamples that were covered by all four core assays, we analysed how cCRE classification could differ between biosamples. For each cell-type-agnostic group of cCREs, the bars indicate their group classification in specific biosamples, coloured by group as indicated. Black indicates a switch in the grouping, for example, from cell type-agnostic PLS to cell type-specif ic pELS or CTCF-only. c, d, Two example switches of cCRE grouping between different biosamples. c, EH38E2652345 is a cCRE-dELS that has high DNase, H3K4me3, and H3K27ac signals in bipolar spindle neurons. By contrast, in cell types at earlier stages of neuronal differentiation, such as embryonic stem cells, iPSCs, and neural progenitor cells, this cCRE only has high DNase and H3K4me3 signals, suggesting that in these cell types the cCRE may be a poised enhancer, d, EH38E2459760 is a cCRE-dELS that has high DNase, H3K27ac, and CTCF signals in Hl-hESCs and iPSCs. However, in further differentiated cell types such as neural progenitors and bipolar spindle neurons, the H3K27ac signal decreases while the CTCF signal remains, and accordingly, EH38E2459760 is classified as a CTCF-only cCRE. In c and d, cCRE colours correspond to group classification defined in a and b. Grey cCREs have low DNase signals. 350 5 250 3 LU DC O 150 0.5 o o » 0.3 Q_ O 0.1 if] ffl # of cCREs Homology No homology Homology only \u Human I_ _ _ 383k 374k 169k Homology only No homology MJ Homology & cCRE I-1 Mouse DNase- CTCF-only H3K4me3 86k GM12878 Pol II DNase-H3K4me3 CTCF-only Random PLS pELS dELS DNase-H3K4me3 CTCF-only DNase-only Low DNase GM12878 EP300 GM12878 RAD21 -250 -125 0 125 Distance from GRCh38 cCRE Center 250 Extended Data Fig. 21 General properties of cCREs. a, Distributions of GRCh38 cCRE width in base pairs stratified by group classification, b, Average phyloP score in the ± 250 bp from the centre of each cCRE stratified by cell type-agnostic cCRE group: PLS (red), pELS (orange), dELS (yellow), DNase-H3K4me3 (pink), and CTCF-only (blue). In grey are 500,000 300-bp control regions randomlyselected from mappable regions of the human genome, c, Fractions of human and mouse cCREs with homology in the other species. In black (no homology) are cCREs that do not map to the other genome. In dark blue (homology only) are cCREs that map to the other genome but do not overlap a cCRE in that genome. In light blue (homology & cCRE) are 100 0 1 10 100 0 ■*-ChlP-seq signal-»■ cCREs that map to cCREs in the other genome, which then reciprocally map back to the original genome, d, Transcription factor ChlP-seq signals support the group classification of cCREs. Violin plots show the average Pol II, EP300, and RAD21 ChlP-seq signals for cCREs belonging to each cCRE group, along with values indicating median signal levels. All ChlP-seq data and cCREs are in GM12878 cells. Colours of violins indicate cCRE groups (PLS, red, N=17,119; pELS, orange, N=29,435; dELS, yellow, N=28,594; DNase-H3K4me3, pink, N=7,298;CTCF-only, blue, #=11,355; DNase-only, green,/V=9,394;low-DNase, grey, N=823,340). Boxplots inside violins display median and first and third quartiles. Article 20 5 _ Q. "I » a 10 dELS DNase-H3K4me3 DNase-only CTCF-only Low-DNase PLS 30 % cCREs overlapping RAMPAGE peaks 60 150 75 100 % ChlP-seq peaks overlapping cCREs 3 Q_ GM12878 GRO-seq: PLS 2.5 .2> 1.25 O DC CD coding strand, PLS non-coding strand, PLS -1k 0 1k Distance from cCRE center 2k small RNAs | protein coding mRNAs ! IncRNA, divergent pseudogenes ; structural RNAs uncertain coding ' short ncRNAs j IncRNA, antisense IncRNA, intergenic I IncRNA, sense intronic ' sense overlap RNAs GM12878 GRO-seq: dELS PLS pELS dELS 40 % of FANTOM TSSs 80 Extended Data Fig. 3 Summary of transcription and transcription factor binding at cCREs. a, Scatterplot depicting percent overlap of various groups of cCREs with RAMPAGE peaks in eight biosamples with matching data vs. the median expression level (in RPM) of the overlapping RAMPAGE peaks, b, The vast majority of high-quality ChlP-seq peaks of chromatin-associated proteins (mostly transcription factors) overlap cell type-agnostic cCREs. The median overlap is 90% across all ChlP-seq experiments, c, d, GRO-seq signal in GM12878 averaged over all cCRE-PLSs (c, in red) and cCRE-dELSs (d, inyellow) in a ± 2 kb window around cCRE centres. The GRO-seq signals around cCRE-PLSs were grouped by the orientation of their associated genes. The GRO-seq signals around cCRE-dELSs were grouped by genomic strands. 3 UJ O DC 0.4 0.2 h strand, ELS -2k -1k 0 1k Distance from cCRE center 2k Genomic background signal, computed as described in Supplementary Methods, is shown by the grey dashed lines and was approximately 0.02 for both strands in GM12878. e, Percentages of the transcription start sites of FANTOM CAGE-associated transcripts in the eleven FANTOM-defined categories that overlap cCRE-PLSs (red), cCRE-pELSs (orange), or cCRE-dELSs (yellow). The TSSs of the majority of coding-associated transcripts (protein-coding mRNAand divergent IncRN As) overlapped a cCRE-PLS, while the TSSs of the majority of eRNA-likenoncoding RNAs (short ncRN As, antisense IncRNAs, intergenic IncRN As, sense intronic IncRN As, and sense overlap RNAs) overlapped a cCRE-dELS. 25 -25 -50 50 Biosample type • primary cell ■ tissue + cell line Jfc in vitro differentiated cells from ESCs and iPSCs -50 -100 Developmental time □ e11.5 O e12.5 A e13.5 + e14.5 ■ e15.5 • e16.5 ▲ P0 (3) -100 -50 0 t-SNE 1 50 100 © Nervous system © Blood & immune system Other tissues © Nervous system brain • thymus © adrenal gland forebrain spinal cord • spleen • breast midbrain • nerve hematopoietic cells connective tissue • hindbrain Reproductive system © Gastrointestinal system heart musculature of body • neural tube © ovary stomach parathyroid uterus esophagus • thyroid gland CD Blood & immune system vagina • intestine • vasculature hematopoietic cells, fetal liver © penis prostate gland © Embryo © kidney liver © • embryo • lung Pluripotent cells • extraembryonic component • pancreas • iPS • placenta • salivary ES skin urinary bladder Extended Data Fig.41 t-SNE analysis of human and mouse biosamples based on the H3K27ac signals at their cCREs. To investigate the relationship among biosamples and their tissues or cell types of origin, we performed t-SNE based on the H3K27ac signal at the cCRE-dELSs (human: 667,599 and mouse: 209,041) across all biosamples (human: 228 and mouse: 66). a, Human biosamples formed seven main clusters as determined by K-means clustering. Cluster 1 comprises adult brain tissues and embryonic neurospheres. Cluster 2 comprises tissues from the adrenal gland, heart, leg muscle, and muscular samples of the gastrointestinal (GI) system. Cluster 3 comprises haematopoietic cells and immune tissues including the spleen and thymus. Cluster 4 comprises tissue but those without strong muscle components such © Other tissues kidney • lung • heart Gastrointestinal system stomach • intestine Extremities • limb embryonic facial prominence as kidney, liver, and mucosa of the gastrointestinal system. Cluster 5 comprises embryonic stem cells, induced pluripotent stem cells and in vitro differentiated cells from these pluripotent cell types. This cluster also includes two outliers, A673 and SK-N-MC cell lines. Cluster 6 comprises a mixture of cell lines and primary cells. Cluster 7comprises tissues from embryonic structures such as the placenta and chorion, b, The mouse developmental tissue samples formed three large clusters: brain, liver (hepatic plus fetal haematopoietic systems), and other tissues, with related tissues cluster together, and several tissues (for example, the four brain regions, face, and limb) display a time-course dependent arrangement of the samples. Article Extended Data Table 11 Summary of data produced during ENCODE phase III (as of 1 December 2019) _# of experiments__# of surveyed biosamples All ENCODE data Category Assay Tissues Primary cells Cell lines In vitro diff. cells ENCODE Phase III All ENCODE + ENCODE ROADMAP ENCODE Phase III All ENCODE ENCODE + ROADMAP Human CAGE 0 30 46 1 0 77 77 0 64 64 CRISPR RNA-seq 0 0 50 0 50 50 50 2 2 2 CRISPRi RNA-seq 0 0 77 0 77 77 77 1 1 1 RAMPAGE 104 15 30 6 155 155 155 154 154 154 RNA-PET 1 4 26 0 0 31 31 0 31 31 polyA depleted RNA-seq 0 11 20 1 1 32 32 1 32 32 polyA RNA-seq 189 51 110 21 38 143 371 28 105 301 Transcriptome small RNA-seq 67 24 73 12 86 171 176 85 144 148 total RNA-seq 113 57 45 8 196 221 223 191 216 217 microRNA counts 24 1 8 5 38 38 38 38 38 38 microRNA-seq 52 36 9 5 34 34 102 34 34 87 shRNA RNA-seq 0 0 523 0 523 523 523 2 2 2 siRNA RNA-seq 0 0 54 0 54 54 54 3 3 3 single cell RNA-seq 0 5 2 0 7 7 7 6 6 6 RNA microarray 3 66 94 7 0 170 170 0 145 145 DNase-seq 369 143 161 30 196 388 703 196 366 649 genetic modification DNase-seq 0 0 46 0 46 46 46 1 1 1 ATAC-seq 48 0 0 0 48 48 48 48 48 48 DNAme array 122 38 91 6 154 257 257 151 211 211 FAIRE-seq 7 4 26 0 0 37 37 0 37 37 MNase-seq 0 0 2 0 0 2 2 0 2 2 MRE-seq 0 0 2 0 0 2 2 0 2 2 Transcriptional regulation and replication MeDIP-seq RRBS 0 17 0 27 2 57 0 2 0 0 2 103 2 103 0 0 2 94 2 94 WGBS 78 7 18 14 48 48 117 45 45 109 ChlP-seq (TF) 232 56 1891 26 1327 2205 2205 140 278 278 ChlP-seq (histone) 798 480 583 230 518 863 2091 86 153 350 ChlP-seq (control) 362 117 469 38 513 747 986 155 279 461 5C 0 0 13 0 0 13 13 0 11 11 ChlA-PET 0 2 52 3 49 57 57 29 32 32 Hi-C 8 6 19 0 33 33 33 33 33 33 Repli-chip 0 4 14 27 36 45 45 30 39 39 Repli-seq 0 12 92 0 14 104 104 14 104 104 RIP-chip 0 0 32 0 0 32 32 0 5 5 RIP-seq 0 0 15 0 7 15 15 2 2 2 Post- RNA Bind-N-Seq 0 0 0 78 78 78 78 in vitro in vitro in vitro transcriptional RNA Bind-N-Seq (control) 0 0 0 80 80 80 80 in vitro in vitro in vitro regulation via eCLIP 2 0 168 0 170 170 170 3 3 3 RBPs eCLIP (control) 2 0 177 0 179 179 179 3 3 3 iCLIP 0 0 3 0 3 3 3 1 1 1 Switchgear, RNA binding protein 0 0 2 0 0 2 2 0 1 1 DNA-PET 0 0 6 0 0 6 6 0 2 2 Genotyping genotyping array genotyping HTS 7 8 37 0 75 2 4 0 59 10 123 10 123 10 56 5 88 5 88 5 Proteome MS-MS 0 0 13 1 0 14 14 0 12 12 Human Total 4,827 7,495 9,649 490 904 1,369 Mouse polyA RNA-seq 156 9 22 2 78 189 78 171 total RNA-seq 5 18 9 0 28 32 18 21 Transcriptome microRNA counts 77 0 0 0 77 77 77 77 microRNA-seq 78 0 0 0 78 78 74 74 single cell RNA-seq 3 3 0 0 6 6 6 6 DNase-seq 67 13 22 3 50 105 50 103 ATAC-seq 68 11 2 0 81 81 81 81 snATAC-seq 8 0 0 0 8 8 8 8 Transcriptional regulation and replication MRE-seq WGBS ChlP-seq (control) ChlP-seq (histone) 0 84 112 630 0 0 5 18 2 0 29 66 0 0 4 6 0 84 94 564 2 84 150 720 0 84 72 72 2 84 108 101 ChlP-seq (TF) 45 9 122 16 16 192 11 45 MeDIP-seq 0 0 2 0 0 2 0 2 Repli-chip 0 3 7 8 0 18 0 17 Mouse Total 1,164 1,744 144 276 Grand Total 5,991 9,239 11,393 nature research Reporting Summary Corresponding author(s): Zhiping Weng Last updated by author(s): Dec 10, 2019 Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist. Statistics For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section. n/a □ □ □ □ □ □ □ □ □ □ Confirmed ^ The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement ^ A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly prpi The statistical test(s) used AND whether they are one- or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section. ^ A description of all covariates tested ^| A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) ^ AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) prpi For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable ^ For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings ^| For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes ^ Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above Software and code Policy information about availability of computer code '- Data collection All protocols are described in the Methods and Supplementary Methods sections of the manuscript and available in GitHub Data analysis The nearly six thousand experiments were processed using the applicable ENCODE Processing pipeline, which are extensively documented on the ENCODE portal with pipeline schematics and software versions. All pipelines are also available via GitHub. To create the Registry of cCREs and run subsequent analyses we utilized the following commercial software: Bedtools v2.27.1, PRROC vl.3.1, UCSC Utilities (liftOver, bigWigAverageOverBed), DESeq2 vl.14.1. All custom code is available on GitHub. zor manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information. Data Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: - Accession codes, unique identifiers, or web links for publicly available datasets - A list of figures that have associated raw data - A description of any restrictions on data availability All ENCODE data are available at the ENCODE Portal (http://encodeproject.org) Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. ^ Life sciences ] Behavioural & social sciences ] Ecological, evolutionary & environmental sciences zor a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summarv-flat.pdf Life sciences study design All studies must disclose on these points even when the disclosure is negative Sample size We performed almost six thousand experiments on nearly 500 biosamples including tissues, primary cells, in vitro differentiated cells, and cell lines. No statistical methods were used to determine sample sizes. Data exclusions Replication Each ENCODE experiment is subject to assay specific quality control measurements which are available on the ENCODE portal. To create the Registry of cCREs we selected all released DNase experiments with SPOT score > 0.3. To annotate cCREs, we selected one representative I experiment per biosample to account for assay redundancy based on QC metrics Randomization No randomization was performed. This was not a clinical trial and therefore randomization is not relevant. Blinding I No blinding was performed. This was not a clinical trial and therefore blinding is not relevant. The majority of all ENCODE assays require two successful replicates. In cases of biosample scarcity one replicate was performed and these rare cases are clearly labeled at the ENCODE portal. For the mouse transgenic enhancer-reporter assays, a predicted element was scored positive as an enhancer if at least three embryos had identical (5-galactosidase staining in the same tissue. Specific testing results for the 151 tested regions can be found in Supplemental Table 13 and at https://enhancer.lbl.gov/. Reporting for specific materials, systems and methods_ We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response Materials & experimental systems Methods n/a Involved in the study n/a Involved in the study □ |^| Antibodies □ ^ ChlP-sec □ 15^1 Eukaryotic cell lines ] Flow cytometry ] Palaeontology ] MRI-based neuroimagin □ 15^1 Animals and other organisms ] Human research participants ] Clinical data Antibodies Antibodies used Validation The > 3,000 antibodies that were used are listed on the ENCODE portal at https://www.encodeproject.org/search/? type=AntibodyLot&status=released. Each antibody page contains information about the supplier name, catalog number, clone name, lot number and dilution. Each experiment is linked with its corresponding antibody. The > 3,000 antibodies that were used are listed on the ENCODE portal at https://www.encodeproject.org/search/? type=AntibodyLot&status=released. Each antibody page contains information about the antibody validation. Antibody characterization guidelines can be found here: https://www.encodeproject.org/documents/4bb40778-387a-47c4-ab24-cebe64ead5ae/@@d own load/attachment/ ENCODE_Approved_Oct_2016_Histone_and_Chromatin_associated_Proteins_Antibody_Characterization_Guidelines.pdf Eukaryotic cell lines Policy information about cell lines Cell line source(s) We performed assays on 168 cell lines in this study. On the ENCODE data portal each experiment is linked to a specific biosample page with details about the sample source. Authentication We performed assays on 168 cell lines in this study. On the ENCODE data portal each experiment is linked to a specific biosample page with details about the sample being authenticated 2 Mycoplasma contamination Commonly misidentified lines (See JCLAC register) Cell lines were not tested for mycoplasma contamination No commonly misidentified cell lines were used Animals and other organisms Policy information about studies involving animals: ARRIVE guidelines recommended for reporting animal research Laboratory animals We performed assays on 119 mouse biosamples in this study. On the ENCODE data portal each experiment is linked to a specific biosample page with details about the sample source including species, strain, sex, and age. Wild animals Field-collected samples Ethics oversight None None Not required Note that full information on the approval of the study protocol must also be provided in the manuscript. ChlP-seq Data deposition ^ Confirm that both raw and final processed data have been deposited in a public database such as GEO. ^ Confirm that you have deposited or provided access to graph files (e.g. BED files) for the called peaks. [The ENCODE Portal. Data access links May remain private before pub/ication. Files in database submission Genome browser session (e.g. UCSC) Methodology Replicates Sequencing depth Antibodies Peak calling parameters Data quality Software The ENCODE Portal 1 Track hubs for our data are provided in the supplementary methods See https://www.encodeproject.org/chip-seq/transcription_factor/ and https://www.encodeproject.org/chip-seq/histone/ See https://www.encodeproject.org/chip-seq/transcription_factor/ and https://www.encodeproject.org/chip-seq/histone/ See https://www.encodeproject.org/chip-seq/transcription_factor/ and https://www.encodeproject.org/chip-seq/histone/ See https://www.encodeproject.org/chip-seq/transcription_factor/ and https://www.encodeproject.org/chip-seq/histone/ See https://www.encodeproject.org/chip-seq/transcription_factor/ and https://www.encodeproject.org/chip-seq/histone/ See https://www.encodeproject.org/chip-seq/transcription_factor/ and https://www.encodeproject.org/chip-seq/histone/