ENCODE Histone Modifications Track Settings
 
ENCODE Histone Modification Peaks and Signal based on Uniform processing pipeline

Maximum display mode:       Reset to defaults   
Select view (Help):
Peaks ▾       Signal ▾      
Select subtracks by factor and cell line:
 All
Factor
H2A.Z
H3K27ac
H3K27me3
H3K36me3
H3K4me1
H3K4me2
H3K4me3
H3K79me2
H3K9ac
H3K9me1
H3K9me3
H4K20me1
Factor
All 
Cell Line











Cell Line
GM12878   GM12878
H1-hESC   H1-hESC
K562   K562
HeLa-S3   HeLa-S3
HepG2   HepG2
HUVEC   HUVEC
AG04449   AG04449
AG04450   AG04450
AG09309   AG09309
AG09319   AG09319
AG10803   AG10803
AoAF   AoAF
BJ   BJ
Caco-2   Caco-2
GM06990   GM06990
H7-hESC   H7-hESC
HA-sp   HA-sp
HBMEC   HBMEC
HCF   HCF
HCFaa   HCFaa
HCM   HCM
HCPEpiC   HCPEpiC
HCT-116   HCT-116
HEEpiC   HEEpiC
HEK293   HEK293
HL-60   HL-60
HMEC   HMEC
HMF   HMF
HPAF   HPAF
HPF   HPF
HRE   HRE
HRPEpiC   HRPEpiC
HSMM   HSMM
HSMMtube   HSMMtube
HVMF   HVMF
Jurkat   Jurkat
MCF-7   MCF-7
NB4   NB4
NH-A   NH-A
NHDF-Ad   NHDF-Ad
NHDF-neo   NHDF-neo
NHEK   NHEK
NHLF   NHLF
NT2-D1   NT2-D1
Osteobl   Osteobl
SAEC   SAEC
SK-N-SH RA   SK-N-SH RA
U2OS   U2OS
Cell Line











Cell Line
 All
Factor
H2A.Z
H3K27ac
H3K27me3
H3K36me3
H3K4me1
H3K4me2
H3K4me3
H3K79me2
H3K9ac
H3K9me1
H3K9me3
H4K20me1
Factor
All 
Select subtracks further by: (select multiple categories and items - help)
Tier:
Lab:

List subtracks: only selected/visible    all    ()
  view↓1 Tier↓2 Cell Line↓3 Factor↓4 Lab↓5   Track Name↓6  
 
dense
 Configure
 Peaks  1  GM12878  H2A.Z  Broad  GM12878 H2A.Z Peak Calls from Broad    Data format 
 
dense
 Configure
 Signal  1  GM12878  H2A.Z  Broad  GM12878 H2A.Z signal from Broad    Data format 
 
dense
 Configure
 Peaks  1  GM12878  H3K27ac  Broad  GM12878 H3K27ac Peak Calls from Broad    Data format 
 
dense
 Configure
 Signal  1  GM12878  H3K27ac  Broad  GM12878 H3K27ac signal from Broad    Data format 
 
dense
 Configure
 Peaks  1  GM12878  H3K27me3  Broad  GM12878 H3K27me3 Peak Calls from Broad    Data format 
 
dense
 Configure
 Peaks  1  GM12878  H3K27me3  UW  GM12878 H3K27me3 Peak Calls from UW    Data format 
 
dense
 Configure
 Signal  1  GM12878  H3K27me3  Broad  GM12878 H3K27me3 signal from Broad    Data format 
 
dense
 Configure
 Signal  1  GM12878  H3K27me3  UW  GM12878 H3K27me3 signal from UW    Data format 
 
dense
 Configure
 Peaks  1  GM12878  H3K36me3  Broad  GM12878 H3K36me3 Peak Calls from Broad    Data format 
 
dense
 Configure
 Peaks  1  GM12878  H3K36me3  UW  GM12878 H3K36me3 Peak Calls from UW    Data format 
 
dense
 Configure
 Signal  1  GM12878  H3K36me3  Broad  GM12878 H3K36me3 signal from Broad    Data format 
 
dense
 Configure
 Signal  1  GM12878  H3K36me3  UW  GM12878 H3K36me3 signal from UW    Data format 
 
dense
 Configure
 Peaks  1  GM12878  H3K4me1  Broad  GM12878 H3K4me1 Peak Calls from Broad    Data format 
 
dense
 Configure
 Signal  1  GM12878  H3K4me1  Broad  GM12878 H3K4me1 signal from Broad    Data format 
 
dense
 Configure
 Peaks  1  GM12878  H3K4me2  Broad  GM12878 H3K4me2 Peak Calls from Broad    Data format 
 
dense
 Configure
 Signal  1  GM12878  H3K4me2  Broad  GM12878 H3K4me2 signal from Broad    Data format 
 
dense
 Configure
 Peaks  1  GM12878  H3K4me3  Broad  GM12878 H3K4me3 Peak Calls from Broad    Data format 
 
dense
 Configure
 Peaks  1  GM12878  H3K4me3  UW  GM12878 H3K4me3 Peak Calls from UW    Data format 
 
dense
 Configure
 Signal  1  GM12878  H3K4me3  Broad  GM12878 H3K4me3 signal from Broad    Data format 
 
dense
 Configure
 Signal  1  GM12878  H3K4me3  UW  GM12878 H3K4me3 signal from UW    Data format 
 
dense
 Configure
 Peaks  1  GM12878  H3K79me2  Broad  GM12878 H3K79me2 Peak Calls from Broad    Data format 
 
dense
 Configure
 Signal  1  GM12878  H3K79me2  Broad  GM12878 H3K79me2 signal from Broad    Data format 
 
dense
 Configure
 Peaks  1  GM12878  H3K9ac  Broad  GM12878 H3K9ac Peak Calls from Broad    Data format 
 
dense
 Configure
 Signal  1  GM12878  H3K9ac  Broad  GM12878 H3K9ac signal from Broad    Data format 
 
dense
 Configure
 Peaks  1  GM12878  H3K9me3  Broad  GM12878 H3K9me3 Peak Calls from Broad    Data format 
 
dense
 Configure
 Signal  1  GM12878  H3K9me3  Broad  GM12878 H3K9me3 signal from Broad    Data format 
 
dense
 Configure
 Peaks  1  GM12878  H4K20me1  Broad  GM12878 H4K20me1 Peak Calls from Broad    Data format 
 
dense
 Configure
 Signal  1  GM12878  H4K20me1  Broad  GM12878 H4K20me1 signal from Broad    Data format 
 
dense
 Configure
 Peaks  1  K562  H2A.Z  Broad  K562 H2A.Z Peak Calls from Broad    Data format 
 
dense
 Configure
 Signal  1  K562  H2A.Z  Broad  K562 H2A.Z signal from Broad    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K27ac  Broad  K562 H3K27ac Peak Calls from Broad    Data format 
 
dense
 Configure
 Signal  1  K562  H3K27ac  Broad  K562 H3K27ac signal from Broad    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K27me3  Broad  K562 H3K27me3 Peak Calls from Broad    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K27me3  UCD  K562 H3K27me3 Peak Calls from UCD    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K27me3  UW  K562 H3K27me3 Peak Calls from UW    Data format 
 
dense
 Configure
 Signal  1  K562  H3K27me3  Broad  K562 H3K27me3 signal from Broad    Data format 
 
dense
 Configure
 Signal  1  K562  H3K27me3  UCD  K562 H3K27me3 signal from UCD    Data format 
 
dense
 Configure
 Signal  1  K562  H3K27me3  UW  K562 H3K27me3 signal from UW    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K36me3  Broad  K562 H3K36me3 Peak Calls from Broad    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K36me3  UW  K562 H3K36me3 Peak Calls from UW    Data format 
 
dense
 Configure
 Signal  1  K562  H3K36me3  Broad  K562 H3K36me3 signal from Broad    Data format 
 
dense
 Configure
 Signal  1  K562  H3K36me3  UW  K562 H3K36me3 signal from UW    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K4me1  Broad  K562 H3K4me1 Peak Calls from Broad    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K4me1  UCD  K562 H3K4me1 Peak Calls from UCD    Data format 
 
dense
 Configure
 Signal  1  K562  H3K4me1  Broad  K562 H3K4me1 signal from Broad    Data format 
 
dense
 Configure
 Signal  1  K562  H3K4me1  UCD  K562 H3K4me1 signal from UCD    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K4me2  Broad  K562 H3K4me2 Peak Calls from Broad    Data format 
 
dense
 Configure
 Signal  1  K562  H3K4me2  Broad  K562 H3K4me2 signal from Broad    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K4me3  Broad  K562 H3K4me3 Peak Calls from Broad    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K4me3  UCD  K562 H3K4me3 Peak Calls from UCD    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K4me3  UW  K562 H3K4me3 Peak Calls from UW    Data format 
 
dense
 Configure
 Signal  1  K562  H3K4me3  Broad  K562 H3K4me3 signal from Broad    Data format 
 
dense
 Configure
 Signal  1  K562  H3K4me3  UCD  K562 H3K4me3 signal from UCD    Data format 
 
dense
 Configure
 Signal  1  K562  H3K4me3  UW  K562 H3K4me3 signal from UW    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K79me2  Broad  K562 H3K79me2 Peak Calls from Broad    Data format 
 
dense
 Configure
 Signal  1  K562  H3K79me2  Broad  K562 H3K79me2 signal from Broad    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K9ac  Broad  K562 H3K9ac Peak Calls from Broad    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K9ac  UCD  K562 H3K9ac Peak Calls from UCD    Data format 
 
dense
 Configure
 Signal  1  K562  H3K9ac  Broad  K562 H3K9ac signal from Broad    Data format 
 
dense
 Configure
 Signal  1  K562  H3K9ac  UCD  K562 H3K9ac signal from UCD    Data format 
 
dense
 Configure
 Signal  1  K562  H3K9me1  Broad  K562 H3K9me1 signal from Broad    Data format 
 
dense
 Configure
 Peaks  1  K562  H3K9me3  Broad  K562 H3K9me3 Peak Calls from Broad    Data format 
 
dense
 Configure
 Signal  1  K562  H3K9me3  Broad  K562 H3K9me3 signal from Broad    Data format 
 
dense
 Configure
 Peaks  1  K562  H4K20me1  Broad  K562 H4K20me1 Peak Calls from Broad    Data format 
 
dense
 Configure
 Signal  1  K562  H4K20me1  Broad  K562 H4K20me1 signal from Broad    Data format 
    
Assembly: Human Feb. 2009 (GRCh37/hg19)


Note: ENCODE Project

Description

This set of data tracks represents a comprehensive set of human histone modifications based on ChIP-seq experiments generated by multiple production groups in the ENCODE Consortium. The data tracks represent peak calls (regions of enrichment) that were generated by the ENCODE Analysis Working Group (AWG) based on a uniform processing pipeline. The datasets are based on the January 2011 internal data freeze. These datasets were used in all downstream analysis pipelines by members of the ENCODE Consortium and are one of the primary sources of data referenced in the ENCODE Integrative analysis paper (ENCODE Project Consortium, 2012).

Methods

All ChIP-seq experiments were performed at least in duplicate, and were scored against an appropriate control designated by the production groups (either input DNA or DNA obtained from a control immunoprecipitation). Submitted data was generally expected to meet an initial standard for inter-replicate consistency developed by the ENCODE Consortium to ensure an acceptable level of reproducibility; four fifths of the top 40% of the targets identified from one replicate (using an acceptable scoring method) should overlap the list of targets from the other replicate, or target lists scored using all available reads from each replicate should share more than 75% of targets in common. In addition, a number of quality metrics for individual replicates, including measures of library complexity and signal enrichment, were calculated, and these are available for review (Kundaje et al., 2012a, Kundaje et al., 2012b). As sequencing has become more economical, minimum standards for the number of reads required for submission of data have been established and upgraded over the course of the ENCODE project. Datasets used in the analyses presented here complied with the minimum depth requirements at the time of submission.

A detailed description of the precise standards and considerations for evaluating the quality of ChIP-seq data and antibodies used for ChIP-seq is available (ENCODE Project Consortium, 2012).

We built a scoring pipeline for the uniform processing of all ChIP-seq experiments generated by the ENCODE Consortium. This pipeline was implemented on the EBI cluster, but can readily be ported to other computers. Uniform signal was generated by processing the aligned reads using the align2rawsignal "Wiggler" software (see http://code.google.com/p/align2rawsignal for details and settings). The method accounts for the depth of sequencing, the mappability of the genome (based on read length and ambiguous bases) and different fragment length shifts for the different datasets being combined. It also differentiates between positions that showed zero signal simply because they are unmappable and positions that are mappable but have no reads.

Reads from all ENCODE ChIP-seq experiments and matching controls were mapped to a standardized version of the GRCh37 (hg19) reference human genome sequence with the following modifications:

  • Mitochondrial sequence was included.
  • Alternate sequences were excluded.
  • Random contigs were excluded.
  • The female version of the genome was represented by the autosomes and chrX, whereas the male genome was represented by the autosomes, chrX, and chrY with the PAR regions masked.
Reads from experiments in cell lines labeled male or unknown were mapped to the above mentioned male genome, while experiments in cell lines labeled female were mapped to the female version of the genome.

All Histone Modification ChIP-seq datasets were processed using a standardized pipeline. Mapped reads in the form of BAM files were downloaded from the UCSC ENCODE portal (ENCODE Project Consortium, 2011). Multi-mapping reads were discarded. The MACS (Zhang et al., 2008) peak caller (version 1.4) was used to identify peaks (regions of enrichment) by comparing each ChIP-seq experiment to a corresponding input DNA control experiment.

Since every ENCODE dataset is represented by at least two biological replicate experiments, we used a measure of consistency of peak calling results between replicates, known as the irreproducible discovery rate (IDR), in order to determine an optimal number of reproducible peaks (Li et al., 2011, Kundaje et al., 2012b). Peak calling was performed independently on each replicate of a ChIP-seq dataset. We used a relaxed peak calling threshold (p-value = 0.01) in order to obtain a large number of peaks that span true signal as well as noise (false identifications). The IDR method analyzes a pair of replicates, and considers peaks that are present in both replicates to belong to one of two populations : a reproducible signal group or an irreproducible noise group. Peaks from the reproducible group are expected to show relatively higher ranks and stronger rank-consistency across the replicates, relative to peaks in the irreproducible groups. Based on these assumptions, a two-component probabilistic copula-mixture model is used to fit the bivariate peak rank distributions from the pairs of replicates. The method adaptively learns the degree of peak-rank consistency in the signal component and the proportion of peaks belonging to each component. The model can then be used to infer an IDR score for every peak that is found in both replicates. The IDR score of a peak represents the expected probability that the peak belongs to the noise component, and is based on its ranks in the two replicates. Hence, low IDR scores represent high-confidence peaks. We used an IDR score threshold to obtain an optimal peak rank threshold on the replicate peak sets (cross-replicate threshold). The IDR threshold used was 1%. If a dataset had more than two replicates, all pairs of replicates were analyzed using the IDR method. We used the maximum peak rank threshold across all pairwise analyses as the final cross-replicate peak rank threshold. We then pooled reads from replicate datasets and used MACS to call peaks on the pooled data with a relaxed p-value threshold of 0.01.

We used a projection step, where we regress the ranks of the meta-peaks from replicates to the ranks of peaks in the pooled data that overlap the meta-peaks. Meta-peak ranks were based on their IDR scores, and peaks in the pooled data were ranked by their p-values. For each meta-peak from the replicates, we found all peaks in the pooled data that overlap it. We then paired the meta-peak with the best ranking peak from the set of overlapping pooled data peaks. A non-linear polynomial regression curve was fitted to define a relationship between the meta-peak ranks and the corresponding pooled peak ranks. We used this regression to project the meta-peak rank threshold corresponding to the IDR threshold to a corresponding rank threshold on pooled data peaks; the pooled peak list was truncated based on this rank threshold.

The final peak list consists of peaks in the pooled data that pass the projected rank threshold AND are also present in both replicates. Hence, each peak in the pooled set can be assigned an IDR score based on the meta-peak from the replicates that it overlaps with.

Any thresholds based on reproducibility of peak calling between biological replicates are bounded by the quality and enrichment of the worst replicate. Valuable signal is lost in cases for which a dataset has one replicate that is significantly worse in data quality than another replicate. Hence, we developed a rescue pipeline for such cases. In order to balance data quality between a set of replicates, we pooled mapped reads across all replicates of a dataset, and then randomly sampled (without replacement) two pseudo-replicates with equal numbers of reads. This sampling strategy tends to transfer signal from stronger replicates to the weaker replicates, thereby balancing cross-replicate data quality and sequencing depth. We then processed these pseudo-replicates using the IDR method in order to learn a rescue threshold. We found that for datasets with comparable replicates (based on independent measures of data quality), the rescue threshold and cross-replicate thresholds were very similar. However, for datasets with replicates of differing data quality, the rescue thresholds were often higher than the cross-replicate thresholds, and were able to capture true peaks that showed statistically significant and visually compelling ChIP-seq signal in one replicate but not in the other. Ultimately, for each dataset, we used the best of the cross-replicate and rescue thresholds to obtain a final consolidated optimal set of peaks.

In order to make the replicates in terms of library size/sequencing depth, reads from all replicates were pooled and subsampled to form two equally sized pseudo-replicates. We then used the same IDR procedure on the pseudo-replicates as we did for the original replicates (including the projection step), using an IDR threshold of 0.01.

The final set of peak calls is the longest of the pseudo-replicate peak set or the replicate peak set. The optimal set of peak calls contains peaks from the pooled data that are present in both replicates. However, since these are based on projected rank thresholds on the pooled data peaks (which are ranked by p-value) not all peaks have IDR scores passing the original IDR threshold, even though these peaks pass the projected rank threshold.

All peak sets were then screened against a specially curated empirical blacklist (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz) of regions in the human genome and peaks overlapping the blacklisted regions were discarded (Kundaje et al., 2012b). Briefly, these artifact regions typically show the following characteristics:

  • Unstructured and extreme artifactual high signal in sequenced input-DNA and control datasets as well as open chromatin datasets irrespective of cell type identity.
  • An extreme ratio of multi-mapping to unique mapping reads from sequencing experiments.
  • Overlap with pathological repeat regions such as centromeric, telomeric and satellite repeats that often have few unique mappable locations interspersed in repeats.

References

ENCODE Project Consortium. A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol. 2011 Apr;9(4):e1001046.

ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 2012 Sep 6;489(7414):57-74.

Kundaje A, Jung L, Kharchenko PV, Sidow A, Batzoglou S, Park PJ. Assessment of ChIP-seq data quality using strand cross-correlation analysis (submitted), 2012a.

Kundaje A, Li Q, Brown JB, Rozowsky J, Harmanci A, Wilder SP, Batzoglou S, Dunham I, Gerstein M, Birney E, et al. Reproducibility measures for automatic threshold selection and quality control in ChIP-seq datasets (submitted), 2012b.

Li QH, Brown JB, Huang HY, Bickel PJ. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat. 2011; 5(3):1752-1779.

Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9(9):R137.

Data Release Policy

Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column on the track configuration page and the download page. The full data release policy for ENCODE is available here.

There is no restriction on the use of these specific tracks.

Contact

Bradley Bernstein