TBA Cons Track Settings
 
TBA Conservation   (All ENCODE Comparative Genomics tracks)

Display mode:       Reset to defaults

Type of graph:
Track height: pixels (range: 11 to 100)
Data view scaling: Always include zero: 
Vertical viewing range: min:  max:   (range: 0 to 1)
Transform function:Transform data points by: 
Windowing function: Smoothing window:  pixels
Negate values:
Draw y indicator lines:at y = 0.0:    at y =
Graph configuration help
List subtracks: only selected/visible    all  
hide
 Configure
 TBA PhastCons  TBA PhastCons Conservation   Data format 
hide
 Configure
 TBA GERP Cons  TBA GERP Conservation   Data format 
hide
 Configure
 TBA SCONE Cons  TBA SCONE Conservation   Data format 
Source data version: ENCODE Oct 2005 Freeze, June 2007
Assembly: Human May 2004 (NCBI35/hg17)

Description

This track displays different measurements of conservation based on the multiple sequence alignments of ENCODE regions generated by the Threaded Blockset Aligner (TBA) and shown in the TBA Alignment track. The conservation scoring used to create this track was generated by three programs:

  • phastCons (phylogenetic hidden-Markov model method)
  • GERP (Genomic Evolutionary Rate Profiling)
  • SCONE (from Harvard Genetics)
A related track, TBA Elements, shows multi-species conserved sequences (MCSs) based on the conservation measurements displayed in this track.

For details on the conservation scores generated by each program, refer to the individual Methods subsections.

Display Conventions and Configuration

The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. A subtrack may be hidden from view by checking the box to the left of the track name in the list. For more information about the graphical configuration options, click the Graph configuration help link.

Color differences among the subtracks are arbitrary; they provide a visual cue for distinguishing the different gene prediction methods. See the Methods section for display information specific to each subtrack.

Methods

The methods used to create the TBA alignments in the ENCODE regions are described in the TBA Alignment track description.

PhastCons

The phastCons program predicts conserved elements and produces base-by-base conservation scores using a two-state phylogenetic hidden Markov model. The model consists of a state for conserved regions and a state for nonconserved regions, each of which is associated with a phylogenetic model. These two models are identical except that the branch lengths of the conserved phylogeny are multiplied by a scaling parameter rho (0 < rho < 1).

For determining the conservation for the ENCODE alignments, the nonconserved model was estimated from four-fold degenerate coding sites within the ENCODE regions using the program phyloFit. The parameter rho was then estimated by maximum likelihood, conditional on the nonconserved model, using the EM algorithm implemented in phastCons. Parameter estimation was based on a single large alignment, constructed by concatenating the alignments for all conserved regions.

PhastCons was run with the options --expected-lengths 15 and --target-coverage 0.01 to obtain the desired level of "smoothing" and a final coverage by conserved elements of 5%.

The conservation score at each base is the posterior probability that the base was generated by the conserved state of the phylo-HMM. It can be interpreted as the probability that the base is in a conserved element, given the assumptions of the model and the estimated parameters. Scores range from 0 to 1, with higher scores corresponding to higher levels of conservation.

More details on phastCons can be found in Siepel et. al. (2005) cited below.

GERP

The GERP score is the expected substitution rate minus the observed substitution rate at a particular human base. Scores are estimated on a column-by-column basis using multiple sequence alignments of mammalian genomic DNA. The scores are both positive and negative, with negative values (i.e. observed > expected) corresponding to neutral or unconstrained sites and positive values (i.e. observed < expected) corresponding to constrained or slowly evolving sites. The expected and observed rates are both calculated on a phylogenic tree using the same fixed topology. The branch lengths of the expected tree are based on the average substitutions at neutral sites. The branch lengths of the observed tree, which is calculated separately for each human base, are based on the substitutions seen at the column of the multiple alignment at that base. Species that have gaps at a particular column are not considered in the scoring for that column.

Higher scores correspond to human bases in alignment columns with higher degrees of similarity, i.e. bases that have evolved slowly, some of which have been under purifying selection. The opposite holds true for swiftly evolving (low similarity) columns.

Scores are deterministic, given a maximum-likelihood model of nucleotide substitution, species topology, neutral tree, and alignment.

SCONE

SCONE is a probabilistic measure of purifying selection expressed as a p-value that a given position evolves neutrally. It has a model of evolution that considers both sequence-contextual effects on substitution rates and insertion/deletion events. This model may be used to compute the probability of any transitional event along a lineage.

The score is computed for any column in a multiple sequence alignment by first parsimoniously inferring the evolutionary history of the site, using a given phylogenetic tree with known branch-lengths. Subsequently, transition probabilities are computed for each branch in the tree. A heuristic score is computed using the formula:

S = ln(product(all i in M)/product(all j in C))
where M and C are the set of all branches in the tree that contain mutations and the set of all branches in the tree that do not contain mutations, respectively. This heuristic score serves to effectively partition sites according to the influence of purifying selection on the site.

This heuristic score is used to compute a p-value by comparing it against the expected distribution of neutral scores as determined by Monte-Carlo simulation. Forward simulation of evolution is performed along the phylogenetic tree using the SCONE model of mutation events, and the above heuristic score is computed for a simulated tree. Repeated simulation produces a distribution of scores that reflects the neutral expected distribution. A p-value score may be computed by counting the fraction of simulated heuristic scores that fall below the heuristic score for the site.

Credits

PhastCons was developed by Adam Siepel, Cold Spring Harbor Laboratory, while at the Haussler lab at UCSC.

GERP was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford).

SCONE was developed by Saurabh Asthana in the lab of Shamil Sunyaev at Harvard Medical School and Brigham & Women's Hospital (Department of Medicine/Division of Genetics).

TBA was provided by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group.

The GERP data for this track was generated by Greg Cooper. The PhastCons data was generated by Elliott Margulies, with assistance from Adam Siepel. The SCONE data was generated by Saurabh Asthana.

References

Asthana S, Roytberg M, Stamatoyannopoulos J, Sunyaev S. Analysis of Sequence Conservation at Nucleotide Resolution. PLoS Comput. Biol. 2007 Dec 28:3(12):e254.

Blanchette M, Kent WJ, Reimer C, Elnitski L, Smit A, Roskin K, Baertsch R, Rosenbloom KR, Clawson H, Green ED, et al. Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner. Genome Res. 2004 Apr:14(4):708-15.

Cooper GM, Stone EA, Asimenos G, NISC Comparative Sequencing Program, Green ED, Batzoglou , Sidow A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005 Jul;15(7):901-13. Epub 2005 Jun 17.

Margulies EH, Blanchette M, NISC Comparative Sequencing Program, Haussler D, Green ED. Identification and characterization of multi-species conserved sequences. Genome Res. 2003 Dec;13(12):2507-18.

Siepel A, Bejerano G, Pedersen JS, Hinrichs A, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005 Aug;15(8):1034-50. Epub 2005 Jul 15.