Description
This track displays multi-species conserved sequences (MCSs)
derived from phastCons, binCons,
GERP (Genomic Evolutionary Rate Profiling),
and SCONE conservation scoring
of Threaded Blockset Aligner (TBA) multiple sequence alignments in the
ENCODE regions.
The multiple sequence alignments may be viewed in the TBA Alignment
track. Another related track, TBA Cons, shows the conservation scoring.
The descriptions accompanying these tracks detail
the methods used to create the alignments and conservation scoring.
Display Conventions and Configuration
The locations of conserved elements are indicated by blocks in the graphical
display. This composite annotation track consists of several subtracks that
show conserved elements derived by the various methods listed above.
To view only selected subtracks, uncheck the boxes next to the tracks
you wish to hide.
The display may also be filtered to show only those items
with unnormalized scores that meet or exceed a certain threshold. To set a
threshold, type the minimum score into the text box at the top of the
description page.
Display characteristics specific to certain subtracks are described in the
respective Methods sections below.
Methods
PhastCons-based Elements
The predicted MCSs are segments of the alignment that are likely to
have been "generated" by the conserved state of the phylo-HMM,
i.e., maximal segments in which the maximum-likelihood (Viterbi)
path remains in the conserved state.
BinCons-based Elements
The binCons score is based on the cumulative binomial probability of
detecting the observed number of identical bases (or greater) in
sliding 25 bp windows (moving one bp at a time) between the
reference sequence and each other species, given the neutral rate
at four-fold degenerate sites. Neutral rates are calculated
separately at each targeted region. For targets with no gene annotations,
the average percent identity across all alignable sequence was instead used
to weight the individual species binomial scores; this latter
weighting scheme was found to closely match 4D weights.
The negative log of these p-values was then averaged across all
human-referenced pairwise combinations, and the highest-scoring overlapping
25 bp window for each base was the resulting score. This track shows the
plotting of a ranked percentile score normalized between 0 and 1 across all
ENCODE regions, such that the top 5% most conserved sequence across all ENCODE
regions have a score of 0.95 or greater, the top 10% have a score of 0.9 or
greater, and so on.
For each ENCODE target, a conservation score threshold was picked to match
the number of conserved bases predicted by phastCons, an alternative method
for measuring conservation. This latter method has been found slightly more
reliable for predicting the expected fraction of conserved sequence
in each target. Clusters of bases
that exceeded the given conservation score threshold were designated
as MCSs. The minimum length of an MCS is 25
bases. Strict cutoffs were used: if even one base fell below the
conservation score threshold, it separated an MCS into two distinct
regions.
More details on binCons can be found in Margulies et. al. (2003)
cited below.
GERP-based Elements
GERP elements are scored according to the inferred intensity
of purifying selection
and are measured as "rejected substitutions" (RSs). RSs capture the
magnitude of difference between the number of "observed" substitutions
(estimated using maximum likelihood) and the number that would be
"expected" under a neutral model of evolution.
The RS is displayed as part of the item name.
Items with higher RSs are displayed in a darker shade of blue. The score shown
on the details page, which has been scaled by 300 for display purposes, is
generally not as accurate as the RS count that is part of the item name.
"Constrained elements" are identified as those groups
of consecutive human bases that have an observed rate of evolution that is
smaller than the expected rate. These groups of columns are merged if they
are less than a few nucleotides apart and are scored according to the sum of
the site-by-site difference between observed and expected rates (RS).
Permutations of the actual alignments were analyzed, and the "constrained
elements" identified in these permuted alignments were treated as
"false positives". Subsequently, an RS threshold was picked such
that the total length of "false positive" constrained elements
(identified in the permuted alignments) was less than 5% of the length of
constrained elements identified in the actual alignment.
Thus, all annotated constrained elements are significant at better
than 95% confidence, and the total fraction of the ENCODE regions
annotated as constrained is 5-7%.
SCONE-based Elements
SCONE provides p-value scores per base. Constrained elements are
defined based on SCONE site-specific scores as follows. An additive
score is first defined as the sum of (-log p + log t) along an
interval, where p is the SCONE score and t is some threshold value for
conservation. This additive score may be treated as a random walk;
elements are defined as the intervals between local minima and maxima
along this walk. Subsequently, a cutoff is set for the additive scores
for each element. This cutoff is chosen such that the elements scoring
above the cutoff for a random sequence of scores draw from a uniform
distribution [0,1] with threshold t = 0.25 will cover no more than
0.25% of the sequence.
Credits
PhastCons was developed by
Adam Siepel, Cold Spring Harbor Laboratory, while at the
Haussler Lab at UCSC.
BinCons was developed by Elliott Margulies of NHGRI, while at the
Eric Green lab.
BinCons and phastCons MCS data were contributed by Elliott Margulies,
with assistance from Adam Siepel of UCSC.
GERP was developed primarily by Greg Cooper in the lab of
Arend Sidow
at Stanford University (Depts of Pathology and Genetics), in close collaboration
with Eric Stone (Biostatistics, NC State), and George Asimenos and
Eugene Davydov in the lab of
Serafim Batzoglou
(Dept. of Computer Science, Stanford).
SCONE was developed by Saurabh Asthana in the lab of Shamil
Sunyaev at Harvard Medical School and Brigham & Women's Hospital
(Department of Medicine/Division of Genetics).
TBA was provided by Minmei Hou, Scott Schwartz and Webb Miller of the
Penn State Bioinformatics
Group.
References
See the TBA Alignment and TBA Cons tracks for references.
|