Problematic Sites Track Settings
 
Problematic sites where masking or caution are recommended for analysis   (All Mapping and Sequencing tracks)

Display mode:       Reset to defaults
List subtracks: only selected/visible    all  
hide
 Mask  Problematic sites where masking is recommended for analysis   Schema 
hide
 Caution  Problematic sites where caution is recommended for analysis   Schema 
Data version: 2020-07-29


new Note: Updated Aug. 31, 2020

Description

Attempts to infer phylogenetic relationships, sites under selection, or evidence of recombination from SARS-CoV-2 genome sequences can be led astray by sequencing errors, contamination, and hypermutable sites. In order to make reliable inferences, it is important to identify probable errors and susceptible sites within the genome sequences, carefully consider how those might affect the specific analysis one is about to perform, and perhaps exclude problematic sites from analysis.

This track shows locations in the SARS-CoV-2 genome that have been identified as problematic for analysis for various reasons. They have been collected in the github repository https://github.com/W-L/ProblematicSites_SARS-CoV2/. Locations have been separated into two subtracks and colored corresponding to levels of severity:

  • Mask: Problems are expected to affect most types of analysis, so it is recommended to mask out these sites before analysis.
  • Caution: Some types of analysis may be affected while other types may not; caution is recommended.

Locations are labeled with the following terms to indicate the type of potential problem:

  • ambiguous: Sites which show an excess of ambiguous basecalls relative to the number of alternative alleles, often emerging from a single country or sequencing laboratory
  • amended: Previous sequencing errors which now appear to have been fixed in the latest versions of the GISAID sequences, at least in sequences from some of the sequencing laboratories
  • highly_ambiguous: Sites with a very high proportion of ambiguous characters, relative to the number of alternative alleles
  • highly_homoplasic: Positions which are extremely homoplasic - it is sometimes not necessarily clear if these are hypermutable sites or sequencing artefacts
  • homoplasic: Homoplasic sites, with many mutation events needed to explain a relatively small alternative allele count
  • interspecific_contamination: Cases (only one instance as of July 2020) in which the known sequencing issue is due to contamination from genetic material that does not have SARS-CoV-2 origin
  • nanopore_adapter: Cases in which the known sequencing issue is due to the adapter sequences in nanopore reads
  • narrow_src: Variants which are found in sequences from only a few sequencing labs (usually two or three), possibly as a consequence of the same artefact reproduced independently
  • neighbour_linked: Proximal variants displaying near perfect linkage
  • seq_end: Alignment ends are affected by low coverage and high error rates (masking recommended, but might be more stringent than necessary)
  • single_src: Only observed in samples from a single laboratory

Methods

Multiple groups applied various methods (De Maio, Walker et al.; De Maio, Gozashti et al.; Turakhia et al.) to identify sites that were homoplasic, likely contaminated, likely sequencing error and/or observed in multiple virus lineages by only one or a few laboratories. They contributed their observations and recommendations to the github repository https://github.com/W-L/ProblematicSites_SARS-CoV2/. UCSC downloaded the collection, split the sites into Mask and Caution subsets depending on the recommended action and reformatted the data for display in the Genome Browser.

Data Access

The original data file was downloaded from github: https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/problematic_sites_sarsCov2.vcf. You can download the bigBed files underlying this track (problematicSites*.bb) from our Download Server. The data can be explored interactively with the Table Browser or the Data Integrator. The data can be accessed from scripts through our API.

References

De Maio N, Walker C, Borges R, Weilguny L, Slodkowicz G, Goldman N. Issues with SARS-CoV-2 sequencing data. virological.org. 2020 May 5.

De Maio N, Gozashti L, Turakhia Y, Walker C, Lanfear R, Corbett-Detig R, Goldman N. Updated analysis with data from 12th June 2020. virological.org. 2020 July 14.

Turakhia Y, Thornlow B, Gozashti L, Hinrichs AS, Fernandes JD, Haussler D, and Corbett-Detig R. Stability of SARS-CoV-2 Phylogenies. bioRxiv. 2020 June 9.