Schema for GENCODE VM23 - GENCODE VM23 Comprehensive Transcript Set (only Basic displayed by default)
  Database: mm10    Primary Table: knownGene    Row Count: 142,446   Data last updated: 2019-09-19
Format description: Transcript from default gene set in UCSC browser
fieldexampleSQL type info description
name ENSMUST00000193812.1varchar(255) values Name of gene
chrom chr1varchar(255) values Reference sequence chromosome or scaffold
strand +char(1) values + or - for strand
txStart 3073252int(10) unsigned range Transcription start position (or end position for minus strand item)
txEnd 3074322int(10) unsigned range Transcription end position (or start position for minus strand item)
cdsStart 3073252int(10) unsigned range Coding region start (or end position if for minus strand item)
cdsEnd 3073252int(10) unsigned range Coding region end (or start position if for minus strand item)
exonCount 1int(10) unsigned range Number of exons
exonStarts 3073252,longblob   Exon start positions (or end positions for minus strand item)
exonEnds 3074322,longblob   Exon end positions (or start positions for minus strand item)
proteinID  varchar(40) values UniProt display ID, UniProt accession, or RefSeq protein ID
alignID uc287gdb.1varchar(255) values Unique identifier (GENCODE transcript ID for GENCODE Basic)

Connected Tables and Joining Fields
        mm10.bioCycPathway.kgID (via knownGene.name)
      mm10.ccdsKgMap.geneId (via knownGene.name)
      mm10.ceBlastTab.query (via knownGene.name)
      mm10.dmBlastTab.query (via knownGene.name)
      mm10.drBlastTab.query (via knownGene.name)
      mm10.foldUtr3.name (via knownGene.name)
      mm10.foldUtr5.name (via knownGene.name)
      mm10.hgBlastTab.query (via knownGene.name)
      mm10.keggPathway.kgID (via knownGene.name)
      mm10.kg10ToKg11.newId (via knownGene.name)
      mm10.kg11ToKg12.newId (via knownGene.name)
      mm10.kgAlias.kgID (via knownGene.name)
      mm10.kgColor.kgID (via knownGene.name)
      mm10.kgProtAlias.kgID (via knownGene.name)
      mm10.kgProtMap2.qName (via knownGene.name)
      mm10.kgSpAlias.kgID (via knownGene.name)
      mm10.kgTargetAli.qName (via knownGene.name)
      mm10.kgTxInfo.name (via knownGene.name)
      mm10.kgXref.kgID (via knownGene.name)
      mm10.knownAttrs.kgID (via knownGene.name)
      mm10.knownBlastTab.query (via knownGene.name)
      mm10.knownBlastTab.target (via knownGene.name)
      mm10.knownCanonical.transcript (via knownGene.name)
      mm10.knownCds.name (via knownGene.name)
      mm10.knownGeneMrna.name (via knownGene.name)
      mm10.knownGenePep.name (via knownGene.name)
      mm10.knownIsoforms.transcript (via knownGene.name)
      mm10.knownToEnsembl.name (via knownGene.name)
      mm10.knownToKeggEntrez.name (via knownGene.name)
      mm10.knownToLocusLink.name (via knownGene.name)
      mm10.knownToLynx.name (via knownGene.name)
      mm10.knownToMrna.name (via knownGene.name)
      mm10.knownToMrnaSingle.name (via knownGene.name)
      mm10.knownToPfam.name (via knownGene.name)
      mm10.knownToRefSeq.name (via knownGene.name)
      mm10.knownToSuper.gene (via knownGene.name)
      mm10.knownToTag.name (via knownGene.name)
      mm10.knownToVisiGene.name (via knownGene.name)
      mm10.knownToWikipedia.name (via knownGene.name)
      mm10.rnBlastTab.query (via knownGene.name)
      mm10.scBlastTab.query (via knownGene.name)
      mm10.ucscScop.ucscId (via knownGene.name)

Sample Rows
 
namechromstrandtxStarttxEndcdsStartcdsEndexonCountexonStartsexonEndsproteinIDalignID
ENSMUST00000193812.1chr1+307325230743223073252307325213073252,3074322,uc287gdb.1
ENSMUST00000082908.1chr1+310201531021253102015310201513102015,3102125,uc287gdc.1
ENSMUST00000162897.1chr1-320590032163443205900320590023205900,3213608,3207317,3216344,uc287gdd.1
ENSMUST00000159265.1chr1-320652232156323206522320652223206522,3213438,3207317,3215632,uc007aet.2
ENSMUST00000070533.4chr1-321448136714983216021367134833214481,3421701,3670551,3216968,3421901,3671498,Q5GH67uc007aeu.1
ENSMUST00000195335.1chr1-336573033685493365730336573013365730,3368549,uc287gdf.1
ENSMUST00000192336.1chr1-337555533777883375555337555513375555,3377788,uc287gdg.1
ENSMUST00000194099.1chr1-346497634672853464976346497613464976,3467285,uc287gdh.1
ENSMUST00000161581.1chr1+346658635135533466586346658623466586,3513404,3466687,3513553,uc287gdi.1
ENSMUST00000192973.1chr1-351245035145073512450351245013512450,3514507,uc287gdj.1

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

GENCODE VM23 (knownGene) Track Description
 

Description

The GENCODE track is composed of all the gene models in the GENCODE VM23 release. By default, only the basic gene set is displayed, which is a subset of the comprehensive gene set. The basic set represents transcripts that GENCODE believes will be useful to the majority of users. The track includes protein-coding genes, non-coding RNA genes, and pseudo-genes, though pseudo-genes are not displayed by default.

The following table provides statistics for the VM23 release derived from the GTF file that contains annotations only on the main chromosomes. More information on how they were generated can be found in the GENCODE site.

GENCODE VM23 Release Stats
GenesObservedTranscriptsObserved
Protein-coding genes 21,849Protein-coding transcripts59,188
Long non-coding RNA genes13,201- full length protein-coding45,391
Small non-coding RNA genes6,108- partial length protein-coding13,797
Pseudogenes13,681Nonsense mediated decay transcripts7,200
Immunoglobulin/T-cell receptor gene segments700Long non-coding RNA loci transcripts18,339

For more information on the different gene tracks, see our Genes FAQ.

Display Conventions and Configuration

By default, this track displays only the basic GENCODE set, splice variants, and non-coding genes. It includes options to display the comprehensive GENCODE set and pseudogenes. To customize these options, the respective boxes can be checked or unchecked at the top of this description page. Our FAQ includes examples of how to display a single transcript per gene and switching between the basic and comprehensive gene sets.

This track also includes a variety of labels which identify the transcripts when visibility is set to "full" or "pack". Gene symbols (e.g. NIPA1) are displayed by default, but additional options include GENCODE Transcript ID (ENSMUST00000052204.5), UCSC Known Gene ID (uc009hdu.2), and UniProt Display ID (Q8BHK1) . Additional information about gene and transcript names can be found in our FAQ.

This track, in general, follows the display conventions for gene prediction tracks. The exons for putative non-coding genes and untranslated regions are represented by relatively thin blocks, while those for coding open reading frames are thicker. The following color key is used:

  • Black -- feature has a corresponding entry in the Protein Data Bank (PDB)
  • Dark blue -- transcript has been reviewed or validated by either the RefSeq or SwissProt staff
  • Medium blue -- other RefSeq transcripts
  • Light blue -- non-RefSeq transcripts

This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. There is also an option to display the data as a density graph, which can be helpful for visualizing the distribution of items over a region.

Methods

All the GENCODE genes in the comprehensive set are downloaded from the GENCODE website. Data from other sources are correlated with the GENCODE data to build the knownTo_ tables.

Related Data

The GENCODE Genes transcripts are annotated in numerous tables. These include tables that link GENCODE Genes transcripts to external datasets (such as knownToLocusLink, which maps GENCODE Genes transcripts to Entrez identifiers, previously known as Locus Link identifiers), and tables that detail some property of GENCODE Genes transcript sequences (such as knownToPfam, which identifies any Pfam domains found in the GENCODE Genes protein-coding transcripts).

One can see a full list of the associated tables by clicking the View table schema link at the top of the page, or in the Table Browser by selecting GENCODE Genes from the track menu; this list is then available on the table menu. Note that some of these tables refer to GENCODE Genes by its table name knownGene, sometimes abbreviated as known or kg. While the complete set of annotation tables is too long to describe, some of the more important tables are described below.

  • kgXref identifies the RefSeq, SwissProt, Rfam, or tRNA sequences (if any) which are associated with each transcript.
  • knownToRefSeq identifies the RefSeq transcript that each GENCODE Genes transcript is most closely associated with. That RefSeq transcript is the RefSeq transcript that the GENCODE Genes transcript overlaps at the most bases.
  • knownGeneMrna contains the genomic sequence for each of the GENCODE Genes models. This may not be the same as the actual mRNA used to validate the gene model.
  • knownGenePep contains the protein sequences derived from the knownGeneMrna transcript sequences. Any protein-level annotations, such as the contents of the knownToPfam table, are based on these sequences.
  • knownIsoforms maps each transcript to a cluster ID, a cluster of isoforms of the same gene.
  • knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.

Data access

GENCODE Genes and its associated tables can be explored interactively using the Table Browser or the Data Integrator. The data are also all available as downloadable files. For example, if you would like to download the entire GENCODE Genes set as seen in the View table schema page, the knownGene.txt.gz file in the downloads directory contains a compressed version of the data. All the tables can also be queried directly from our public MySQL servers. Information on accessing this data through MySQL can be found on our help page as well as on our blog.

Credits

The GENCODE Genes track was produced at UCSC from the GENCODE comprehensive gene set using a computational pipeline developed by Jim Kent and Brian Raney.

References

Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012 Sep;22(9):1760-74. PMID: 22955987; PMC: PMC3431492

Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9. PMID: 16925838; PMC: PMC1810553

A full list of GENCODE publications is available at The GENCODE Project web site.

Data Release Policy

GENCODE data are available for use without restrictions.