Complete RefSeq genome annotation results represented in UCSC genome browser

Complete RefSeq genome annotation results represented in UCSC genome browser

NCBI’s RefSeq project provides comprehensive annotation of the human and other eukaryotic genomes through a combination of curation and an evidence-based eukaryotic genome annotation pipeline. Our curated records, ‘Known RefSeqs’, can be identified by the accession prefix (NM_, NR_, NG_, NP_). Model RefSeq records (XM_, XR_, and XP_ accession prefixes) are predicted based on transcript evidence (RNA-Seq and more) and protein support from Known RefSeqs, Swiss-Prot, and select INSDC records.

We recognize that many scientists access genome annotation data from one of three sources – NCBI, Ensembl, or UCSC. NCBI provides access to the human (and other) genome annotation results in the Genome Data Viewer, by BLAST and FTP, and per gene in NCBI’s Gene resource. Ensembl provides RefSeq annotation information based directly on the FTP content that NCBI releases.  In the past, UCSC has provided a partial dataset of RefSeq human genome annotation content by aligning Known RefSeq transcripts to the genome using BLAT. Using this approach, additional model RefSeq transcript variants, non-transcribed pseudogenes, and immunoglobulin and T-cell receptor regions, were not available through UCSC services. In rare cases the independent alignment method resulted in small differences in the exon structure compared to NCBI’s placement details as well as some ambiguous placements for transcripts originating from very similar paralogs that are uniquely placed within the NCBI dataset.

We are very pleased to announce the availability of the complete RefSeq human genome annotation product for the GRCh38 assembly in the University of California, Santa Cruz (UCSC) Genome browser. NCBI and UCSC staff have worked closely to define an improved data exchange process and NCBI is now providing RefSeq genome annotation and alignment data in order to have a more complete reflection of the RefSeq product in the UCSC genome browser. This resolves issues of incomplete data and conflicting placement details between UCSC displays and NCBI displays.

This initial release is for the human reference genome (GRCh38) and does not include NCBI RefSeq annotation for GRCh38 patches added since the initial GRCh38 release. We anticipate working with UCSC to expand on the number of organisms in the future.

NCBI-provided RefSeq data is included in the “NCBI RefSeq” composite track. For the following tracks, the alignments and coordinates are provided by RefSeq:
• RefSeq All – curated and predicted transcript annotations
• RefSeq Curated – curated annotations (transcripts with NM_ and NR_ accessions)
• RefSeq Predicted – predicted annotations (transcripts with XM_ and XR_ accessions)
• RefSeq Other –annotations not included in RefSeq All such as pseudogenes or other loci
• RefSeq Alignments – alignments of transcripts to the genome provided by RefSeq

By default, only the “RefSeq Curated” subtrack is activated within the “NCBI RefSeq” track, but you may wish to activate the other subtracks to view the complete dataset.

A huge thank you to the UCSC Genome Browser staff for adding RefSeq annotation as provided by NCBI.

Leave a Reply