Complete RefSeq genome annotation results represented in UCSC genome browser


NCBI’s RefSeq project provides comprehensive annotation of the human and other eukaryotic genomes through a combination of curation and an evidence-based eukaryotic genome annotation pipeline. Our curated records, ‘Known RefSeqs’, can be identified by the accession prefix (NM_, NR_, NG_, NP_). Model RefSeq records (XM_, XR_, and XP_ accession prefixes) are predicted based on transcript evidence (RNA-Seq and more) and protein support from Known RefSeqs, Swiss-Prot, and select INSDC records.

We recognize that many scientists access genome annotation data from one of three sources – NCBI, Ensembl, or UCSC. NCBI provides access to the human (and other) genome annotation results in the Genome Data Viewer, by BLAST and FTP, and per gene in NCBI’s Gene resource. Ensembl provides RefSeq annotation information based directly on the FTP content that NCBI releases.  In the past, UCSC has provided a partial dataset of RefSeq human genome annotation content by aligning Known RefSeq transcripts to the genome using BLAT. Using this approach, additional model RefSeq transcript variants, non-transcribed pseudogenes, and immunoglobulin and T-cell receptor regions, were not available through UCSC services. In rare cases the independent alignment method resulted in small differences in the exon structure compared to NCBI’s placement details as well as some ambiguous placements for transcripts originating from very similar paralogs that are uniquely placed within the NCBI dataset.

We are very pleased to announce the availability of the complete RefSeq human genome annotation product for the GRCh38 assembly in the University of California, Santa Cruz (UCSC) Genome browser. NCBI and UCSC staff have worked closely to define an improved data exchange process and NCBI is now providing RefSeq genome annotation and alignment data in order to have a more complete reflection of the RefSeq product in the UCSC genome browser. This resolves issues of incomplete data and conflicting placement details between UCSC displays and NCBI displays.

This initial release is for the human reference genome (GRCh38) and does not include NCBI RefSeq annotation for GRCh38 patches added since the initial GRCh38 release. We anticipate working with UCSC to expand on the number of organisms in the future.

NCBI-provided RefSeq data is included in the “NCBI RefSeq” composite track. For the following tracks, the alignments and coordinates are provided by RefSeq:
• RefSeq All – curated and predicted transcript annotations
• RefSeq Curated – curated annotations (transcripts with NM_ and NR_ accessions)
• RefSeq Predicted – predicted annotations (transcripts with XM_ and XR_ accessions)
• RefSeq Other –annotations not included in RefSeq All such as pseudogenes or other loci
• RefSeq Alignments – alignments of transcripts to the genome provided by RefSeq

By default, only the “RefSeq Curated” subtrack is activated within the “NCBI RefSeq” track, but you may wish to activate the other subtracks to view the complete dataset.

A huge thank you to the UCSC Genome Browser staff for adding RefSeq annotation as provided by NCBI.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s