RefSeq release 94 with MANE and RefSeq Select markup, protein name evidence, and improved [Candida] auris assembly


RefSeq release 94 is now available through NCBI web services, FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available, as of May 13, 2019 and contains 200,311,267 records, including 141,839,334 proteins, 26,534,602 RNAs, and sequences from 91,873 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Special announcements:

New RefSeq attributes for human MANE and RefSeq Select transcripts

We recently announced that NCBI and EBI have released a matched annotation set (MANE v0.5) of representative transcripts for human protein-coding genes, as a part of the MANE (Matched Annotation from NCBI and EBI) project. This collaboration aims to
identify a single representative or “Select” transcript for each protein-coding human gene, and to update RefSeq and Ensembl annotation for these so they match. The project will continue to match additional transcripts between the two datasets.

We have now added markup to the RefSeq transcript and protein records for this dataset as KEYWORDS and RefSeq-Attributes.

KEYWORDS:

  • “MANE Select” – representative transcripts that are matched between RefSeq and Ensembl [Example: NM_001238]
  • “RefSeq Select” – additional representative transcripts for other loci that have not yet been matched. [Example: NM_000572]

RefSeq-Attributes:

  • “MANE Ensembl match” – the matching Ensembl transcript and protein identifiers [Example: NM_001238]
  • “RefSeq Select criteria” – the criteria by which the transcript was chosen as Select by RefSeq processing [Example: NM_000572]

Evidence information for prokaryotic RefSeq protein names

We are now providing information on how we determined the names for  the non-redundant prokaryotic RefSeq proteins (WP_ accession prefix) .  A new comment on the record, “Evidence-For-Name-Assignment”, contains the curated evidence used to assert the protein name. Curated evidence includes protein family hidden Markov models (TIGRFAMs, NCBI HMMs, Pfams), BlastRules or CDD architectures. [Examples: WP_137287854.1 and WP_130206135.1]

Over the next two months, we are planning to add this comment to all WP_ style proteins that were named based on curated evidence. We will also provide links to pages that provide  more information on the evidence and the matching proteins.

Updated assembly for pathogenic budding yeast, [Candida] auris

The RefSeq  genome sequence for the emerging drug-resistant fungal pathogen [Candida] auris  has been updated to the assembly GCA_002775015.1, from strain B11221. This assembly is of higher quality than the previous genome assembly from C. auris (GCA_001189475.1) represented in RefSeq. The new RefSeq genome assembly is also one of the C. auris reference assemblies used by the CDC.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s