This full release incorporates genomic, transcript, and protein data available, as of May 13, 2019 and contains 200,311,267 records, including 141,839,334 proteins, 26,534,602 RNAs, and sequences from 91,873 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.
New RefSeq attributes for human MANE and RefSeq Select transcripts
We recently announced that NCBI and EBI have released a matched annotation set (MANE v0.5) of representative transcripts for human protein-coding genes, as a part of the MANE (Matched Annotation from NCBI and EBI) project. This collaboration aims to
identify a single representative or “Select” transcript for each protein-coding human gene, and to update RefSeq and Ensembl annotation for these so they match. The project will continue to match additional transcripts between the two datasets.
We have now added markup to the RefSeq transcript and protein records for this dataset as KEYWORDS and RefSeq-Attributes.
- “MANE Select” – representative transcripts that are matched between RefSeq and Ensembl [Example: NM_001238]
- “RefSeq Select” – additional representative transcripts for other loci that have not yet been matched. [Example: NM_000572]
- “MANE Ensembl match” – the matching Ensembl transcript and protein identifiers [Example: NM_001238]
- “RefSeq Select criteria” – the criteria by which the transcript was chosen as Select by RefSeq processing [Example: NM_000572]
Evidence information for prokaryotic RefSeq protein names
We are now providing information on how we determined the names for the non-redundant prokaryotic RefSeq proteins (WP_ accession prefix) . A new comment on the record, “Evidence-For-Name-Assignment”, contains the curated evidence used to assert the protein name. Curated evidence includes protein family hidden Markov models (TIGRFAMs, NCBI HMMs, Pfams), BlastRules or CDD architectures. [Examples: WP_137287854.1 and WP_130206135.1]
Over the next two months, we are planning to add this comment to all WP_ style proteins that were named based on curated evidence. We will also provide links to pages that provide more information on the evidence and the matching proteins.
Updated assembly for pathogenic budding yeast, [Candida] auris
The RefSeq genome sequence for the emerging drug-resistant fungal pathogen [Candida] auris has been updated to the assembly GCA_002775015.1, from strain B11221. This assembly is of higher quality than the previous genome assembly from C. auris (GCA_001189475.1) represented in RefSeq. The new RefSeq genome assembly is also one of the C. auris reference assemblies used by the CDC.