RefSeq Release 215

RefSeq Release 215

RefSeq release 215 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of November 7, 2022, and contains 335,372,031 records, including 244,583,657 proteins and sequences from 125,116 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Changes to checksum files

In order to better support our RefSeq FTP Release users, we are making some improvements to the checksum files available to verify successful downloads. For the bi-monthly and weekly release files, we have switched from CRC to MD5 checksums. Using MD5 checksums will provide consistency with checksum files being added for the GenBank FTP release files. The checksums are available in the “release###.files.installed” file in the release-catalog directory. We have also added MD5 checksums for the daily release files (see example here).

Human CCDS Release 24

An updated dataset of human protein-coding regions from the Consensus Coding Sequence (CCDS) collaboration, CCDS Release 24, is now available. This CCDS set was generated by comparing RefSeq Annotation Release 110 and Ensembl Release 108. This update adds 2,746 new CCDS IDs and 237 new genes compared to the last human CCDS build (Release 22, 2018). CCDS Release 24 includes a total of 35,608 CCDS IDs that correspond to 19,107 genes, with 48,062 protein sequences from RefSeq and 47,762 from Ensembl.

New eukaryotic genome annotations

This release includes new annotations generated by NCBI’s eukaryotic genome annotation pipeline for 30 species, including:

  • Pere David’s macaque annotation release 100, based on new assembly ASM2454274v1 (GCF_024542745.1)
  • Golden spiny mouse annotation release 100, based on new assembly mAcoRus1.1 (GCF_903995435.1) (pictured)
  • California condor annotation release 100, based on updated assembly ASM1813914v2 (GCF_018139145.2)
  • European seabass annotation release 100, based on new assembly dlabrax2021 (GCF_905237075.1)
  • Pea annotation release 100, based on new assembly CAAS_Psat_ZW6_1.0 (GCF_024323335.1)
  • Fall armyworm annotation release 102, based on updated assembly AGI-APGP_CSIRO_Sfru_2.0 (GCF_023101765.2)
Increase in size of ASN.1 files

Currently, the size of uncompressed ASN.1 files in the release is capped at 500Mb. The size limit will be increased to 2Gb per ASN.1 file in future releases (starting January 2023). This change will reduce the total number of files in the release.

Update of prokaryote phylum names

As previously announced, NCBI Taxonomy will begin to update phylum names for prokaryotes in January 2023. Informal phylum names in long use (e.g., Firmicutes, Proteobacteria) will be changed to newly formalized names (e.g. Bacillota, Pseudomonadota, respectively). This update affects over 40 NCBI TaxIDs at phylum rank. The rollout will take several weeks to complete. Note that the flatfiles in the next RefSeq release (January 2023) may contain a partial update of phylum names.

Plasmid sequences

We are looking at revising the set of sequences included in the plasmid bin to add in plasmids from WGS sequences.

RefSeq supports the NIH Comparative Genomics Resource (CGR), an NLM project to establish an ecosystem to facilitate reliable comparative genomics analyses for all eukaryotic organisms.

Join our mailing list to keep up to date with RefSeq and other CGR news.

Leave a Reply