This full release incorporates genomic, transcript, and protein data available as of November 1, 2021, and contains 296,293,486 records, including 215,655,378 proteins, 41,751,205 RNAs, and sequences from 114,396 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.
New eukaryotic genome annotations
This release includes new annotations generated by NCBI’s eukaryotic genome annotation pipeline for 41 species, including:
- West African lungfish annotation release 100, based on new assembly PAN1.0 (GCF_019279795.1)
- large genome (40 Gb), ~13 times the size of the human genome
- Wheat annotation release 100, based on updated assembly IWGSC_CS_RefSeq_v2.1 (GCF_018294505.1)
- Himalayan honeybee annotation release 100, based on new assembly ASM1406632v1 (GCF_014066325.1)
- Komodo dragon annotation release 100, based on new assembly ASM479886v1 (GCF_004798865.1)
- Brown bear annotation release 101, based on updated assembly ASM358476v2 (GCF_003584765.2)
- Leopard cat annotation release 100, based on new assembly Fcat_Pben_1.1_paternal_pri (GCF_016509475.1)
NCBI to assign 64-bit numeric GIs in November 2021
NCBI’s GI sequence identifiers will soon exceed 32-bit numbers. NCBI will begin assigning larger (64-bit) numeric ‘GIs’ to the remaining sequence types that still receive these identifiers. This change is expected as soon as November 15th, 2021, but could occur earlier if data submission volumes are unexpectedly high.
RefSeq assembly information
We are considering adding information to the RefSeq FTP release catalog about the RefSeq assembly for each sequence. We welcome your comments on information that would be useful to you.
We are also considering revising the set of sequences included in the plasmid bin to add in plasmids from WGS sequences.