Tag: RefSeq

RefSeq Release 217

RefSeq Release 217

RefSeq release 217 is now available online and from the FTP site. You can access RefSeq data through NCBI Datasets.

What’s included in this release?

As of March 8, 2023, this full release incorporates genomic, transcript, and protein data, containing:

  • 348,351,219 records
  • 254,500,694 proteins
  • 50,975,429 RNAs
  • sequences from 130,837 organisms

The release is provided in several directories as a complete dataset and divided by logical groupings. Continue reading “RefSeq Release 217”

New annotations in RefSeq!

New annotations in RefSeq!

In December and January, the NCBI Eukaryotic Genome Annotation Pipeline released twenty-nine new annotations in RefSeq for the following organisms:

  • Acinonyx jubatus (cheetah)
  • Anopheles cruzii (mosquito)
  • Anopheles moucheti (mosquito)
  • Bicyclus anynana (squinting bush brown)
  • Budorcas taxicolor (takin)
  • Carassius gibelio (silver crucian carp)
  • Citrus sinensis (sweet orange)
  • Crassostrea angulata (Portugese oyster)
  • Culex pipiens pallens (northern house mosquito)
  • Drosophila gunungcola (fruit fly)
  • Galleria mellonella (greater wax moth)
  • Gossypium arboreum (tree cotton)
  • Gossypium raimondii (Peruvian cotton)
  • Harpia harpyja (harpy eagle)
  • Hemicordylus capensis (graceful crag lizard)
  • Lactuca sativa (garden lettuce)
  • Mercenaria mercenaria (northern quahog)
  • Mya arenaria (softshell)
  • Octopus bimaculoides (California two-spot octopus)
  • Oncorhynchus keta (chum salmon)
  • Pangasianodon hypophthalmus (striped catfish)
  • Panonychus citri (citrus red mite)
  • Panthera uncia (snow leopard) (pictured)
  • Peromyscus californicus insignis (California mouse)
  • Podarcis raffonei (Aeolian wall lizard)
  • Populus trichocarpa (black cottonwood)
  • Scomber japonicus (chub mackerel)
  • Tympanuchus pallidicinctus (lesser prairie-chicken)
  • Vigna angularis (adzuki bean)

Continue reading “New annotations in RefSeq!”

Announcing New Names for Eukaryotic Genome Annotations in RefSeq!

Announcing New Names for Eukaryotic Genome Annotations in RefSeq!

The RefSeq eukaryotic genome annotation pipeline (EGAP) is moving to a new annotation naming format that can be used to unambiguously reference both the genome assembly and the RefSeq annotation. This will improve clarity when reporting the data you use and make the data more FAIR (Findable, Accessible, Interoperable, and Reusable). The new naming convention applies to all eukaryotic annotations released after December 15, 2022.

Historically, RefSeq EGAP has used an integer to identify a particular annotation release, such as Homo sapiens Annotation Release 110. This method provides no information on the assembly used for the annotation. In the new RefSeq  naming system, annotation releases are designated by a combination of the assembly identifier (e.g., GCF_000001405.40) and an annotation name (e.g., RS_2022_04). The annotation name consists of an RS prefix to indicate RefSeq annotation, and the year and month that it was generated, RS_YYYY_MM. You should always use the annotation name in combination with the corresponding assembly accession.version, for example, GCF_026419915.1-RS_2022_12 (as shown in Figure 1). This ensures that you’re always using the name that defines a specific annotation for a specific genome assembly. If you use only part of the name, it will be ambiguous.

Figure 1. The annotation section of the Datasets Genome page for the assembly bHarHar1 for the harpy eagle (Harpia harpyja) showing the new annotation release GCF_026419915.1-RS_2022_12. Continue reading “Announcing New Names for Eukaryotic Genome Annotations in RefSeq!”

RefSeq Release 216

RefSeq Release 216

RefSeq release 216 is now available online, from the FTP site, and through NCBI’s new resource, Datasets.

This full release incorporates genomic, transcript, and protein data available as of January 9, 2023, and contains 342,395,932 records, including 249,868,639 proteins, 49,869,497 RNAs, and sequences from 128,299 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 216”

Updated bacterial and archaeal reference genomes collection now available!

Updated bacterial and archaeal reference genomes collection now available!

An updated bacterial and archaeal reference genome collection is available! This collection of 17,163 genomes was built by selecting exactly one genome assembly for each species among the 272,000+ prokaryotic genomes in RefSeq, except for E. coli for which two assemblies were selected as reference.

A total of 497 species are included in this collection for the first time. In addition, comparing to the October 2022 set, 174 species are represented by a better assembly and 15 species were removed because of changes in NCBI Taxonomy or uncertainty in their species assignment. The criteria for selecting one assembly for a given species from all assemblies available in RefSeq for the species include assembly contiguity and completeness and quality of the RefSeq annotation. See the documentation for details.

We have updated the nucleotide BLAST RefSeq reference genomes database (fourth in the menu) as well as the database on the Microbial Nucleotide BLAST page to reflect these changes. You can also run BLAST searches against the proteins annotated on these reference genomes (RefSeq Select proteins database, second in the menu).

New RefSeq Annotations!

New RefSeq Annotations!

In October and November, the NCBI Eukaryotic Genome Annotation Pipeline released thirty-one new annotations in RefSeq for the following organisms:

  • Acanthochromis polyacanthus (spiny chromis)
  • Acomys russatus (golden spiny mouse)
  • Andrographis paniculata (eudicot)
  • Antechinus flavipes (yellow-footed antechinus)
  • Apodemus sylvaticus (European woodmouse)
  • Apus apus (common swift)
  • Arachis duranensis (eudicot)
  • Continue reading “New RefSeq Annotations!”
RefSeq Release 215

RefSeq Release 215

RefSeq release 215 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of November 7, 2022, and contains 335,372,031 records, including 244,583,657 proteins and sequences from 125,116 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 215”

CCDS Release 24

CCDS Release 24

An updated dataset of human protein-coding regions from the Consensus Coding Sequence (CCDS) collaboration

Are you interested in a set of high-quality human coding regions (CDS) with equivalent annotation in NCBI’s RefSeq and EMBL-EBI’s (European Molecular Biology Laboratories-European Bioinformatics Institute) Ensembl annotations? Check out the new CCDS Release 24! This CCDS set was generated by comparing RefSeq Annotation Release 110 and Ensembl Release 108.

This update adds 2,746 new CCDS IDs and 237 new genes compared to the last human CCDS build (Release 22, 2018). CCDS Release 24 includes a total of 35,608 CCDS IDs that correspond to 19,107 genes, with 48,062 protein sequences from RefSeq and 47,762 from Ensembl.

The new CCDS release is available on FTP for bulk download and on the CCDS webpage in case you are looking for data on individual genes. Continue reading “CCDS Release 24”

New annotations in RefSeq!

New annotations in RefSeq!

In August and September, the NCBI Eukaryotic Genome Annotation Pipeline released thirty-eight new annotations in RefSeq for the following organisms:

  • Adelges cooleyi (spruce gall adelgid)
  • Aethina tumida (small hive beetle)
  • Anopheles aquasalis (mosquito)
  • Anopheles maculipalpis (mosquito)
  • Anthonomus grandis grandis (boll weevil)
  • Aphis gossypii (cotton aphid)
  • Bactrocera neohumeralis (fly)
  • Bombus affinis (bee)
  • Bombus huntii (bee)
  • Cataglyphis hispanica (ant)
  • Cygnus atratus (black swan) (pictured) Continue reading “New annotations in RefSeq!”
Now available: Updated prokaryote representative genomes collection

Now available: Updated prokaryote representative genomes collection

An updated bacterial and archaeal representative genomes collection is available! We selected a total of 16,665 of the 262,000 prokaryotic assemblies in RefSeq to represent their respective species. For the first time, more complete assemblies (as calculated by CheckM) were ranked higher than less complete assemblies. See the ranked list of criteria for selecting representative assemblies here. Continue reading “Now available: Updated prokaryote representative genomes collection”