Tag: RefSeq

Recent RefSeq annotations: barn owl, monarch butterfly and more

800px-Barn_Owl,_Manchester_area,_UK,_by_Andy_Chilton_2016-07-06_(Unsplash)In February and March, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

  • Amblyraja radiata (thorny skate)
  • Catharus ustulatus (Swainson’s thrush)
  • Chelonoidis abingdonii (Abingdon island giant tortoise)
  • Chiroxiphia lanceolata (lance-tailed manakin)
  • Danaus plexippus plexippus (monarch butterfly)
  • Daphnia magna (crustacean)
  • Drosophila grimshawi (fly)
  • Drosophila mojavensis (fly)
  • Drosophila sechellia (fly)
  • Homo sapiens (human)
  • Hylobates moloch (silvery gibbon)
  • Lontra canadensis (Northern American river otter)
  • Lynx canadensis (Canada lynx)
  • Nasonia vitripennis (jewel wasp)
  • Odontomachus brunneus (ant)
  • Petromyzon marinus (sea lamprey)
  • Phocoena sinus (vaquita)
  • Rattus rattus (black rat)
  • Rhinolophus ferrumequinum (greater horseshoe bat)
  • Strigops habroptila (Kakapo)
  • Taeniopygia guttata (zebra finch)
  • Tyto alba (Barn owl)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

The next RefSeq FTP release number will skip to 200

NCBI’s Reference Sequence (RefSeq) FTP release numbers will increment to 200 for the next release and skip over the numbers 100-199. The current, March 2020 release, is release 99. The next bi-monthly release in May 2020 will be release 200.  This change is to avoid overlapping with the release numbers of the completely independent RefSeq annotation releases for the eukaryotic genomes we annotate, which are currently in the range 100-109, for example Mus musculus Annotation Release 108. Continue reading “The next RefSeq FTP release number will skip to 200”

Protein family models used by PGAP are now available for download

A new release of the NCBI protein families profiles used by PGAP (the Prokaryotic Genome Annotation Pipeline) is now available. You can search this collection of Hidden Markov models (HMM) against your favorite prokaryotic proteins to identify their function using hmmer.

The HMMs are used as hints for the structural annotation of protein-coding genes in bacterial genomes and are also one of the sources for the names assigned to PGAP-annotated proteins presented in the Evidence-For-Name-Assignment comment block of RefSeq protein records (See for example, WP_004152100.1).

The collection comprises 12,753 HMMs that were built at NCBI, and 4,486 TIGRFAM HMMs whose ownership was transferred to NCBI in April 2018. In addition to the HMM profiles and seed alignments, a tab-delimited file containing the product names and other attributes added to the HMMs by curators is available.

  • 85% of models were assigned a product name that can be transferred to proteins hit by the model.
  • 7702 models have gene symbols.
  • 14508 are supported by a least one publication.
  • 6266 are assigned an Enzyme Commission number.
  • 617 represent anti-microbial resistance proteins.
  • Product names added to 4,686 PFAM HMMs owned by EBI-EMBL and used for functional annotation by PGAP are also included.

A total of 57 million RefSeq prokaryotic proteins have been named based on these curated HMMs, and can be identified with the Entrez query “meta Evidence-For-Name-Assignment”[Properties] AND “Evidence Category=HMM”[Text Word]. See an example and more information on web displays of HMMs in a previous post.

Fifteen new NCBI annotations in RefSeq: flies, harbor seal and more

Fifteen new NCBI annotations in RefSeq: flies, harbor seal and more

In January and February, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

  • Aythya fuligula (tufted duck)
  • Camelus ferus (Wild Bactrian camel)
  • Corvus moneduloides (New Caledonian crow)
  • Coturnix japonica (Japanese quail)
  • Drosophila ananassae (fly)
  • Drosophila virilis (fly)
  • Etheostoma spectabile (orangethroat darter)
  • Hylobates moloch (silvery gibbon)
  • Mustela erminea (ermine)
  • Nematostella vectensis (starlet sea anemone)
  • Nomia melanderi (Alkali bee)
  • Phoca vitulina (harbor seal)
  • Sapajus apella (Tufted capuchin)
  • Thamnophis elegans (Western terrestrial garter snake)
  • Xiphophorus hellerii (green swordtail)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

RefSeq Release 99 is public

RefSeq release 99 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of March 2, 2020, and contains 231,402,293 records, including 167,278,920 proteins, 29,869,155 RNAs, and sequences from 99,842 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Other announcements: Continue reading “RefSeq Release 99 is public”

New ribosomal RNA BLAST databases available on the web BLAST service and for download

We have a curated set of ribosomal RNA (rRNA)  reference sequences (Targeted Loci) with verifiable organism sources and current names. This set is critical for correctly identifying and classifying prokaryotic (bacteria and archaea) and fungal samples (Table 1). To provide easy access to these sequences, we recently added a separate rRNA/ITS databases section on the nucleotide BLAST page for these targeted sequences that makes it convenient to quickly identify source organisms (Figure 1)

Database BioProjects Sequences
16S ribosomal RNA (Bacteria and Archaea) PRJNA33317 , PRJNA33175

 

20,845
18S ribosomal RNA sequences (SSU) from Fungi type and reference material PRJNA39195 2,337
28S ribosomal RNA sequences (LSU) from Fungi type and reference material PRJNA51803 5,185
Internal transcribed spacer region (ITS) from Fungi and Oomycete type and reference material PRJNA177353, PRJNA362621

 

10,874

Table 1.  NCBI curated targeted rRNA sequences now available as BLAST databases. Continue reading “New ribosomal RNA BLAST databases available on the web BLAST service and for download”

Important changes coming to prokaryotic Reference and Representative genome assemblies

We are making changes to the set of bacterial and archaeal RefSeq Reference and Representative assemblies in February 2020.

  • We will reduce the number of Reference assemblies to 15 that have annotation provided by outside experts (Table 1) and re-annotate the 105 other current Reference assemblies using the latest Prokaryotic Genome Annotation Pipeline (PGAP) software. The re-annotated assemblies will lose reference status.
  • We will reassess and revise the set of Representative assemblies so that there is one assembly per species to better reflect the taxonomic diversity of the RefSeq bacterial and archaeal assemblies.

Continue reading “Important changes coming to prokaryotic Reference and Representative genome assemblies”

Important changes to the genomes FTP site in February

We have added the latest NCBI Eukaryotic Genome Annotation Pipeline results for the more than 580 species that we annotate to the genomes/refseq directory on the genomes FTP area. As we announced in December, we will stop publishing annotation results to the genus_species directories (example: genomes/Xenopus_tropicalis) on the genomes FTP site effective February 1, 2020. We will also move existing genus_species directories to genomes/archive/old_refseq during the month of February.X_t_assemblyFigure 1. The Assembly page for the Xenopus tropicalis UCB Xtro 10.0 (GCF_000004195.4) showing the blue download button. Annotation results such as the RefSeq transcript alignments that can be downloaded from the web page are now also under the genomes/refseq directory on the FTP site. The FTP path to the .bam alignment files is in red.

These FTP changes do not affect the Assembly download function. As always, you can download assembly data using the blue Download button on the web pages (Figure 1).

 

RefSeq Release 98 is public

RefSeq Release 98 is public

RefSeq release 98 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of January 6, 2020, and contains 223,560,051 records, including 161,133,441 proteins, 29,134,515 RNAs, and sequences from 98,406 organisms.

The release is provided in several directories as a complete dataset and as divided by logical groupings.

Read on for several important announcements.

Continue reading “RefSeq Release 98 is public”