Protein family models used by PGAP are now available for download

A new release of the NCBI protein families profiles used by PGAP (the Prokaryotic Genome Annotation Pipeline) is now available. You can search this collection of Hidden Markov models (HMM) against your favorite prokaryotic proteins to identify their function using hmmer.

The HMMs are used as hints for the structural annotation of protein-coding genes in bacterial genomes and are also one of the sources for the names assigned to PGAP-annotated proteins presented in the Evidence-For-Name-Assignment comment block of RefSeq protein records (See for example, WP_004152100.1).

The collection comprises 12,753 HMMs that were built at NCBI, and 4,486 TIGRFAM HMMs whose ownership was transferred to NCBI in April 2018. In addition to the HMM profiles and seed alignments, a tab-delimited file containing the product names and other attributes added to the HMMs by curators is available.

  • 85% of models were assigned a product name that can be transferred to proteins hit by the model.
  • 7702 models have gene symbols.
  • 14508 are supported by a least one publication.
  • 6266 are assigned an Enzyme Commission number.
  • 617 represent anti-microbial resistance proteins.
  • Product names added to 4,686 PFAM HMMs owned by EBI-EMBL and used for functional annotation by PGAP are also included.

A total of 57 million RefSeq prokaryotic proteins have been named based on these curated HMMs, and can be identified with the Entrez query “meta Evidence-For-Name-Assignment”[Properties] AND “Evidence Category=HMM”[Text Word]. See an example and more information on web displays of HMMs in a previous post.

Fifteen new NCBI annotations in RefSeq: flies, harbor seal and more

In January and February, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

  • Aythya fuligula (tufted duck)
  • Camelus ferus (Wild Bactrian camel)
  • Corvus moneduloides (New Caledonian crow)
  • Coturnix japonica (Japanese quail)
  • Drosophila ananassae (fly)
  • Drosophila virilis (fly)
  • Etheostoma spectabile (orangethroat darter)
  • Hylobates moloch (silvery gibbon)
  • Mustela erminea (ermine)
  • Nematostella vectensis (starlet sea anemone)
  • Nomia melanderi (Alkali bee)
  • Phoca vitulina (harbor seal)
  • Sapajus apella (Tufted capuchin)
  • Thamnophis elegans (Western terrestrial garter snake)
  • Xiphophorus hellerii (green swordtail)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

RefSeq Release 99 is public

RefSeq release 99 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of March 2, 2020, and contains 231,402,293 records, including 167,278,920 proteins, 29,869,155 RNAs, and sequences from 99,842 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Other announcements: Continue reading

New ribosomal RNA BLAST databases available on the web BLAST service and for download

We have a curated set of ribosomal RNA (rRNA)  reference sequences (Targeted Loci) with verifiable organism sources and current names. This set is critical for correctly identifying and classifying prokaryotic (bacteria and archaea) and fungal samples (Table 1). To provide easy access to these sequences, we recently added a separate rRNA/ITS databases section on the nucleotide BLAST page for these targeted sequences that makes it convenient to quickly identify source organisms (Figure 1)

Database BioProjects Sequences
16S ribosomal RNA (Bacteria and Archaea) PRJNA33317 , PRJNA33175

 

20,845
18S ribosomal RNA sequences (SSU) from Fungi type and reference material PRJNA39195 2,337
28S ribosomal RNA sequences (LSU) from Fungi type and reference material PRJNA51803 5,185
Internal transcribed spacer region (ITS) from Fungi and Oomycete type and reference material PRJNA177353, PRJNA362621

 

10,874

Table 1.  NCBI curated targeted rRNA sequences now available as BLAST databases. Continue reading

Important changes coming to prokaryotic Reference and Representative genome assemblies

We are making changes to the set of bacterial and archaeal RefSeq Reference and Representative assemblies in February 2020.

  • We will reduce the number of Reference assemblies to 15 that have annotation provided by outside experts (Table 1) and re-annotate the 105 other current Reference assemblies using the latest Prokaryotic Genome Annotation Pipeline (PGAP) software. The re-annotated assemblies will lose reference status.
  • We will reassess and revise the set of Representative assemblies so that there is one assembly per species to better reflect the taxonomic diversity of the RefSeq bacterial and archaeal assemblies.

Continue reading

Important changes to the genomes FTP site in February

We have added the latest NCBI Eukaryotic Genome Annotation Pipeline results for the more than 580 species that we annotate to the genomes/refseq directory on the genomes FTP area. As we announced in December, we will stop publishing annotation results to the genus_species directories (example: genomes/Xenopus_tropicalis) on the genomes FTP site effective February 1, 2020. We will also move existing genus_species directories to genomes/archive/old_refseq during the month of February.X_t_assemblyFigure 1. The Assembly page for the Xenopus tropicalis UCB Xtro 10.0 (GCF_000004195.4) showing the blue download button. Annotation results such as the RefSeq transcript alignments that can be downloaded from the web page are now also under the genomes/refseq directory on the FTP site. The FTP path to the .bam alignment files is in red.

These FTP changes do not affect the Assembly download function. As always, you can download assembly data using the blue Download button on the web pages (Figure 1).

 

RefSeq Release 98 is public

RefSeq release 98 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of January 6, 2020, and contains 223,560,051 records, including 161,133,441 proteins, 29,134,515 RNAs, and sequences from 98,406 organisms.

The release is provided in several directories as a complete dataset and as divided by logical groupings.

Read on for several important announcements.

Continue reading

RefSeq Release 97 is public

RefSeq release 97 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available, as of November 4, 2019 and contains 219,407,891 records, including 157,639,958 proteins, 28,730,283 RNAs, and sequences from 97,407 organisms.

The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Continue reading

August-October 2019 RefSeq annotations: mouse, firefly and more

mouse with dark brown fur, eating

Recently, the NCBI Eukaryotic Genome Annotation Pipeline has released new annotations in RefSeq for the following organisms:

  • Aedes albopictus (Asian tiger mosquito)
  • Aquila chrysaetos chrysaetos (golden eagle)
  • Archocentrus centrarchus (flier cichlid)
  • Calypte anna (Anna’s hummingbird)
  • Camarhynchus parvulus (bird)
  • Camelus dromedarius (Arabian camel)
  • Cannabis sativa (hemp)
  • Chanos chanos (milkfish)

Continue reading