A new release of the NCBI protein families profiles used by PGAP (the Prokaryotic Genome Annotation Pipeline) is now available. You can search this collection of Hidden Markov models (HMM) against your favorite prokaryotic proteins to identify their function using hmmer.
The HMMs are used as hints for the structural annotation of protein-coding genes in bacterial genomes and are also one of the sources for the names assigned to PGAP-annotated proteins presented in the Evidence-For-Name-Assignment comment block of RefSeq protein records (See for example, WP_004152100.1).
The collection comprises 12,753 HMMs that were built at NCBI, and 4,486 TIGRFAM HMMs whose ownership was transferred to NCBI in April 2018. In addition to the HMM profiles and seed alignments, a tab-delimited file containing the product names and other attributes added to the HMMs by curators is available.
85% of models were assigned a product name that can be transferred to proteins hit by the model.
7702 models have gene symbols.
14508 are supported by a least one publication.
6266 are assigned an Enzyme Commission number.
617 represent anti-microbial resistance proteins.
Product names added to 4,686 PFAM HMMs owned by EBI-EMBL and used for functional annotation by PGAP are also included.
A total of 57 million RefSeq prokaryotic proteins have been named based on these curated HMMs, and can be identified with the Entrez query “meta Evidence-For-Name-Assignment”[Properties] AND “Evidence Category=HMM”[Text Word]. See an example and more information on web displays of HMMs in a previous post.
RefSeq release 99 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of March 2, 2020, and contains 231,402,293 records, including 167,278,920 proteins, 29,869,155 RNAs, and sequences from 99,842 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.
We have a curated set of ribosomal RNA (rRNA) reference sequences (Targeted Loci) with verifiable organism sources and current names. This set is critical for correctly identifying and classifying prokaryotic (bacteria and archaea) and fungal samples (Table 1). To provide easy access to these sequences, we recently added a separate rRNA/ITS databases section on the nucleotide BLAST page for these targeted sequences that makes it convenient to quickly identify source organisms (Figure 1)
We are making changes to the set of bacterial and archaeal RefSeq Reference and Representative assemblies in February 2020.
We will reduce the number of Reference assemblies to 15 that have annotation provided by outside experts (Table 1) and re-annotate the 105 other current Reference assemblies using the latest Prokaryotic Genome Annotation Pipeline (PGAP) software. The re-annotated assemblies will lose reference status.
We will reassess and revise the set of Representative assemblies so that there is one assembly per species to better reflect the taxonomic diversity of the RefSeq bacterial and archaeal assemblies.
We have added the latest NCBI Eukaryotic Genome Annotation Pipeline results for the more than 580 species that we annotate to the genomes/refseq directory on the genomes FTP area. As we announced in December, we will stop publishing annotation results to the genus_species directories (example: genomes/Xenopus_tropicalis) on the genomes FTP site effective February 1, 2020. We will also move existing genus_species directories to genomes/archive/old_refseq during the month of February.Figure 1. The Assembly page for the Xenopus tropicalis UCB Xtro 10.0 (GCF_000004195.4) showing the blue download button. Annotation results such as the RefSeq transcript alignments that can be downloaded from the web page are now also under the genomes/refseq directory on the FTP site. The FTP path to the .bam alignment files is in red.
These FTP changes do not affect the Assembly download function. As always, you can download assembly data using the blue Download button on the web pages (Figure 1).
RefSeq release 98 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of January 6, 2020, and contains 223,560,051 records, including 161,133,441 proteins, 29,134,515 RNAs, and sequences from 98,406 organisms.
The release is provided in several directories as a complete dataset and as divided by logical groupings.