The bacterial and archaeal representative genome collection has been updated! We selected a total of 14,912 of the 224,000 prokaryotic RefSeq assemblies to represent their respective species. The collection has grown by 8% since April 2021 and now includes Candidatus and endosymbiont species (Figure 1), which constitute 303 and 140 respectively of the 1,077 newly added species. In addition, 719 species are represented by a better assembly, and 70 species were removed because of changes in NCBI Taxonomy or uncertainty in their species assignment.
Figure 1. Graphical view of a portion of the RefSeq Representative assembly for the bedbug endosymbiont Candidatus Wolbachia massiliensis isolate PL13.
The NCBI Hidden Markov models (HMM) 6.0 release, available on our FTP site, has 15,247 models supported at NCBI. We created 80 more new HMMs and consolidated the collection by removing 2,151 HMMs that were nearly identical to another. Release 6.0 also incorporates 12,656 PFAM from release 34 that apply to prokaryotic proteins. You can use the HMMER sequence analysis package to search the collection against your favorite prokaryotic proteins to identify their function. We have also added more specific names or associated EC number, gene symbols and publication to over 500 HMMs.
Gene Ontology (GO) term attributes are now available for 20% of HMM models (see Figure 1 below). We added most of these based on existing mappings, but our experts are working on creating more associations. Starting in the fall, we’ll start propagating GO terms from HMMs to annotated genomes and proteins!
RefSeq Release 206 is now available. This release includes the following:
Updated human genome Annotation Release 109.20210514
Updated Annotation Release 109.20210514 is an update of NCBI Homo sapiens Annotation Release 109. The annotation report is available here. The annotation products are available in the sequence databases and on the FTP site.
We are happy to announce an updated bacterial and archaeal representative genome collection! We have selected 13,835 among 214,000 prokaryotic RefSeq assemblies to represent their respective species. The collection has increased by 6% since December 2020. About 950 species are represented for the first time, 476 species are represented by a better assembly, and 170 species were removed because of changes in NCBI Taxonomy or uncertainty in their species assignment.
Join us on May 19, 2021 at 12PM eastern time to learn how to use the new RAPT pilot service to assemble and annotate public or private Illumina genomic reads sequenced from bacterial or archaeal isolates at the click of a button. RAPT consists of two major components, the genome assembler SKESA and the Prokaryotic Genome Annotation Pipeline (PGAP), and produces an annotated genome of quality comparable to RefSeq in a couple of hours.
Date and time: Wed, May 19, 2021 12:00 PM – 12:45 PM EDT
NCBI staff will be presenting virtual posters at the Cold Spring Harbor Laboratory Biology of Genomes Meeting, May 11 -14, 2021. The posters will cover the following topics: 1) a cloud-ready suite of tools (PGAP, RAPT , and SKESA) for assembling and annotating prokaryotic genomes, 2) Datasets — a new set of services for downloading genome assemblies and annotations, and 3) updates on NCBI RefSeq eukaryotic genome annotation, and the Genome Data Viewer (GDV). Read more below for the full abstracts.
We have updated the bacterial and archaeal representative genome collection! The current collection contains over 13,000 assemblies selected from the 203,000 prokaryotic RefSeq assemblies to represent their respective species. The collection has increased by 11% since August 2020. We’ve included about 1,400 species for the first time, have used better assemblies for 1,177 species, and have removed 65 species because of changes in NCBI Taxonomy or uncertainty in their species assignment.
We have updated the collection of representative genome assemblies for Bacteria and Archaea. As announced in April, this set is now recalculated three times a year. We selected a total of 11,727 prokaryotic assemblies to represent their respective species among the 192,000 assemblies in RefSeq. Six hundred and thirty-five species were included in the collection for the first time, while 395 organisms from undefined species (such as Bacillus bacterium) were removed. We were able to choose a higher-quality representative than in the previous set for 18% of Bacterial and Archaeal species due to improvements in the logic of the selection that is now based on the assembly length, number of pseudo CDSs called in the PGAP annotation, number of scaffolds, whether Gene IDs are available in the Gene database for the assembly that is currently representative, and type strain status. You can see the exact criteria in order of importance on the Prokaryotic RefSeq Genomes page. Now that the new selection process is in place, we expect future updates to have fewer changes. We will replace a representative only if the assembly has changed RefSeq status or if a substantially better assembly becomes available.
You can download the reference and representative set from the Assembly resource. If you are interested in the annotation on these genomes, you can limit searches to proteins annotated on representative genomes by adding “refseq_select[filter]” to any query in the Protein database. For example, you can find all proteins annotated on representative genomes in the genus Klebsiella by using the query: “Klebsiella[organism] AND refseq_select[filter]“. A BLAST database of proteins annotated on representative genomes will be coming soon. Stay tuned!
Release 3.0 of the NCBI protein family models used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available from our FTP site. You can search this collection of hidden Markov models (HMMs) against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
The 3.0 release contains 17,350 models: 12,864 HMMs built at NCBI (111 more than in release 2.0) and 4,486 TIGRFAM HMMs. In addition, since release 2.0, we have assigned product names to over 2,000 Pfam HMMs, bringing the total to 6,698 Pfam HMMs with names that can be transferred by PGAP to the annotated proteins they hit. You can access a table of these product names from the release directory.Figure 1. The evidence for name assignment for type III secretion system (T3SS) translocon subunit SctB (NF038055) showing the protein matches. Species-specific names for this highly variable component of T3SS include YopD, EspB, IpaC, SipC, etc. Instead, we used the standard moniker for core genes of T3SS, Sct, Secretion and cellular translocation (PMID 26520801, PMID 9618447) providing a unified nomenclature for this secretion system. Continue reading “Updated protein family models used by PGAP available for download”→
A new release of the NCBI protein families profiles used by PGAP (the Prokaryotic Genome Annotation Pipeline) is now available. You can search this collection of Hidden Markov models (HMM) against your favorite prokaryotic proteins to identify their function using hmmer.
The HMMs are used as hints for the structural annotation of protein-coding genes in bacterial genomes and are also one of the sources for the names assigned to PGAP-annotated proteins presented in the Evidence-For-Name-Assignment comment block of RefSeq protein records (See for example, WP_004152100.1).
The collection comprises 12,753 HMMs that were built at NCBI, and 4,486 TIGRFAM HMMs whose ownership was transferred to NCBI in April 2018. In addition to the HMM profiles and seed alignments, a tab-delimited file containing the product names and other attributes added to the HMMs by curators is available.
85% of models were assigned a product name that can be transferred to proteins hit by the model.
7702 models have gene symbols.
14508 are supported by a least one publication.
6266 are assigned an Enzyme Commission number.
617 represent anti-microbial resistance proteins.
Product names added to 4,686 PFAM HMMs owned by EBI-EMBL and used for functional annotation by PGAP are also included.
A total of 57 million RefSeq prokaryotic proteins have been named based on these curated HMMs, and can be identified with the Entrez query “meta Evidence-For-Name-Assignment”[Properties] AND “Evidence Category=HMM”[Text Word]. See an example and more information on web displays of HMMs in a previous post.