Do you work with or study prokaryotic proteins? As previously announced, we’ve been adding Gene Ontology (GO) terms to RefSeq prokaryotic protein sequence records (example below) to standardize the language when describing the functions of genes and their products. Over 100 million RefSeq proteins from prokaryotes now have at least one GO Term, a 55% increase since we started propagating GO terms from Conserved Domains Database (CDD) architectures in March. Continue reading “Gene Ontology (GO) Terms on 100M+ RefSeq Prokaryotic Protein Sequence Records”
RefSeq Release 216
Prokaryotic phylum name changes coming soon!
Beginning in the first week of January 2023, NCBI Taxonomy will initiate changes to prokaryote phylum names in accordance with the recent inclusion of rank ‘phylum’ in the International Code of Nomenclature for Prokaryotes (ICNP). We first announced this update that involves changes to 42 NCBI taxa about a year ago. We will change several names that have long been in use (e.g., Firmicutes, Proteobacteria) to newly formalized names (e.g., Bacillota, Pseudomonadota) that may be unfamiliar to some.
You will still see the previous names on records and can search using them, but they will not be displayed as prominently as before. The organism names on Entrez records will not change (e.g., Bacillus subtilis). However, we will update the phylum names on the displayed lineages for ~276 million records (see an example in Figure 1 below). Continue reading “Prokaryotic phylum name changes coming soon!”
NCBI hidden Markov models (HMM) release 10.0 now available!
Release 10.0 of the NCBI Hidden Markov models (HMM) used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available for download. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
The 10.0 release contains 15,360 models maintained by NCBI, including 228 that are new since 9.0, 99 that were modified significantly, and 205 that were assigned better names, EC numbers, Gene Ontology (GO) terms, gene symbols or publications. You can search and view the details for these in the Protein Family Model collection, which also includes conserved domain architectures and BlastRules, and find all RefSeq proteins they name.
GO terms associated with HMMs are now propagated to CDSs and proteins annotated with PGAP. In case you missed it, see our previous blog post on this topic.
ASM Microbe 2022 was a success!
NCBI had the pleasure of attending and participating in this year’s American Society of Microbiology (ASM) Microbe conference, June 9-13 in Washington, D.C. NCBI staff participated in activities and events throughout the three-day conference. Over 4,500 attendees gathered in the exhibit hall and joined a variety of poster presentations and talks!
Reflections from a few of our NCBI experts
“It was a great honor for me to receive the ASM Elizabeth O. King Lecturer Award. Thank you to my colleagues, without whom so much of my work would not have been possible, and to all of those who attended my presentation on Making Genomics Accessible to Aid Public Health and Research.”
~Michael Feldgarden, Ph.D. Continue reading “ASM Microbe 2022 was a success!”
NCBI hidden Markov models (HMM) release 8.0 now available!
Release 8.0 of the NCBI Hidden Markov models (HMM), used by the Prokaryotic Genome Annotation Pipeline (PGAP), is now available for download. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
The 8.0 release contains 15,358 models, including 160 that are new since 7.0. In addition, we have added better names, EC numbers, Gene Ontology (GO) terms, gene symbols or publications to over 550 existing HMMs. You can search and view the details for these in the Protein Family Model collection, which also includes conserved domain architectures and BlastRules, and find all RefSeq proteins they name.
GO terms associated with HMMs are now propagated to coding sequences and proteins annotated with PGAP. In case you missed it, see our previous blog post on this topic.
Updated protein family models used by PGAP available for download
Release 3.0 of the NCBI protein family models used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available from our FTP site. You can search this collection of hidden Markov models (HMMs) against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
The 3.0 release contains 17,350 models: 12,864 HMMs built at NCBI (111 more than in release 2.0) and 4,486 TIGRFAM HMMs. In addition, since release 2.0, we have assigned product names to over 2,000 Pfam HMMs, bringing the total to 6,698 Pfam HMMs with names that can be transferred by PGAP to the annotated proteins they hit. You can access a table of these product names from the release directory.Figure 1. The evidence for name assignment for type III secretion system (T3SS) translocon subunit SctB (NF038055) showing the protein matches. Species-specific names for this highly variable component of T3SS include YopD, EspB, IpaC, SipC, etc. Instead, we used the standard moniker for core genes of T3SS, Sct, Secretion and cellular translocation (PMID 26520801, PMID 9618447) providing a unified nomenclature for this secretion system. Continue reading “Updated protein family models used by PGAP available for download”
Enhanced prokaryote type strain report now with details on needed type strain data
The Prokaryote type strain report provides information on type-strains for over 18,000 species. We revised and expanded the report to make it easier to identify cases where sequencing or establishing type material would have the biggest impact on improving prokaryote taxonomy and accurate identification. These cases include species with designated type strains but without a sequenced type strain assembly and species without designated type material. We hope that the community will prioritize sequencing type strains for the former set of species (Table 1) and establishing a neotype or reftype, where applicable (as defined in Cuifo et al 2018) for the latter set (Table 2).
Other changes from the old format file are detailed in a recent genomes announce post.
|Scientific Name||Type material/co-identical strains||Assemblies|
|Burkholderia ubonensis||CCUG:48852, CIP:1070, …||308|
|Escherichia albertii||Albert 19982, BCCM/LMG:20976, …||181|
|Xanthomonas perforans||AATCC:BAA-983, DSM:18975, …||153|
|Listeria innocua||ATCC:33090, BCCM/LMG:11387, …||106|
|Streptococcus iniae||ATCC:29178, BCCM/LMG:14520, …||94|
|Vibrio lentus||CECT:5110, CIP:107166, …||87|
|Vibrio cyclitrophicus||ATCC:700982, BCCM/LMG:21359, …||83|
|Pseudomonas coronafaciens||BCCM/LMG:5060, CFPB:2216, …||77|
|Aliivibrio fischeri||ATCC:7744, BCCM/LMG:4414, …||66|
|Xanthomonas fragariae||ATCC:33239, BCCM/LMG:708, …||61|
Table 1. The top 10 candidate species for sequencing type-strains sorted by the number of assemblies. These have designated type strains but no type strain assembly. We generated the list by sorting by “number of assemblies from type materials per species”, then by decreasing “number of assemblies per taxon”, then filtering out “type materials and coidentical strains” = “na”.
Table 2. The top 10 candidates for proposing a reftype assembly, or neotype where applicable sorted by the number of assemblies. These species have no designated type strain. We generated the list by selecting for “type materials and coidentical strains” = “na”, “number of assemblies from type materials per species” = 0, and sorting by decreasing “number of assemblies per taxon”, then filtering out Candidatus.
Please contact email@example.com if you want to provide information about missing type-strains.
Expanded average nucleotide identity analysis now available for prokaryotic genome assemblies
As we described in an earlier post, GenBank uses average nucleotide identity (ANI) analysis to find and correct misidentified prokaryotic genome assemblies. You can now access ANI data for the more than 600,000 GenBank bacterial and archaeal genome assemblies through a downloadable report (ANI_report_prokaryotes.txt) available from the genomes/ASSEMBLY_REPORTS area of the FTP site. The README describes the contents of the report in detail. You can use the ANI data to evaluate the taxonomic identity of genome assemblies of interest for yourself.
The new ANI_report_prokaryotes.txt replaces the older ANI_report_bacteria.txt in the same directory. We are no longer updating the ANI_report_bacteria.txt file and will remove it after 31st May 2020.
New release of the Prokaryotic Genome Annotation Pipeline with updated tRNAscan and protein models
A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) is now available on GitHub. This release uses a new and improved version of tRNAscan (tRNAscan-SE:2.0.4) and includes our most up-to-date Hidden Markov Model and BlastRule collections for naming proteins.
Remember that you can submit the results of PGAP to GenBank. Or, if you are still improving the assembly and your genome doesn’t pass the pre-annotation validation, you can use the –ignore-all-errors mode to get a preliminary annotation.
See our previous post and our documentation for details on how to set up and run PGAP yourself.
Try PGAP and let us know how you like it!
RefSeq release 216 is now available online, from the FTP site, and through NCBI’s new resource, Datasets.
This full release incorporates genomic, transcript, and protein data available as of January 9, 2023, and contains 342,395,932 records, including 249,868,639 proteins, 49,869,497 RNAs, and sequences from 128,299 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 216” →