Release 3.0 of the NCBI protein family models used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available from our FTP site. You can search this collection of hidden Markov models (HMMs) against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
The 3.0 release contains 17,350 models: 12,864 HMMs built at NCBI (111 more than in release 2.0) and 4,486 TIGRFAM HMMs. In addition, since release 2.0, we have assigned product names to over 2,000 Pfam HMMs, bringing the total to 6,698 Pfam HMMs with names that can be transferred by PGAP to the annotated proteins they hit. You can access a table of these product names from the release directory.Figure 1. The evidence for name assignment for type III secretion system (T3SS) translocon subunit SctB (NF038055) showing the protein matches. Species-specific names for this highly variable component of T3SS include YopD, EspB, IpaC, SipC, etc. Instead, we used the standard moniker for core genes of T3SS, Sct, Secretion and cellular translocation (PMID 26520801, PMID 9618447) providing a unified nomenclature for this secretion system. Continue reading “Updated protein family models used by PGAP available for download”→
We have updated the collection of representative and reference assemblies for Bacteria and Archaea to better reflect the taxonomic breadth of the prokaryotes in RefSeq. We chose the 11,478 representative assemblies in the new collection from the 180,000+ prokaryotic assemblies in RefSeq today. We have selected one representative or reference assembly for every species based on several criteria including contiguity, completeness and whether the assembly is from type material. We have also updated the reference and representative microbial Blast database to reflect these changes. This reference and representative set will be updated three times a year to reflect changes in RefSeq. In addition, as we announced on Feb 14, we have reduced the number of reference genome assemblies — the subset of representative assemblies with annotation provided by outside experts — to 15. See the list in our previous post . We have re-annotated the 104 assemblies that are no longer reference with or Prokaryotic Genome Annotations Pipel (PGAP).
A new release of the NCBI protein families profiles used by PGAP (the Prokaryotic Genome Annotation Pipeline) is now available. You can search this collection of Hidden Markov models (HMM) against your favorite prokaryotic proteins to identify their function using hmmer.
The HMMs are used as hints for the structural annotation of protein-coding genes in bacterial genomes and are also one of the sources for the names assigned to PGAP-annotated proteins presented in the Evidence-For-Name-Assignment comment block of RefSeq protein records (See for example, WP_004152100.1).
The collection comprises 12,753 HMMs that were built at NCBI, and 4,486 TIGRFAM HMMs whose ownership was transferred to NCBI in April 2018. In addition to the HMM profiles and seed alignments, a tab-delimited file containing the product names and other attributes added to the HMMs by curators is available.
85% of models were assigned a product name that can be transferred to proteins hit by the model.
7702 models have gene symbols.
14508 are supported by a least one publication.
6266 are assigned an Enzyme Commission number.
617 represent anti-microbial resistance proteins.
Product names added to 4,686 PFAM HMMs owned by EBI-EMBL and used for functional annotation by PGAP are also included.
A total of 57 million RefSeq prokaryotic proteins have been named based on these curated HMMs, and can be identified with the Entrez query “meta Evidence-For-Name-Assignment”[Properties] AND “Evidence Category=HMM”[Text Word]. See an example and more information on web displays of HMMs in a previous post.
Next week, NCBI staff will attend AGBT in Marco Island, Florida. On Tuesday, February 25, 2020, three posters from NCBI staff will be on display from 4:40 p.m. – 6:10 p.m. during the Poster Session and Wine Reception in the Banyan and Calusa Ballroom Foyers, Levels 1 and 3. Read on to learn a little bit about what we’ll be presenting.
A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) with several important features is now available on Github.
In response to several requests we have added the option of running PGAP with Singularity, Podman or any other Docker-compatible executable you wish to use.
We have also lifted the requirement for internet access in case you have privacy concerns. To run the pipeline without internet access, set the flag
Are you unsure about the identity of organism you sequenced? We’ve added the Taxonomy-Check module to help you. This module will confirm the organism name or suggest a new taxonomic assignment through average nucleotide identity comparison with type material assemblies from GenBank. The check is currently an optional validation step prior to PGAP.
Try these new features and let us know what you think! Or submit your PGAP-annotated assembly to GenBank. And remember that if you are still improving the assembly and your genome doesn’t pass the pre-annotation validation, you can use the --ignore-all-errors flag to get a preliminary annotation.
We are making changes to the set of bacterial and archaeal RefSeq Reference and Representative assemblies in February 2020.
We will reduce the number of Reference assemblies to 15 that have annotation provided by outside experts (Table 1) and re-annotate the 105 other current Reference assemblies using the latest Prokaryotic Genome Annotation Pipeline (PGAP) software. The re-annotated assemblies will lose reference status.
We will reassess and revise the set of Representative assemblies so that there is one assembly per species to better reflect the taxonomic diversity of the RefSeq bacterial and archaeal assemblies.
Check out the latest videos on YouTube to learn how to best use NCBI graphical viewers, SRA, PGAP, and other resources.
Genome Data Viewer: Analyzing Remote BAM Alignment Files and Other Tips
This video shows you how to upload remote BAM files, and succinctly demonstrates handy viewer settings, such as Pileup display options, and highlights the very helpful tooltips in the Genome Data Viewer (GDV). There’s also a brief blog post on the same topic.
On Wednesday, December 11, 2019 at 12 PM, NCBI staff will present a webinar that will show you how to use NCBI’s PGAP (https://github.com/ncbi/pgap) on your own data to predict genes on bacterial and archaeal genomes using the same inputs and applications used inside NCBI. You can run PGAP your own machine, a compute farm, or in the Cloud. Plus, you can now submit genome sequences annotated by your copy of PGAP to GenBank. Attend the webinar to learn more!
Date and time: Wed, Dec 11, 2019 12:00 PM – 12:45 PM EDT
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.
A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) is now available on GitHub. This release uses a new and improved version of tRNAscan (tRNAscan-SE:2.0.4) and includes our most up-to-date Hidden Markov Model and BlastRule collections for naming proteins.
Remember that you can submit the results of PGAP to GenBank. Or, if you are still improving the assembly and your genome doesn’t pass the pre-annotation validation, you can use the –ignore-all-errors mode to get a preliminary annotation.
We have released a new version of the Prokaryotic Genome Annotation Pipeline (PGAP), available on GitHub. The new release includes the ability to ignore pre-annotation validation errors (–ignore-all-errors). This new feature allows you to produce a preliminary annotation for a draft version of the genome, even one that contains vector and adapter sequences or that is outside of the size range for the species. This draft annotation should be helpful with your ongoing work on the genome assembly. Please keep in mind that these pre-annotations and assemblies with contaminants or other errors are not suitable for submission to GenBank.
Another new feature allows you to provide the name of the consortium that generated the assembly and annotation so that this information appears in the final GenBank records. For more details, consult our guidelines on input files.
See our previous post and our documentation for details on how to obtain and run PGAP yourself.
Next on our to-do list is a module for calculating Average Nucleotide Identity (ANI) to confirm the assembly’s taxonomic assignment. Stay tuned!