Release 12.0 of the NCBI protein profile Hidden Markov models (HMMs) used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available for download. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
The 12.0 release contains:
15,849 HMMs maintained by NCBI
271 new HMMs since release 11.0
1,248 HMMs with better names, EC numbers, Gene Ontology (GO) terms, gene symbols or publications
We are happy to announce a new version of the stand-alone Prokaryotic Genome Annotation Pipeline (PGAP). This version helps you interpret your results by providing an estimate of the completeness and contamination of your PGAP-annotated genome assembly using CheckM.
CheckM uses the presence of a set of lineage-specific genes for the species provided or the species returned by the taxonomy check (–taxcheck, –auto-correct-tax). The higher the completeness and the lower the contamination, the better the assembly is! If contamination is a concern, please try FCS-GX, a highly sensitive tool for detecting foreign contaminants in prokaryotic and eukaryotic genome assemblies.
This new release also contains code changes that improve prediction of some long genes, especially in low complexity regions. And, as with every release, PGAP incorporates incremental improvements from expert curators of the Protein Family Model collection that increase the precision of PGAP’s structural and functional annotation.
We are excited to announce two improvements to the Read assembly and Annotation Pipeline Tool (RAPT), which allows you to assemble genomic reads for bacterial or archaeal isolates and annotate their genes at the click of a button.
Improved taxonomic assignment
Now RAPT verifies the scientific name you provide with the reads, and corrects it as needed with the Average Nucleotide Identity (ANI) tool, which compares your genome to type strain assemblies in GenBank to place it in the taxonomic tree. So, even if you only have a rough idea of the species you have sequenced, input datasets tailored to your genome will be used for the annotation and you will get the best possible gene set from RAPT. Continue reading “New in RAPT: Better taxonomic assignment and GO annotation”→
This version of PGAP offers a more streamlined experience to users who are uncertain about the taxonomic classification of the genomes they wish to annotate. Adding one flag to the command (--auto-correct-tax) results in the override of the species name provided on input if the taxonomy verification process predicts a different organism with high confidence. Continue reading “New version of PGAP available now!”→
RefSeq prokaryotic genomes and proteins are now annotated with Gene Ontology (GO) terms. Over the years we have received many requests to add GO terms to the annotations we provide. We heard you!
We are embarking on this adventure and starting to place terms from the Biological Process, Molecular Function and Cellular Component ontologies to genomes and proteins we annotate with the Prokaryotic Genome Annotation Pipeline (PGAP). Because of the hierarchical nature of the Gene Ontologies, these annotations will help the comparison of gene content across genomes at variable levels of specificity and eventually allow GO term enrichment analysis. GO terms are now associated with coding sequence (CDS) features on newly-submitted genomes (See Figure 1). They will progressively appear on genomes that are already in RefSeq as these get reannotated (about once a year). We expect all RefSeq genomes to have some GO terms by the spring of 2023.
Release 4.0 of the NCBI hidden Markov models (HMM) used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available from our FTP site. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
This release contains 17,443 models, including 94 new models since the last release. We have also updated names and added EC numbers and gene symbols to over 100 models. You can search and view the details of these HMMs in the newly deployed Protein Family Model collection that also includes conserved domain architectures and BlastRules and allows you to find all RefSeq proteins named by these profiles. See our recent post for more details.
The new Protein Family Model resource (Figure 1) provides a way for you to search across the evidence used by the NCBI annotation pipelines to name and classify proteins. You can find protein families by gene symbol, protein function, and many other terms. You have access to related proteins in the family and publications describing members. Protein Family Models includes protein profile hidden Markov models (HMMs) and BlastRules for prokaryotes, and conserved domain architectures for prokaryotes and eukaryotes. The HMMs in the collection include Pfam models, TIGRFAMs as well as models developed at NCBI either de novo, or from NCBI protein clusters. Each of the BlastRules (PMCID: 5753331) consists of one or more model proteins of known biological function with BLAST identity and coverage cutoffs. The conserved domain architectures are based on BLAST-compatible Position Specific Score Matrices (PSSMs) that constitute the NCBI Conserved Domain database.Figure 1. Protein Family Model resource pages. Top panel. Home page. Middle panel, selected results summaries from a fielded search for the DnaK gene product (DnaK[Gene Symbol]). Bottom panel, a portion of an HMM record for DnaK derived from NCBI Protein Clusters (NF009946). The record also includes PubMed citations and HMMER analyses showing the RefSeq proteins named by this method.