Tag: Protein Family Model

New in RAPT: Better taxonomic assignment and GO annotation

New in RAPT: Better taxonomic assignment and GO annotation

We are excited to announce two improvements to the Read assembly and Annotation Pipeline Tool (RAPT), which allows you to assemble genomic reads for bacterial or archaeal isolates and annotate their genes at the click of a button.

Improved taxonomic assignment

Now RAPT verifies the scientific name you provide with the reads, and corrects it as needed with the Average Nucleotide Identity (ANI) tool, which compares your genome to type strain assemblies in GenBank to place it in the taxonomic tree. So, even if you only have a rough idea of the species you have sequenced, input datasets tailored to your genome will be used for the annotation and you will get the best possible gene set from RAPT. Continue reading “New in RAPT: Better taxonomic assignment and GO annotation”

New version of PGAP available now!

We are happy to announce the release of a new version of the stand-alone Prokaryotic Genome Annotation Pipeline (PGAP).

This version of PGAP offers a more streamlined experience to users who are uncertain about the taxonomic classification of the genomes they wish to annotate. Adding one flag to the command (--auto-correct-tax) results in the override of the species name provided on input if the taxonomy verification process predicts a different organism with high confidence. Continue reading “New version of PGAP available now!”

Bacterial and archaeal genomes with GO terms in RefSeq!

RefSeq prokaryotic genomes and proteins are now annotated with Gene Ontology (GO) terms. Over the years we have received many requests to add GO terms to the annotations we provide. We heard you!

We are embarking on this adventure and starting to place terms from the Biological Process, Molecular Function and Cellular Component ontologies to genomes and proteins we annotate with the Prokaryotic Genome Annotation Pipeline (PGAP). Because of the hierarchical nature of the Gene Ontologies, these annotations will help the comparison of gene content across genomes at variable levels of specificity and eventually allow GO term enrichment analysis. GO terms are now associated with coding sequence (CDS) features on newly-submitted genomes (See Figure 1). They will progressively appear on genomes that are already in RefSeq as these get reannotated (about once a year). We expect all RefSeq genomes to have some GO terms by the spring of 2023.

Continue reading “Bacterial and archaeal genomes with GO terms in RefSeq!”

NCBI hidden Markov models (HMM) release 4.0 now available!

Release 4.0 of the NCBI hidden Markov models (HMM) used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available from our FTP site. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.

This release contains 17,443 models, including 94 new models since the last release. We have also updated names and added EC numbers and  gene symbols to over 100 models. You can search and view the details of these HMMs in the newly deployed Protein Family Model collection that also includes conserved domain architectures and BlastRules  and allows you to find all RefSeq proteins named by these profiles. See our recent post for more details.

The Protein Family Model resource is now available!

The new Protein Family Model resource  (Figure 1) provides a way for you to search across the evidence used by the NCBI annotation pipelines to name and classify proteins. You can find protein families by gene symbol, protein function, and many other terms. You have access to related proteins in the family and publications describing members. Protein Family Models includes protein profile hidden Markov models (HMMs) and BlastRules for prokaryotes, and conserved domain architectures for prokaryotes and eukaryotes. The HMMs in the collection include Pfam models, TIGRFAMs as well as models developed at NCBI either de novo, or from NCBI protein clusters.  Each of the BlastRules (PMCID: 5753331) consists of one or more model proteins of known biological function with BLAST identity and coverage cutoffs.  The conserved domain architectures are based on BLAST-compatible Position Specific Score Matrices  (PSSMs) that constitute the NCBI Conserved Domain database.Figure 1. Protein Family Model resource pages. Top panel.  Home page. Middle  panel, selected results summaries from a fielded search for the DnaK gene product (DnaK[Gene Symbol]). Bottom panel, a portion of an HMM record for DnaK derived from NCBI Protein Clusters (NF009946). The record also includes PubMed citations and HMMER analyses showing the RefSeq proteins named by this method.

Continue reading “The Protein Family Model resource is now available!”