Release 8.0 of the NCBI Hidden Markov models (HMM), used by the Prokaryotic Genome Annotation Pipeline (PGAP), is now available for download. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
The 8.0 release contains 15,358 models, including 160 that are new since 7.0. In addition, we have added better names, EC numbers, Gene Ontology (GO) terms, gene symbols or publications to over 550 existing HMMs. You can search and view the details for these in the Protein Family Model collection, which also includes conserved domain architectures and BlastRules, and find all RefSeq proteins they name.
GO terms associated with HMMs are now propagated to coding sequences and proteins annotated with PGAP. In case you missed it, see our previous blog post on this topic.
This version of PGAP offers a more streamlined experience to users who are uncertain about the taxonomic classification of the genomes they wish to annotate. Adding one flag to the command (--auto-correct-tax) results in the override of the species name provided on input if the taxonomy verification process predicts a different organism with high confidence. Continue reading “New version of PGAP available now!”→
A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) is available on GitHub. With this release, you can expect:
Incremental improvements in structural annotation, driven by increased weight of GeneMarkS2+ab initio models at loci with only weak evidence, such as low identity and low coverage protein alignments or partial HMM signatures.
Better structural annotation and more specific functional annotation as a result of the incorporation of PFAM 34 and extensive curation of HMMs, BlastRules and Conserved Domain architectures by NCBI experts.
Fewer overly stringent calls by the taxonomy verification module for several species, including the human pathogens Listeria monocytogenes, Campylobacter lari, and Vibrio vulnificus. This is a result of manual review and adjustment of the minimum percent identity thresholds used by the Average Nucleotide Identity tool.
Multiple bug fixes. Notably, users of Azure Debian 10 machines can now run PGAP successfully, as we have incorporated GeneMarkS2+ compiled under Linux kernel 3 into the PGAP image.
Release 7.0 of the NCBI Hidden Markov models (HMM), used by the Prokaryotic Genome Annotation Pipeline (PGAP), is now available for download. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
Figure 1. Recently added HMM-based Protein Family Model for the histidine-histamine antiporter family (NF040512), with GO terms (framed in red).
RefSeq release 208 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of September 7, 2021, and contains 288,903,207 records, including 210,703,648 proteins, 40,213,945 RNAs, and sequences from 113,002 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 208 is available!”→
The NCBI Hidden Markov models (HMM) 6.0 release, available on our FTP site, has 15,247 models supported at NCBI. We created 80 more new HMMs and consolidated the collection by removing 2,151 HMMs that were nearly identical to another. Release 6.0 also incorporates 12,656 PFAM from release 34 that apply to prokaryotic proteins. You can use the HMMER sequence analysis package to search the collection against your favorite prokaryotic proteins to identify their function. We have also added more specific names or associated EC number, gene symbols and publication to over 500 HMMs.
Gene Ontology (GO) term attributes are now available for 20% of HMM models (see Figure 1 below). We added most of these based on existing mappings, but our experts are working on creating more associations. Starting in the fall, we’ll start propagating GO terms from HMMs to annotated genomes and proteins!
Release 4.0 of the NCBI hidden Markov models (HMM) used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available from our FTP site. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
This release contains 17,443 models, including 94 new models since the last release. We have also updated names and added EC numbers and gene symbols to over 100 models. You can search and view the details of these HMMs in the newly deployed Protein Family Model collection that also includes conserved domain architectures and BlastRules and allows you to find all RefSeq proteins named by these profiles. See our recent post for more details.
The new Protein Family Model resource (Figure 1) provides a way for you to search across the evidence used by the NCBI annotation pipelines to name and classify proteins. You can find protein families by gene symbol, protein function, and many other terms. You have access to related proteins in the family and publications describing members. Protein Family Models includes protein profile hidden Markov models (HMMs) and BlastRules for prokaryotes, and conserved domain architectures for prokaryotes and eukaryotes. The HMMs in the collection include Pfam models, TIGRFAMs as well as models developed at NCBI either de novo, or from NCBI protein clusters. Each of the BlastRules (PMCID: 5753331) consists of one or more model proteins of known biological function with BLAST identity and coverage cutoffs. The conserved domain architectures are based on BLAST-compatible Position Specific Score Matrices (PSSMs) that constitute the NCBI Conserved Domain database.Figure 1. Protein Family Model resource pages. Top panel. Home page. Middle panel, selected results summaries from a fielded search for the DnaK gene product (DnaK[Gene Symbol]). Bottom panel, a portion of an HMM record for DnaK derived from NCBI Protein Clusters (NF009946). The record also includes PubMed citations and HMMER analyses showing the RefSeq proteins named by this method.