The American Society of Microbiology (ASM) Microbe conference is back, and scheduled to take place in-person, June 9th-13th in Washington, D.C.
NCBI staff member Dr. Michael Feldgarden will be recognized by ASM with an award for his research. Other NCBI staff will present posters on NCBI resources and will also be available at our booth (#1128) to address your questions. Drop by to see what’s new and provide your feedback. We hope to see you there! Check out NCBI’s schedule of activities: Continue reading “Come see NCBI at the ASM Microbe Conference 2022”→
We are excited to announce two improvements to the Read assembly and Annotation Pipeline Tool (RAPT), which allows you to assemble genomic reads for bacterial or archaeal isolates and annotate their genes at the click of a button.
Improved taxonomic assignment
Now RAPT verifies the scientific name you provide with the reads, and corrects it as needed with the Average Nucleotide Identity (ANI) tool, which compares your genome to type strain assemblies in GenBank to place it in the taxonomic tree. So, even if you only have a rough idea of the species you have sequenced, input datasets tailored to your genome will be used for the annotation and you will get the best possible gene set from RAPT. Continue reading “New in RAPT: Better taxonomic assignment and GO annotation”→
Release 8.0 of the NCBI Hidden Markov models (HMM), used by the Prokaryotic Genome Annotation Pipeline (PGAP), is now available for download. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
The 8.0 release contains 15,358 models, including 160 that are new since 7.0. In addition, we have added better names, EC numbers, Gene Ontology (GO) terms, gene symbols or publications to over 550 existing HMMs. You can search and view the details for these in the Protein Family Model collection, which also includes conserved domain architectures and BlastRules, and find all RefSeq proteins they name.
GO terms associated with HMMs are now propagated to coding sequences and proteins annotated with PGAP. In case you missed it, see our previous blog post on this topic.
This version of PGAP offers a more streamlined experience to users who are uncertain about the taxonomic classification of the genomes they wish to annotate. Adding one flag to the command (--auto-correct-tax) results in the override of the species name provided on input if the taxonomy verification process predicts a different organism with high confidence. Continue reading “New version of PGAP available now!”→
RefSeq prokaryotic genomes and proteins are now annotated with Gene Ontology (GO) terms. Over the years we have received many requests to add GO terms to the annotations we provide. We heard you!
We are embarking on this adventure and starting to place terms from the Biological Process, Molecular Function and Cellular Component ontologies to genomes and proteins we annotate with the Prokaryotic Genome Annotation Pipeline (PGAP). Because of the hierarchical nature of the Gene Ontologies, these annotations will help the comparison of gene content across genomes at variable levels of specificity and eventually allow GO term enrichment analysis. GO terms are now associated with coding sequence (CDS) features on newly-submitted genomes (See Figure 1). They will progressively appear on genomes that are already in RefSeq as these get reannotated (about once a year). We expect all RefSeq genomes to have some GO terms by the spring of 2023.
A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) is available on GitHub. With this release, you can expect:
Incremental improvements in structural annotation, driven by increased weight of GeneMarkS2+ab initio models at loci with only weak evidence, such as low identity and low coverage protein alignments or partial HMM signatures.
Better structural annotation and more specific functional annotation as a result of the incorporation of PFAM 34 and extensive curation of HMMs, BlastRules and Conserved Domain architectures by NCBI experts.
Fewer overly stringent calls by the taxonomy verification module for several species, including the human pathogens Listeria monocytogenes, Campylobacter lari, and Vibrio vulnificus. This is a result of manual review and adjustment of the minimum percent identity thresholds used by the Average Nucleotide Identity tool.
Multiple bug fixes. Notably, users of Azure Debian 10 machines can now run PGAP successfully, as we have incorporated GeneMarkS2+ compiled under Linux kernel 3 into the PGAP image.
Release 7.0 of the NCBI Hidden Markov models (HMM), used by the Prokaryotic Genome Annotation Pipeline (PGAP), is now available for download. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
Figure 1. Recently added HMM-based Protein Family Model for the histidine-histamine antiporter family (NF040512), with GO terms (framed in red).
The NCBI Hidden Markov models (HMM) 6.0 release, available on our FTP site, has 15,247 models supported at NCBI. We created 80 more new HMMs and consolidated the collection by removing 2,151 HMMs that were nearly identical to another. Release 6.0 also incorporates 12,656 PFAM from release 34 that apply to prokaryotic proteins. You can use the HMMER sequence analysis package to search the collection against your favorite prokaryotic proteins to identify their function. We have also added more specific names or associated EC number, gene symbols and publication to over 500 HMMs.
Gene Ontology (GO) term attributes are now available for 20% of HMM models (see Figure 1 below). We added most of these based on existing mappings, but our experts are working on creating more associations. Starting in the fall, we’ll start propagating GO terms from HMMs to annotated genomes and proteins!
RefSeq release 207 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of July 12, 2021, and contains 285,425,070 records, including 209,035,492 proteins, 39,039,901 RNAs, and sequences from 112,462 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq release 207 is available!”→
We have re-annotated all RefSeq genomes for Escherichia coli, Mycobacterium tuberculosis, Bacillus subtilis, Acinetobacter pittii, and Campylobacter jejuni using the most recent release of PGAP. You will find that more genes now have gene symbols (e.g. recA). Your feedback indicated that the lack of symbols was an impediment to comparative analysis, so we hope that this improvement will help.
The number of re-annotated genomes is 25,619 for E. coli, 470 for B. subtilis, 6,828 for M. tuberculosis, 316 for A. pittii, and 1,829 for C. jejuni. On average, the increase in gene symbols is 30% in E. coli, 110% in B. subtilis, 57% in M. tuberculosis, 94% in A. pittii and 62% in C. jejuni (see Figure 1). After re-annotation, on average, 73% of PGAP-annotated E. coli genes and 79% of B. subtilis have symbols (35% for M. tuberculosis, 40% for A. pittii and 46% for C. jejuni). We assigned symbols to the annotated genes by calculating the orthologs between the genome of interest and the reference assembly for the species, and transferring the symbols from the reference genes to their orthologs in the annotated genomes.
Figure 1: Average and standard deviation of the number of genes annotated with symbols per genome, in the previous (blue) and the current annotation (orange).