The NCBI Hidden Markov models (HMM) 6.0 release, available on our FTP site, has 15,247 models supported at NCBI. We created 80 more new HMMs and consolidated the collection by removing 2,151 HMMs that were nearly identical to another. Release 6.0 also incorporates 12,656 PFAM from release 34 that apply to prokaryotic proteins. You can use the HMMER sequence analysis package to search the collection against your favorite prokaryotic proteins to identify their function. We have also added more specific names or associated EC number, gene symbols and publication to over 500 HMMs.
Gene Ontology (GO) term attributes are now available for 20% of HMM models (see Figure 1 below). We added most of these based on existing mappings, but our experts are working on creating more associations. Starting in the fall, we’ll start propagating GO terms from HMMs to annotated genomes and proteins!
RefSeq release 207 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of July 12, 2021, and contains 285,425,070 records, including 209,035,492 proteins, 39,039,901 RNAs, and sequences from 112,462 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq release 207 is available!”→
We have re-annotated all RefSeq genomes for Escherichia coli, Mycobacterium tuberculosis, Bacillus subtilis, Acinetobacter pittii, and Campylobacter jejuni using the most recent release of PGAP. You will find that more genes now have gene symbols (e.g. recA). Your feedback indicated that the lack of symbols was an impediment to comparative analysis, so we hope that this improvement will help.
The number of re-annotated genomes is 25,619 for E. coli, 470 for B. subtilis, 6,828 for M. tuberculosis, 316 for A. pittii, and 1,829 for C. jejuni. On average, the increase in gene symbols is 30% in E. coli, 110% in B. subtilis, 57% in M. tuberculosis, 94% in A. pittii and 62% in C. jejuni (see Figure 1). After re-annotation, on average, 73% of PGAP-annotated E. coli genes and 79% of B. subtilis have symbols (35% for M. tuberculosis, 40% for A. pittii and 46% for C. jejuni). We assigned symbols to the annotated genes by calculating the orthologs between the genome of interest and the reference assembly for the species, and transferring the symbols from the reference genes to their orthologs in the annotated genomes.
Figure 1: Average and standard deviation of the number of genes annotated with symbols per genome, in the previous (blue) and the current annotation (orange).
We are happy to announce that a new version of PGAP is available. This version will annotate 20 to 25% more genes with symbols (e.g. recA) on the assembled genomes of key species, compared to previous versions.
Do you need an easy way to analyze a bacterium you just isolated? The latest version of NCBI’s Read assembly and Annotation Pipeline Tool (RAPT) is a pilot web service for the assembly and gene annotation of public or private Illumina genomic reads sequenced from bacterial or archaeal isolates.
We’ll be giving a webinar on webRAPT on May 19 where you can learn more, but you can test it out now.
Get started with the click of a button
RAPT is simple to use.
1. If you’re working with NIH’s Sequence Read Archive (SRA) and have an SRA accession, enter it in the first box below (Figure 1a) or upload a file of sequencing reads in the second box (Figure 1b).
Join us on May 19, 2021 at 12PM eastern time to learn how to use the new RAPT pilot service to assemble and annotate public or private Illumina genomic reads sequenced from bacterial or archaeal isolates at the click of a button. RAPT consists of two major components, the genome assembler SKESA and the Prokaryotic Genome Annotation Pipeline (PGAP), and produces an annotated genome of quality comparable to RefSeq in a couple of hours.
Date and time: Wed, May 19, 2021 12:00 PM – 12:45 PM EDT
NCBI staff will be presenting virtual posters at the Cold Spring Harbor Laboratory Biology of Genomes Meeting, May 11 -14, 2021. The posters will cover the following topics: 1) a cloud-ready suite of tools (PGAP, RAPT , and SKESA) for assembling and annotating prokaryotic genomes, 2) Datasets — a new set of services for downloading genome assemblies and annotations, and 3) updates on NCBI RefSeq eukaryotic genome annotation, and the Genome Data Viewer (GDV). Read more below for the full abstracts.
NCBI’s genome Assembly has a number of significant improvements!
Assembly records now have a link to Primer-BLAST making it easy to design primers in the context of a specific eukaryote genome assembly. Figure 1 shows the Assembly page for the Genome Reference Consortium Mouse Build 39 (GRCm39) with the link to Primer-BLAST.
There is a new release of the Read assembly and Annotation Pipeline Tool (RAPT) available from our GitHub site. RAPT is a one-step application for the genome assembly and gene annotation of archaeal and bacterial isolates that can run on your local computer or the Google Cloud Platform (GCP). With this new release, jobs will run twice as fast as with the December release. For example, we have assembled and annotated a Salmonella enterica genome in under an hour on a 16-CPU machine with the new release.
We have also added several new features based on your feedback including:
The –stop-on-errors flag that will stop the process if there evidence from the average nucleotide identity check that there is sample mix-up or contamination by other bacteria.
The ability to accept forward and reverse reads of paired-end runs in separate files. These can be compressed (gzip) files.
Finally, thanks to all who came to our webinar in December and provided their comments! For these who couldn’t join us, you can now view the recording on our YouTube channel.
Release 4.0 of the NCBI hidden Markov models (HMM) used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available from our FTP site. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
This release contains 17,443 models, including 94 new models since the last release. We have also updated names and added EC numbers and gene symbols to over 100 models. You can search and view the details of these HMMs in the newly deployed Protein Family Model collection that also includes conserved domain architectures and BlastRules and allows you to find all RefSeq proteins named by these profiles. See our recent post for more details.