A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) is available on GitHub. With this release, you can expect:
Incremental improvements in structural annotation, driven by increased weight of GeneMarkS2+ab initio models at loci with only weak evidence, such as low identity and low coverage protein alignments or partial HMM signatures.
Better structural annotation and more specific functional annotation as a result of the incorporation of PFAM 34 and extensive curation of HMMs, BlastRules and Conserved Domain architectures by NCBI experts.
Fewer overly stringent calls by the taxonomy verification module for several species, including the human pathogens Listeria monocytogenes, Campylobacter lari, and Vibrio vulnificus. This is a result of manual review and adjustment of the minimum percent identity thresholds used by the Average Nucleotide Identity tool.
Multiple bug fixes. Notably, users of Azure Debian 10 machines can now run PGAP successfully, as we have incorporated GeneMarkS2+ compiled under Linux kernel 3 into the PGAP image.
Release 7.0 of the NCBI Hidden Markov models (HMM), used by the Prokaryotic Genome Annotation Pipeline (PGAP), is now available for download. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.
Figure 1. Recently added HMM-based Protein Family Model for the histidine-histamine antiporter family (NF040512), with GO terms (framed in red).
If you’ve ever tried searching for a genomic location in NCBI’s Genome Data Viewer (GDV) or Variation Viewer and found that your search term didn’t work, it’s time to try again! We recently expanded support for searches in our genome browsers using non-NCBI identifiers such as HGVS patterns (e.g. NM_001318787.2:c.2258G>A) and Ensembl IDs. You can also search by chromosome coordinates, cytogenetic band, assembly scaffold/component, disease/phenotype, dbSNP identifier, or RefSeq transcript/protein accession. We’ve gathered example searches in the table below.
When you search by single coordinate, SNP or dbVar ID, or HGVS, the browser view zooms to the location of the search result. A marker is automatically created to identify the searched position. For HGVS, the marker is labelled with the corresponding rsID, if there is one.
As always, please contact us if you have additional questions or suggestions about this or any other feature in GDV or Variation Viewer. You can use the Feedback button on the page or write to the NCBI Help Desk directly.
Missed a few videos on YouTube? Here’s the latest from our channel.
Customize the MSA Viewer to Make Your Analysis Easier
We’re constantly improving the Multiple Sequence Alignment (MSA) Viewer. This video demonstrates several new and popular features, including the ability to change data columns, hide selected rows, analyze polymorphisms, and more.
We have added a new function to IgBLAST on the Web. You can now search immunoglobulin (Ig) nucleotide sequences against the Constant region (C) gene database (Figure 1) to determine the Ig isotypes including subtypes (IgM, IgG, IgA1, etc.). The isotype information is reported in the rearrangement summary table, and the C gene region is displayed in the alignment section. This feature is now available on the IgBLAST web service for human and mouse sequences with possible expansion to other organisms in the future. The feature is not yet implemented for the standalone IgBLAST package.
GenBank release 246.0 (11/2/2021) is now available on the NCBI FTP site. This release has 16.1 trillion bases and 2.57 billion records.
The current release has 233642893 traditional records containing 1,014,763,752,113 base pairs of sequence data. There are also 1,721,064,101 WGS records containing 14,599,101,574,547 base pairs of sequence data, 508,319,391 bulk-oriented TSA records containing 449,891,016,597 base pairs of sequence data, and 107,569,935 bulk-oriented TLS records containing 40,168,874,815 base pairs of sequence data.
RefSeq release 209 is now available online, from the FTP site and through NCBI’s Entrez
programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of November 1, 2021, and contains 296,293,486 records, including 215,655,378 proteins, 41,751,205 RNAs, and sequences from 114,396 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 209 is available”→
In June, we announced the arrival of PMC Labs, where you can test drive the work underway to create a more modern PMC website. Since then, we’ve continued to talk to users, gather input, and make ongoing adjustments based on your feedback.
We hope that the planned updates will create an easier navigation and reading experience, while keeping all the features you use most within PMC. If you haven’t had a chance to try out the changes, there’s still time to give input using the green feedback button in the lower right-hand corner of the site.
NCBI’s Genome Data Viewer (GDV) now supports visualization and analysis of nearly 400 submitter-annotated chromosome-level assemblies from the INSDC (GenBank/ENA/DDBJ). These submitter-annotated assemblies join more than 1,200 NCBI RefSeq-annotated assemblies available in GDV for hundreds of eukaryotes, spanning fungi, plants, fish, insects, and all major model organisms.
Figure 1. Submitter-annotated Malus domestica (apple) assembly displayed in GDV. GDV provides submitter-provided gene annotation, as well as some additional tracks including interspersed repeats identified by RepeatMasker and six-frame translations (not shown). Red boxes indicate useful tools and panels including a search box, an exon navigator, and interfaces to add user data and conduct NCBI BLAST searches.
The Genome Data Viewer (GDV) is now the comprehensive NCBI genome browser. The development of GDV led to a few different types of genome browsers along the way, each one originally delivering visual displays for particular datasets. We developed the 1000 Genomes Browser for variation data from the 1000 Genomes project, the dbGaP Data Browser for controlled-access sequence read alignment data, and the GeT-RM browser for Genome in a Bottle (GIAB) data.
The data displayed in these three browsers is now either obsolete and/or can largely be accessed from the GDV browser or other NCBI resources. Moreover, unlike GDV, these older browsers are no longer under active development and the data has not been updated to meet changing needs of the communities they were developed to serve. For these reasons we will retire these browsers in April 2022. Please see details below for more information on the data displayed in these browsers and how to access and display these data now through GDV and other means.