Prokaryotic Genome Annotation Pipeline (PGAP) now produces results suitable for submission to GenBank

We are happy to announce that you can now submit your genome sequences annotated by  your own local copy of the standalone Prokaryotic Genome Annotation Pipeline (PGAP) to GenBank.

How does it work? Download PGAP from GitHub, provide some basic information and the FASTA sequences for your genome sequence, and run the pipeline on your own machine, compute farm or the cloud. PGAP will produce annotation consistent with NCBI’s internal PGAP. Submit the resulting annotated genome to GenBank through the genome submission portal, and get an accession back.

As with any other submitted assembly, PGAP-annotated genomes will be screened for foreign contaminants and vector sequences at submission.  Any annotated assemblies that don’t pass may need to be modified. We are developing an automated process to handle these edits!

We are also working on other  improvements to stand-alone PGAP such as a module for calculating Average Nucleotide Identity (ANI) to confirm the assembly’s taxonomic assignment. Stay tuned for new developments!


May 15, 2019 Webinar: Using taxonomic information and other improvements in standalone BLAST+ (2.9.0) and the v5 databases

Next Wednesday, May 15, 2019 at 11AM, NCBI staff will show you how to use the latest version of standalone BLAST+ (2.9.0) and the new accession-based DBv5 databases with built-in taxonomy information. You will learn how to limit searches to taxonomic groups and to retrieve sequences from the database by taxonomy without having to download an identifier list. You will also learn about additional improvements in the BLAST databases and programs that make them compatible with the new PDB identifiers and gi-less proteins from the Pathogen Detection Project.

Date and time: Wed, May 15, 2018 11:00 AM – 11:30 AM EDT


After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

Over 1 billion records in GenBank release 231

GenBank release 231.0 (4/19/2019) is now available on the NCBI FTP site. This release has 5.03 terabases and 1.54 billion records.

The release has 212,775,414 traditional records containing 321,680,566,570 base pairs of sequence data. There are also 993,732,214 WGS records containing 4,421,986,382,065 base pairs of sequence data, 311,247,136 bulk-oriented TSA records containing 277,118,019,688 base pairs of sequence data, and 24,240,761 bulk-oriented TLS records containing 9,623,321,565 base pairs of sequence data.

Searching for orthologous genes at NCBI

NCBI is testing a new way to find and retrieve orthologous vertebrate genes. To find orthologs enter a gene symbol (e.g. RAG1) or a gene symbol combined with a taxonomic group (e.g. primate RAG1). Select the matching entry from the suggestions menu or you can select the orthologs option (e.g. Rag1 orthologs) to see all orthologs. Your search will return a results link to the set of orthologs provided by NCBI’s Gene resource. Click on the results link to see information for that ortholog group (Figure 1).


Figure 1.  Search  for Rag1 orthologs showing the link to the set of RAG1 genes from vertebrates.

Proposed changes to AGP files for genome assemblies

If you are a consumer or producer of AGP (A Golden Path) files for genome assemblies, please read on.  We’d like your feedback on the proposed changes described here.

As you know, AGP files are used to describe the structure of certain genome assemblies. The AGP file format has not kept up with changes in sequencing technology or International Sequence Database Collaboration (INSDC) feature usage. NCBI is therefore proposing to extend the current AGP v2.0 specification to add new linkage evidence types and a gap type of “contamination” as detailed below and described in the AGP v2.1 proposed specification.

Recent enhancements to BLAST+ (2.9.0): built-in taxonomy and access to proteins from the Pathogen Detection Project

We have made some recent improvements to the BLAST+ applications that take full advantage of the version 5 BLAST databases (BLASTDBv5), which include built in taxonomic information for sequences and no longer rely on the integer sequence identifiers (gi numbers).

With the latest version of BLAST, you can now:

NCBI on YouTube: Request access to controlled data in dbGaP

Do you need access to controlled data in the database of Genotypes and Phenotypes (dbGaP)? This short video will show you how to request data today!

dbGaP archives and distributes the data and results from studies that have investigated the interaction of genotype and phenotype in humans. Responsible stewardship of controlled-access data subject to the NIH GDS Policy is shared among the NIH, the investigators approved to access the data, and the investigators’ institutions.

Conserved Domain Database (CDD) 3.17 is now available

The latest version of the Conserved Domain Database contains 3,272 new or updated NCBI-curated domains and now mirrors Pfam version 31 as well as models from NCBIfams, a collection of protein family hidden Markov models (HMMs) for improving bacterial genome annotation. A fine-grained classification of the major facilitator superfamily has also been added. You can find this updated content on the CDD FTP site.

