Recent enhancements to BLAST+ (2.9.0): built-in taxonomy and access to proteins from the Pathogen Detection Project

We have made some recent improvements to the BLAST+ applications that take full advantage of the version 5 BLAST databases (BLASTDBv5), which include built in taxonomic information for sequences and no longer rely on the integer sequence identifiers (gi numbers).

With the latest version of BLAST, you can now:

  • Limit your searches by taxonomy using information built into the BLAST databases
  • Limit searches more efficiently when using a list of sequence accessions
  • Retrieve sequences by taxonomy from the BLAST database with blastdbcmd
  • Search PDB proteins with identifiers up to four-characters long.  You can read more about about PDB changes on our Structure database documentation.

Only BLASTDBv5 supports these new features. These new BLAST databases also contain accession-based (gi-less) proteins from important high-throughput genome sequencing projects that are not available in the previous version of BLAST databases. These include proteins from annotation of assemblies from large-scale pathogen surveillance efforts that are part of the NCBI Pathogen Project as well as those coming from large-scale metagenomics surveillance. With the v5 databases, you can perform BLAST searches of all proteins from these assemblies to find the proteins of interest.

For more information on new database version, BLASTDBv5 (download), see the previous NCBI Insights article and the recording of our webinar. We will continue to update the BLAST databases in their current version (BLASTDBv4) until September 2019.

Bioinformatics paper uses NCBI open data to analyze drug response

study (PMID: 28158543) published in the July 2017 issue of Bioinformatics collects, classifies and analyzes single nucleotide variants (SNVs) that may affect response to currently approved drugs. They identified 2,640 SNVs of interest, most of which occur rarely in populations (minor allele frequency <0.01).

The researchers used protein sequence alignment tools and mined open data from multiple information resources accessed through E-utilities including PubChem Compound (Kim et al., 2016 PMID: 26400175), NCBI Gene (Maglott D, et al., 2014. PMID: 25355515), NCBI Protein (Sayers, 2013), MMDB (Madej et al., 2012 PMID: 22135289), PDB (Berman et al., 2000 PMID: 10592235), dbSNP (Sherry et al., 2001 PMID: 11125122), and ClinVar (Landrum et al., 2016 PMID: 26582918).

Questions, comments, and other feedback may be sent to Yanli Wang.