We have made some recent improvements to the BLAST+ applications that take full advantage of the version 5 BLAST databases (BLASTDBv5), which include built in taxonomic information for sequences and no longer rely on the integer sequence identifiers (gi numbers).
With the latest version of BLAST, you can now:
By the end of 2018, GenBank and other INSDC members will expand the accession formats used for sequencing projects. We have assigned almost all the possible accession numbers using the current, shorter formats. Using these longer formats will allow us to expand accession ranges and give us greater capacity.
The expanded format for Whole Genome Shotgun (WGS), Transcriptome Shotgun Assembly (TSA), and Targeted Locus Study (TLS) sequencing projects will use a six-letter Project Code prefix and a two-digit Assembly-Version number followed by 7, 8, or 9 digits (for example, AAAAAA020000001).
Non-WGS/TLS/TSA nucleotide sequences currently use a “2+6” format, two-letter prefix followed by six digits. This format will be expanded to eight digits.
Protein sequences currently use a “3+5” accession format. By the end of 2018, this format will use seven digits.
You will need to adjust any processing methods to accommodate these new identifier formats. Please write to the helpdesk with any questions about the new formats.
As you may have read in previous posts, NCBI is phasing out sequence GIs and transitioning to accession.version identifiers. To help you prepare for this transition, we created sample BLAST databases that will help you make code changes to your programs and workflows for the switch to accession identifiers.
The sample databases, env_nr_v5 and tsa_nr_v5, are on FTP.
If you have any questions or concerns, please contact our Help Desk.
As you may have read in previous posts, NCBI is in the process of changing the way we handle GI numbers for sequence records. In short, we are moving to a time when accession.version identifiers, rather than GI numbers, will be the primary identifiers for sequence records.
In a previous post, we outlined a method for converting GI numbers (used to identify sequence records) to accession.version identifiers. That method used the E-utility EFetch and is capable of handling cases where you have no more than a few thousand GI numbers to convert.
What if you have more?
We now have a bulk conversion resource that will allow you to handle very large jobs. The resource consists of a Python script coupled with a database file (about 40 GB uncompressed). You’ll need to download both of these files (gi2accession.py and gi2acc_lmdb.gz) to local disk, and then you can process as needed.
As you may have read in previous posts, NCBI is in the process of changing the way we handle GI numbers for sequence records.
In short, we are moving to a time when accession.version identifiers, rather than GI numbers, will be the primary identifiers for sequence records.
As part of this transition, an obvious question for any of you currently using GI numbers is how to convert a GI number to an accession.version, so that you can make appropriate updates. The good news is that it’s pretty easy if you have no more than a few thousand GIs to convert.
NCBI has announced that we will be changing the way we handle GI numbers for sequence records in September 2016. (Read more, in case you missed it).
In this post, we’ll address a key question:
What is the future of existing GI numbers?
The short answer is that nothing is happening to these GI numbers.
If a nucleotide or protein record already has a GI, it will continue to have that GI indefinitely. You will also be able to retrieve such a record using its GI either on the NCBI web site or using the E-utilities.
Moreover, GIs will remain part of the XML and ASN.1 formats of sequence records.
If not GIs, then what?
Accession.version identifiers. All sequence records, both new and old, will have a unique accession.version identifier.
Existing records will keep the accessions they already have; new sequences will only receive an accession.version identifier.
So what’s all the fuss about?
Stay tuned for additional posts about this topic, and please contact us if you have questions.
You may have heard that NCBI is changing the way we handle GI numbers for sequence records in September 2016. Well, you heard right! Here’s the announcement, in case you missed it.
There are a number of issues raised by these changes, but we’re going to answer two questions in this post:
- What pieces of your code will break in September?
- Are GI numbers gone for good?