You may have heard that NCBI is changing the way we handle GI numbers for sequence records in September 2016. Well, you heard right! Here’s the announcement, in case you missed it.
There are a number of issues raised by these changes, but we’re going to answer two questions in this post:
- What pieces of your code will break in September?
- Are GI numbers gone for good?
What pieces of your code will actually break in September?
- Any code that parses GI numbers from sequence flat files (from web, FTP, E-utilities or any other NCBI source) will break. Why? Because the GI numbers will no longer be there.
- Any code that parses GI numbers from NCBI FASTA records (again, from any NCBI source) will break. Why? Same reason. The GI numbers will no longer be in the FASTA definition lines.
Only those two locations are affected. If your code isn’t parsing from either of those places, you’ll be fine.
Keep in mind that these changes affect all sequence records from any NCBI database (including GenBank, RefSeq, BLAST databases, Nuccore, NucEST, NucGSS, Protein and Popset).
Of course, there’s one kind of “code” we haven’t mentioned yet – and that’s you. If you rely on reading GI numbers from GenBank flat files or FASTA records, or copying them or searching for them in those records, that “code” will fail in September. Why? Same reason. They won’t be there.
So are the GI numbers gone for good?
No! They are still part of the data record, and you will still be able to use them to retrieve the record on the web or using the E-utilities, indefinitely. They will remain in XML and ASN.1 data presentations, and will only be removed from flat files and FASTA.
However, more and more new sequence records will not be assigned a GI number, and so will never be retrievable using GI methods. But records that currently have a GI will always have that GI.
Stay tuned for additional posts about these changes, and please let us know if you have questions.
15 thoughts on “NCBI is Phasing Out Sequence GIs – Here’s What You Need to Know”
after phasing out GIS, what should provided for makeblastdb -taxid_map?
Accessions now work with -taxid_map as of the 2.4.0 blast+ release in June 2016.
Will the command -gilist in blastdb_aliastool be replaced with accession.version?
Yes, we will be adding support for accession.version to blastdb_aliastool. Check the NCBI News, BLAST News (on the BLAST home page) and our social media (FaceBook, Twitter) for updates on this.
Regarding phasing out GI numbers from NCBI, I was highly concerned about how can I use accession.version indicator to (e)fetch content from nuccore.
I mainly use efetch to obtain FASTA sequences in text format in a lab-based web server environment and as long as GIs served as UIDs, I could perfectly obtain sequences and pipe them for further procession.
Under the new schema, how could this work?
Thank you in advance for your response!
This is supposed to work by adding “&idtype=acc” and “&rettype=acc” to your queries, but as of today, it’s still not working.
Can anyone @NCBI please give us a heads up as to when it should start working?
As a follow up: “&idtype=acc” is working. Just not for “&usehistory=y”, as described in the webinar’s slides.
The code supporting the new &idtype parameter has not yet been released. We are expecting it to be released soon and will blog about it when it happens. For several years EFetch has supported accession.version identifiers in the &id parameter, and so it may appear that &idtype is working, when it simply is existing functionality for EFetch. The true changes will be to ESummary, EPost, and ELink, which only accept GI numbers (for sequences) in &id. Once &idtype is released, they will accept accession.version like EFetch does now. In addition, ESearch will output accession.version identifiers with &idtype=acc.
Thank you for the clarification.
So for now it is OK to get the IDs from “esearch” and then use “efetch” to “translate” the resulting GIs to Accession numbers, right? At least until “&idtype=acc” is released. Because this approach I mentioned feels “hackish”. I’ll be watching this space closely for the follow up so I can implement “&idtype” in my tool.
Yes, what you propose is fine. It depends on what you’re starting with. If you have a list of GI’s already, then use EFetch by itself to translate to accessions. ESearch is really only required if you have a text query and want to use that to pull GI’s. In that case, you could add &usehistory=y to the ESearch call, and then pass the WebEnv and query_key params from the ESearch result to EFetch (with &rettype=acc). This avoids using a (possibly large) literal list of GI’s.
Thank you very much for the recipe – has anything changed since the last post, namely has the rettype / idtype = “acc” been implemented? Esearch seems to be returning Gi’s only, but at the same time, the gi’s have disappeared from many annotations. This extremely complicates queries followed by batch downloads (and we cannot download newly posted sequences, if I am not mistaken, since no new gis are assigned). Thank you.
We did release the new &idtype parameter very recently, and will be providing a webinar about this on Jan 31 (register here: https://attendee.gotowebinar.com/register/7530877675754064131). We’re also preparing a post that will appear on this blog. If you can provide more details about what you’re trying to do (write to firstname.lastname@example.org), we’d be happy to provide advice.
Thank you! I have already used the new &idtype parameter and it works as advertised! It is way better than the “hacky” “translation” step I had implemented on my end.