As you may have read in previous posts, NCBI is in the process of changing the way we handle GI numbers for sequence records.
In short, we are moving to a time when accession.version identifiers, rather than GI numbers, will be the primary identifiers for sequence records.
As part of this transition, an obvious question for any of you currently using GI numbers is how to convert a GI number to an accession.version, so that you can make appropriate updates. The good news is that it’s pretty easy if you have no more than a few thousand GIs to convert.
Use EFetch to convert GIs
You can use the NCBI E-utility EFetch:
For those of you unfamiliar with EFetch, let’s break that down a bit. The first part of the URL is fixed:
“db=nuccore” establishes that you’ll be downloading data from the nuccore database. If you want to convert protein GIs, then you would use “db=protein” instead of “db=nuccore”.
Next comes the &id parameter with a list of GI numbers separated by commas:
This requests data for two records: GIs 663070995 and 568815587.
Finally comes the real trick – setting the &rettype parameter to “acc”.
This defines an output format where each line contains the accession.version of a single GI, and the order of the lines in the output matches the order of the GIs in the URL. In this case the result would be:
The result indicates that the accession.version for GI 663070995 is ‘NM_001178.5’ and the accession for 568815587 is “NC_000011.10 “. If one of your input GIs is invalid, then you’ll get a blank line in the output file:
Notice the GI “100” between the two previous GIs. There is no record with GI 100, so you get this:
You can list approximately 250 GIs in a single URL (using HTTPS GET), but you can put several thousand in an HTTPS POST call. Just be sure to adhere to our usage guidelines. No more than 3 calls per second, please!
For more information about doing these conversions, please view our recent webinar on this topic.
5 thoughts on “Converting GI Numbers to Accession.version”
Great post, it does clarify some previous issues, thank you very much for the clarification.
My question is in the cases when we have more than a “few thousand” GIs.
I have implemented a workaround in my code – a “translation” step to convert GIs to Accession.version numbers. It is very similar to the example you provide here.
The catch is that this extra step takes quite a while to complete with a large number of sequences (say, 1M) – in fact, my application now takes *nearly* as long to translate GIs to Accession.version as to download the sequence records. It is also “stressing” your servers twice as much as before, at least as far as connection requests go.
Is there any way to retrieve sequence identifiers as accession.version directly when using esearch? IMO that would be the true solution to the problem, as adding a “translation” step is just a work around.
Once again, thanks for keeping us, the API users, in the loop of the changes you are making! It is an effort that is really appreciated.
Thanks for your comments. We describe a solution to the large download case in a more recent post: https://ncbiinsights.ncbi.nlm.nih.gov/2016/12/23/converting-lots-of-gi-numbers-to-accession-version/
Basically, you can download a large database and accompanying python script that will allow you to do the conversions rapidly on local machines. You may also want to view this webinar: https://youtu.be/Zf-gwXVzU4E
What about converting it the other way around? Efetch and Epost (though biopython) requires GI numbers, not accessions.
You can also use EFetch for this:
The order of the GI’s in the XML output will match the order of the accessions in &id.
The larger point is that we will not be assigning GI numbers to more and more new sequences, so you will need to rely on accession.version identifiers for retrieval. Please see the other relevant posts linked to this one for more information.