NCBI is Phasing Out Sequence GIs – Here’s What You Need to Know

You may have heard that NCBI is changing the way we handle GI numbers for sequence records in September 2016. Well, you heard right! Here’s the announcement, in case you missed it.

There are a number of issues raised by these changes, but we’re going to answer two questions in this post:

What pieces of your code will break in September?
Are GI numbers gone for good?

What pieces of your code will actually break in September?

Any code that parses GI numbers from sequence flat files (from web, FTP, E-utilities or any other NCBI source) will break. Why? Because the GI numbers will no longer be there.
Any code that parses GI numbers from NCBI FASTA records (again, from any NCBI source) will break. Why? Same reason. The GI numbers will no longer be in the FASTA definition lines.

Only those two locations are affected. If your code isn’t parsing from either of those places, you’ll be fine.

Keep in mind that these changes affect all sequence records from any NCBI database (including GenBank, RefSeq, BLAST databases, Nuccore, NucEST, NucGSS, Protein and Popset).

Of course, there’s one kind of “code” we haven’t mentioned yet – and that’s you. If you rely on reading GI numbers from GenBank flat files or FASTA records, or copying them or searching for them in those records, that “code” will fail in September. Why? Same reason. They won’t be there.

So are the GI numbers gone for good?

No! They are still part of the data record, and you will still be able to use them to retrieve the record on the web or using the E-utilities, indefinitely. They will remain in XML and ASN.1 data presentations, and will only be removed from flat files and FASTA.

However, more and more new sequence records will not be assigned a GI number, and so will never be retrievable using GI methods. But records that currently have a GI will always have that GI.

Stay tuned for additional posts about these changes, and please let us know if you have questions.

Quick Tips

15 thoughts on “NCBI is Phasing Out Sequence GIs – Here’s What You Need to Know”

Yanfei Zhou says:

August 4, 2016 at 2:14 pm

after phasing out GIS, what should provided for makeblastdb -taxid_map?

Loading...

Reply
1. NCBI Staff says:
  
  August 30, 2016 at 3:31 pm
  
  Accessions now work with -taxid_map as of the 2.4.0 blast+ release in June 2016.
  
  Loading...
  
  Reply
Kathryn Napier says:

August 17, 2016 at 3:28 am

Will the command -gilist in blastdb_aliastool be replaced with accession.version?

Loading...

Reply
1. NCBI Staff says:
  
  August 30, 2016 at 3:34 pm
  
  Yes, we will be adding support for accession.version to blastdb_aliastool. Check the NCBI News, BLAST News (on the BLAST home page) and our social media (FaceBook, Twitter) for updates on this.
  
  Loading...
  
  Reply
Nik Sis says:

September 1, 2016 at 4:49 pm

Regarding phasing out GI numbers from NCBI, I was highly concerned about how can I use accession.version indicator to (e)fetch content from nuccore.

I mainly use efetch to obtain FASTA sequences in text format in a lab-based web server environment and as long as GIs served as UIDs, I could perfectly obtain sequences and pipe them for further procession.

Under the new schema, how could this work?

Thank you in advance for your response!

Loading...

Reply
1. Stunts says:
  
  October 24, 2016 at 10:03 am
  
  This is supposed to work by adding “&idtype=acc” and “&rettype=acc” to your queries, but as of today, it’s still not working.
  Can anyone @NCBI please give us a heads up as to when it should start working?
  Thanks.
  
  Loading...
  
  Reply
  1. Stunts says:
    
    October 24, 2016 at 11:43 am
    
    As a follow up: “&idtype=acc” is working. Just not for “&usehistory=y”, as described in the webinar’s slides.
    
    Loading...
  2. NCBI Staff says:
    
    October 24, 2016 at 1:25 pm
    
    The code supporting the new &idtype parameter has not yet been released. We are expecting it to be released soon and will blog about it when it happens. For several years EFetch has supported accession.version identifiers in the &id parameter, and so it may appear that &idtype is working, when it simply is existing functionality for EFetch. The true changes will be to ESummary, EPost, and ELink, which only accept GI numbers (for sequences) in &id. Once &idtype is released, they will accept accession.version like EFetch does now. In addition, ESearch will output accession.version identifiers with &idtype=acc.
    
    Loading...
  3. Stunts says:
    
    October 24, 2016 at 4:41 pm
    
    Thank you for the clarification.
    So for now it is OK to get the IDs from “esearch” and then use “efetch” to “translate” the resulting GIs to Accession numbers, right? At least until “&idtype=acc” is released. Because this approach I mentioned feels “hackish”. I’ll be watching this space closely for the follow up so I can implement “&idtype” in my tool.
    
    Loading...
  4. NCBI Staff says:
    
    November 9, 2016 at 3:27 pm
    
    Yes, what you propose is fine. It depends on what you’re starting with. If you have a list of GI’s already, then use EFetch by itself to translate to accessions. ESearch is really only required if you have a text query and want to use that to pull GI’s. In that case, you could add &usehistory=y to the ESearch call, and then pass the WebEnv and query_key params from the ESearch result to EFetch (with &rettype=acc). This avoids using a (possibly large) literal list of GI’s.
    
    Loading...
Pingback: The Future of Existing GI Numbers at NCBI | NCBI Insights
Ondrej C says:

December 6, 2016 at 3:36 pm

Thank you very much for the recipe – has anything changed since the last post, namely has the rettype / idtype = “acc” been implemented? Esearch seems to be returning Gi’s only, but at the same time, the gi’s have disappeared from many annotations. This extremely complicates queries followed by batch downloads (and we cannot download newly posted sequences, if I am not mistaken, since no new gis are assigned). Thank you.

Loading...

Reply
1. NCBI Staff says:
  
  January 24, 2017 at 9:52 am
  
  We did release the new &idtype parameter very recently, and will be providing a webinar about this on Jan 31 (register here: https://attendee.gotowebinar.com/register/7530877675754064131). We’re also preparing a post that will appear on this blog. If you can provide more details about what you’re trying to do (write to info@ncbi.nlm.nih.gov), we’d be happy to provide advice.
  
  Loading...
  
  Reply
  1. Stunts says:
    
    March 2, 2017 at 6:56 pm
    
    Thank you! I have already used the new &idtype parameter and it works as advertised! It is way better than the “hacky” “translation” step I had implemented on my end.
    
    Loading...
Pingback: NCBI Insights : NCBI’s GI sequence identifiers will soon exceed 32-bit numbers. Are you and your software ready?

NCBI Insights

NCBI is Phasing Out Sequence GIs – Here’s What You Need to Know

What pieces of your code will actually break in September?

So are the GI numbers gone for good?

Like this:

15 thoughts on “NCBI is Phasing Out Sequence GIs – Here’s What You Need to Know”

Leave a ReplyCancel reply

What pieces of your code will actually break in September?

So are the GI numbers gone for good?

Share this post:

Like this:

15 thoughts on “NCBI is Phasing Out Sequence GIs – Here’s What You Need to Know”

Leave a ReplyCancel reply

Discover more from NCBI Insights