You may have heard that NCBI is changing the way we handle GI numbers for sequence records in September 2016. Well, you heard right! Here’s the announcement, in case you missed it.
There are a number of issues raised by these changes, but we’re going to answer two questions in this post:
- What pieces of your code will break in September?
- Are GI numbers gone for good?
What pieces of your code will actually break in September?
- Any code that parses GI numbers from sequence flat files (from web, FTP, E-utilities or any other NCBI source) will break. Why? Because the GI numbers will no longer be there.
- Any code that parses GI numbers from NCBI FASTA records (again, from any NCBI source) will break. Why? Same reason. The GI numbers will no longer be in the FASTA definition lines.
Only those two locations are affected. If your code isn’t parsing from either of those places, you’ll be fine.
Keep in mind that these changes affect all sequence records from any NCBI database (including GenBank, RefSeq, BLAST databases, Nuccore, NucEST, NucGSS, Protein and Popset).
Of course, there’s one kind of “code” we haven’t mentioned yet – and that’s you. If you rely on reading GI numbers from GenBank flat files or FASTA records, or copying them or searching for them in those records, that “code” will fail in September. Why? Same reason. They won’t be there.
So are the GI numbers gone for good?
No! They are still part of the data record, and you will still be able to use them to retrieve the record on the web or using the E-utilities, indefinitely. They will remain in XML and ASN.1 data presentations, and will only be removed from flat files and FASTA.
However, more and more new sequence records will not be assigned a GI number, and so will never be retrievable using GI methods. But records that currently have a GI will always have that GI.
Stay tuned for additional posts about these changes, and please let us know if you have questions.