Tag: GenInfo Identifier (GI)

RefSeq release 209 is available

RefSeq release 209 is now available online, from the FTP site and through NCBI’s Entrez
programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of November 1, 2021, and contains 296,293,486 records, including 215,655,378 proteins, 41,751,205 RNAs, and sequences from 114,396 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq release 209 is available” →

NCBI will assign 64-bit numeric GIs by November 15th. Update affected software!

As announced last month, NCBI will begin assigning larger (64-bit) numeric ‘GIs’ to the remaining sequence types that still receive these identifiers. This change is expected as soon as Nov. 15th, 2021 but could occur earlier if data submission volumes are unexpectedly high. This is a reminder that all organizations and developers using our products should review software for any remaining reliance on GIs and compatibility with these larger identifiers.

How do you know if your software or organization may be impacted?

If you have built custom software to interface with NCBI data and consume a sequence database UID (i.e. GI), process the GI from an ASN1 or XML product, or process the GI from any tabular product on FTP, you should review all code to ensure that the new, longer, 64-bit GIs will be handled properly. To ensure a smooth transition and the best overall experience, please update to the latest versions of NCBI-provided programmatic and command line tools. Alternatively, you could make updates to your code to use accession.version identifiers instead of GIs.

NCBI is here to help the community as we make this change. Stay tuned here or follow NCBI Twitter where we will share updates and additional information, such as a final confirmation of the projected cutover date.

Please contact info@ncbi.nlm.nih.gov with any questions about this change or to determine if any software you are using is affected.

NCBI’s GI sequence identifiers will soon exceed 32-bit numbers. Are you and your software ready?

In 2016, NCBI announced that it was curtailing its display of its numeric ‘GI’ in popular sequence data formats such as FASTA and GenBank flatfiles. Due to the continued growth of GenBank, NCBI will soon begin assigning GIs exceeding the signed 32-bit threshold of 2,147,483,647 for those remaining sequence types that still receive these identifiers.

NCBI has updated products including Entrez system, GenBank (Nucleotide), BLAST™ and the C++ Toolkit to prepare for that moment by upgrading GI-related code and APIs to accept 64-bit integers. This change over is projected for late 2021. Stay tuned for additional communications from NCBI and take note of the following information if you think you may be impacted.

For a seamless transition, all organizations and developers using our products should review software for any remaining reliance on GIs and compatibility with these larger identifiers. Note that this update requires no changes to submission procedures or assignment of accessions. Continue reading “NCBI’s GI sequence identifiers will soon exceed 32-bit numbers. Are you and your software ready?” →

BLAST is transitioning to accession.version-based databases

As you may have read in previous posts, NCBI is phasing out sequence GIs and transitioning to accession.version identifiers. To help you prepare for this transition, we created sample BLAST databases that will help you make code changes to your programs and workflows for the switch to accession identifiers.

The sample databases, env_nr_v5 and tsa_nr_v5, are on FTP.

If you have any questions or concerns, please contact our Help Desk.

Converting Lots of GI Numbers to Accession.version

As you may have read in previous posts, NCBI is in the process of changing the way we handle GI numbers for sequence records. In short, we are moving to a time when accession.version identifiers, rather than GI numbers, will be the primary identifiers for sequence records.

In a previous post, we outlined a method for converting GI numbers (used to identify sequence records) to accession.version identifiers. That method used the E-utility EFetch and is capable of handling cases where you have no more than a few thousand GI numbers to convert.

What if you have more?

We now have a bulk conversion resource that will allow you to handle very large jobs. The resource consists of a Python script coupled with a database file (about 40 GB uncompressed). You’ll need to download both of these files (gi2accession.py and gi2acc_lmdb.gz) to local disk, and then you can process as needed.

Continue reading “Converting Lots of GI Numbers to Accession.version” →

Converting GI Numbers to Accession.version

As you may have read in previous posts, NCBI is in the process of changing the way we handle GI numbers for sequence records.

In short, we are moving to a time when accession.version identifiers, rather than GI numbers, will be the primary identifiers for sequence records.

As part of this transition, an obvious question for any of you currently using GI numbers is how to convert a GI number to an accession.version, so that you can make appropriate updates. The good news is that it’s pretty easy if you have no more than a few thousand GIs to convert.

Continue reading “Converting GI Numbers to Accession.version” →

The Future of Existing GI Numbers at NCBI

NCBI has announced that we will be changing the way we handle GI numbers for sequence records in September 2016. (Read more, in case you missed it).

In this post, we’ll address a key question:

What is the future of existing GI numbers?

The short answer is that nothing is happening to these GI numbers.

If a nucleotide or protein record already has a GI, it will continue to have that GI indefinitely. You will also be able to retrieve such a record using its GI either on the NCBI web site or using the E-utilities.

Moreover, GIs will remain part of the XML and ASN.1 formats of sequence records.

If not GIs, then what?

Accession.version identifiers. All sequence records, both new and old, will have a unique accession.version identifier.

Existing records will keep the accessions they already have; new sequences will only receive an accession.version identifier.

So what’s all the fuss about?

Two things:

GIs will no longer appear on flat file or FASTA data displays after September 2016. The GIs will still exist, but they won’t be visible.
More and more new sequence records will not be assigned a GI. This means that over time, you will be missing more and more new sequences if you only use GIs.

Stay tuned for additional posts about this topic, and please contact us if you have questions.

NCBI is Phasing Out Sequence GIs – Here’s What You Need to Know

You may have heard that NCBI is changing the way we handle GI numbers for sequence records in September 2016. Well, you heard right! Here’s the announcement, in case you missed it.

There are a number of issues raised by these changes, but we’re going to answer two questions in this post:

What pieces of your code will break in September?
Are GI numbers gone for good?

Continue reading “NCBI is Phasing Out Sequence GIs – Here’s What You Need to Know” →