Converting Lots of GI Numbers to Accession.version

Converting Lots of GI Numbers to Accession.version

As you may have read in previous posts, NCBI is in the process of changing the way we handle GI numbers for sequence records. In short, we are moving to a time when accession.version identifiers, rather than GI numbers, will be the primary identifiers for sequence records.

In a previous post, we outlined a method for converting GI numbers (used to identify sequence records) to accession.version identifiers. That method used the E-utility EFetch and is capable of handling cases where you have no more than a few thousand GI numbers to convert.

What if you have more?

We now have a bulk conversion resource that will allow you to handle very large jobs. The resource consists of a Python script coupled with a database file (about 40 GB uncompressed). You’ll need to download both of these files (gi2accession.py and gi2acc_lmdb.gz) to local disk, and then you can process as needed.

The files are available here: ftp.ncbi.nlm.nih.gov/genbank/livelists/gi2acc_mapping/.

The script works in two modes: interactive and bulk.

Interactive

$ ./gi2_accession.py

gi: 42

42  CAA44840.1  416

After entering the GI number, the script responds with the GI, the corresponding accession.version, and the length of the sequence (in residues).

Bulk

./gi2accession.py < list_of_gis.txt

In this case, the script will accept an input stream of GI numbers (e.g., from a file, one per line) and then output a line for each GI with the same three columns as above.

 

Further instructions for using the script are in a README file in the FTP directory.

Please be aware that you’ll need about 40GB of disk space, along with Python 2.7 or higher and the Python lmdb package.

Let us know if you have comments or questions about this resource.

6 thoughts on “Converting Lots of GI Numbers to Accession.version

  1. Hi,

    Could you check this entry?

    GI Accession
    1680002 AH007344.1
    1680091 AH003807.1

    For example, AH007344.1(Accession) have the 340 nt 10 segments sequence information.

    but, In this directory (ftp.ncbi.nlm.nih.gov/genbank/livelists/gi2acc_mapping/)

    this information AH007344.1(Accession) length is 1362.

    Why this record is not same each in NCBI?

    Thanks

  2. Python script does not work under Python 3″ uses old style ‘print “…”‘ instead of new ‘print(“…”)’ . 😛

Leave a Reply