As you may have read in previous posts, NCBI is in the process of changing the way we handle GI numbers for sequence records. In short, we are moving to a time when accession.version identifiers, rather than GI numbers, will be the primary identifiers for sequence records.
In a previous post, we outlined a method for converting GI numbers (used to identify sequence records) to accession.version identifiers. That method used the E-utility EFetch and is capable of handling cases where you have no more than a few thousand GI numbers to convert.
What if you have more?
We now have a bulk conversion resource that will allow you to handle very large jobs. The resource consists of a Python script coupled with a database file (about 40 GB uncompressed). You’ll need to download both of these files (gi2accession.py and gi2acc_lmdb.gz) to local disk, and then you can process as needed.
The files are available here: ftp.ncbi.nlm.nih.gov/genbank/livelists/gi2acc_mapping/.
The script works in two modes: interactive and bulk.
42 CAA44840.1 416
After entering the GI number, the script responds with the GI, the corresponding accession.version, and the length of the sequence (in residues).
./gi2accession.py < list_of_gis.txt
In this case, the script will accept an input stream of GI numbers (e.g., from a file, one per line) and then output a line for each GI with the same three columns as above.
Further instructions for using the script are in a README file in the FTP directory.
Please be aware that you’ll need about 40GB of disk space, along with Python 2.7 or higher and the Python lmdb package.
Let us know if you have comments or questions about this resource.