NCBI offers extensive collections of sequences through its BLAST services (http://blast.ncbi.nlm.nih.gov) for comparing and identifying DNA, RNA and protein sequences. NCBI now deposits descriptions of these sequence collections, known as BLAST databases, in a special database called blastdbinfo that you can access through the Entrez Programming Utilities (E-Utilities). Using blastdbinfo, you can enable a program to find an appropriate database and then send BLAST searches to that database using either the BLAST URL API or standalone BLAST (installed locally).
Given the size of modern sequence databases, finding the complete genome sequence for a bacterium among the many other partial sequences can be a challenge. In addition, if you want to download sequences for many bacterial species, an automated solution might be preferable.
In this post we’ll discuss how to download bacterial genomes programmatically for a list of species using the E-utilities, the application programming interface (API) to NCBI’s Entrez system of databases. We’ll also take advantage of NCBI’s redesigned Genome database, which links all genome sequences for a given species to one record, making it easy to obtain the desired sequences once you find the right Genome record. In principle you can apply the procedure below to other simple genomes that are represented by a single sequence. Future posts will address additional considerations that apply to complex, eukaryotic genomes.