The Tasmanian devil (Sarcophilus harrisii), the last remaining large marsupial carnivore, now faces extinction because of a strange and deadly infection: a transmissible cancer known as Devil Facial Tumor Disease. These tumor infections are apparently passed to other devils through bites during mating or during squabbles over carrion when devils gather to feed. In this unusual situation, the cancer cells themselves are the infectious agent.
The failure of devil immune systems to recognize and destroy the foreign tumor cells may be related to a decline in genetic diversity and may serve as a warning about the vulnerability of species with reduced gene pools. The advent of next-generation sequencing has provided an unprecedented opportunity to track the spread and identify the origin of this unusual zoonosis, as well as to examine the population structure of an endangered mammal and generate a complete genome sequence for this unique marsupial.
Given the size of modern sequence databases, finding the complete genome sequence for a bacterium among the many other partial sequences can be a challenge. In addition, if you want to download sequences for many bacterial species, an automated solution might be preferable.
In this post we’ll discuss how to download bacterial genomes programmatically for a list of species using the E-utilities, the application programming interface (API) to NCBI’s Entrez system of databases. We’ll also take advantage of NCBI’s redesigned Genome database, which links all genome sequences for a given species to one record, making it easy to obtain the desired sequences once you find the right Genome record. In principle you can apply the procedure below to other simple genomes that are represented by a single sequence. Future posts will address additional considerations that apply to complex, eukaryotic genomes.
If you’re a protein researcher, one thing you may want to do is to find homologs for a protein of interest on the basis of its sequence. This can provide insights into what the protein does and how it does it, and may identify proteins with known three-dimensional structures that can serve as models for the protein of interest. The Conserved Domains Database (CDD) groups proteins that have strong sequence similarity to protein domain fingerprints and allows you to search these groups with any protein sequence. Such searches are often more sensitive than standard BLAST searches since the scoring matrices used are tuned to locate important functional sites and sequence motifs that are highly conserved within the domain. You can then use the results to explore the evolutionary relationships of these proteins or identify these important sequence and structural features.
Here is a method to find protein sequences from many organisms that contain a particular conserved domain:
Over the past several months, you may have noticed a warning message if you’ve accessed the NCBI site using Microsoft’s Internet Explorer web browser:
This message has caused some concern among some users about exactly what changed on January 1, 2013 and whether or not they will still be able to access PubMed and other NCBI resources. We hope that this post will address some of the more common questions.