An easy way to speed up your BLAST analysis is to search a smaller database targeted to sequences of interest. We’ll describe here a few ways to create such custom databases on the BLAST web pages. For this Quick Tip we’ll use the pages in the Basic BLAST section of the BLAST home page.
BLAST parent databases
Generating a custom database begins with selecting the appropriate parent database. The BLAST Guide provides database descriptions to help with choosing a database. You select the parent in the Database pull-down menu, shown in Figure 1. Selecting the database is really your first opportunity to customize.
For example, the default nucleotide database is nt, but if you want a non-redundant set of transcript sequences, select the refseq_rna database, a subset of nt. After choosing a parent database, you have two primary options to refine the database, the Organism field and the Entrez Query field. In addition, there are exclude check boxes to remove model sequences or those from environmental samples. You can customize any of these settings in the ‘Choose Search Set’ section of the BLAST form (Figure 2).
The Organism filter is more versatile than you might think, allowing you to choose and combine any of the taxonomic nodes found in NCBI’s Taxonomy database. For example, here is a sampling of terms in the zebrafish lineage that you can select in the Organism autocomplete field: zebrafish, Danio rerio, bony fishes, teleostei, and vertebrates. Also, as shown in the examples in Figure 2, by expanding the Organism boxes, you may also search sequences from more than one node or exclude one or more groups or species.
Entrez Query is the real powerhouse for customization. You can use any query that works when searching the NCBI Nucleotide and Protein databases in BLAST. In fact, we recommend that you test your queries on the Nucleotide or Protein pages first to be sure you are retrieving the sequences you want. For help writing Entrez queries, see Entrez Sequences Help or the more concise Search Field Descriptions for Sequence Database.
There are also two filters available as check boxes for excluding two categories of sequences (‘Models’ and ‘Uncultured/Environmental samples’). These check boxes can be used to additionally modify any parent or custom databases you have created. Environmental sample sequences are abundant in some of the nucleotide databases but don’t have precise taxonomic assignments. Model transcript and protein sequences are based mainly on the analysis of genomic DNA and may have no experimental support. Depending on your search goals, you may want to exclude either or both of these categories of sequences.
Custom Database Examples
Some example custom database settings are shown in Figure 2 and in the list below. You can click on the hyperlinked examples in the list to access BLAST forms with these custom databases pre-set. You can save any page that you have created by clicking the ‘Bookmark’ link in the upper right of the BLAST form.
Example 1: Search only human chromosome 22
Parent database: NCBI Genomes (chromosome)
Entrez query: NC_000022[Accession]
Example 2: Search all mammalian assembled genomes, except human
Parent database: Reference genomic sequences (refseq_genomic)
Organism selection: mammals (taxid:40674)
exclude: human (taxid:9606)
Example 3: Search RefSeq proteins from enterobacteria with molecular weightsbetween 25 kD and 35 kD.
Parent database: Reference proteins (refseq_protein)
Organism selection: enterobacteria (taxid:91347)
Entrez query: 25000:35000[Molecular Weight]
Example 4: Search only submitted mitochondrial sequences from non-insect arthropods and gastropods.
Parent database: Nucleotide collection (nr/nt)
Organism selection: arthropods (taxid: 6656)
gastropods (taxid: 6448)
exclude: insects (taxid: 6960)
Entrez query: mitochondrion[Filter] NOT refseq[Filter]
For technical assistance on BLAST, write to firstname.lastname@example.org.