An easy way to speed up your BLAST analysis is to search a smaller database targeted to sequences of interest. We’ll describe here a few ways to create such custom databases on the BLAST web pages. For this Quick Tip we’ll use the pages in the Basic BLAST section of the BLAST home page.
BLAST parent databases
Generating a custom database begins with selecting the appropriate parent database. The BLAST Guide provides database descriptions to help with choosing a database. You select the parent in the Database pull-down menu, shown in Figure 1. Selecting the database is really your first opportunity to customize.
For example, the default nucleotide database is nt, but if you want a non-redundant set of transcript sequences, select the refseq_rna database, a subset of nt. After choosing a parent database, you have two primary options to refine the database, the Organism field and the Entrez Query field. In addition, there are exclude check boxes to remove model sequences or those from environmental samples. You can customize any of these settings in the ‘Choose Search Set’ section of the BLAST form (Figure 2).
The Organism filter is more versatile than you might think, allowing you to choose and combine any of the taxonomic nodes found in NCBI’s Taxonomy database. For example, here is a sampling of terms in the zebrafish lineage that you can select in the Organism autocomplete field: zebrafish, Danio rerio, bony fishes, teleostei, and vertebrates. Also, as shown in the examples in Figure 2, by expanding the Organism boxes, you may also search sequences from more than one node or exclude one or more groups or species.
Entrez Query is the real powerhouse for customization. You can use any query that works when searching the NCBI Nucleotide and Protein databases in BLAST. In fact, we recommend that you test your queries on the Nucleotide or Protein pages first to be sure you are retrieving the sequences you want. For help writing Entrez queries, see Entrez Sequences Help or the more concise Search Field Descriptions for Sequence Database.
There are also two filters available as check boxes for excluding two categories of sequences (‘Models’ and ‘Uncultured/Environmental samples’). These check boxes can be used to additionally modify any parent or custom databases you have created. Environmental sample sequences are abundant in some of the nucleotide databases but don’t have precise taxonomic assignments. Model transcript and protein sequences are based mainly on the analysis of genomic DNA and may have no experimental support. Depending on your search goals, you may want to exclude either or both of these categories of sequences.
Custom Database Examples
Some example custom database settings are shown in Figure 2 and in the list below. You can click on the hyperlinked examples in the list to access BLAST forms with these custom databases pre-set. You can save any page that you have created by clicking the ‘Bookmark’ link in the upper right of the BLAST form.
Example 1: Search only human chromosome 22
Parent database: NCBI Genomes (chromosome)
Entrez query: NC_000022[Accession]
Example 2: Search all mammalian assembled genomes, except human
Parent database: Reference genomic sequences (refseq_genomic)
Organism selection: mammals (taxid:40674)
exclude: human (taxid:9606)
Example 3: Search RefSeq proteins from enterobacteria with molecular weightsbetween 25 kD and 35 kD.
Parent database: Reference proteins (refseq_protein)
Organism selection: enterobacteria (taxid:91347)
Entrez query: 25000:35000[Molecular Weight]
Example 4: Search only submitted mitochondrial sequences from non-insect arthropods and gastropods.
Parent database: Nucleotide collection (nr/nt)
Organism selection: arthropods (taxid: 6656)
gastropods (taxid: 6448)
exclude: insects (taxid: 6960)
Entrez query: mitochondrion[Filter] NOT refseq[Filter]
For technical assistance on BLAST, write to firstname.lastname@example.org.
6 thoughts on “Making Custom Databases for Web BLAST”
What if I need to do this locally? I am running hundreds of thousands of psiblast searches on a giant dataset of protein variants, and it is both A) impractical to handle this remotely, and B) a goal of the project to do the computational work on our local cluster. But I haven’t found an easy way to restrict searches to specific taxa when using the psiblast command-line tool, nor have I found a documented way to pare my local BLAST nr protein DB down to just the taxa of interest. Any advice about this matter would be greatly appreciated.
You can restrict to subsets by organism locally using the gi list option.
Restrict search of database to list of GI’s
One way to generate a gi list for a particular taxon is through the the Entrez system (http://www.ncbi.nlm.nih.gov/protein/). For example search for
then use the “Send to” menu on the results page to save the in GI list format to a file into the BLAST directory. If you have additional questions or need more help with this please write to email@example.com
What if I have a taxon ID list instead of a gilist?
(Also, just to be clear, I know about the -entrez_query option, but that option only works when querying a remote DB, which I specifically want to avoid doing.)
See my reply about the gilist option above.
Is it also possible to filter the locally saved nr database and delete entries that I don’t need instead of always include the ‘-gilist’ option? This would safe a lot disk space.