The ongoing sequencing revolution has resulted in exponential growth of the NCBI BLAST databases. The default BLAST nucleotide database (nt), the most popular Web BLAST database, is currently 903 billion letters and continues to grow rapidly – doubling in size in the last year. This growth will cause longer search times, reduced capacity, and more delays in updating the database. In the not-too-distant future, searching the entire nt database on the web will no longer be possible unless we modify the database scope and composition.
Because of the above concerns, we want to make the default Web BLAST nucleotide database smaller and more efficient. Some options are to:
- Change its composition to improve the quality of sequence entries included
- Take steps to slow its growth rate
- Divide it into several databases by biological or functional categories
Re-factoring of the database may be based on one of these strategies and will make use of feedback concerning your BLAST usage.
To begin getting feedback, we have created a page with four new databases based on taxonomic categories (Figure 1).
Please try the test databases and let us know what you think!
Figure 1. Database selection list on the new test page. This page has a possible database option that divides nt into four smaller databases by taxonomic/functional categories. Inset: The Feedback button where you can provide your feedback about these databases and how you use BLAST.
Please try searching with the test databases and tell us about your experience. How do you use the default nucleotide database? Do you mainly need quick matches to highly annotated records, or are you trying to find distantly related homologs to coding regions?
Use the Feedback button on the results page and send us your thoughts and comments about BLAST databases.
BLAST supports the NIH Comparative Genomics Resource (CGR), an NLM project to establish an ecosystem to facilitate reliable comparative genomics analyses for all eukaryotic organisms.
Join our mailing list to keep up to date with BLAST and other CGR news.
If you have questions, please reach out to us at email@example.com.
2 thoughts on “Re-evaluating the BLAST Nucleotide Database (nt)”
Two top databases could be initially selected, either NT or RefseqNT. After this initial selection, TAXID’s could be selected.
If the large number of eukaryotic taxid’s constitute a problem, limiting to family or order could perhaps be acceptable.
On another issue: For blastp, it would be great to have taxid based databases that incorporate the sequences from NR, refseq and TSA. I download frequently the NR and TSA -pep databases to select sequences of my desired tax-ids, then remove redundancy with CD-HIT, and use these databases for local blastp. It would be time saving for me and for your servers if I could download only what I need.
Regards, and may thanks for your incredible work,
Thank you for your suggestions. I have passed these along to BLAST team. Please write to firstname.lastname@example.org if you would like to be part of the BLAST testing community.