What is Magic-BLAST and why are we excited about it?
Magic-BLAST is a BLAST tool, but it’s unlike any other.
It aligns next generation sequencing reads, both DNA and RNA-seq. It implements the aligner algorithm from MAGIC , a trusted pipeline, but uses the well tested and supported BLAST infrastructure. We think it’s like putting two great things together, like having your favorite ice cream in your morning coffee.
We’re so excited about it that we even wrote an article that compares Magic-BLAST to a few other aligners on several data sets.
If you look at the figures in our article, we think you’ll see that Magic-BLAST excels at finding introns and processing ultra-long sequences. It also can handle high levels of mismatches as well compositionally biased DNA. Finally, you’ll see that Magic-BLAST works in a lot of relevant situations in which current aligners won’t. If our results got your attention, here is our documentation, which includes a cookbook with a few examples.
What about BLASTN and MegaBLAST? How is Magic-BLAST different?
You can think of traditional BLAST programs as working one query at a time against a reference database, producing multiple refined matches of your special query and allowing high levels of mismatches, large gaps, and reporting on the statistics of the match. By default, you’ll get alignments for the 100 best reference sequences.
An aligner like Magic-BLAST may try and align 100 million or more reads (those are its queries), so it can’t spend a lot of time carefully working on each read. Instead, each read is aligned once on the reference database. Mostly, it’s not going to tolerate high levels of mismatches or long gaps. However, it can consider paired-end reads when doing the placement. It will attempt to identify and align around splice junctions if you have RNA-seq. It also works with DNA sequences only. Finally, Magic-BLAST won’t give you a traditional BLAST report to study.
BLASTN and MegaBLAST are good for situations where you want to run a query or multiple queries against a database to find similar sequences (perhaps from other organisms) or even to just verify the identity of your sequences. However, if you have 100 million reads, you probably don’t want the top 100 matches for each read, since that’s going to be a lot of alignments. Whatever you do, don’t hit print on that BLAST report!
Magic-BLAST uses the same infrastructure you’re familiar with
Magic-BLAST uses a traditional BLAST database to hold the reference searches. That database is easy to create from FASTA or retrieve from the NCBI FTP site. You might even have a database or two on your local disk.
The BLAST database compresses DNA sequences 4-to-1, so it doesn’t take up a lot of disk space or memory. It can also hold sequence metadata like taxonomy, sequence length, identifiers, and titles. All the sequences and metadata can be retrieved with the blastdbcmd executable (see https://www.ncbi.nlm.nih.gov/books/NBK279690/).
You can also use this same database for BLASTN, MegaBLAST, or TBLASTN searches if you want. Magic-BLAST can also align against a FASTA files or even an accession, but a BLAST database will give you the fastest searches.
Magic-BLAST can also use an SRA accession as your query, so you don’t need to download a FASTQ file of 100 million reads to get started. It can also align against any BLAST database that the other BLAST programs use.
Want to align your reads against the 50 million sequences in the NCBI nucleotide collection (nt) or some ad hoc database of viruses that you created? Magic-BLAST will not get in your way.
Has anybody tested Magic-BLAST?
Magic-BLAST was used in some NCBI hackathons. This allowed us to see how the hackathon participants used it and to discuss their needs with them. There are several rapidly prototyped workflows using Magic-BLAST on the NCBI hackathons GitHub. There is also a hackathon paper that describes use of Magic-BLAST in extracting antimicrobial resistance information from metagenomes.
 Zhang, W., Yu, Y., Hertwig, F., Thierry-Mieg, J., Thierry-Mieg, D., Wang, J., Furlanello, C., Devanarayan, V., Cheng, J., Deng, Y. et al. (2015) Comparison of RNA-seq and microarray-based models for clinical endpoint prediction. Genome Biol, 16, 133