ElasticBLAST is a new way to BLAST large numbers of queries, faster and on the cloud. Here are the top three reasons you should use ElasticBLAST:
1. ElasticBLAST can handle much LARGER queries!
ElasticBLAST can search query sets that have hundreds to millions of sequences and against BLAST databases of all sizes.
2. ElasticBLAST is FASTER
ElasticBLAST distributes your searches across multiple cloud instances to process them simultaneously. The ability to scale resources in this way allows you to process large numbers of queries in a shorter time than you could with BLAST+.
3. ElasticBLAST is EASY to run on the cloud
ElasticBLAST is easy to set up using our step-by-step instructions (Amazon Web Services (AWS), Google Cloud Platform (GCP))andallows you to leverage the power of the cloud. Once configured, itmanages the software and database installation, handles partitioning of the BLAST workload among the various instances, and deallocates cloud resources when the searches are done.
ElasticBLAST is a new tool that helps you run BLAST searches on the cloud. ElasticBLAST is perfect for you if you have thousands to millions of queries to our Basic Local Alignment Search Tool (BLAST ®), or if you want to use cloud infrastructure for your searches. ElasticBLAST can handle large searches that are not appropriate for NCBI web BLAST, and it runs them more quickly than stand-alone BLAST+.
ElasticBLAST works on two of the current NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) partners- Amazon Web Services (AWS) and Google Cloud Platform (GCP). ElasticBLAST works by distributing your searches across multiple cloud instances to process them in tandem. The ability to scale resources in this way allows you to process large numbers of queries in a shorter time than you could with BLAST+. ElasticBLAST can handle millions of queries, and it also supports most BLAST+ options and programs.
Making it easier to run BLAST on the cloud
ElasticBLAST reduces the barrier to using the cloud by creating and managing cloud resources for you. It manages the software and database installation, handles partitioning of the BLAST workload among the various instances and deallocates cloud resources when the searches are done. For example, ElasticBLAST will select the best cloud instance type for your search based on the database metadata that provides database size and memory needs (Figure 1). You can also manually select the instance type if you prefer.
Fig. 1: JSON metadata for the 16S_ribosomal_RNA database. The “bytes-to-cache” information helps ElasticBLAST pick out an instance with the appropriate capacity.
ElasticBLAST can access the 28 NCBI databases available on AWS and GCP. These are the same databases that are also available from the NCBI FTP site. For instance, databases available on the two cloud providers include the RefSeq Eukaryotic Representative Genomes database, 16S database based on Targeted Loci, and Human and mouse genomes databases.
You can also provide your own databases, and you can produce the metadata needed to select an instance through a Python script that comes with ElasticBLAST.
ElasticBLAST can perform a variety of searches with query sets that range from hundreds to millions of sequences and BLAST databases of all sizes. Table 1 shows ElasticBLAST searches with query sets that range up to billions of letters using a variety of BLAST databases.
Table 1: Sample ElasticBLAST searches. This table demonstrates the breadth of searches supported by ElasticBLAST. Additionally, the first row demonstrates the ability of ElasticBLAST to use many CPUs (3200) on a cloud provider at once to complete a task in hours that would have taken days on a single machine.
Because ElasticBLAST runs on cloud providers, using it will incur some cost. Based on current cost structures on AWS and GCP, in most cases these costs are quite small. For example, a protein search with a query of about 20 million residues against a database of about 20 billion residues can cost less than $5. Even a larger search with a query of 3-4 billion DNA bases can cost only around $50. Both cloud services include the option to bid on instances for less than full price, which can result in significant savings. ElasticBLAST can be configured to request such instances. Your costs will obviously vary based on many factors, and we encourage you to explore these options with the individual cloud providers. Also, both AWS and GCP offer a free tier or time-limited trial of their cloud services, and you can find information about using ElasticBLAST with the free tiers here.
ElasticBLAST is a cloud-native package developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) with support from the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.
Join us on February 16, 2022 at 12 PM US eastern time to learn about ElasticBLAST, a new tool that runs your BLAST searches on cloud hardware, using the standard BLAST command-line package. You will hear about the benefits of ElasticBLAST, which include speed and ease of use. You will also see some practical applications of this tool and how you can try it out yourself.
The Sequence Read Archive (SRA) is the National Institute of Health’s (NIH) primary repository for raw, high-throughput sequencing data, containing both controlled- and open-access datasets that continue to grow exponentially. SRA is managed by the National Library of Medicine’s National Center for Biotechnology Information (NCBI), and the data are available from NCBI’s servers as well as through cloud platforms: Amazon Web Services (AWS) and Google Cloud Platform (GCP). Cloud access was made possible by support from NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.
Due to SRA’s exponential growth in size, data in the cloud environment are currently partitioned in hot and cold storage to keep SRA sustainable and accessible. Per industry standards, data in hot storage are immediately accessible; because hot storage is more expensive to host, we make efforts to align this distribution method with our most frequently requested datasets. The less frequently requested datasets are available in cold storage, which may not be immediately accessible. Fret not! SRA is constantly evolving to meet our users’ needs. NCBI’s Cloud Data Delivery Service (CDDS) now allows you to get public and controlled-access data delivered from cold and hot storage directly to your chosen cloud bucket in just a few hours. The minor cost is currently handled by NCBI but certain limits apply; within a 30-day request cycle, users are able to request up to 5TB from cold storage and 20TB from hot storage to their cloud bucket.
Adelaide Rhodes, Ph.D. from the Customer Experience team and Adam Stine, SRA Curator will co-lead the workshop, which will introduce attendees to powerful metadata searches on BigQuery on Google Cloud Platform (GCP) and Athena on Amazon Web Services (AWS) to speed up analytic workflows using the NIH’s Sequence Read Archive (SRA).
Cloud-based query services with expanded metadata options for SRA help researchers to find the target data more quickly than ever before. The workshop will be a mix of training in Structured Query Language (SQL), demos on the cloud console and hands-on exercises in Jupyter notebooks with examples to help researchers understand how to build searches in SQL. Researchers who attend this workshop will learn how to extract specific data sets as well as how to conduct exploratory analysis of the entirety of the SRA data available in the cloud.
Both BigQuery and Athena require SQL but no prior SQL experience is required. By the end of this workshop you will know how to run cloud metadata queries using SQL to find SRA data based on parameters that are of interest to you.
To enhance machine access to biomedical literature and drive impactful analyses and reuse, the National Library of Medicine (NLM) is pleased to announce the availability of the PubMed Central (PMC) Article Datasets on Amazon Web Services (AWS) Registry of Open Data as part of AWS’s Open Data Sponsorship Program (ODP). These datasets collectively span 4 million of PMC’s 7 million (total) full-text scientific articles.
Come visit us virtually to learn about new NCBI data access, tools and best practices at the Bioinformatics Open Science Conference part of the ISMB/ECCB online conference from July 29 – 30, 2021. We will be presenting virtual posters on NCBI resources, offering a Birds of a Feather discussion, and participating in the BOSC (CoFest) following the conference where you can take part in a hands-on evaluation of ElasticBLAST.
NCBI Posters, July 29, 2021, 11:20 – 12:20 PM EDT
All posters will be presented on Thursday afternoon. You can see complete abstracts on the ISMB/ECCB BOSC schedule.
Nuala O’Leary will talk about NCBI Datasets, a new resource for fast, easy access to NCBI sequence data. You will learn about the new interface and new tools to access reference genomes, genes, and orthologs using web-based and programmatic tools.
Adelaide Rhodes will present Open access NCBI cloud resources to accelerate scientific insightswhere you can learn about recent developments in transferring > 20 petabytes of NCBI Sequence Read Archive (SRA) data to the cloud.
Deacon Sweeney will describe the web RAPT service for assembling and annotating bacterial genomes at the click of a button in RAPT, The Read assembly and Annotation Pipeline Tool: building a prokaryotic genome annotation package for users of all backgrounds.
Roberto Vera Alvarez will talk about best practices for using cloud tools for transcriptomics in his poster Transcriptome annotation in the cloud: complexity, best practices, and cost.
Greg Boratyn will discuss improvements to the BLAST-based short read aligner, Magic-Blast, in Recent improvements in Magic-BLAST 1.6.
Visit Christiam Camacho’s poster ElasticBLAST: Using the power of the cloud to speed up scienceto get an introduction to ElasticBLAST, a Kubernetes-based approach for high throughput BLAST tasks. Join us following the conference in the CoFest to try out ElasticBLAST yourself and provide input. See the section on the CoFest below and our companion post.
Birds of a Feather, July 29, 2021, 11:20 – 12:20 PM EDT
We will host a Birds of Feather public feedback session on Thursday, where you can provide feedback and participate in discussions on all aspects of NCBI’s new data access options: NCBI Datasets, SRA, BLAST, and the Genome Data Viewer (GDV) — our genome browser for sequence visualization. We welcome your input! Come and see us!
CollaborationFest (CoFest), July 31 – August 1, 2021
The ElasticBlast team will attend the BOSC CoFest following the conference. Sign up to participate on July 31 and August 1 to get an in-depth orientation and opportunity to test the capabilities of ElasticBlast on the Amazon Web Services (AWS) cloud. You do not have to register for the conference to attend the CoFest. See our post on the CoFest for more information.
Join the BLAST team at the virtual CollaborationFest (July 31 -August 1, 2021) after the BOSC 2021 conference to help test and improve ElasticBLAST, a new cloud-based tool designed to speed up high throughput BLAST searches. We would love to have your help with real world testing of our alpha release of ElasticBLAST with you own workflows and data. You may sign up for the CoFest even if you aren’t registered for BOSC 2021.
Here are suggestions for how you can participate. See the FAQs below for additional information.
Try it out and let us know how well it works. You can be blunt.
We’re bringing exciting developments to our user community at the 2021 Galaxy Community Conference (GCC 2021), which is virtual this year!
We start with hosting NCBI’s first ever GCC training week tutorial co-written by Jon Trow, Ph.D. – Sequence Read Archive (SRA): Subject Matter Expert and Adelaide Rhodes, Ph.D. – Cloud: Subject Matter Expert. This tutorial will become a permanent addition to the Galaxy Training Network. The tutorial, “SRA Aligned Read Format (SARF) to Speed Up SARS-CoV-2 Data Analysis”, has detailed instructions and a video demonstration on how to search SRA metadata for SARFs and download them into Galaxy workflows. We will be available via Slack during Office Hours for live virtual interactions.