Join the BLAST team at the virtual CollaborationFest (July 31 -August 1, 2021) after the BOSC 2021 conference to help test and improve ElasticBLAST, a new cloud-based tool designed to speed up high throughput BLAST searches. We would love to have your help with real world testing of our alpha release of ElasticBLAST with you own workflows and data. You may sign up for the CoFest even if you aren’t registered for BOSC 2021.
Here are suggestions for how you can participate. See the FAQs below for additional information.
Try it out and let us know how well it works. You can be blunt.
Write a script to make ElasticBLAST part of your workflow.
Try to process ElasticBLAST results with cloud-native tools. Here is an example.
Bring your own high throughput BLAST search problem to use with ElasticBLAST! Please discuss it with us first to make sure you don’t blow our budget and get the ElasticBLAST team in trouble!
We’re bringing exciting developments to our user community at the 2021 Galaxy Community Conference (GCC 2021), which is virtual this year!
Dr. Jon Trow, SRA Subject Matter ExpertDr. Adelaide Rhodes, Cloud Subject Matter Expert
We start with hosting NCBI’s first ever GCC training week tutorial co-written by Jon Trow, Ph.D. – Sequence Read Archive (SRA): Subject Matter Expert and Adelaide Rhodes, Ph.D. – Cloud: Subject Matter Expert. This tutorial will become a permanent addition to the Galaxy Training Network. The tutorial, “SRA Aligned Read Format (SARF) to Speed Up SARS-CoV-2 Data Analysis”, has detailed instructions and a video demonstration on how to search SRA metadata for SARFs and download them into Galaxy workflows. We will be available via Slack during Office Hours for live virtual interactions.
The NIH NCBI Sequence Read Archive (SRA) on AWS, containing all public SRA data, is now live! This data is hosted on Amazon Web Services (AWS) under the Open Data Sponsorship Program (ODP) with support from NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative.
National Library of Medicine’s (NLM) National Center for Biotechnology Information (NCBI) and Amazon Web Services (AWS) are happy to announce that the controlled- and public-access Sequence Read Archive (SRA)–one of the world’s largest repositories of raw next generation sequencing data–will be freely accessible from Amazon S3 via the Open Data Sponsorship Program (ODP) as of January 2021. The SRA is currently hosted by NLM at the National Institutes of Health (NIH).
Join us on December 9, 2020 to learn about containerized BLAST+ in Docker that is ready to use locally and in the cloud. We are staging BLAST databases in some cloud providers making running containerized BLAST as part of a pipeline in the cloud even easier. In this webinar you will learn about the advantages of containerized BLAST and learn how to use it in some practical examples. You will also learn about Elastic BLAST, a cloud application that is useful for aligning extremely large numbers of sequences against BLAST databases.
Date and time: Wed, December 9, 2020 12:00 PM – 12:45 PM EST
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.
Join us December 2 to learn how to use the Read assembly and Annotation Pipeline Tool (RAPT). With RAPT, you can assemble and annotate a microbial genome right out of the sequencing machine! Provide the short genomic reads or an SRA run on input, and get back the sequence annotated with a complete gene set. The assembly is built with SKESA and annotated with PGAP. In addition, RAPT also verifies the taxonomic assignment of the genome with the Average Nucleotide Identity tool. In this webinar, you will learn how you can run RAPT on your own machine or on the Google Cloud Platform.
Date and time: Wed, December 2, 2020 12:00 PM – 12:45 PM EST
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.
NIH’s Sequence Read Archive (SRA) is the largest, most diverse collection of next generation sequencing data from human, non-human and microbial sources. Hosted by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), SRA data is also available on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.
SRA currently contains more than 36 petabytes (PB) of data and is projected to grow to 43 PB by 2023. Though the value of this resource grows with each new sample, the exponential growth experienced over the last decade (Figure 1) threatens SRA sustainability. The storage footprint is growing more costly to maintain and the data more difficult to use at scale. The situation has reached a tipping point. SRA must be refactored to support FAIR data principles into the future.
Figure 1. SRA data has grown exponentially over the last decade.
NIH remains committed to the SRA and hopes to establish a long-range plan for sustained resource growth. Considerations include a model wherein normalized working files without Base Quality Scores (BQS) are readily available through cloud platforms and NCBI FTP sites, and larger source files and normalized files with base quality scores will be distributed on cloud platforms based on prevalent use cases and usage demands. Further details regarding data formats are available here.
It is critical that as an SRA user, you participate in the review and testing of proposed data formats and infrastructure by commenting on how these developments impact your data usage. NIH has prepared a Request for Information (RFI) that details planned developments and would greatly appreciate feedback from the scientific community.
Join us on May 20th to learn how to use Google’s BigQuery to quickly search the data from the Sequence Read Archive (SRA) in the cloud to speed up your bioinformatic research and discovery projects. BigQuery is a tool for exploring cloud-based data tables with SQL-like queries. In this webinar, we’ll introduce you to using BigQuery to mine SRA submitter-supplied metadata and the results of taxonomic analysis for SRA runs. You’ll see real-world case studies that demonstrate how to find key information about SRA runs and identify data sets for your own analysis pipelines.
Date and time: Wed, May 20, 2020 12:00 PM – 12:45 PM EDT
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.
Now that the Sequence Read Archive (SRA) is publicly available on the cloud, you can harness the power of high-performance cloud computing to analyze all the data you wish without having to download a single byte. To help you programmatically find datasets of interest to you, we’ve loaded BigQuery with the SRA Metadata Table, which contains the descriptive information supplied at the time of sequence submission. Searches of the SRA Metadata Table are dependent on the quality and consistency of the metadata as submitted which means it can sometimes be a challenge to identify a complete and relevant set of suitable sequences. However, the Taxonomy Analysis Table can be a useful tool to overcome this challenge. Here’s why.
NCBI indexes SRA runs with one or more taxonomy terms when species-specific sequence k-mers are matched in the submitted sequences. The Taxonomy Analysis Table (tax_analysis) thus becomes a catalog of all taxonomic IDs detected in every run, based on the specificity and accuracy characteristics of these unique hashes sampled from reference genomes. We have now added the Taxonomy Analysis Table to BigQuery so you can filter hundreds of thousands of runs by this calculated taxonomic content to gather target datasets. Use this in conjunction with the BigQuery Taxonomy Table (which connects scientific names to taxonomic IDs) and link back to the BigQuery Metadata Table.
Explore/link to these four new tables in BigQuery:
tax_analysis_info: a summary table for the results of the STAT tool
tax_analysis: use the taxonomy analysis table to locate any number of runs based on kmer hits to a particular organism or branch in a taxonomic tree.
taxonomy: NCBI Taxonomy database where you can locate the taxid based on organism names.
kmer: contains kmers mapped to a particular organism and allows you to continue exploring organismal content further. You can leverage kmer tables in your downstream analysis by building custom kmer libraries.
Figure 1. SRA runs found using the taxonomy tables and BigQuery for taxid:694002, Betacoronavirus.
We are actively working on new tools and ways to help you use the cloud to access and compute on SRA data. We are piloting this new feature in BigQuery, and plan to add this information to Amazon Cloud’s (AWS) Athena soon.