NIH’s Sequence Read Archive (SRA) is the largest, most diverse collection of next generation sequencing data from human, non-human and microbial sources. Hosted by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), SRA data is also available on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.
SRA currently contains more than 36 petabytes (PB) of data and is projected to grow to 43 PB by 2023. Though the value of this resource grows with each new sample, the exponential growth experienced over the last decade (Figure 1) threatens SRA sustainability. The storage footprint is growing more costly to maintain and the data more difficult to use at scale. The situation has reached a tipping point. SRA must be refactored to support FAIR data principles into the future.
Figure 1. SRA data has grown exponentially over the last decade.
NIH remains committed to the SRA and hopes to establish a long-range plan for sustained resource growth. Considerations include a model wherein normalized working files without Base Quality Scores (BQS) are readily available through cloud platforms and NCBI FTP sites, and larger source files and normalized files with base quality scores will be distributed on cloud platforms based on prevalent use cases and usage demands. Further details regarding data formats are available here.
It is critical that as an SRA user, you participate in the review and testing of proposed data formats and infrastructure by commenting on how these developments impact your data usage. NIH has prepared a Request for Information (RFI) that details planned developments and would greatly appreciate feedback from the scientific community.
Join us on May 20th to learn how to use Google’s BigQuery to quickly search the data from the Sequence Read Archive (SRA) in the cloud to speed up your bioinformatic research and discovery projects. BigQuery is a tool for exploring cloud-based data tables with SQL-like queries. In this webinar, we’ll introduce you to using BigQuery to mine SRA submitter-supplied metadata and the results of taxonomic analysis for SRA runs. You’ll see real-world case studies that demonstrate how to find key information about SRA runs and identify data sets for your own analysis pipelines.
Date and time: Wed, May 20, 2020 12:00 PM – 12:45 PM EDT
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.
Now that the Sequence Read Archive (SRA) is publicly available on the cloud, you can harness the power of high-performance cloud computing to analyze all the data you wish without having to download a single byte. To help you programmatically find datasets of interest to you, we’ve loaded BigQuery with the SRA Metadata Table, which contains the descriptive information supplied at the time of sequence submission. Searches of the SRA Metadata Table are dependent on the quality and consistency of the metadata as submitted which means it can sometimes be a challenge to identify a complete and relevant set of suitable sequences. However, the Taxonomy Analysis Table can be a useful tool to overcome this challenge. Here’s why.
NCBI indexes SRA runs with one or more taxonomy terms when species-specific sequence k-mers are matched in the submitted sequences. The Taxonomy Analysis Table (tax_analysis) thus becomes a catalog of all taxonomic IDs detected in every run, based on the specificity and accuracy characteristics of these unique hashes sampled from reference genomes. We have now added the Taxonomy Analysis Table to BigQuery so you can filter hundreds of thousands of runs by this calculated taxonomic content to gather target datasets. Use this in conjunction with the BigQuery Taxonomy Table (which connects scientific names to taxonomic IDs) and link back to the BigQuery Metadata Table.
Explore/link to these four new tables in BigQuery:
tax_analysis_info: a summary table for the results of the STAT tool
tax_analysis: use the taxonomy analysis table to locate any number of runs based on kmer hits to a particular organism or branch in a taxonomic tree.
taxonomy: NCBI Taxonomy database where you can locate the taxid based on organism names.
kmer: contains kmers mapped to a particular organism and allows you to continue exploring organismal content further. You can leverage kmer tables in your downstream analysis by building custom kmer libraries.
Figure 1. SRA runs found using the taxonomy tables and BigQuery for taxid:694002, Betacoronavirus.
We are actively working on new tools and ways to help you use the cloud to access and compute on SRA data. We are piloting this new feature in BigQuery, and plan to add this information to Amazon Cloud’s (AWS) Athena soon.
We recently announced that we made all of the Sequence Read Archive (SRA) publicly available on two cloud platforms. This archive of genetic sequences is a treasure trove of information and the cloud environments provide high-performance computing capabilities via a GCP or AWS account – right from your own device. High-throughput sequencing has made generating data extremely fast and inexpensive, which has fueled the rapid growth of SRA. Putting it on the cloud makes it possible to analyze “the high-throughput, unassembled sequence data, across all such sequences”.
So, what are some of the potential discoveries that await? To investigate some of the possibilities, we have held a series of codeathons to see if known and unknown viruses could be found lurking within SRA cloud datasets. Spoiler alert – they are! And just recently, a team from Stanford reported that they were able to identify a 2019-nCoV-like Coronavirus in pangolins by examining data sets identified via a meta-metagenomic search of SRA and downloaded using the SRA Toolkit. One challenge this team faced was downloading the datasets: 2.5TB corresponding to approximately 1013 bases took over 48 hours to gather. How might cloud-based SRA tools have made this task easier/faster? Here’s how:
BigQuery: allows native cloud programmatic access to and search based on SRA metadata in the cloud. SRA Toolkit enables retrieval and reading of sequencing files from the SRA datasets in the cloud and writing files into the same format, respectively.
Coming soon to the cloud are tools for large scale BLAST processing for a Read Alignment and Annotation Pipeline Tool (RAPT). These tools allow the data to be analyzed directly in the cloud, eliminating the need for download to local storage for analysis.
Also in the works is a mechanism to provide better access to taxonomic content of SRA runs as calculated by NCBI tools.
We are continually adding new functionality to better support your cloud workflows and are happy to help! Contact us at email@example.com if you have questions or need help getting started. If you need assistance setting up GCP or AWS, please follow the steps in our how-to videos on YouTube.
The National Library of Medicine (NLM) is pleased to announce that all controlled-access and publicly available data in SRA is now available through Google Cloud Platform (GCP) and Amazon Web Services (AWS). To access the data please visit our SRA in the Cloud webpage where you will find links to our new SRA Toolkit and other access methods.
The SRA data available in the two clouds currently totals more than 14 petabytes and consists of all data in the SRA format as well as some data in its original submission format. Since May 2019, NCBI has been putting all submitted SRA data on the GCP and AWS clouds in both the submitted format and our converted SRA format. We have also been moving previously submitted original format data to the clouds and expect to complete that process in 2021. Continue reading “The entire corpus of the Sequence Read Archive (SRA) now live on two cloud platforms!”→
If you’re interested in visualizing and analyzing genomic data, then you’ll want to check out a new way to run Genome Workbench: in the cloud! Genome Workbench is a desktop application (both Windows and Mac) that lets you analyze genomic data in one place. You can run tools such as BLAST and create views such as multiple sequence alignments, and much more. You can run Genome Workbench on a cloud environment from your local desktop computer. This manual will show you how.
There are many advantages to using Genome Workbench in the cloud:
You can easily compare your data to the complete GenBank and RefSeq datasets without needing to download them
You can run BLAST searches against standard databases or any custom databases you’ve assembled in the cloud
All of the data (e.g. FASTA, BAM, GFF files) remain in the cloud with no need for local copies
NCBI is pleased to announce a single-cell focused codeathon at the New York Genome Center, January 15 -17. To apply, please complete the application form by December 30, 2019. Read on if you need more information about the event.
We are pleased to announce the second installment of the Virus Hunting Codeathon that will take place from November 4-6, 2019 at the University of Maryland in College Park.
The NCBI will help run this bioinformatics codeathon, hosted by the UMIACS and CBCB at the University of Maryland. The purpose of this event is to continue develop techniques, code, and pipelines to identify known, taxonomically definable, and novel viruses from metagenomic datasets on cloud infrastructure.
This event is for researchers, including students and postdocs, who are already engaged in the use of bioinformatics data or in the development of pipelines for virological analyses from high-throughput experiments. We especially encourage people who have experience in Computational Virus Hunting or related fields to participate. The event is open to anyone selected for the codeathon and willing to travel to College Park (see below).
Fast, federated indexing
Genome graphs for viruses
Approximate taxonomic analysis
Domain/HMM Boundary and Taxonomic Refinement
Bringing together approximate taxonomy and domain models
Sequence data quality metrics
We will provide the final list of projects before the codeathon starts.
In modern biomedical research, you often need to analyze very large datasets. This may require computing and storage capacity that exceeds what you have available locally. Working in a cloud environment where you can provision nearly limitless computing power, gain access to enormous data sets, and pay for only what you need is a great option in these cases.
To help with these tasks, NCBI is now providing a Docker version of NCBI BLAST that you can use on the cloud. This implementation will help you work with large volumes of sequence data and the set of NCBI BLAST databases. The BLAST Docker image makes using BLAST on the cloud much more convenient.
Installation and maintenance of the BLAST programs and databases is all handled by Docker.
Integration with other tools in your pipelines is easier.
NCBI BLAST databases are pre-loaded on the Google Cloud, providing fast access.
While we have tested the Docker image on the Google Cloud, the Docker image will allow BLAST to run equally well on any Docker-enabled platform, such as another cloud platform or on your local computer — and you can still can use the cloud-installed BLAST databases.