Still waiting for an analysis pipeline that can uniformly process raw sequence data produced by a variety of sequencing platforms? Your wait is over! Announcing the SARS-CoV-2 Variant Calling Pipeline, which is now operational and optimized to provide support for multiple sequencing platforms including, Illumina, Oxford Nanopore, and PacBio.
ElasticBLAST is a new way to BLAST large numbers of queries, faster and on the cloud. Here are the top three reasons you should use ElasticBLAST:
1. ElasticBLAST can handle much LARGER queries!
ElasticBLAST can search query sets that have hundreds to millions of sequences and against BLAST databases of all sizes.
2. ElasticBLAST is FASTER
ElasticBLAST distributes your searches across multiple cloud instances to process them simultaneously. The ability to scale resources in this way allows you to process large numbers of queries in a shorter time than you could with BLAST+.
3. ElasticBLAST is EASY to run on the cloud
ElasticBLAST is easy to set up using our step-by-step instructions (Amazon Web Services (AWS), Google Cloud Platform (GCP))andallows you to leverage the power of the cloud. Once configured, itmanages the software and database installation, handles partitioning of the BLAST workload among the various instances, and deallocates cloud resources when the searches are done.
One impetus for development of the dashboard is that unassembled SRA data cannot be processed through Pango tools, and many SARS-CoV-2 samples are only represented in SRA. The Pango nomenclature is being used by researchers and public health agencies worldwide to track the transmission and spread of SARS-CoV-2, including variants of concern. Thus, we developed a uniform approach to making variant calls from SRA records and assigning Pangolin lineages on the basis of these results. This means that submission groups do not have to go through the effort of creating assemblies. Continue reading “Introducing SARS-CoV-2 Variants Overview, NLM’s latest tool in the fight against COVID-19 “→
The volume of biological data being generated by the scientific community is growing exponentially, reflecting technological advances and research activities. This increase in available data has great promise for pushing scientific discovery but also introduces new challenges that scientific communities need to address. The National Institutes of Health’s (NIH) Sequence Read Archive (SRA), which is maintained by the National Library of Medicine’s National Center for Biotechnology Information (NCBI), is a rapidly growing public database that researchers use to improve scientific discovery across all domains of life. As part of the Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative, over 36 petabytes of “next generation” (raw and SRA-formatted) sequencing data is accessible to anybody via two cloud service providers.
To help address the challenges of conducting large-scale analysis of -omic data in the SRA and similar databases, the Department of Energy (DOE) Office of Biological and Environmental Research (BER), the NIH Office of Data Science Strategy (ODSS), and NCBI, held a virtual workshop on June 8, 2021, on Emerging Solutions in Petabyte Scale Sequence Search. The workshop brought together experts from DOE national labs, research institutions, and universities across the world.
SRA data growth over time. Databases like the NIH Sequence Read Archive are growing rapidly and are used extensively by scientific communities. As these databases grow, so do their potential scientific value, but work must be done to ensure ease of access.
The NIH NCBI Sequence Read Archive (SRA) on AWS, containing all public SRA data, is now live! This data is hosted on Amazon Web Services (AWS) under the Open Data Sponsorship Program (ODP) with support from NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative.
National Library of Medicine’s (NLM) National Center for Biotechnology Information (NCBI) and Amazon Web Services (AWS) are happy to announce that the controlled- and public-access Sequence Read Archive (SRA)–one of the world’s largest repositories of raw next generation sequencing data–will be freely accessible from Amazon S3 via the Open Data Sponsorship Program (ODP) as of January 2021. The SRA is currently hosted by NLM at the National Institutes of Health (NIH).
While searching for SARS-CoV-2 sequences, have you longed for a COVID-focused SRA dataset? Great news — now there is one! We are happy to announce the addition of COVID-focused datasets (including source and normalized SRA file formats) to the AWS Public Dataset Program. These data can now be explored at the Registry of Open Data on AWS.
Researchers can now access more than 13K SRA runs that include Coronaviridae (CoV) content identified by a kmer-based approach to organismal content identification using the SRA Taxonomy Analysis Tool.
NIH’s Sequence Read Archive (SRA) is the largest, most diverse collection of next generation sequencing data from human, non-human and microbial sources. Hosted by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), SRA data is also available on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.
SRA currently contains more than 36 petabytes (PB) of data and is projected to grow to 43 PB by 2023. Though the value of this resource grows with each new sample, the exponential growth experienced over the last decade (Figure 1) threatens SRA sustainability. The storage footprint is growing more costly to maintain and the data more difficult to use at scale. The situation has reached a tipping point. SRA must be refactored to support FAIR data principles into the future.
Figure 1. SRA data has grown exponentially over the last decade.
NIH remains committed to the SRA and hopes to establish a long-range plan for sustained resource growth. Considerations include a model wherein normalized working files without Base Quality Scores (BQS) are readily available through cloud platforms and NCBI FTP sites, and larger source files and normalized files with base quality scores will be distributed on cloud platforms based on prevalent use cases and usage demands. Further details regarding data formats are available here.
It is critical that as an SRA user, you participate in the review and testing of proposed data formats and infrastructure by commenting on how these developments impact your data usage. NIH has prepared a Request for Information (RFI) that details planned developments and would greatly appreciate feedback from the scientific community.
Join us on May 20th to learn how to use Google’s BigQuery to quickly search the data from the Sequence Read Archive (SRA) in the cloud to speed up your bioinformatic research and discovery projects. BigQuery is a tool for exploring cloud-based data tables with SQL-like queries. In this webinar, we’ll introduce you to using BigQuery to mine SRA submitter-supplied metadata and the results of taxonomic analysis for SRA runs. You’ll see real-world case studies that demonstrate how to find key information about SRA runs and identify data sets for your own analysis pipelines.
Date and time: Wed, May 20, 2020 12:00 PM – 12:45 PM EDT
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.