To make it easier for you to find and access Sequence Read Archive (SRA) data, we are re-organizing and improving our cloud storage systems.
Beginning April 2023, we will move the SARS-CoV-2 normalized data and source files from the COVID-19 data buckets on Amazon Web Services (AWS) and Google Cloud Platform (GCP) to the NIH NCBI SRA on AWS registry. We will also remove the SARS-CoV-2 original format data from AWS and GCP COVID-19 buckets and make them available in AWS cold storage. If you need these data, you can request them using the Cloud Data Delivery Service (CDDS).
Where and how will I be able to access SARS-CoV-2 normalized data after this change occurs?
To ensure a smooth transition, we want you to have enough time to adjust your scripts and pipelines to minimize disruption to your analyses. Continue reading “Streamlining Access to SRA COVID-19 Datasets on the Cloud”
Still waiting for an analysis pipeline that can uniformly process raw sequence data produced by a variety of sequencing platforms? Your wait is over! Announcing the SARS-CoV-2 Variant Calling Pipeline, which is now operational and optimized to provide support for multiple sequencing platforms including, Illumina, Oxford Nanopore, and PacBio.
This new pipeline can make allele frequency calls equal to or above 15%. See our publication preprint and our GitHub repository for more details. This optimized pipeline is a result of the efforts of the COVID-19 research community, led by the NIH Accelerating COVID-19 Therapeutic Interventions and Vaccines (ACTIV) initiative, a public-private partnership for a coordinated research strategy to support and speed up the development of COVID-19 treatments and vaccines. Continue reading “Announcing the NCBI SARS-CoV-2 Variant Calling Pipeline and Related Data Products”
NCBI Microbial Pathogen and SARS-CoV-2 Resources in the Cloud
Get hands-on experience with NCBI Pathogen Detection and SARS-CoV-2 Surveillance data in the cloud. No prior cloud experience necessary!
NCBI staff are presenting a workshop at the American Society for Microbiology Next-Generation Sequencing (ASM NGS) 2022 Meeting on Sunday, October 16, 2022 from 10 am – 3 pm ET (with a 1 hour break) to help conference attendees learn about two NCBI cloud-hosted resources, Pathogen Detection and SARS-CoV-2 Genome Sequence datasets. Continue reading “NCBI Workshop at the ASM NGS 2022 Meeting”
ElasticBLAST is a new way to BLAST large numbers of queries, faster and on the cloud. Here are the top three reasons you should use ElasticBLAST:
1. ElasticBLAST can handle much LARGER queries!
ElasticBLAST can search query sets that have hundreds to millions of sequences and against BLAST databases of all sizes.
2. ElasticBLAST is FASTER
ElasticBLAST distributes your searches across multiple cloud instances to process them simultaneously. The ability to scale resources in this way allows you to process large numbers of queries in a shorter time than you could with BLAST+.
3. ElasticBLAST is EASY to run on the cloud
ElasticBLAST is easy to set up using our step-by-step instructions (Amazon Web Services (AWS), Google Cloud Platform (GCP)) and allows you to leverage the power of the cloud. Once configured, it manages the software and database installation, handles partitioning of the BLAST workload among the various instances, and deallocates cloud resources when the searches are done.
ElasticBLAST also selects the instance (i.e., machine) type for you based on database size. Of course, you can also choose the instance type manually if you prefer. Continue reading “Top 3 reasons to use ElasticBLAST”
The National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) has released a new resource, called the SARS-CoV-2 Variants Overview, that aggregates data related to SARS-CoV-2 variants from sequences available in NCBI’s GenBank and Sequence Read Archive (SRA) databases.
SARS-CoV-2 Variants Overview, a freely available online dashboard, was developed with guidance from the TRACE Working Group as part of NLM’s participation in the National Institutes of Health (NIH) Accelerating COVID-19 Therapeutic Interventions and Vaccines (ACTIV) initiative, a public-private partnership for a coordinated research strategy to support and speed up the development of COVID-19 treatments and vaccines.
One impetus for development of the dashboard is that unassembled SRA data cannot be processed through Pango tools, and many SARS-CoV-2 samples are only represented in SRA. The Pango nomenclature is being used by researchers and public health agencies worldwide to track the transmission and spread of SARS-CoV-2, including variants of concern. Thus, we developed a uniform approach to making variant calls from SRA records and assigning Pangolin lineages on the basis of these results. This means that submission groups do not have to go through the effort of creating assemblies. Continue reading “Introducing SARS-CoV-2 Variants Overview, NLM’s latest tool in the fight against COVID-19 “
The volume of biological data being generated by the scientific community is growing exponentially, reflecting technological advances and research activities. This increase in available data has great promise for pushing scientific discovery but also introduces new challenges that scientific communities need to address. The National Institutes of Health’s (NIH) Sequence Read Archive (SRA), which is maintained by the National Library of Medicine’s National Center for Biotechnology Information (NCBI), is a rapidly growing public database that researchers use to improve scientific discovery across all domains of life. As part of the Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative, over 36 petabytes of “next generation” (raw and SRA-formatted) sequencing data is accessible to anybody via two cloud service providers.
To help address the challenges of conducting large-scale analysis of -omic data in the SRA and similar databases, the Department of Energy (DOE) Office of Biological and Environmental Research (BER), the NIH Office of Data Science Strategy (ODSS), and NCBI, held a virtual workshop on June 8, 2021, on Emerging Solutions in Petabyte Scale Sequence Search. The workshop brought together experts from DOE national labs, research institutions, and universities across the world.
SRA data growth over time. Databases like the NIH Sequence Read Archive are growing rapidly and are used extensively by scientific communities. As these databases grow, so do their potential scientific value, but work must be done to ensure ease of access.
Continue reading “Tackling Petabyte Scale Sequence Search Challenges”
The NIH NCBI Sequence Read Archive (SRA) on AWS, containing all public SRA data, is now live! This data is hosted on Amazon Web Services (AWS) under the Open Data Sponsorship Program (ODP) with support from NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative.
Continue reading “The wait is over… NIH’s Public Sequence Read Archive is now open access on the cloud”
National Library of Medicine’s (NLM) National Center for Biotechnology Information (NCBI) and Amazon Web Services (AWS) are happy to announce that the controlled- and public-access Sequence Read Archive (SRA)–one of the world’s largest repositories of raw next generation sequencing data–will be freely accessible from Amazon S3 via the Open Data Sponsorship Program (ODP) as of January 2021. The SRA is currently hosted by NLM at the National Institutes of Health (NIH).
Continue reading “NIH’s Sequence Read Archive to be made available on AWS’s Open Data Sponsorship Program”
While searching for SARS-CoV-2 sequences, have you longed for a COVID-focused SRA dataset? Great news — now there is one! We are happy to announce the addition of COVID-focused datasets (including source and normalized SRA file formats) to the AWS Public Dataset Program. These data can now be explored at the Registry of Open Data on AWS.
Researchers can now access more than 13K SRA runs that include Coronaviridae (CoV) content identified by a kmer-based approach to organismal content identification using the SRA Taxonomy Analysis Tool.
Continue reading “NIH’s COVID-focused Sequence Read Archive (SRA) datasets are now open access on AWS!”
NIH’s Sequence Read Archive (SRA) is the largest, most diverse collection of next generation sequencing data from human, non-human and microbial sources. Hosted by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), SRA data is also available on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.
SRA currently contains more than 36 petabytes (PB) of data and is projected to grow to 43 PB by 2023. Though the value of this resource grows with each new sample, the exponential growth experienced over the last decade (Figure 1) threatens SRA sustainability. The storage footprint is growing more costly to maintain and the data more difficult to use at scale. The situation has reached a tipping point. SRA must be refactored to support FAIR data principles into the future.
Figure 1. SRA data has grown exponentially over the last decade.
NIH remains committed to the SRA and hopes to establish a long-range plan for sustained resource growth. Considerations include a model wherein normalized working files without Base Quality Scores (BQS) are readily available through cloud platforms and NCBI FTP sites, and larger source files and normalized files with base quality scores will be distributed on cloud platforms based on prevalent use cases and usage demands. Further details regarding data formats are available here.
It is critical that as an SRA user, you participate in the review and testing of proposed data formats and infrastructure by commenting on how these developments impact your data usage. NIH has prepared a Request for Information (RFI) that details planned developments and would greatly appreciate feedback from the scientific community.
Continue reading “We want to hear from you about changes to NIH’s Sequence Read Archive data format and storage”