Streamlining Access to SRA COVID-19 Datasets on the Cloud

Streamlining Access to SRA COVID-19 Datasets on the Cloud

To make it easier for you to find and access Sequence Read Archive (SRA) data, we are re-organizing and improving our cloud storage systems.  

Beginning April 2023, we will move the SARS-CoV-2 normalized data and source files from the COVID-19 data buckets on Amazon Web Services (AWS) and Google Cloud Platform (GCP) to the NIH NCBI SRA on AWS registry. We will also remove the SARS-CoV-2 original format data from AWS and GCP COVID-19 buckets and make them available in AWS cold storage. If you need these data, you can request them using the Cloud Data Delivery Service (CDDS). 

Where and how will I be able to access SARS-CoV-2 normalized data after this change occurs?

To ensure a smooth transition, we want you to have enough time to adjust your scripts and pipelines to minimize disruption to your analyses.  

  • For SRA Toolkit users, the latest version of the Toolkit is configured to automatically locate data in their new locations.  
  • For those not using SRA Toolkit, we recommend updating pipelines to look for SARS-CoV-2 SRA normalized files in the COVID AWS bucket first (s3:::sra-pub-sars-cov2), and if the file is not found, look in the SRA AWS ODP bucket (s3:::sra-pub-run-odp). 

Other SARS-CoV-2 data assets, including Variant Calling Format (VCF) data and metadata tables that are products of our dedicated SARS-CoV-2 Variant Calling Pipeline, will continue to be available in the COVID-19 Genome Sequence dataset on AWS and GCP. 

The SARS-CoV-2 Variant Calling Pipeline and the SRA data on AWS and GCP are supported by the NIH Accelerating COVID-19 Therapeutic Interventions and Vaccines (ACTIV) and Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiatives. 

Questions? 

We appreciate your understanding and cooperation as we work to improve access to our data on the cloud. Please contact our help desk with any questions or concerns.  

One thought on “Streamlining Access to SRA COVID-19 Datasets on the Cloud

  1. Many thanks for sharing this valuable information! I’m sure it will be a great help to many followers. Keep up the fantastic work!
    NIH NCBI is reorganizing and improving its cloud storage systems including moving the SARS-CoV-2 normalized data and source files from AWS and GCP to the SRA on AWS registry starting April 2023. SARS-CoV-2 original format data will be removed from AWS and GCP COVID-19 buckets and be made available in AWS cold storage. SRA Toolkit users are recommended to update their software while others are advised to look for SARS-CoV-2 SRA normalized files in COVID AWS bucket before the SRA AWS ODP bucket. The change does not affect other data assets such as Variant Calling Format data and metadata tables from its dedicated SARS-CoV-2 Variant Calling Pipeline in COVID-19 Genome Sequence dataset. Questions can be directed to the help desk.
    Wayne

Leave a Reply