NIH’s COVID-focused Sequence Read Archive (SRA) datasets are now open access on AWS!

While searching for SARS-CoV-2 sequences, have you longed for a COVID-focused SRA dataset? Great news — now there is one! We are happy to announce the addition of COVID-focused datasets (including source and normalized SRA file formats) to the AWS Public Dataset Program. These data can now be explored at the Registry of Open Data on AWS.

Researchers can now access more than 13K SRA runs that include Coronaviridae (CoV) content identified by a kmer-based approach to organismal content identification using the SRA Taxonomy Analysis Tool.

Rapid and reliable access to COVID-19 data is paramount to support research and management of the SARS-CoV-2 outbreak. By including this dataset in the AWS Public Dataset Program, researchers can access and egress critical datasets at no cost, helping researchers to get straight to the science. The data are publicly accessible natively from S3, for researchers to download and analyze locally, or compute upon directly in the cloud.

Work is currently underway to host this dataset on additional public data cloud platforms. Stay tuned!

screenshot of COVID-19 genome sequence dataset on Amazon Web Services registry of open data
Figure 1. NCBI’s COVID-19 Genome Sequence Dataset on Registry of Open Data on AWS.

In case your research interests extend beyond Coronaviridae (CoV), you can explore the entire SRA dataset, hosted by the NCBI at the NLM, and on GCP and AWS  as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.

Getting started is easy! Refer to detailed guidance here and watch our how-to videos to get started on AWS  on YouTube. Write to us at sra@ncbi.nlm.nih.gov to let us know what you think and how we can serve your needs better!

Leave a Reply