NIH’s Sequence Read Archive to be made available on AWS’s Open Data Sponsorship Program

NIH’s Sequence Read Archive to be made available on AWS’s Open Data Sponsorship Program

National Library of Medicine’s (NLM) National Center for Biotechnology Information (NCBI) and Amazon Web Services (AWS) are happy to announce that the controlled- and public-access Sequence Read Archive (SRA)–one of the world’s largest repositories of raw next generation sequencing data–will be freely accessible from Amazon S3 via the Open Data Sponsorship Program (ODP) as of January 2021. The SRA is currently hosted by NLM at the National Institutes of Health (NIH).

Established in 2009, the SRA is NIH’s primary repository for raw next generation sequencing data. Currently, the SRA hosts over 36 petabytes of sequence data representing controlled- and public-access sequencing data and growing exponentially. This rate of growth presents unique challenges for efficient storage and accessibility to this invaluable database. To that end, NIH released a Request for Information (RFI) from the biomedical research community to provide input on next steps for the future of the SRA. A major theme of this RFI is to reduce the size of the SRA by eliminating base quality scores (BQS), making them more efficient to work with and store at scale. Moving the SRA to AWS’s ODP provides an avenue to retain BQS and maintain the normalized ETL+BQS data format, while reducing the complexity by which researchers can locate and retrieve SRA data.

While the SRA is slated to transition to the ODP in January 2021, NLM also maintains two additional S3 buckets hosted by ODP with support from the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative. The existent buckets contain 1) 250 TB of coronavirus genome sequence data and 2) public SRA data in original format from select, high value and newly-released studies. AWS users can simply use Amazon Athena to query the publicly accessible SRA metadata bucket s3://sra-pub-metadata-us-east-1/sra/metadata for accessions of interest, or directly interrogate the SRA bucket for a specific SRA submission or set of submissions, then call it directly into a cloud-based genomics workflow. You can also view our recent webinar on using Athena to access SRA. The SRA Toolkit and the cloud data delivery service are additional mechanisms to access SRA files of interest to you. Refer detailed documentation on SRA in the cloud to learn more.

New to AWS? We have a short video tutorial to help you get started with detailed guidance available here. Write to us at sra@ncbi.nlm.nih.gov to let us know how we can serve your research needs better.

 

2 thoughts on “NIH’s Sequence Read Archive to be made available on AWS’s Open Data Sponsorship Program

Leave a Reply