NIH’s Cloud Data Delivery Service: SRA Delivers Even More Big Data to your Cloud Bucket

NIH’s Cloud Data Delivery Service: SRA Delivers Even More Big Data to your Cloud Bucket

The Sequence Read Archive (SRA) is the National Institute of Health’s (NIH) primary repository for raw, high-throughput sequencing data, containing both controlled- and open-access datasets that continue to grow exponentially. SRA is managed by the National Library of Medicine’s National Center for Biotechnology Information (NCBI), and the data are available from NCBI’s servers as well as through cloud platforms:  Amazon Web Services (AWS) and Google Cloud Platform (GCP).  Cloud access was made possible by support from NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.

Due to SRA’s exponential growth in size, data in the cloud environment are currently partitioned in hot and cold storage to keep SRA sustainable and accessible. Per industry standards, data in hot storage are immediately accessible; because hot storage is more expensive to host, we make efforts to align this distribution method with our most frequently requested datasets. The less frequently requested datasets are available in cold storage, which may not be immediately accessible. Fret not! SRA is constantly evolving to meet our users’ needs. NCBI’s Cloud Data Delivery Service (CDDS) now allows you to get public and controlled-access data delivered from cold and hot storage directly to your chosen cloud bucket in just a few hours. The minor cost is currently handled by NCBI but certain limits apply; within a 30-day request cycle, users are able to request up to 5TB from cold storage and 20TB from hot storage to their cloud bucket.

Access to this rich genomic sequence resource is open to the public. Use of CDDS requires users to have an account with AWS or GCP and users may incur associated costs for egress, storage, and/or compute, which may vary by cloud provider. Interaction with data on the cloud allows  speed, reliability, and accessibility of data which may be otherwise difficult to obtain. Use of cloud services is not required, however; SRA will continue to support existing methods of retrieving data using the NCBI website and SRA Toolkit.

How does CDDS work?

Select files of interest using the SRA Run Selector, then deliver the data to the cloud by clicking the button under ‘Cloud Data Delivery’ (Fig.1).

Figure 1. Within the SRA Run Selector, please select ‘Deliver Data’ (marked in red) to launch the Cloud Data Delivery Service (CDDS) to proceed with submitting a ‘Deliver Data’ request (see Figure 2).

On the following screen, confirm a few important details like bucket location and file type(s), then click the ‘Deliver Data’ button (Fig. 2).

Figure 2. Confirm important details about your request and submit your data transfer request by clicking the ‘Deliver Data’ button.

Depending upon size and cloud provider, files may take up to 48 hours to transfer, but most requests will be fulfilled much faster. The system will send an email notification once the data has been delivered. The cloud data delivery screen will show you various details about the file types and sizes in your requested runs before you finalize your request. The monthly cloud data transfer limits are generous and rarely exceeded, but if you have questions on your limit, please contact SRA for help at sra@ncbi.nlm.nih.gov.

Leave a Reply