Improving how SRA data is distributed

Improving how SRA data is distributed

NCBI will be incrementally streamlining the Sequence Read Archive (SRA) data distribution model over the next year as SRA Lite becomes the standard SRA file format. This simplified format reduces the average file size for more efficient analysis and storage of large datasets. SRA is the largest publicly available repository of high throughput sequencing data and is available through cloud providers and NCBI servers. Depending on the way you currently access SRA data, your experience may change. If you are using the SRA Toolkit, you can continue to set your location and file format preferences and allow the toolkit to select the best distribution point given your location. 

SRA formats

NCBI continually evaluates the data format and distribution model to lower storage costs and support faster data transfers, downloads, and analyses. With the transition to SRA Lite, NCBI will provide the data in three data formats

  • SRA Lite Files:  SRA Lite is produced by assessing overall read quality and setting a per-read quality flag. In the resulting files, all reads have a “Read_Filter” flag with a value of pass or reject.   
  • SRA Normalized Files: This is the format provided since the inception of the SRA. It contains base calls, full base quality scores, and alignments. 
  • Original Submitted Files: The original submitted files do not have normalized formats and may be large in size. 

Data dissemination and access

Over the next year, we will increase dissemination of SRA Lite files on NCBI servers and cloud platforms. Availability of this format will increase incrementally over the year as we work through all the data. As part of this work, we will consolidate access to the SRA Normalized file format to a single storage location. You can access the three file formats from several storage platforms using different services as outlined below.  

  • SRA Lite Files: SRA Lite files will become the standard SRA file format used by the SRA Toolkit which will automatically determine the most efficient distribution point for these files. 
  • SRA Normalized Files: SRA Normalized files are accessible through the Amazon Web Services Open Data Platform (AWS ODP) using the SRA Toolkit and AWS tools. 
  • Original Submitted Files: Original submitted files can be easily retrieved from cold storage using NCBI’s Cloud Data Delivery Service (CDDS). 

During the transition: 

  • A complete copy of SRA, with some mix of Normalized and SRA Lite formats, will be available in Google Cloud Platform (GCP) and on premises at NCBI. 
  • AWS will host all Original submitted files and SRA Normalized formats, and an increasing volume of SRA Lite data format. 

The move to SRA Lite should be complete by spring of 2023. As market and technical conditions change, we will continue to monitor and refine our processes. 

To learn more about the SRA Lite methodology, get sample output, and instructions on how to include or exclude rejected reads, please review our documentation or checkout our SRA Lite blog.

Questions? 

If you have any questions or would like to provide feedback, please contact us

Leave a Reply