We want to hear from you about changes to NIH’s Sequence Read Archive data format and storage

RFI_SRA_largeNIH’s Sequence Read Archive (SRA) is the largest, most diverse collection of next generation sequencing data from human, non-human and microbial sources. Hosted by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), SRA data is also available on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.

SRA currently contains more than 36 petabytes (PB) of data and is projected to grow to 43 PB by 2023. Though the value of this resource grows with each new sample, the exponential growth experienced over the last decade (Figure 1) threatens SRA sustainability. The storage footprint is growing more costly to maintain and the data more difficult to use at scale. The situation has reached a tipping point. SRA must be refactored to support FAIR data principles into the future.

Sra_growthFigure 1. SRA data has grown exponentially over the last decade.

NIH remains committed to the SRA and hopes to establish a long-range plan for sustained resource growth. Considerations include a model wherein normalized working files without Base Quality Scores (BQS) are readily available through cloud platforms and NCBI FTP sites, and larger source files and normalized files with base quality scores will be distributed on cloud platforms based on prevalent use cases and usage demands. Further details regarding data formats are available here.

It is critical that as an SRA user, you  participate in the review and testing of proposed data formats and infrastructure by commenting on how these developments impact your data usage. NIH has prepared a Request for Information (RFI) that details planned developments and would greatly appreciate feedback from the scientific community.

The RFI seeks comments from you on any or all of the following topics:

  1. How you are currently engaging with SRA, considering:
    1. Pipelines and tools you are using with SRA data.
    2. Formats of SRA data required for your current analyses, particularly whether and how you use BQS.
  2. The potential usability and usefulness of SRA normalized data format without BQS.
  3. Possible use cases for new formats with SRA read data stored in alignments without BQS.
  4. Specific value to you of having the original format (as submitted) SRA data available for research.
  5. Whether you are currently using or planning to use SRA data in the cloud and the factors influencing that decision, such as tools, components, or accessories that would facilitate using the data in the cloud.
  6. How the proposed hybrid model for SRA data storage and retrieval in the cloud would impact your current or future research workflows.
  7. Any other topics that NIH might consider to maximize the use and value of SRA in the cloud.

We encourage you to respond to the RFI to share your perspective so it can be factored into optimizing the solutions. This RFI is relevant to anyone interacting with SRA data in any capacity such as data generation, storage, submission, curation, analyses etc. Please submit comments electronically by July 17, 2020.

One thought on “We want to hear from you about changes to NIH’s Sequence Read Archive data format and storage

Leave a Reply to zianez Cancel reply