NIH’s Sequence Read Archive (SRA) is the largest, most diverse collection of next generation sequencing data from human, non-human and microbial sources. Hosted by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), SRA data is also available on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.
SRA currently contains more than 36 petabytes (PB) of data and is projected to grow to 43 PB by 2023. Though the value of this resource grows with each new sample, the exponential growth experienced over the last decade (Figure 1) threatens SRA sustainability. The storage footprint is growing more costly to maintain and the data more difficult to use at scale. The situation has reached a tipping point. SRA must be refactored to support FAIR data principles into the future.
Figure 1. SRA data has grown exponentially over the last decade.
NIH remains committed to the SRA and hopes to establish a long-range plan for sustained resource growth. Considerations include a model wherein normalized working files without Base Quality Scores (BQS) are readily available through cloud platforms and NCBI FTP sites, and larger source files and normalized files with base quality scores will be distributed on cloud platforms based on prevalent use cases and usage demands. Further details regarding data formats are available here.
It is critical that as an SRA user, you participate in the review and testing of proposed data formats and infrastructure by commenting on how these developments impact your data usage. NIH has prepared a Request for Information (RFI) that details planned developments and would greatly appreciate feedback from the scientific community.
Continue reading “We want to hear from you about changes to NIH’s Sequence Read Archive data format and storage”