The Sequence Read Archive slims down your data with SRA Lite

In response to your requests for compact and faster-to-deliver data, NIH’s Sequence Read Archive (SRA) now offers a new data format – SRA Lite (Figure 1).  SRA Lite supports reliable and faster data transfer, downloads, and analysis using current tools. SRA Lite replaces the submitted base quality score (BQS) with a simplified read quality score, reducing the average read size by ~60% for more efficient analysis and storage of large datasets. This format was designed to reflect improvements in next-generation sequencing that include increases in average read length and sequence coverage. Indeed, the data has improved enough that that removing some quality scores increase genotype accuracy (PMCID: PMC4439189).

Figure 1. FASTQ dumped from SRA Lite format and the SRA configuration dialog. The FASTQ has the quality score for each base set to 30 (‘?’ in the ASCII encoding).  Select “Prefer SRA Lite files with simplified base Quality scores” in the SRA configuration dialog to use SRA Lite.

Curious about the details? The SRA Lite formatting process assesses overall read quality and sets a per-read quality flag. In the resulting files, all reads have a “Read_Filter” flag with a value of pass or reject. In this manner, we will continue to provide normalized working files that remain compatible with applications expecting base quality scores. These files will have a new file extension “.sralite” so they are easily identifiable. To learn more about the SRA Lite methodology, get sample output, and instructions on how to include or exclude rejected reads, please review our documentation.

Wondering how you can access SRA Lite? The latest version of the SRA toolkit (version 2.11.2) allows you to select a preference for SRA Lite files through the SRA configuration (Figure 1).  In cases where such a file doesn’t exist, you will be served SRA Normalized Format with full base quality scores. If you don’t upgrade to the latest toolkit version and update your configuration settings, SRA will continue to preferentially serve SRA Normalized Format files (.sra file extension) and only fall back to SRA Lite based on availability. We encourage you to upgrade to version 2.11.2 of the SRA toolkit and set your preference accordingly.

Concerned about quality scores changing? Please let us know: sra@ncbi.nlm.nih.gov! Also be aware that our open access NIH NCBI Sequence Read Archive on AWS will continue to offer SRA Normalized Format for the foreseeable future.

One thought on “The Sequence Read Archive slims down your data with SRA Lite

Leave a Reply