The Sequence Read Archive (SRA), NCBI’s largest growing repository of molecular data, archives raw sequencing data and alignment information from high-throughput sequencing platforms, including Roche 454 GS Systems®, Illumina’s Genome Analyzer®, and Complete Genomics® systems.
Researchers commonly use SRA data to make discoveries via comparison of data sets. Data sets can be compared through the SRA web interface, but if you want to integrate these downloads and file conversions into an already existing pipeline, or you simply prefer using a command-line interface, we recommend using the SRA Toolkit.
Figure 1. The SRA Toolkit and GitHub download pages.
This open-source toolkit can be downloaded from the SRA Toolkit webpage or from GitHub/NCBI and is available for the major operating systems. The GitHub web link also provides the uncompiled files for you if you are computer savvy and would like to compile the files yourself. No matter where you download the toolkit from, there are instructions for installation and use as well as an FAQ page.
The toolkit’s command-line executables allow you to stream data from the NCBI/SRA servers for direct analysis or transform the data into common text formats, such as FASTQ or SAM. After applying for access, users can also get restricted-access data from dbGaP, with functions for decrypting and encrypting metadata (for example, phenotype data). Since some of these data files can be exceptionally large, the command-line tools make downloading them that much easier.
SRA Toolkit can also be used to run BLAST searches against archived NGS data. The application allows users to compare a FASTA sequence of interest against specific SRA accessions. For example, if you have a set of run accessions (indicated in the command by ‘ERR’ accession numbers), and you want to search for a particular FASTA sequence, the code would look like the following:
./blastn_vdb -db "ERR039542 ERR047215 ERR039539 ERR039540" -query nt.test -out test.out
where “nt.test” is the FASTA sequence of interest and “test.out” is the output file after the search containing the results.
An important note: When running this function, the FASTA file query must be located inside the bin directory in the SRA Toolkit to use the command format above. Otherwise, a file path to the blast function must be specified instead.
We hope you’ll find this resource helpful and beneficial to your research needs. If you have questions or comments regarding the SRA Toolkit, please email firstname.lastname@example.org and we’d be happy to help!
3 thoughts on “SRA Toolkit: the SRA database at your fingertips”
What do you mean by “restricted-access data from dbGaP”? Does this mean access to TCGA data types?
This refers to the controlled access data in dbGaP. These are data sets that contain sensitive data, like personal health information.
We have individual level phenotypes for over 1 million people in dbGaP, many with cancer phenotypes. That said, we do not support TCGA data. For TCGA data, users should go to https://tcga-data.nci.nih.gov/tcga/.