Do you work with human-derived sequence data? Do you often struggle with the need to determine if your data is free of human sequence and therefore suitable for public distribution? We encourage submitters to screen for and remove contaminating human reads from data files prior to submission to SRA. To support investigators in this effort, we offer a tool to remove human sequence contamination from your SRA submissions!
Human Read Removal Tool (HRRT)
The Human Read Removal Tool (HRRT; also known as the Human Scrubber) is available on GitHub and DockerHub. The HRRT is based on the SRA Taxonomy Analysis Tool (STAT) that will take as input a fastq file and produce as output a fastq.clean file in which all reads identified as potentially of human origin are masked with ‘N’.
You may also request that NCBI applies the HRRT to all SRA data linked to your submitted BioProject (more information below). When requested, all data previously submitted to the BioProject will be queued for scrubbing, and any future data submitted to the BioProject will be automatically scrubbed at load time.
This tool can be particularly helpful when a submission could be contaminated with human reads not consented for public display. Clinical pathogen and human metagenome samples are common submission types that benefit from applying the Human Scrubber tool.
For more information on genome data sharing policies, consult with institutional review boards and the NIH Genomic Data Sharing Policy. It is the responsibility of submitting parties to ensure that they have appropriate consent for human sequence data to be distributed publicly without access controls.
How do I apply HRRT to my SRA submission?
If you want the HRRT applied to your SRA submission, please email the SRA help desk and request that the HRRT be activated for your BioProject. Please include your BioProject accession or Submission ID in the request to avoid delays. Submit your sequence data at least one week prior to your desired release date to ensure sufficient time for screening.
The DockerHub and GitHub repositories contain a minimal test that ensures all components are working properly. Additionally, the core scrubber binary (aligns_to) is subject to a Constant Integration (CI) regimen employing automatic testing with any code change.
For more information, read our recent STAT publication.
If you have questions or would like to provide feedback, please reach out to the SRA help desk.
2 thoughts on “Scrubbing human sequence contamination from Sequence Read Archive (SRA) submissions”
When I deposited some Amplicon-Seq data (with off-target human reads), I found that I could maximize cleaning up the data by running cutadapt first.
In that situation, I believe I also deposited the data based upon what did not align the human reference to a joint human+virus alignment.
While I would usually expect raw data should be deposited, this reduced the content to deposit (at least as of a few years ago). So, if the human reads are off-target, then perhaps it is worth considering the effect of adapter trimming upstream of HRRT (and the GitHub code can help with testing that out locally, before uploading to the SRA)?
Thank you for the suggestion!