Announcing the NCBI SARS-CoV-2 Variant Calling Pipeline and Related Data Products

Announcing the NCBI SARS-CoV-2 Variant Calling Pipeline and Related Data Products

Still waiting for an analysis pipeline that can uniformly process raw sequence data produced by a variety of sequencing platforms? Your wait is over! Announcing the SARS-CoV-2 Variant Calling Pipeline, which is now operational and optimized to provide support for multiple sequencing platforms including, Illumina, Oxford Nanopore, and PacBio.

This new pipeline can make allele frequency calls equal to or above 15%. See our publication preprint and our GitHub repository for more details. This optimized pipeline is a result of the efforts of the COVID-19 research community, led by the NIH Accelerating COVID-19 Therapeutic Interventions and Vaccines (ACTIV) initiative, a public-private partnership for a coordinated research strategy to support and speed up the development of COVID-19 treatments and vaccines.

SRA files contain a comprehensive snapshot of all genetic variability in a sample including all major and minor variants. However, these files can be very large and require pre-processing to distill information. As part of ACTIV TRACE, we have implemented consistent processing for SARS-CoV-2 SRA files to call genetic variants in viral samples and provide this information in easy-to-use formats including the Variant Call Format (VCF) that aligns SARS-CoV-2 reads to a viral reference.

You can access VCF files, both raw and in SPDI format, for free through the Registry of Open Data on Amazon Web Services (AWS) and the Google Cloud Platform (GCP) Public Dataset Program with support from the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative. You can also retrieve the supporting metadata using the annotated variations table in AWS Athena or GCP BigQuery.

VCFs support downstream analyses including visualization:

You can use VCF files in a wide variety of analyses including sequence visualization. In the example below, we streamed the following SARS-CoV-2 VCF file- https://sra-pub-sars-cov2.s3.amazonaws.com/vcf/SRR17781870/SRR17781870.ref.snpeff.vcf into NCBI Sequence Viewer. As shown in Figure 1, VCFs allow for a quick visualization of the variations concentrated in the SARS-COV-2 spike protein region.

Figure 1. The spike protein region of the SARS-CoV-2 reference sequence displayed in the graphical sequence viewer. The VCF file for run SRR17781870 is loaded as the bottom track (boxed in red) and shows the variants (red rectangles) in this region.

We would love to hear how you use VCFs for your analyses. Please write to us to share your use cases and workflows or if you have any questions.

Leave a Reply