The National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) has released a new resource, called the SARS-CoV-2 Variants Overview, that aggregates data related to SARS-CoV-2 variants from sequences available in NCBI’s GenBank and Sequence Read Archive (SRA) databases.
SARS-CoV-2 Variants Overview, a freely available online dashboard, was developed with guidance from the TRACE Working Group as part of NLM’s participation in the National Institutes of Health (NIH) Accelerating COVID-19 Therapeutic Interventions and Vaccines (ACTIV) initiative, a public-private partnership for a coordinated research strategy to support and speed up the development of COVID-19 treatments and vaccines.
One impetus for development of the dashboard is that unassembled SRA data cannot be processed through Pango tools, and many SARS-CoV-2 samples are only represented in SRA. The Pango nomenclature is being used by researchers and public health agencies worldwide to track the transmission and spread of SARS-CoV-2, including variants of concern. Thus, we developed a uniform approach to making variant calls from SRA records and assigning Pangolin lineages on the basis of these results. This means that submission groups do not have to go through the effort of creating assemblies.
Furthermore, our standardized analysis approach should result in higher quality data with fewer artifacts introduced through data processing. Records for SARS-CoV-2 sequence data from GenBank and SRA are processed through the NCBI SARS-CoV-2 Variant Calling pipeline to match and deduplicate records from a unique sample and then classify records as Pango lineages based on the presence of required mutations. The processing pipeline and output files, such as VCFs, are available freely (open access) in the NIH NCBI Sequence Read Archive on AWS, made available with support from NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative.
The SARS-CoV-2 Variants Overview provides quick visualizations of the geographical prevalence of SARS-CoV-2 sequences classified as Variants Being Monitored (VBM), Variants of Interest (VOI), Variants of Concern (VOC), or Variants of High Consequence (VOHC) by the US Centers for Disease Control and Prevention (CDC) in the USA (Fig. 1a) and around the world (Fig. 1b). Charts also display lineage frequencies over time in the USA or world (Fig. 3).
Fig. 1A: Geographic prevalence of various SARS-CoV-2 lineages within the USA is displayed through a choropleth map. Clicking on a location displays a popover table to provide chronological counts of unique samples classified as VBM, VOI, VOC, or VOHC lineages in the selected location, the total number of samples for a lineage from the location since the beginning of the pandemic, as well as the total number of unique samples from that location in that month. Example is data for Pennsylvania.
Fig. 1b: Geographic prevalence of various SARS-CoV-2 lineages within the world (not including USA data) is displayed through a choropleth map. Clicking on a location displays a popover table to provide chronological counts of unique samples classified as VBM, VOI, VOC, or VOHC lineages in the selected location, the total number of samples for a lineage from the location since the beginning of the pandemic, as well as the total number of unique samples from that location in that month.
Details for variants within the tracked SARS-CoV-2 lineages are provided via a dedicated info-panel per lineage and allows users to get a quick glance into the lineage-defining mutations and mutations associated with the lineage which may be in important epitopes (Fig. 2). Links for NCATS Open Data Portal resources allow easy access to research evidence for the effect of mutations on efficacy of therapeutics. Work is underway to provide data from additional sources to support interpretation of the significance of mutations and lineages, so stay tuned!
Fig. 2: Info-panels dedicated to SARS-CoV-2 variants allow quick navigation to individual cards for each of the lineages, which are grouped by the WHO labels when available. Example shows the lineages included in the Alpha group. Also included within the lineage cards are links to the NCATS Open Data Portal for access to research evidence for the effect of mutations on efficacy of therapeutics.
Lineage-frequency charts show the proportion of unique sequences by lineage over recent months, highlighting lineages which are growing and may deserve further investigation. The charts show proportions, but the actual counts are also provided through pop-ups, since the number of samples sequenced varies over time.
Fig. 3: Lineage frequency charts corresponding to the US and the world show the proportion of total samples which are classified as one of the tracked lineages. World data excludes US records. Clicking on an area opens a pop-over showing the proportions and counts of records in recent months and clicking on a date below the chart opens a popover showing proportions and counts for all tracked lineages for that month as well as the total number of samples collected in that month.
In addition to using the URL to navigate directly to the SARS-CoV-2 Variants Overview, there are additional paths to this interface via the NCBI Virus home page or the SARS-CoV-2 Data Hub (Fig. 4a and 4b, respectively).
Fig. 4a: Quick link to SARS-CoV-2 Variants Overview from the NCBI Virus homepage.
Fig. 4b: Quick link to SARS-CoV-2 Variants Overview from the SARS-CoV-2 Data Hub.
We are always looking for ways to improve, so please send in comments, questions, or suggestions via the yellow Feedback button available on the bottom right of the screen when using this resource. Don’t forget to include contact info if you would like us to respond to your comment. You can also contact us via email. We would love to hear from you!