NCBI staff will be presenting talks and a poster on accessing SARS-CoV-2 at NCBI and in the Cloud at the American Society of Virology 2021 virtual conference, July 19-23, 2021.
One important way the National Library of Medicine (NLM) is responding to the ongoing public health emergency is through the COVID-19 Initiative. This public-private cooperation between NLM and more than 50 scholarly publishers and societies allows you to access over 100,000 articles on COVID-19, SARS-CoV-2 and other coronaviruses through PubMed Central (PMC). This collection includes recently published discoveries, a history of coronavirus reports for comparison, international (globally comprehensive) content, and captures the breadth of research, analysis, and commentary. We make these articles available in human- and machine-readable formats to support public accessibility and analysis by researchers.
You can search this public health emergency collection in PMC or download the collection through the PMC Open Access Subset. The collection spans:
- More than half a century of research, including articles from the 1960s through the present (more than 60% of the articles included thus far were published in 2020 (Figure 1, top panel);
- Several languages, including content in English (~95%), German, French, and Spanish;
- Many publication types, more than half of them research or review articles (Figure 1, bottom panel).
Figure 1. The Public Health Emergency Collection articles by decade of publication (top panel) and by publication type (bottom panel).
People have viewed or downloaded articles in this PMC collection more than 80 million times since March reflecting the great demand for such an open and centralized collection. Artificial intelligence organizations, such as the Allen Institute for AI — builders of the COVID-19 Research Dataset (CORD-19), have also used the collection to develop new text and data mining techniques that can help answer high-priority scientific questions related to COVID-19.
The COVID-19 pandemic has drawn attention to the human host genes associated with SARS-CoV-2 entry and to the elements that regulate expression of these genes. At NCBI, we have prioritized curation of experimentally validated regulatory elements for these genes in the RefSeq Functional Elements project. Our annotations include several enhancers, promoters, cis-regulatory elements and protein binding sites, among other feature types. We have annotated 236 regulatory features for 27 distinct biological regions in the latest human Annotation Release (109.20200522) including regulatory elements for the ABO, ACE2, ANPEP, CD209, CLEC4G, CLEC4M, CTSL, DPP4,and TMPRSS2 genes.
You can view our regulatory element to target gene linkages in the regulatory interactions track using our new track hub that we recently announced. You can also see the biological regions and features tracks. These have functional and descriptive metadata, including biological region summaries, experimental evidence types, publication support and more.
The example in Figure 1 shows RefSeq Functional Element feature annotation in NCBI’s Genome Data Viewer (GDV) for the ABO gene region (GRCh38, NW_009646201.1: 73,864-103,789) the determiner of the human ABO blood group. A genome-wide association study recently identified non-coding ABO variants associated with COVID-19 disease severity (PMID:32558485), which map to some of the RefSeq Functional Elements in this region.Figure 1. The human ABO gene region in the NCBI GDV displaying the RefSeq Functional Element features. The biological regions aggregate track shows underlying feature annotation for an ABO upstream enhancer (LOC112637023), promoter region (LOC112679202), +5.8 intron 1 enhancer (LOC112679198), a 3′ regulatory region (LOC112639999), and a +36.0 downstream enhancer (LOC112637025). Functional Element features include numerous enhancers, promoters, cis-regulatory elements and protein / transcription factor binding sites.
We have more information about RefSeq Functional Elements on our website, including data download and extraction options. Stay tuned to NCBI Insights and other NCBI social media for future announcements about RefSeq Functional Elements!
The National Library of Medicine and its partners in the International Nucleotide Database Collaboration (INSDC) have joined together to issue a statement encouraging the scientific community to submit their SARS-CoV-2 sequences to INSDC databases. The databases offer broad open access and integrated data, literature and tools – features that we believe are critical as the research community works together to understand and combat COVID-19. Read the full statement below.
The databases of the International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) capture, organize, preserve and present nucleotide sequence data as part of the open scientific record. INSDC member institutions – the EMBL European Bioinformatics Institute (EMBL-EBI), the NIG DNA Data Bank of Japan (NIG-DDBJ) and the National Library of Medicine’s National Center for Biotechnology Information at NIH (NCBI) – are committed to the continued delivery of this critical element of scientific infrastructure.
The global COVID-19 crisis has brought an urgent need for the rapid open sharing of data relating to the outbreak. Most importantly, access to sequence data from the SARS-CoV-2 viral genome is essential for our understanding of the biology and spread of COVID-19. To aid in that effort, all three INSDC members have prioritized processing of SARS-CoV-2 sequence data and have streamlined the submission process.
Availability of data through INSDC databases provides:
- Rapid open access – INSDC quickly makes submitted data freely available to everyone, without restrictions on reuse
- Linkage of raw sequence read data to genome assemblies, providing researchers with the ability to validate the integrity of assemblies and investigate asserted mutations and changes in genome sequences
- Integration of SARS-CoV-2 sequences with entirety of INSDC data, including related coronaviruses genome sequences, enabling comparison across species
- Linkage of sequences to the published literature
- Tools – INSDC partners provide integrated data analysis tools, such as BLAST, enhancing the discovery process
In support of the global response to the COVID-19 crisis, the INSDC calls upon the research community to:
- Submit raw SARS-CoV-2 data to the databases of the INSDC
- Submit consensus/assembled SARS-CoV-2 data to the databases of the INSDC
- Provide information relating to the sequenced isolate or sample as part of the sequence submission; minimally the time and place of isolation/sampling and an isolate/sample identifier should be provided to maximize the value of the sequences.
- In cases where scientists have already established submissions to other databases, these submissions should continue in parallel to the INSDC submission
The integration of INSDC databases with the global bioinformatics data infrastructure, including tools, secondary databases, compute capacity and curation processes, assures the rapid dissemination of data and drives its maximal impact.
In addition to these fundamental roles of INSDC member institutions in the sharing of viral sequence data, each institution has rapidly established COVID-19-specific programs and resources: the European COVID-19 Data Platform from EMBL-EBI, the DDBJ’s Research Data Resources on New Coronavirus and the NCBI SARS-CoV-2 Resources. These resources both demonstrate the connectedness of INSDC databases to broader bioinformatics initiatives and serve to add immediate value to COVID-19 research.
While searching for SARS-CoV-2 sequences, have you longed for a COVID-focused SRA dataset? Great news — now there is one! We are happy to announce the addition of COVID-focused datasets (including source and normalized SRA file formats) to the AWS Public Dataset Program. These data can now be explored at the Registry of Open Data on AWS.
Researchers can now access more than 13K SRA runs that include Coronaviridae (CoV) content identified by a kmer-based approach to organismal content identification using the SRA Taxonomy Analysis Tool.
NCBI Datasets has a simple, new way to get Coronoviridae data, including from SARS-CoV-2 (Figure 1). The data package includes genomic, protein and CDS sequences, annotation and a comprehensive data report for all complete genomes. You can also target your search to major taxonomic ranks within Coronaviridae.
Interested in a specific protein? The SARS-CoV-2 protein page allows you to choose a protein and download the corresponding sequences, annotation and representative structures from all annotated genomes (Figure 2).
Looking for programmatic access? NCBI Datasets offers the same Coronoviridae genomic data and SARS-CoV-2 protein data through a command-line tool and a RESTful API. These tools support additional filtering including the ability to download only those genomes released after a date you specify.
Are you trying to keep up with the rapidly growing number of biological resources associated with the SARS-CoV-2 virus and the related disease, COVID-19? There’s a new page to help you find SARS-CoV-2-related content available at NCBI (Figure1). This new site will help bench scientists, bioinformaticians, clinicians, and others connect with the information they need to study SARS-CoV-2 and end the COVID-19 pandemic.Figure 1. The new SARS-CoV-2 resources page providing access to data submissions, literature, molecular information, and clinical resources.
Figure 1. The SARS-CoV-2 submission landing page, where you can submit to GenBank or SRA. You can also view other resources related to SARS-CoV-2.
Quickly and easily add your SARS-CoV-2 sequence data to the growing public archive with new, special features and support from NCBI. Our new SARS-CoV-2 sequence submission landing page will help you get started. GenBank submissions are accessioned and released in approximately 1-2 working days, and Sequence Read Archive (SRA) submissions typically processed and released within hours. Submission is simple!
Are you interested in mining literature about COVID-19 and the novel SARS-Cov-2 virus? You may want to check out the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a collection of more than 13,000 full text articles that focus on COVID-19 and coronaviruses and that were assembled from PMC, the WHO, bioRxiv, and medRxiv. To produce this dataset, the National Library of Medicine partnered with colleagues from the Allen Institute for AI, the Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security and Emerging Technology (CSET), Kaggle, Microsoft, and the White House Office of Science and Technology Policy (OSTP).
CORD-19 is available from the Allen Institute and will be updated weekly as new articles become available. The article data are formatted in JSON, making the collection ideal for computational methods such as data mining, machine learning, and natural language processing. We hope this collection serves as a call to action for the community to improve our understanding of coronaviruses and the human diseases they cause. Have a look and let us know what you think!
As the global health emergency around the Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2, formerly 2019-nCoV) continues, we continue to play a key role in providing the biomedical community free and easy access to genome sequences from the coronavirus. You can quickly access these data through the NCBI search (Figure 1).Figure 1. NCBI search results for the term “SARS-COV-2” showing the schematic map of the viral assembly and annotation and buttons that link to the data in the NCBI Virus resource, a specialized BLAST page that searches Betacoronavirus sequences, and the reference assembly download. The bottom panel provides links to the CDC website for COVID-19 information and a link to GenBank®/SRA sequence data.