While searching for SARS-CoV-2 sequences, have you longed for a COVID-focused SRA dataset? Great news — now there is one! We are happy to announce the addition of COVID-focused datasets (including source and normalized SRA file formats) to the AWS Public Dataset Program. These data can now be explored at the Registry of Open Data on AWS.
Researchers can now access more than 13K SRA runs that include Coronaviridae (CoV) content identified by a kmer-based approach to organismal content identification using the SRA Taxonomy Analysis Tool.
NCBI Datasets has a simple, new way to get Coronoviridae data, including from SARS-CoV-2 (Figure 1). The data package includes genomic, protein and CDS sequences, annotation and a comprehensive data report for all complete genomes. You can also target your search to major taxonomic ranks within Coronaviridae.
Interested in a specific protein? The SARS-CoV-2 protein page allows you to choose a protein and download the corresponding sequences, annotation and representative structures from all annotated genomes (Figure 2).
Looking for programmatic access? NCBI Datasets offers the same Coronoviridae genomic data and SARS-CoV-2 protein data through a command-line tool and a RESTful API. These tools support additional filtering including the ability to download only those genomes released after a date you specify.
NCBI is pleased to announce ongoing enhancements to submission of SARS-CoV-2 assembled genomes to GenBank, including a streamlined workflow on the web and a new API option. Both new options mean that you can receive accessions for SARS-CoV-2 data submissions more quickly!
A streamlined workflow with improved interface and enhanced validation on both web and API saves you time and effort and, most importantly, makes it possible to get SARS-CoV-2 accession numbers and public release of data within hours. In addition, we automatically annotate all SARS-CoV-2 genomes to produce standardized, consistent annotation which saves you time and benefits researchers who find your data valuable. Continue reading “New GenBank submission options for SARS-CoV-2 submitters”→
Are you trying to keep up with the rapidly growing number of biological resources associated with the SARS-CoV-2 virus and the related disease, COVID-19? There’s a new page to help you find SARS-CoV-2-related content available at NCBI (Figure1). This new site will help bench scientists, bioinformaticians, clinicians, and others connect with the information they need to study SARS-CoV-2 and end the COVID-19 pandemic.Figure 1. The new SARS-CoV-2 resources page providing access to data submissions, literature, molecular information, and clinical resources.
Figure 1. The SARS-CoV-2 submission landing page, where you can submit to GenBank or SRA. You can also view other resources related to SARS-CoV-2.
Quickly and easily add your SARS-CoV-2 sequence data to the growing public archive with new, special features and support from NCBI. Our new SARS-CoV-2 sequence submission landing page will help you get started. GenBank submissions are accessioned and released in approximately 1-2 working days, and Sequence Read Archive (SRA) submissions typically processed and released within hours. Submission is simple!
We recently announced that we made all of the Sequence Read Archive (SRA) publicly available on two cloud platforms. This archive of genetic sequences is a treasure trove of information and the cloud environments provide high-performance computing capabilities via a GCP or AWS account – right from your own device. High-throughput sequencing has made generating data extremely fast and inexpensive, which has fueled the rapid growth of SRA. Putting it on the cloud makes it possible to analyze “the high-throughput, unassembled sequence data, across all such sequences”.
So, what are some of the potential discoveries that await? To investigate some of the possibilities, we have held a series of codeathons to see if known and unknown viruses could be found lurking within SRA cloud datasets. Spoiler alert – they are! And just recently, a team from Stanford reported that they were able to identify a 2019-nCoV-like Coronavirus in pangolins by examining data sets identified via a meta-metagenomic search of SRA and downloaded using the SRA Toolkit. One challenge this team faced was downloading the datasets: 2.5TB corresponding to approximately 1013 bases took over 48 hours to gather. How might cloud-based SRA tools have made this task easier/faster? Here’s how:
BigQuery: allows native cloud programmatic access to and search based on SRA metadata in the cloud. SRA Toolkit enables retrieval and reading of sequencing files from the SRA datasets in the cloud and writing files into the same format, respectively.
Coming soon to the cloud are tools for large scale BLAST processing for a Read Alignment and Annotation Pipeline Tool (RAPT). These tools allow the data to be analyzed directly in the cloud, eliminating the need for download to local storage for analysis.
Also in the works is a mechanism to provide better access to taxonomic content of SRA runs as calculated by NCBI tools.
We are continually adding new functionality to better support your cloud workflows and are happy to help! Contact us at email@example.com if you have questions or need help getting started. If you need assistance setting up GCP or AWS, please follow the steps in our how-to videos on YouTube.
Are you interested in mining literature about COVID-19 and the novel SARS-Cov-2 virus? You may want to check out the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a collection of more than 13,000 full text articles that focus on COVID-19 and coronaviruses and that were assembled from PMC, the WHO, bioRxiv, and medRxiv. To produce this dataset, the National Library of Medicine partnered with colleagues from the Allen Institute for AI, the Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security and Emerging Technology (CSET), Kaggle, Microsoft, and the White House Office of Science and Technology Policy (OSTP).
CORD-19 is available from the Allen Institute and will be updated weekly as new articles become available. The article data are formatted in JSON, making the collection ideal for computational methods such as data mining, machine learning, and natural language processing. We hope this collection serves as a call to action for the community to improve our understanding of coronaviruses and the human diseases they cause. Have a look and let us know what you think!
As the global health emergency around the Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2, formerly 2019-nCoV) continues, we continue to play a key role in providing the biomedical community free and easy access to genome sequences from the coronavirus. You can quickly access these data through the NCBI search (Figure 1).Figure 1. NCBI search results for the term “SARS-COV-2” showing the schematic map of the viral assembly and annotation and buttons that link to the data in the NCBI Virus resource, a specialized BLAST page that searches Betacoronavirus sequences, and the reference assembly download. The bottom panel provides links to the CDC website for COVID-19 information and a link to GenBank®/SRA sequence data.
Get rapid access to Wuhan coronavirus (2019-nCoV) sequence data from the current outbreak as it becomes available. We will continue to update the page with newly released data.
The complete annotated genome sequence of the novel coronavirus associated with the outbreak of pneumonia in Wuhan, China is now available from GenBank for free and easy access by the global biomedical community. Figure 1 shows the relationship of the Wuhan virus to selected coronaviruses.
Figure 1. Phylogenetic tree showing the relationship of Wuhan-Hu-1 (circled in red) to selected coronaviruses. Nucleotide alignment was done with MUSCLE 3.8. The phylogenetic tree was estimated with MrBayes 3.2.6 with parameters for GTR+g+i. The scale bar indicates estimated substitutions per site, and all branch support values are 99.3% or higher.