SARS-CoV-2 genomic data is critical for monitoring the viral spread and evolution of the COVID-19 pandemic, identifying newly emerging variants, and developing and evaluating the countermeasures. As of September 2022, over 13 million SARS-CoV-2 genomes have been sequenced across the world, making it the most sequenced pathogen ever. A cornerstone of genomic analysis is building a phylogeny, which demonstrates the relatedness of individual isolates to the rest of the sequenced genomes. However, the volume of SARS-CoV-2 genomes presents novel opportunities beyond phylogenies, as well as computational challenges to traditional methods of genomic analyses and visualization. Continue reading “NCBI-NIAID Beyond Phylogenies Codeathon was a success!”
Tag: Genomics
Come see NCBI at the ASM Microbe Conference 2022
The American Society of Microbiology (ASM) Microbe conference is back, and scheduled to take place in-person, June 9th-13th in Washington, D.C.
NCBI staff member Dr. Michael Feldgarden will be recognized by ASM with an award for his research. Other NCBI staff will present posters on NCBI resources and will also be available at our booth (#1128) to address your questions. Drop by to see what’s new and provide your feedback. We hope to see you there! Check out NCBI’s schedule of activities: Continue reading “Come see NCBI at the ASM Microbe Conference 2022”
RefSeq Release 205 is available!
RefSeq release 205 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of March 1, 2021, and contains 269,975,565 records, including 197,232,209 proteins, 36,514,168 RNAs, and sequences from 108,257 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.
Improved access to SARS-CoV-2 data
NCBI Datasets has a simple, new way to get Coronoviridae data, including from SARS-CoV-2 (Figure 1). The data package includes genomic, protein and CDS sequences, annotation and a comprehensive data report for all complete genomes. You can also target your search to major taxonomic ranks within Coronaviridae.

Interested in a specific protein? The SARS-CoV-2 protein page allows you to choose a protein and download the corresponding sequences, annotation and representative structures from all annotated genomes (Figure 2).

Looking for programmatic access? NCBI Datasets offers the same Coronoviridae genomic data and SARS-CoV-2 protein data through a command-line tool and a RESTful API. These tools support additional filtering including the ability to download only those genomes released after a date you specify.
We appreciate your feedback. Try NCBI Datasets and let us know what you think!
May 20 webinar: Exploring SRA metadata in the cloud with BigQuery
Join us on May 20th to learn how to use Google’s BigQuery to quickly search the data from the Sequence Read Archive (SRA) in the cloud to speed up your bioinformatic research and discovery projects. BigQuery is a tool for exploring cloud-based data tables with SQL-like queries. In this webinar, we’ll introduce you to using BigQuery to mine SRA submitter-supplied metadata and the results of taxonomic analysis for SRA runs. You’ll see real-world case studies that demonstrate how to find key information about SRA runs and identify data sets for your own analysis pipelines.
- Date and time: Wed, May 20, 2020 12:00 PM – 12:45 PM EDT
- Register
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.
Flies Are A-buzzing in RefSeq!
Are you interested in comparative genomics or other studies using Drosophila genomics?
Then don’t miss our online poster #568A at TAGC 2020 Online (no meeting registration required). Also, tune in to the online Q&A session on Monday, April 27 at 12:00 – 12:30 pm EDT.
What’s happening? In coordination with FlyBase, we are transitioning almost all of the RefSeq Drosophila assemblies to annotation produced primarily by NCBI’s eukaryotic genome annotation pipeline. We’ll continue to use the FlyBase annotation for Drosophila melanogaster (soon to be updated to Release 6.32), but we’ll annotate the other species using available RNA-seq datasets and our latest software. This will allow us to provide consistent, high-quality annotations across the full spectrum of Drosophila species, and also rapidly provide annotations as new high-quality assemblies become available. Another benefit is that these annotations will be available in the full suite of NCBI resources, including nucleotide, protein, BLAST, Gene, Genome Data Viewer, Genomes, Assembly, and more. You can download these annotation data from the NCBI genomes FTP site or you can try the new NCBI Datasets tool. By special request, we’re making orthology data relative to D. melanogaster available on the Gene FTP site, and plan to expose that data in our public pages in the future.
SRA cloud sequences hold the promise of additional discoveries related to COVID-19
We recently announced that we made all of the Sequence Read Archive (SRA) publicly available on two cloud platforms. This archive of genetic sequences is a treasure trove of information and the cloud environments provide high-performance computing capabilities via a GCP or AWS account – right from your own device. High-throughput sequencing has made generating data extremely fast and inexpensive, which has fueled the rapid growth of SRA. Putting it on the cloud makes it possible to analyze “the high-throughput, unassembled sequence data, across all such sequences”.
So, what are some of the potential discoveries that await? To investigate some of the possibilities, we have held a series of codeathons to see if known and unknown viruses could be found lurking within SRA cloud datasets. Spoiler alert – they are! And just recently, a team from Stanford reported that they were able to identify a 2019-nCoV-like Coronavirus in pangolins by examining data sets identified via a meta-metagenomic search of SRA and downloaded using the SRA Toolkit. One challenge this team faced was downloading the datasets: 2.5TB corresponding to approximately 1013 bases took over 48 hours to gather. How might cloud-based SRA tools have made this task easier/faster? Here’s how:
- BigQuery: allows native cloud programmatic access to and search based on SRA metadata in the cloud. SRA Toolkit enables retrieval and reading of sequencing files from the SRA datasets in the cloud and writing files into the same format, respectively.
- Coming soon to the cloud are tools for large scale BLAST processing for a Read Alignment and Annotation Pipeline Tool (RAPT). These tools allow the data to be analyzed directly in the cloud, eliminating the need for download to local storage for analysis.
- Also in the works is a mechanism to provide better access to taxonomic content of SRA runs as calculated by NCBI tools.
We are continually adding new functionality to better support your cloud workflows and are happy to help! Contact us at sra@ncbi.nlm.nih.gov if you have questions or need help getting started. If you need assistance setting up GCP or AWS, please follow the steps in our how-to videos on YouTube.
April 8 Webinar: Accelerate genomics discovery with SRA in the cloud
On Wednesday, April 8, 2019 at 12 PM, NCBI staff will show you how to leverage the cloud to speed up your research and discovery. You’ll be introduced to new and existing tools and data including BigQuery, SRA Toolkit, and more. You’ll hear about real workflows in the cloud featuring an example of the work NCBI was able to accomplish in the cloud using SRA data and a case study from an SRA cloud customer
By the end of this webinar, you will know where to look for new cloud products from NCBI, access help information to get you started, and will see how to run your analyses efficiently in the cloud.
- Date and time: Wed, Apr 8, 2020 12:00 PM – 12:45 PM EDT
- Register
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.
View BAM alignments in the NCBI genome browsers and sequence viewers sorted by haplotype tag
NCBI’s genome browsers and graphical sequence viewers now allow you to view BAM alignments sorted by haplotype tag. This option is useful for analyzing variants within a sequenced sample and can help you detect or validate structural variants.Figure 1. Remote BAM alignment data sorted by haplotype tag in the Genome Data Viewer. The remote BAM file was added through the “User Data and Track Hubs” feature in GDV. You can load the remote BAM for this example through https://go.usa.gov/xpM9c. The sorted display shows that haplotype 1 contains a significant deletion in this region relative to haplotype 2 and the reference genome assembly. Aligned reads not assigned a haplotype tag in the BAM file are grouped under the heading “haplotype not set” (not shown).
December 11 Webinar: Running the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) on your own data
On Wednesday, December 11, 2019 at 12 PM, NCBI staff will present a webinar that will show you how to use NCBI’s PGAP (https://github.com/ncbi/pgap) on your own data to predict genes on bacterial and archaeal genomes using the same inputs and applications used inside NCBI. You can run PGAP your own machine, a compute farm, or in the Cloud. Plus, you can now submit genome sequences annotated by your copy of PGAP to GenBank. Attend the webinar to learn more!
- Date and time: Wed, Dec 11, 2019 12:00 PM – 12:45 PM EDT
- Register
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.