Do you need to download a lot of genomic data? Maybe you need all primate reference genomes or maybe you need just a few really big genomes? Prior to the advent of NCBI Datasets, downloading such a large amount of data could be a frustrating and time consuming experience involving failed downloads and writing custom scripts.
NCBI Datasets makes large genome downloads simpler, faster, and more reliable. You don’t have to write a script. You can be sure you get all the data requested. And sharing the data is easier than ever. Figure 1 shows an example data download process using Datasets.
Figure 1. Downloading and processing genomic data using NCBI Datasets. The example shows downloading the set of RefSeq primate assemblies through the Datasets web interface. Since the downloaded files would exceed 15GB, the file comes as a “dehydrated bag” — a small, easily downloaded, zipped file with metadata and links to download the data. You can “rehydrate” the unzipped dehydrated files — fill them with the corresponding data — using the datasets command-line tool.
The Prokaryote type strain report provides information on type-strains for over 18,000 species. We revised and expanded the report to make it easier to identify cases where sequencing or establishing type material would have the biggest impact on improving prokaryote taxonomy and accurate identification. These cases include species with designated type strains but without a sequenced type strain assembly and species without designated type material. We hope that the community will prioritize sequencing type strains for the former set of species (Table 1) and establishing a neotype or reftype, where applicable (as defined in Cuifo et al 2018) for the latter set (Table 2).
Table 1. The top 10 candidate species for sequencing type-strains sorted by the number of assemblies. These have designated type strains but no type strain assembly. We generated the list by sorting by “number of assemblies from type materials per species”, then by decreasing “number of assemblies per taxon”, then filtering out “type materials and coidentical strains” = “na”.
Table 2. The top 10 candidates for proposing a reftype assembly, or neotype where applicable sorted by the number of assemblies. These species have no designated type strain. We generated the list by selecting for “type materials and coidentical strains” = “na”, “number of assemblies from type materials per species” = 0, and sorting by decreasing “number of assemblies per taxon”, then filtering out Candidatus.
As we described in an earlier post, GenBank uses average nucleotide identity (ANI) analysis to find and correct misidentified prokaryotic genome assemblies. You can now access ANI data for the more than 600,000 GenBank bacterial and archaeal genome assemblies through a downloadable report (ANI_report_prokaryotes.txt) available from the genomes/ASSEMBLY_REPORTS area of the FTP site. The README describes the contents of the report in detail. You can use the ANI data to evaluate the taxonomic identity of genome assemblies of interest for yourself.
The new ANI_report_prokaryotes.txt replaces the older ANI_report_bacteria.txt in the same directory. We are no longer updating the ANI_report_bacteria.txt file and will remove it after 31st May 2020.
We are making changes to the set of bacterial and archaeal RefSeq Reference and Representative assemblies in February 2020.
We will reduce the number of Reference assemblies to 15 that have annotation provided by outside experts (Table 1) and re-annotate the 105 other current Reference assemblies using the latest Prokaryotic Genome Annotation Pipeline (PGAP) software. The re-annotated assemblies will lose reference status.
We will reassess and revise the set of Representative assemblies so that there is one assembly per species to better reflect the taxonomic diversity of the RefSeq bacterial and archaeal assemblies.
We have added the latest NCBI Eukaryotic Genome Annotation Pipeline results for the more than 580 species that we annotate to the genomes/refseq directory on the genomes FTP area. As we announced in December, we will stop publishing annotation results to the genus_species directories (example: genomes/Xenopus_tropicalis) on the genomes FTP site effective February 1, 2020. We will also move existing genus_species directories to genomes/archive/old_refseq during the month of February.Figure 1. The Assembly page for the Xenopus tropicalis UCB Xtro 10.0 (GCF_000004195.4) showing the blue download button. Annotation results such as the RefSeq transcript alignments that can be downloaded from the web page are now also under the genomes/refseq directory on the FTP site. The FTP path to the .bam alignment files is in red.
These FTP changes do not affect the Assembly download function. As always, you can download assembly data using the blue Download button on the web pages (Figure 1).
Check out the latest videos on YouTube to learn how to best use NCBI graphical viewers, SRA, PGAP, and other resources.
Genome Data Viewer: Analyzing Remote BAM Alignment Files and Other Tips
This video shows you how to upload remote BAM files, and succinctly demonstrates handy viewer settings, such as Pileup display options, and highlights the very helpful tooltips in the Genome Data Viewer (GDV). There’s also a brief blog post on the same topic.
Get rapid access to Wuhan coronavirus (2019-nCoV) sequence data from the current outbreak as it becomes available. We will continue to update the page with newly released data.
The complete annotated genome sequence of the novel coronavirus associated with the outbreak of pneumonia in Wuhan, China is now available from GenBank for free and easy access by the global biomedical community. Figure 1 shows the relationship of the Wuhan virus to selected coronaviruses.
Figure 1. Phylogenetic tree showing the relationship of Wuhan-Hu-1 (circled in red) to selected coronaviruses. Nucleotide alignment was done with MUSCLE 3.8. The phylogenetic tree was estimated with MrBayes 3.2.6 with parameters for GTR+g+i. The scale bar indicates estimated substitutions per site, and all branch support values are 99.3% or higher.
We’re constantly making improvements to the NCBI genome Assembly resource. This post points out some recent advances, highlighted in Figure 1 and described in more detail below.Figure 1. New improvements to the Assembly web pages. The results page showing the surveillance project filter (lower left), which excludes 28,220 Klebsiella pneumoniae assemblies from the Pathogen Detection Project, and the Download Assemblies button with a link to the File type description (circled in red, upper right). For other improvements in the Download Assemblies menu see our recent post.
If you’re interested in visualizing and analyzing genomic data, then you’ll want to check out a new way to run Genome Workbench: in the cloud! Genome Workbench is a desktop application (both Windows and Mac) that lets you analyze genomic data in one place. You can run tools such as BLAST and create views such as multiple sequence alignments, and much more. You can run Genome Workbench on a cloud environment from your local desktop computer. This manual will show you how.
There are many advantages to using Genome Workbench in the cloud:
You can easily compare your data to the complete GenBank and RefSeq datasets without needing to download them
You can run BLAST searches against standard databases or any custom databases you’ve assembled in the cloud
All of the data (e.g. FASTA, BAM, GFF files) remain in the cloud with no need for local copies
You can now download new file types for species recently annotated by the NCBI Eukaryotic Genome Annotation Pipeline from the Assembly web pages and from the genomes/refseq FTP area. The new files types include alignments of annotated transcripts to the assembly in BAM format, all models predicted by Gnomon, and — for species that have been annotated multiple times — files characterizing the feature-by-feature differences between the current and the previous annotation.