Expanded average nucleotide identity analysis now available for prokaryotic genome assemblies

As we described in an earlier post, GenBank uses average nucleotide identity (ANI) analysis to find and correct misidentified prokaryotic genome assemblies. You can now access ANI data for the more than 600,000 GenBank bacterial and archaeal genome assemblies through a downloadable report (ANI_report_prokaryotes.txt) available from the genomes/ASSEMBLY_REPORTS area of the FTP site. The README describes the contents of the report in detail. You can use the ANI data to evaluate the taxonomic identity of genome assemblies of interest for yourself.

The new ANI_report_prokaryotes.txt replaces the older ANI_report_bacteria.txt in the same directory. We are no longer updating the ANI_report_bacteria.txt file and will remove it after 31st May 2020.

Important changes coming to prokaryotic Reference and Representative genome assemblies

We are making changes to the set of bacterial and archaeal RefSeq Reference and Representative assemblies in February 2020.

  • We will reduce the number of Reference assemblies to 15 that have annotation provided by outside experts (Table 1) and re-annotate the 105 other current Reference assemblies using the latest Prokaryotic Genome Annotation Pipeline (PGAP) software. The re-annotated assemblies will lose reference status.
  • We will reassess and revise the set of Representative assemblies so that there is one assembly per species to better reflect the taxonomic diversity of the RefSeq bacterial and archaeal assemblies.

Continue reading

Important changes to the genomes FTP site in February

We have added the latest NCBI Eukaryotic Genome Annotation Pipeline results for the more than 580 species that we annotate to the genomes/refseq directory on the genomes FTP area. As we announced in December, we will stop publishing annotation results to the genus_species directories (example: genomes/Xenopus_tropicalis) on the genomes FTP site effective February 1, 2020. We will also move existing genus_species directories to genomes/archive/old_refseq during the month of February.X_t_assemblyFigure 1. The Assembly page for the Xenopus tropicalis UCB Xtro 10.0 (GCF_000004195.4) showing the blue download button. Annotation results such as the RefSeq transcript alignments that can be downloaded from the web page are now also under the genomes/refseq directory on the FTP site. The FTP path to the .bam alignment files is in red.

These FTP changes do not affect the Assembly download function. As always, you can download assembly data using the blue Download button on the web pages (Figure 1).

 

NCBI on YouTube: Get the most out of NCBI resources with these videos

Check out the latest videos on YouTube to learn how to best use NCBI graphical viewers, SRA, PGAP, and other resources.

Genome Data Viewer: Analyzing Remote BAM Alignment Files and Other Tips

This video shows you how to upload remote BAM files, and succinctly demonstrates handy viewer settings, such as Pileup display options, and highlights the very helpful tooltips in the Genome Data Viewer (GDV). There’s also a brief blog post on the same topic.

Continue reading

Novel coronavirus complete genome from the Wuhan outbreak now available in GenBank

Updated!

Get rapid access to Wuhan coronavirus (2019-nCoV) sequence data from the current outbreak as it becomes available. We will continue to update the page with newly released data.

The complete annotated genome sequence of the novel coronavirus associated with the outbreak of pneumonia in Wuhan, China is now available from GenBank for free and easy access by the global biomedical community. Figure 1 shows the relationship of the Wuhan virus to selected coronaviruses.

Wuhan-human-1_posterior-output2

Figure 1.  Phylogenetic tree showing the relationship of Wuhan-Hu-1 (circled in red) to selected coronaviruses. Nucleotide alignment was done with MUSCLE 3.8. The phylogenetic tree was estimated with MrBayes 3.2.6 with parameters for GTR+g+i.  The scale bar indicates estimated substitutions per site, and all branch support values are 99.3% or higher.

Continue reading

Recent improvements to the genome Assembly resource

We’re constantly making improvements to the NCBI genome Assembly resource. This post points out some recent advances, highlighted in Figure 1 and described in more detail below.Surv_projFigure 1.  New improvements to the Assembly web pages. The results page showing the surveillance project filter (lower left), which excludes 28,220 Klebsiella pneumoniae assemblies from the Pathogen Detection Project, and the Download Assemblies button with a link to the File type description (circled in red, upper right). For other improvements in the Download Assemblies menu see our recent post.

Continue reading

Genome Workbench is now in the cloud!

If you’re interested in visualizing and analyzing genomic data, then you’ll want to check out a new way to run Genome Workbench: in the cloud! Genome Workbench is a desktop application (both Windows and Mac) that lets you analyze genomic data in one place. You can run tools such as BLAST and create views such as multiple sequence alignments, and much more. You can run Genome Workbench on a cloud environment from your local desktop computer. This manual will show you how.

blog-525_Cloud Graphic

There are many advantages to using Genome Workbench in the cloud:

  • You can easily compare your data to the complete GenBank and RefSeq datasets without needing to download them
  • You can run BLAST searches against standard databases or any custom databases you’ve assembled in the cloud
  • All of the data (e.g. FASTA, BAM, GFF files) remain in the cloud with no need for local copies
  • You won’t pay egress fees for downloading data

Give it a try and let us know how it goes!

New download files and FTP directories for genome assemblies

You can now download new file types for species recently annotated by the NCBI Eukaryotic Genome Annotation Pipeline from the Assembly web pages and from the genomes/refseq FTP area. The new files types include alignments of annotated transcripts to the assembly in BAM format, all models predicted by Gnomon, and — for species that have been annotated multiple times —  files characterizing the feature-by-feature differences between the current and the previous annotation.

Continue reading

December 11 Webinar: Running the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) on your own data

On Wednesday, December 11, 2019 at 12 PM, NCBI staff will present a webinar that will show you how to use NCBI’s PGAP (https://github.com/ncbi/pgap) on your own data to predict genes on bacterial and archaeal genomes using the same inputs and applications used inside NCBI. You can run PGAP your own machine, a compute farm, or in the Cloud. Plus, you can now submit genome sequences annotated by your copy of PGAP to GenBank.  Attend the webinar to learn more!

  • Date and time: Wed, Dec 11, 2019 12:00 PM – 12:45 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

New release of the Prokaryotic Genome Annotation Pipeline with updated tRNAscan and protein models

A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) is now available on GitHub. This release uses a new and improved version of tRNAscan (tRNAscan-SE:2.0.4) and includes our most up-to-date Hidden Markov Model and BlastRule collections for naming proteins.

Remember that you can submit the results of PGAP to GenBank. Or, if you are still improving the assembly and your genome doesn’t pass the pre-annotation validation, you can use the –ignore-all-errors mode to get a preliminary annotation.

See our previous post and our documentation for details on how to set up and run PGAP yourself.

Try PGAP and let us know how you like it!