Validating genome assemblies submitted to GenBank using ANI based workflow
Average Nucleotide Identity (ANI) analysis is a useful tool to verify taxonomic identities in prokaryotic genomes. As part of the NCBI bacterial genome submission process, GenBank performs ANI analyses to compare submitted prokaryotic genome assemblies against reference data generated from type strains. You can learn about more about the relevant workflow and about type strain curation in our publications (PMC6978984 and PMC4383940).
We use genomes obtained from type strains (type assemblies) in computational comparisons, for example using ANI to reclassify or modify existing taxonomy with reasonable confidence. The taxonomy check status for all 1.3 million bacterial genome assemblies is summarized in the ANI_report_prokaryotes.txt file available from the ASSEMBLY_REPORTS FTP directory. The README file describes the contents of the report in detail. You can run ANI on your genome on its own or in the context of annotation. Find more information here.
Progress
Since 2017, we have corrected the taxonomy for over 7,000 new submissions before acceptance to GenBank and corrected over 1,800 genomes post-submission. Currently, we are in the process of correcting the taxonomy of over 500 genomes. Using our ANI method we have reclassified over 97% of all Enterobacter. Current data are available from our FTP site. See the README for each file for more information about the contents.
Taxonomy
Over one million bacterial genomes have confirmed taxonomy. However 4,400 bacterial genomes have the wrong organism, and using ANI we found 3,500 bacterial genomes to be contaminated by other bacteria. For 137,000 bacterial genomes, there is not sufficient evidence to confirm taxonomy. To solve these problems, we need your help to improve the number of assemblies of type strains available as taxonomic reference points!
Type strains in need of community input
ANI processes rely on high quality genome sequences of type strains. Currently there are 19,439 genome assemblies from type material for 13,919 validly published species. However, there are many species that still do not have any genome from type material. These are high priority candidates for sequencing.
Table 1 lists the top ten species for which we have the most genome assemblies but no type strain assembly. Please consider sequencing and submitting the genome for these type strains.
Table 1: Top 10 species and type strains in need of a genome assembly
Species | Type strains |
Planktomarina temperata | DSM 22400, JCM 18269, RCA23 |
Coxiella burnetii | ATCC VR 615 |
Weissella confusa | ATCC 10881, DSM 20196, JCM 1093 |
Streptococcus iniae | ATCC 29178, BCCM/LMG 14520, DSM 20576 |
Fusicatenibacter saccharivorans | DSM 26062, JCM 18507, YIT 12554 |
Providencia stuartii | ATCC 29914, BCCM/LMG 3260, DSM 4539 |
Chlamydia suis | ATCC VR 1474, S45 |
Fusobacterium nucleatum | ATCC 25586, DSM 15643, JCM 8532 |
Vibrio breoganii | BCCM/LMG 23858, CECT 7222, RD 15.11 |
Francisella orientalis | BCCM/LMG 24544, DSM 21254, Ehime-1 |
This document has a complete list of prokaryotic genomes without a type strain assembly. See the README for details about the contents.
Thanks for this interesting post.
A bit surprised however to see Xanthomonas fragariae in this Top 10 when the type strain was sequenced in 2017 (Gétaz et al., 2017) with accession number LT853882.
A good reminder to make the necessary to get this reflected in the metadata of this genome!
Thanks for pointing this out. I’ll pass this along to our RefSeq prokaryote group for their input.