Validating genome assemblies submitted to GenBank using ANI based workflow
Average Nucleotide Identity (ANI) analysis is a useful tool to verify taxonomic identities in prokaryotic genomes. As part of the NCBI bacterial genome submission process, GenBank performs ANI analyses to compare submitted prokaryotic genome assemblies against reference data generated from type strains. You can learn about more about the relevant workflow and about type strain curation in our publications (PMC6978984 and PMC4383940).
We use genomes obtained from type strains (type assemblies) in computational comparisons, for example using ANI to reclassify or modify existing taxonomy with reasonable confidence. The taxonomy check status for all 1.3 million bacterial genome assemblies is summarized in the ANI_report_prokaryotes.txt file available from the ASSEMBLY_REPORTS FTP directory. The README file describes the contents of the report in detail. You can run ANI on your genome on its own or in the context of annotation. Find more information here.
Since 2017, we have corrected the taxonomy for over 7,000 new submissions before acceptance to GenBank and corrected over 1,800 genomes post-submission. Currently, we are in the process of correcting the taxonomy of over 500 genomes. Using our ANI method we have reclassified over 97% of all Enterobacter. Current data are available from our FTP site. See the README for each file for more information about the contents.
Over one million bacterial genomes have confirmed taxonomy. However 4,400 bacterial genomes have the wrong organism, and using ANI we found 3,500 bacterial genomes to be contaminated by other bacteria. For 137,000 bacterial genomes, there is not sufficient evidence to confirm taxonomy. To solve these problems, we need your help to improve the number of assemblies of type strains available as taxonomic reference points!
Type strains in need of community input
ANI processes rely on high quality genome sequences of type strains. Currently there are 19,439 genome assemblies from type material for 13,919 validly published species. However, there are many species that still do not have any genome from type material. These are high priority candidates for sequencing.
Table 1 lists the top ten species for which we have the most genome assemblies but no type strain assembly. Please consider sequencing and submitting the genome for these type strains.
Table 1: Top 10 species and type strains in need of a genome assembly
|Francisella tularensis||ATCC 6223, B-38, GIEM Schu|
|Streptococcus iniae||ATCC 29178, DSM 20576, LMG 14520|
|Vibrio cyclitrophicus||ATCC 700982, LMG 21359, NBRC 107756|
|Weissella confusa||ATCC 10881, DSM 20196, LMG 9497|
|Coxiella burnetii||ATCC VR 615|
|Providencia stuartii||ATCC 29914, DSM 4539, LMG 3260|
|Xanthomonas fragariae||ATCC 33239, DSM 3587, LMG 708|
|Chlamydia suis||ATCC VR 1474, S45|
|Fusobacterium nucleatum||ATCC 25586, DSM 15643, LMG 13131|
|Leptospira weilii||ATCC 43285, DSM 22357, Celledoni|