Average Nucleotide Identity (ANI) for assembly validation

Average Nucleotide Identity (ANI) for assembly validation

Validating genome assemblies submitted to GenBank using ANI based workflow

Average Nucleotide Identity (ANI) analysis is a useful tool to verify taxonomic identities in prokaryotic genomes. As part of the NCBI bacterial genome submission process, GenBank performs ANI analyses to compare submitted prokaryotic genome assemblies against reference data generated from type strains. You can learn about more about the relevant workflow and about type strain curation in our publications (PMC6978984 and PMC4383940).

We use genomes obtained from type strains (type assemblies) in computational comparisons, for example using ANI to reclassify or modify existing taxonomy with reasonable confidence. The taxonomy check status for all 1.3 million bacterial genome assemblies is summarized in the ANI_report_prokaryotes.txt file available from the ASSEMBLY_REPORTS FTP directory.  The README file describes the contents of the report in detail. You can run ANI on your genome on its own or in the context of annotation. Find more information here.

Progress

Since 2017, we have corrected the taxonomy for over 7,000 new submissions before acceptance to GenBank and corrected over 1,800 genomes post-submission. Currently, we are in the process of correcting the taxonomy of over 500 genomes. Using our ANI method we have reclassified over 97% of all Enterobacter. Current data are available from our FTP site. See the README for each file for more information about the contents.

Taxonomy

Over one million bacterial genomes have confirmed taxonomy. However 4,400 bacterial genomes have the wrong organism, and using ANI we found 3,500 bacterial genomes to be contaminated by other bacteria. For 137,000 bacterial genomes, there is not sufficient evidence to confirm taxonomy. To solve these problems, we need your help to improve the number of assemblies of type strains available as taxonomic reference points!

Type strains in need of community input

ANI processes rely on high quality genome sequences of type strains. Currently there are 19,439 genome assemblies from type material for 13,919 validly published species. However, there are many species that still do not have any genome from type material. These are high priority candidates for sequencing.

Table 1 lists the top ten species for which we have the most genome assemblies but no type strain assembly. Please consider sequencing and submitting the genome for these type strains.

Table 1: Top 10 species and type strains in need of a genome assembly

Species Type strains
Francisella tularensis ATCC 6223, B-38, GIEM Schu
Streptococcus iniae ATCC 29178, DSM 20576, LMG 14520
Vibrio cyclitrophicus ATCC 700982, LMG 21359, NBRC 107756
Weissella confusa ATCC 10881, DSM 20196, LMG 9497
Coxiella burnetii ATCC VR 615
Providencia stuartii ATCC 29914, DSM 4539, LMG 3260
Xanthomonas fragariae ATCC 33239, DSM 3587, LMG 708
Chlamydia suis ATCC VR 1474, S45
Fusobacterium nucleatum ATCC 25586, DSM 15643, LMG 13131
Leptospira weilii ATCC 43285, DSM 22357, Celledoni

This document has a complete list of prokaryotic genomes without a type strain assembly. See the README for details about the contents.

2 thoughts on “Average Nucleotide Identity (ANI) for assembly validation

  1. Thanks for this interesting post.
    A bit surprised however to see Xanthomonas fragariae in this Top 10 when the type strain was sequenced in 2017 (Gétaz et al., 2017) with accession number LT853882.
    A good reminder to make the necessary to get this reflected in the metadata of this genome!

Leave a Reply