GenBank submitters, is your genome assembly within the expected size range?

Validation issues can delay the processing of your submissions to GenBank. To avoid one type of delay, use the new “expected genome size” API to check the length of your genome assembly before submission.

The API compares the size of submitted genome assemblies to the expected genome size range for the species to identify outliers that can result from errors such as:

  • incorrect organism assignment
  • metagenome submitted as an organism genome
  • targeted sub-genome assembly not flagged as partial genome representation
  • gross contamination with other sequences

You can check in advance for these possible problems using the API. The API accepts the taxid for the species (taxid = Taxonomy ID – see our Taxonomy quick start guide on how to find the taxid for a given species) and the length of your assembly (excluding gaps and runs of Ns) as input and returns XML with the expected length, the acceptable range, and a status that tells you whether your assembly is too large, too small, or within the acceptable range. Look for <length_status>within_range</length_status> which confirms that your sequence passes the test!

Try the following examples:

https://api.ncbi.nlm.nih.gov/genome/v0/expected_genome_size?species_taxid=1773&length=4.41M
https://api.ncbi.nlm.nih.gov/genome/v0/expected_genome_size?species_taxid=562&length=7221235
https://api.ncbi.nlm.nih.gov/genome/v0/expected_genome_size?species_taxid=5476&length=5.72M

For more information, see the Genome Size Check documentation.

Leave a Reply