GenBank release 234.0 (10/14/2019) is now available on the NCBI FTP site. This release has 6.69 trillion bases and 1.68 billion records.
The release has 216,763,706 traditional records containing 386,197,018,538 base pairs of sequence data. There are also 1,097,629,174 WGS records containing 5,985,250,251,028 base pairs of sequence data, 342,811,151 bulk-oriented TSA records containing 305,371,891,408 base pairs of sequence data, and 27,460,978 bulk-oriented TLS records containing 10,848,455,369 base pairs of sequence data.
Validation issues can delay the processing of your submissions to GenBank. To avoid one type of delay, use the new “expected genome size” API to check the length of your genome assembly before submission.
The API compares the size of submitted genome assemblies to the expected genome size range for the species to identify outliers that can result from errors such as:
- incorrect organism assignment
- metagenome submitted as an organism genome
- targeted sub-genome assembly not flagged as partial genome representation
- gross contamination with other sequences
You can check in advance for these possible problems using the API. The API accepts the taxid for the species (taxid = Taxonomy ID – see our Taxonomy quick start guide on how to find the taxid for a given species) and the length of your assembly (excluding gaps and runs of Ns) as input and returns XML with the expected length, the acceptable range, and a status that tells you whether your assembly is too large, too small, or within the acceptable range. Look for <length_status>within_range</length_status> which confirms that your sequence passes the test!
Try the following examples:
For more information, see the Genome Size Check documentation.
We have released a new version of the Prokaryotic Genome Annotation Pipeline (PGAP), available on GitHub. The new release includes the ability to ignore pre-annotation validation errors (–ignore-all-errors). This new feature allows you to produce a preliminary annotation for a draft version of the genome, even one that contains vector and adapter sequences or that is outside of the size range for the species. This draft annotation should be helpful with your ongoing work on the genome assembly. Please keep in mind that these pre-annotations and assemblies with contaminants or other errors are not suitable for submission to GenBank.
Another new feature allows you to provide the name of the consortium that generated the assembly and annotation so that this information appears in the final GenBank records. For more details, consult our guidelines on input files.
See our previous post and our documentation for details on how to obtain and run PGAP yourself.
Next on our to-do list is a module for calculating Average Nucleotide Identity (ANI) to confirm the assembly’s taxonomic assignment. Stay tuned!
GenBank release 233.0 (8/21/2019) is now available on the NCBI FTP site. This release has 6.26 terabases and 1.65 billion records.
The release has 213,865,349 traditional records containing 366.7 billion base pairs of sequence data. There are also 1.07 billion WGS records containing 5.6 trillion base pairs of sequence data, 331.3 million bulk-oriented TSA records containing 294.7 trillion base pairs of sequence data, and 26 million bulk-oriented TLS records containing 10.5 billion base pairs of sequence data.
In July 2018, NCBI announced plans to retire the EST and GSS databases, and we have now implemented these changes. We will continue to accept submissions of EST and GSS sequences, but will no longer provide special processes for these sequence types. If you want to submit EST and GSS data, please use tbl2asn. For further details, please visit https://www.ncbi.nlm.nih.gov/genbank/dbest/ or https://www.ncbi.nlm.nih.gov/genbank/dbgss/ or contact email@example.com.
We thank all past and present submitters of EST and GSS data for the invaluable benefit these data have provided to numerous genomic sequencing projects over the years. Please let us know if you have any questions or concerns about these changes!
GenBank release 232.0 (6/20/2019) is now available on the NCBI FTP site. This release has 5.47 terabases and 1.58 billion records.
The release has 213 million traditional records containing 329.8 billion base pairs of sequence data. There are also 1 billion WGS records containing 4.8 trillion base pairs of sequence data, 319.9 million bulk-oriented TSA records containing 285.3 trillion base pairs of sequence data, and 25 million bulk-oriented TLS records containing 10 billion base pairs of sequence data.
We are happy to announce that you can now submit your genome sequences annotated by your own local copy of the standalone Prokaryotic Genome Annotation Pipeline (PGAP) to GenBank.
How does it work? Download PGAP from GitHub, provide some basic information and the FASTA sequences for your genome sequence, and run the pipeline on your own machine, compute farm or the cloud. PGAP will produce annotation consistent with NCBI’s internal PGAP. Submit the resulting annotated genome to GenBank through the genome submission portal, and get an accession back.
As with any other submitted assembly, PGAP-annotated genomes will be screened for foreign contaminants and vector sequences at submission. Any annotated assemblies that don’t pass may need to be modified. We are developing an automated process to handle these edits!
We are also working on other improvements to stand-alone PGAP such as a module for calculating Average Nucleotide Identity (ANI) to confirm the assembly’s taxonomic assignment. Stay tuned for new developments!
GenBank release 231.0 (4/19/2019) is now available on the NCBI FTP site. This release has 5.03 terabases and 1.54 billion records.
The release has 212,775,414 traditional records containing 321,680,566,570 base pairs of sequence data. There are also 993,732,214 WGS records containing 4,421,986,382,065 base pairs of sequence data, 311,247,136 bulk-oriented TSA records containing 277,118,019,688 base pairs of sequence data, and 24,240,761 bulk-oriented TLS records containing 9,623,321,565 base pairs of sequence data.
GenBank release 230.0 (2/15/2019) with 4.74 Terabases and 1.47 billion records is now available from the NCBI FTP site (flatfiles, ASN.1). There are two notable changes with this release. Because we have increased in the target maximum uncompressed file-size, the number of files dropped by about 1,000. We are also now assigning expanded WGS and protein accessions. WGS accessions now may have a six-letter Project Code prefix, a two-digit Assembly-Version number, followed by seven, eight, or nine digits, for example AAAABB010000001. Protein accessions may now have three-letter followed by seven digits, for example EAA0000001. See section 1.3.1 and 1.3.2 of the Release Notes for details.
Do you have Norovirus sequence data to submit to GenBank? Try out the newly-released improvements in our submission service for Norovirus data! The new service offers the following advantages:
- Faster processing and shorter time to accession numbers
- Improved user interface
- Automatic Feature annotation
Figure 1. The submission portal page showing the new option for submitting Norovirus data.
Begin a new Norovirus submission or see how to get started submitting other data to GenBank.
GenBank accepts a wide range of data to support scientific discovery and analysis on sequences from all branches of life.