How does it work? Download PGAP from GitHub, provide some basic information and the FASTA sequences for your genome sequence, and run the pipeline on your own machine, compute farm or the cloud. PGAP will produce annotation consistent with NCBI’s internal PGAP. Submit the resulting annotated genome to GenBank through the genome submission portal, and get an accession back.
As with any other submitted assembly, PGAP-annotated genomes will be screened for foreign contaminants and vector sequences at submission. Any annotated assemblies that don’t pass may need to be modified. We are developing an automated process to handle these edits!
We are also working on other improvements to stand-alone PGAP such as a module for calculating Average Nucleotide Identity (ANI) to confirm the assembly’s taxonomic assignment. Stay tuned for new developments!
GenBank release 231.0 (4/19/2019) is now available on the NCBI FTP site. This release has 5.03 terabases and 1.54 billion records.
The release has 212,775,414 traditional records containing 321,680,566,570 base pairs of sequence data. There are also 993,732,214 WGS records containing 4,421,986,382,065 base pairs of sequence data, 311,247,136 bulk-oriented TSA records containing 277,118,019,688 base pairs of sequence data, and 24,240,761 bulk-oriented TLS records containing 9,623,321,565 base pairs of sequence data.
GenBank release 230.0 (2/15/2019) with 4.74 Terabases and 1.47 billion records is now available from the NCBI FTP site (flatfiles, ASN.1). There are two notable changes with this release. Because we have increased in the target maximum uncompressed file-size, the number of files dropped by about 1,000. We are also now assigning expanded WGS and protein accessions. WGS accessions now may have a six-letter Project Code prefix, a two-digit Assembly-Version number, followed by seven, eight, or nine digits, for example AAAABB010000001. Protein accessions may now have three-letter followed by seven digits, for example EAA0000001. See section 1.3.1 and 1.3.2 of the Release Notes for details.
GenBank release 229.0 (12/15/2018) has 211,281,415 traditional records including non-bulk-oriented TSA) containing 285,688,542,186 base pairs of sequence data. There are also 773,773,190 WGS records containing 3,656,719,423,096 base pairs of sequence data, 274,845,473 bulk-oriented TSA records containing 248,592,892,188 base pairs of sequence data, and 20,924,588 bulk-oriented TLS records containing 8,511,829,281 base pairs of sequence data.
To ensure that taxonomic information on genome assemblies is as accurate as possible, NCBI will use average nucleotide identity (ANI) analysis to correct existing public records in GenBank.
We will contact submitters of records found to be misidentified and provide reports with ANI information based on comparison to type strains. If there is no objection, the taxonomic change will be made, and a structured comment will be added to the record.
In cases where a genome assembly was not submitted with a binomial name (ex: Bacillus sp. 123) but was found to match a known species with high confidence, the strain will be merged with the binomial in the taxonomy database. This will occur as part of the normal maintenance of merged taxonomic names. The submitter will not be contacted, but the structured comment indicating the change will be added to the record.
A paper in the International Journal of Systematic and Evolutionary Microbiology presents the method NCBI scientists used to review all prokaryotic genome assemblies in GenBank, as well as the current status of GenBank verifications and recent developments in confirming species assignments in new genome submissions.
As previously announced, GenBank and other INSDC members will expand the accession formats used for sequencing projects by the end of this year. We’re introducing these new formats to accommodate the growth of Whole Genome Shotgun (WGS), Transcriptome Shotgun Assembly (TSA), and Targeted Locus Study (TLS) sequencing sequences. More details about those changes are available on NCBI Insights.
You may have to adjust your code and databases to accommodate the new formats’ longer length. In particular, the first line of the flatfile format, referred to as the LOCUS line, includes the “Locus Name” (usually identical to the accession number), which may now grow to as long as 20 characters. See section 3.4.4 of the GenBank release notes for examples of how the LOCUS line might change.
Since 2003, the GenBank release notes have recommended that flatfile parsers use a whitespace-separated tokens approach to accommodate changes like the one described in section 3.4.4. If your flatfile parsers rely solely on position, you may have to make modifications. From our internal testing, it appears BioPython and BioPerl properly handle most of the examples shown in section 3.4.4, and only have issues with the last theoretical examples where the sequence length no longer ends at position 40. We do recommend adjusting code to accommodate those theoretical examples for future-proofing.
Please write to the helpdesk with any questions about the new formats.
GenBank release 227.0 (8/13/2018) has 208,831,050 traditional records including non-bulk-oriented TSA) containing 260,806,936,411 base pairs of sequence data. There are also 665,309,765 WGS records containing 3,204,855,013,281 base pairs of sequence data, 249,295,386 bulk-oriented TSA records containing 225,520,004,678 base pairs of sequence data, and 15,822,538 bulk-oriented TLS records containing 6,077,824,493 base pairs of sequence data.
By the end of 2018, GenBank and other INSDC members will expand the accession formats used for sequencing projects. We have assigned almost all the possible accession numbers using the current, shorter formats. Using these longer formats will allow us to expand accession ranges and give us greater capacity.
The expanded format for Whole Genome Shotgun (WGS), Transcriptome Shotgun Assembly (TSA), and Targeted Locus Study (TLS) sequencing projects will use a six-letter Project Code prefix and a two-digit Assembly-Version number followed by 7, 8, or 9 digits (for example, AAAAAA020000001).
Non-WGS/TLS/TSA nucleotide sequences currently use a “2+6” format, two-letter prefix followed by six digits. This format will be expanded to eight digits.
Protein sequences currently use a “3+5” accession format. By the end of 2018, this format will use seven digits.
You will need to adjust any processing methods to accommodate these new identifier formats. Please write to the helpdesk with any questions about the new formats.