GenBank release 230 available, changes to number of files, expanded accessions


GenBank release 230.0  (2/15/2019) with 4.74 Terabases and 1.47 billion records is now available from the NCBI FTP site (flatfiles, ASN.1). There are two  notable changes with this release.  Because we have increased in the target maximum uncompressed file-size, the number of files dropped by about 1,000.   We are also now assigning expanded WGS  and protein accessions. WGS accessions now may have a six-letter Project Code prefix, a two-digit Assembly-Version number, followed by seven, eight, or nine digits, for example AAAABB010000001. Protein accessions may now have three-letter followed by seven digits, for example EAA0000001. See section 1.3.1 and 1.3.2 of the Release Notes for details.

The release has 212,260,377 traditional records containing 303,709,510,632 base pairs of sequence data. There are also 945,019,312 WGS records containing 4,164,513,961,679 base pairs of sequence data, 294,772,430 bulk-oriented TSA records containing 263,936,885,705 base pairs of sequence data, and 23,259,929 bulk-oriented TLS records containing 9,146,836,085 base pairs of sequence data.

During the 64 days between the close dates for GenBank Releases 229.0
and 230.0, the traditional portion of GenBank grew by 18,020,968,446
basepairs and 978,962 sequence records. During that same period,
25,301 records were updated. An average of 15,691 ‘traditional’ records
were added and/or updated per day.

Between releases 229.0 and 230.0, the WGS component of GenBank grew by
507,794,538,583 basepairs and by 171,246,122 sequence records, the TSA component of grew by 15,343,993,517 basepairs and by 19,926,957 sequence records, and  the TLS component grew by 635,006,804 basepairs and by 2,335,341 sequence records.

For downloading purposes, please keep in mind that the uncompressed GenBank release 230.0 flatfiles require roughly 964 GB (sequence files only). The ASN.1 data require approximately 773 GB.

For additional release information, see the README files in either of
the directories linked above, and the Release Notes.

New Norovirus GenBank Submission Service


Do you have Norovirus sequence data to submit to GenBank? Try out the newly-released improvements in our submission service for Norovirus data! The new service offers the following advantages:

  • Faster processing and shorter time to accession numbers
  • Improved user interface
  • Automatic Feature annotation
Submisssion_portal

Figure 1. The submission portal page showing the new option for submitting Norovirus data.

Begin a new Norovirus submission or see how to get started submitting other data to GenBank.

GenBank accepts a wide range of data to support scientific discovery and analysis on sequences from all branches of life.

GenBank reaches over 4 terabytes of data in release 229


GenBank release 229.0 (12/15/2018) has 211,281,415 traditional records including non-bulk-oriented TSA) containing 285,688,542,186 base pairs of sequence data. There are also 773,773,190 WGS records containing 3,656,719,423,096 base pairs of sequence data, 274,845,473 bulk-oriented TSA records containing 248,592,892,188 base pairs of sequence data, and 20,924,588 bulk-oriented TLS records containing 8,511,829,281 base pairs of sequence data.

Continue reading

NCBI to correct existing taxonomic information on public GenBank records with average nucleotide identity analysis


To ensure that taxonomic information on genome assemblies is as accurate as possible, NCBI will use average nucleotide identity (ANI) analysis to correct existing public records in GenBank.

We will contact submitters of records found to be misidentified and provide reports with ANI information based on comparison to type strains.  If there is no objection, the taxonomic change will be made, and a structured comment will be added to the record.

In cases where a genome assembly was not submitted with a binomial name (ex: Bacillus sp. 123) but was found to match a known species with high confidence, the strain will be merged with the binomial in the taxonomy database.  This will occur as part of the normal maintenance of merged taxonomic names. The submitter will not be contacted, but the structured comment indicating the change will be added to the record.

paper in the International Journal of Systematic and Evolutionary Microbiology presents the method NCBI scientists used to review all prokaryotic genome assemblies in GenBank, as well as the current status of GenBank verifications and recent developments in confirming species assignments in new genome submissions.

Join NCBI at PAG in San Diego, January 12–16, 2019


Next week, NCBI staff will attend the Plant and Animal Genome (PAG) Conference. We have several activities planned, including 1 booth (#223), 4 workshops, 1 talk and 2 posters.

Read on to learn more about what you can look forward to if you’re attending PAG this year. (Note: The listed times are Pacific time.)

Continue reading

Adapting flatfile parsers for GenBank’s new accession formats


As previously announced, GenBank and other INSDC members will expand the accession formats used for sequencing projects by the end of this year. We’re introducing these new formats to accommodate the growth of Whole Genome Shotgun (WGS), Transcriptome Shotgun Assembly (TSA), and Targeted Locus Study (TLS) sequencing sequences. More details about those changes are available on NCBI Insights.

You may have to adjust your code and databases to accommodate the new formats’ longer length. In particular, the first line of the flatfile format, referred to as the LOCUS line, includes the “Locus Name” (usually identical to the accession number), which may now grow to as long as 20 characters. See section 3.4.4 of the GenBank release notes for examples of how the LOCUS line might change.

Since 2003, the GenBank release notes have recommended that flatfile parsers use a whitespace-separated tokens approach to accommodate changes like the one described in section 3.4.4. If your flatfile parsers rely solely on position, you may have to make modifications. From our internal testing, it appears BioPython and BioPerl properly handle most of the examples shown in section 3.4.4, and only have issues with the last theoretical examples where the sequence length no longer ends at position 40. We do recommend adjusting code to accommodate those theoretical examples for future-proofing.

Please write to the helpdesk with any questions about the new formats.

GenBank release 227 available through FTP, BLAST & Entrez


GenBank release 227.0 (8/13/2018) has 208,831,050 traditional records including non-bulk-oriented TSA) containing 260,806,936,411 base pairs of sequence data. There are also 665,309,765 WGS records containing 3,204,855,013,281 base pairs of sequence data, 249,295,386 bulk-oriented TSA records containing 225,520,004,678 base pairs of sequence data, and 15,822,538 bulk-oriented TLS records containing 6,077,824,493 base pairs of sequence data.

Continue reading

GenBank will start using expanded accession formats by December 2018


By the end of 2018, GenBank and other INSDC members will expand the accession formats used for sequencing projects. We have assigned almost all the possible accession numbers using the current, shorter formats. Using these longer formats will allow us to expand accession ranges and give us greater capacity.

The expanded format for Whole Genome Shotgun (WGS), Transcriptome Shotgun Assembly (TSA), and Targeted Locus Study (TLS) sequencing projects will use a six-letter Project Code prefix and a two-digit Assembly-Version number followed by 7, 8, or 9 digits (for example, AAAAAA020000001).

Non-WGS/TLS/TSA nucleotide sequences currently use a “2+6” format, two-letter prefix followed by six digits. This format will be expanded to eight digits.

Protein sequences currently use a “3+5” accession format. By the end of 2018, this format will use seven digits.

You will need to adjust any processing methods to accommodate these new identifier formats.  Please write to the helpdesk with any questions about the new formats.

GenBank release 225: Over 1 billion sequence records stored!


GenBank release 225.0 (4/14/2018) has 208,452,303 traditional records (including non-bulk-oriented TSA) containing 260,189,141,631 base pairs of sequence data. In addition, there are 621,379,029 WGS records containing 2,784,740,996,536 base pairs of sequence data, 227,364,990 TSA records containing 205,232,396,043 base pairs of sequence data, and 14,782,654 TLS records containing 5,612,769,448 base pairs of sequence data.

During the 60 days between the close dates for GenBank releases 224.0 and 225.0, the traditional portion of GenBank grew by 6,558,433,533 base pairs and by 1,411,748 sequence records. During that same period, 86,960 records were updated – an average of 24,978 records added or updated per day.

Continue reading

The NCBI BioCollections database links specimen vouchers and sequence records to home institutions


A paper in the January 2018 issue of Database describes the NCBI BioCollections database, a curated dataset of metadata for culture collections, museums, herbaria and other natural history collections connected to sequence records in GenBank. The BioCollections database was established to allow the association of specimen vouchers and related sequence records to their home institutions. This process also allows back-linking from the home institution for quick identification of all records originating from each collection.

The rapidly growing set of GenBank submissions frequently includes records that are derived from specimen vouchers.  Correct identification of the specimens studied, along with a method to associate the sample with its institution, is critical to the outcome of related studies and analyses.

New repository records are added to the database if they are submitted to the International Nucleotide Sequence Database Collaboration (INSDC) along with sequence data. Each record now provides information about the institution that houses the collection, standard Institution Code, mailing address, and associated webpage if available.

The BioCollections database is maintained and curated by the Taxonomy group at NCBI.