GenBank reaches over 4 terabytes of data in release 229

GenBank reaches over 4 terabytes of data in release 229

GenBank release 229.0 (12/15/2018) has 211,281,415 traditional records including non-bulk-oriented TSA) containing 285,688,542,186 base pairs of sequence data. There are also 773,773,190 WGS records containing 3,656,719,423,096 base pairs of sequence data, 274,845,473 bulk-oriented TSA records containing 248,592,892,188 base pairs of sequence data, and 20,924,588 bulk-oriented TLS records containing 8,511,829,281 base pairs of sequence data.

During the 61 days between the close dates for GenBank releases 228.0 and 229.0, the traditional portion of GenBank grew by 6,020,252,054 base pairs and 1,624,779 sequence records.

During that same period, 96,194 records were updated. An average of 28,213 traditional records were added or updated per day.

Between releases 228.0 and 229.0, the WGS component of GenBank grew by 212,547,280,889 base pairs and by 51,334,662 sequence records. The TSA component grew by 12,717,318,590 base pairs and by 14,918,059 sequence records. The TLS component grew by 76,716,368 base pairs and by 172,300 sequence records.

The total number of sequence data files increased by 33 with this release. The divisions are as follows:

  • BCT: 24 new files, now a total of 566
  • CON: 3 new files, now a total of 375
  • ENV: 2 new files, now a total of 105
  • EST: 3 new files, now a total of 489
  • INV: 5 new files, now a total of 117
  • PAT: 7 new files, now a total of 347
  • PLN: 9 new files, now a total of 241
  • VRT: 1 new file, now a total of 95

For downloading purposes, please keep in mind that the uncompressed GenBank release 229.0 flatfiles require roughly 934 GB (sequence files only). The ASN.1 data require approximately 760 GB.

More information about GenBank release 229.0 is available in the release notes, as well as in the README files in the genbank and ASN.1 (ncbi-asn1) directories on FTP. See Section 1.4.1 of the release notes for details about future accession format changes for WGS/TSA/TLS sequencing projects, and for protein sequences.

Leave a Reply