Month: November 2020

GenBank 240.0 is available and surpasses 10 trillion basepairs!

GenBank release 240.0 (10/28/2020) is now available on the NCBI FTP site. This release has 10.33 trillion bases and 2.17 billion records.

The current release has 219,055,207 traditional records containing 698,688,094,046 base pairs of sequence data. There are also 1,432,874,252 WGS records containing 9,215,815,569,509 base pairs of sequence data, 435,968,379 bulk-oriented TSA records containing 382,996,662,270 base pairs of sequence data, and 78,177,358 bulk-oriented TLS records containing 28,814,798,868 base pairs of sequence data.

Growth between releases

During the 71 days between the close dates for GenBank Releases 239.0 and 240.0, the ‘traditional’ portion of GenBank grew by 44,631,024,497 basepairs and by 412,969 sequence records. During that same period, 94,006 records were updated. An average of 7,140 ‘traditional’ records were added and/or updated per day.

Between releases 239.0 and 240.0, the WGS component of GenBank grew by 374,166,158,857 basepairs and by 24,751,365 sequence records. The TSA component of GenBank grew by 16,027,711,110 basepairs and by 18,443,812 sequence records. The TLS component of GenBank grew by 989,739,370 basepairs and by 2,495,201 sequence records.

The total number of sequence data files increased by 107 with this release. The divisions are as follows:

  • BCT: 22 new files, now a total of 512
  • CON: 1 new file, now a total of 218
  • INV: 2 new files, now a total of 97
  • PAT: 1 new file, now a total of 213
  • PLN: 47 new files, now a total of 594
  • PRI: 10 new files, now a total of 45
  • ROD: 15 new files, now a total of 56
  • VRL: 5 new files, now a total of 44
  • VRT: 4 new files, now a total of 214

Delivery of GenBank 240.0 was delayed by two weeks

A power surge at the NCBI data center and subsequent downtime for a critical disk storage system led to a nearly two-week delay in the delivery of the data files for GenBank 240.0. There were no data losses, and public-facing systems remained available. However, between the direct impacts of the outage and subsequent efforts to resume processing pipelines, the GenBank release timeline was significantly pushed back. Our apologies for the delay!

Upcoming Changes

New /ncRNA_class value : circRNA

  • The allowed values for the /ncRNA_class qualifier have been extended to include “circRNA”, for circular RNA molecules. This change will not appear until (or after) GenBank Release 242.0 in February 2021.

New /circular_RNA qualifier

  • Complementing the new “circRNA” ncRNA class, a new qualifier will be introduced in (or after) GenBank Release 242.0 in February 2021.

Additional Information

For downloading purposes, please keep in mind that the uncompressed GenBank Release 240.0 sequence data flatfiles require roughly 1,524 GB. The ASN.1 data files require approximately 958 GB.

More information about GenBank release 240.0 is available in the release notes, as well as in the README files in the genbank and ASN.1 (ncbi-asn1) directories on FTP.

November 18 Webinar: A new way to prepare genome submissions using NCBI’s Genome Workbench!

Join us November 18 to learn how to use Genome Workbench, NCBI’s sequence annotation and analysis package, to prepare genome submissions for GenBank.  This webinar will help you prepare for the upcoming retirement of Sequin submission tool in January 2021. You will learn how to use Genome Workbench’s Submission Wizard, Validation and Submitter Reports, Flat File View, and Graphical Sequence View to prepare your annotated genome submission to GenBank and help you find and fix any problems before submitting.

  • Date and time: Wed, November 18, 2020 12:00 PM – 12:45 PM EST
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

NCBI RefSeq and Ensembl/GENCODE taking MANE mainstream with v0.92!

NCBI and EBI have been hard at work on our joint MANE collaborationproviding a set of representative transcripts for human protein-coding genes that are identically annotated in the NCBI RefSeq and Ensembl/GENCODE annotation sets and exactly match the GRCh38 reference assembly. We’re pleased to announce MANE v0.92, now covering 16,865 genes or ~88% of known human protein-coding genes.

In particular, we’ve focused on clinically relevant genes and MANE Select now includes 99% of genes with high gene-disease validity. This release also includes 43 extra transcripts labeled “MANE Plus Clinical” that we’ve chosen to aid in clinical reporting, for example, when there are additional pathogenic variants not covered in the MANE Select transcript. While it’s critical to consider other alternatively-spliced transcripts for variant interpretation or functional analyses, the MANE Select and MANE Plus Clinical transcripts provide a common foundation for clinical reporting, and other analyses that benefit from using just one well-supported transcript or protein per gene.

Continue reading “NCBI RefSeq and Ensembl/GENCODE taking MANE mainstream with v0.92!”