Tag: GenBank

Four new options to simplify your SARS-CoV-2 submissions

Four new options to simplify your SARS-CoV-2 submissions

We have recently added several exciting improvements to the SARS-CoV-2 GenBank submission process based on community feedback. To save you time, NCBI completes feature annotation for you, which means SARS-CoV-2 GenBank submission only requires a FASTA file and source metadata. Here are other new features to ease and simplify your submission workflow.

Automatically remove failed sequences from a submission: On the web, a single click lets you opt-in to automatic removal of failed sequences (Figure 1) so that the rest of your sequences can be swiftly accessioned! A report provided after the submission lists your failed sequences and points out potential sequence problems so that you can take a closer look after your error-free sequences are released. This option is also available for submission via FTP.

Need to set up FTP submissions? The NCBI team is here to help. Contact gb-admin@ncbi.nlm.nih.gov.

Figure 1. GenBank submission page showing the option to remove sequences with processing errors.

Continue reading “Four new options to simplify your SARS-CoV-2 submissions”

GenBank release 245.0

GenBank release 245.0 (8/18/2021) is now available on the NCBI FTP site. This release has 15.31 trillion bases and 2.49 billion records.

The current release has 231,982,592 traditional records containing 940,513,260,726 base pairs of sequence data. There are also 1,653,427,055 WGS records containing 13,888,187,863,722 base pairs of sequence data, 498,305,045 bulk-oriented TSA records containing 440,578,422,611 base pairs of sequence data, and 106,995,218 bulk-oriented TLS records containing 39,930,167,315 base pairs of sequence data.

Continue reading “GenBank release 245.0”

GenBank release 244.0

GenBank release 244.0 (6/26/2021) is now available on the NCBI FTP site. This release has 14.78 trillion bases and 2.46 billion records.

The current release has 227,888,889 traditional records containing 866,009,790,959 base pairs of sequence data. There are also 1,632,796,606 WGS records containing 13,442,974,346,437 base pairs of sequence data, 494,641,358 bulk-oriented TSA records containing 436,594,941,165 base pairs of sequence data, and 102,662,929 bulk-oriented TLS records containing 38,198,113,354 base pairs of sequence data. Continue reading “GenBank release 244.0”

GenBank release 243.0

GenBank release 243.0 (5/26/2021) is now available on the NCBI FTP site. This release has 14.03 trillion bases and 2.40 billion records.

The current release has 227,123,201 traditional records containing 832,400,799,511 base pairs of sequence data. There are also 1,590,670,459 WGS records containing 12,732,048,052,023 base pairs of sequence data, 481,154,920 bulk-oriented TSA records containing 425,076,483,459 base pairs of sequence data, and 102,395,753 bulk-oriented TLS records containing 37,998,534,461 base pairs of sequence data. 

Continue reading “GenBank release 243.0”

New class value and qualifier in GenBank release 242.0 accommodate circular RNA molecules

GenBank release 242.0 (2/16/2021) is now available on the NCBI FTP site and through Entrez and BLAST. This release has 13.49 trillion bases and 2.34 billion records.

Growth between releases

During the 57 days between the close dates for GenBank Releases 241.0 and 242.0, the ‘traditional’ portion of GenBank grew by 53,287,389,099 base pairs and by 4,773,649 sequence records. During that same period, 65,699 records were updated. An average of 84,901 ‘traditional’ records were added and/or updated per day.

Between releases 241.0 and 242.0, the WGS component of GenBank grew by 439,874,781,594 base pairs and by 45,942,354 sequence records. During the same period, the TSA component of GenBank grew by 15,398,434,562 base pairs and by 16,753,622 Sequence records. Finally, the TLS component of GenBank grew by 597,613,549 base pairs and by 2,091,409 sequence records.

Continue reading “New class value and qualifier in GenBank release 242.0 accommodate circular RNA molecules”

GenBank release 241.0

GenBank release 241.0 (12/21/2020) is now available on the NCBI FTP site. This release has 12.98 trillion bases and 2.27 billion records.

The current release has 221,467,827 traditional records containing 723,003,822,007 base pairs of sequence data. There are also 1,517,995,689 WGS records containing 11,830,842,428,018 base pairs of sequence data, 446,397,378 bulk-oriented TSA records containing 392,206,975,386 base pairs of sequence data, and 88,039,152 bulk-oriented TLS records containing 33,036,509,446 base pairs of sequence data. Continue reading “GenBank release 241.0”

GenBank 240.0 is available and surpasses 10 trillion basepairs!

GenBank release 240.0 (10/28/2020) is now available on the NCBI FTP site. This release has 10.33 trillion bases and 2.17 billion records.

The current release has 219,055,207 traditional records containing 698,688,094,046 base pairs of sequence data. There are also 1,432,874,252 WGS records containing 9,215,815,569,509 base pairs of sequence data, 435,968,379 bulk-oriented TSA records containing 382,996,662,270 base pairs of sequence data, and 78,177,358 bulk-oriented TLS records containing 28,814,798,868 base pairs of sequence data.

Growth between releases

During the 71 days between the close dates for GenBank Releases 239.0 and 240.0, the ‘traditional’ portion of GenBank grew by 44,631,024,497 basepairs and by 412,969 sequence records. During that same period, 94,006 records were updated. An average of 7,140 ‘traditional’ records were added and/or updated per day.

Between releases 239.0 and 240.0, the WGS component of GenBank grew by 374,166,158,857 basepairs and by 24,751,365 sequence records. The TSA component of GenBank grew by 16,027,711,110 basepairs and by 18,443,812 sequence records. The TLS component of GenBank grew by 989,739,370 basepairs and by 2,495,201 sequence records.

The total number of sequence data files increased by 107 with this release. The divisions are as follows:

  • BCT: 22 new files, now a total of 512
  • CON: 1 new file, now a total of 218
  • INV: 2 new files, now a total of 97
  • PAT: 1 new file, now a total of 213
  • PLN: 47 new files, now a total of 594
  • PRI: 10 new files, now a total of 45
  • ROD: 15 new files, now a total of 56
  • VRL: 5 new files, now a total of 44
  • VRT: 4 new files, now a total of 214

Delivery of GenBank 240.0 was delayed by two weeks

A power surge at the NCBI data center and subsequent downtime for a critical disk storage system led to a nearly two-week delay in the delivery of the data files for GenBank 240.0. There were no data losses, and public-facing systems remained available. However, between the direct impacts of the outage and subsequent efforts to resume processing pipelines, the GenBank release timeline was significantly pushed back. Our apologies for the delay!

Upcoming Changes

New /ncRNA_class value : circRNA

  • The allowed values for the /ncRNA_class qualifier have been extended to include “circRNA”, for circular RNA molecules. This change will not appear until (or after) GenBank Release 242.0 in February 2021.

New /circular_RNA qualifier

  • Complementing the new “circRNA” ncRNA class, a new qualifier will be introduced in (or after) GenBank Release 242.0 in February 2021.

Additional Information

For downloading purposes, please keep in mind that the uncompressed GenBank Release 240.0 sequence data flatfiles require roughly 1,524 GB. The ASN.1 data files require approximately 958 GB.

More information about GenBank release 240.0 is available in the release notes, as well as in the README files in the genbank and ASN.1 (ncbi-asn1) directories on FTP.

GenBank release 239 is available

GenBank release 239.0 (8/18/2020) is now available on the NCBI FTP site. This release has 9.89 trillion bases and 2.12 billion records.

The current release has 218,642,238 traditional records containing 654,057,069,549 base pairs of sequence data. There are also 1,408,122,887 WGS records containing 8,841,649,410,652 base pairs of sequence data, 417,524,567 bulk-oriented TSA records containing 366,968,951,160 base pairs of sequence data, and 75,682,157 bulk-oriented TLS records containing 27,825,059,498 base pairs of sequence data.

Growth between releases

During the 60 days between the close dates for GenBank Releases 238.0 and 239.0, the ‘traditional’ portion of GenBank grew by 226,233,810,648 basepairs and by 1,520,005 sequence records. During that same period, 80,474 records were updated. An average of 26,675 ‘traditional’ records were added and/or updated per day.

Between releases 238.0 and 239.0, the WGS component of GenBank grew by 727,603,148,494 basepairs and by 105,270,272 sequence records. The TSA component of GenBank grew by 7,021,242,098 basepairs and by 7,799,517 sequence records. The TLS component of GenBank grew by 324,424,370 basepairs and by 618,976 sequence records.

The total number of sequence data files increased by 425 with this release. The divisions are as follows:

  • BCT: 37 new files, now a total of 490
  • ENV: 2 new files, now a total of 62
  • INV: 9 new files, now a total of 95
  • MAM: 5 new files, now a total of 76
  • PAT: 7 new files, now a total of 212
  • PLN: 321 new files, now a total of 547
  • PRI: 1 new file, now a total of 35
  • ROD: 7 new files, now a total of 41
  • VRL: 2 new files, now a total of 38
  • VRT: 35 new files, now a total of 182

Note: The unusually large increase in the number of PLN-division files is due to an influx of multiple sets of near-gigabase-scale chromosomal records for wheat (Triticum aestivum) and barley (Hordeum vulgare subsp. vulgare).

For downloading purposes, please keep in mind that the uncompressed GenBank Release 239.0 sequence data flatfiles require roughly 1,461 GB. The ASN.1 data files require approximately 938 GB.

More information about GenBank release 239.0 is available in the release notes, as well as in the README files in the genbank and ASN.1 (ncbi-asn1) directories on FTP.

INSDC Statement on SARS-CoV-2 sequence data sharing during COVID-19

The National Library of Medicine and its partners in the International Nucleotide Database Collaboration (INSDC) have joined together to issue a statement encouraging the scientific community to submit their SARS-CoV-2 sequences to INSDC databases. The databases offer broad open access and integrated data, literature and tools – features that we believe are critical as the research community works together to understand and combat COVID-19.  Read the full statement below.


The databases of the International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) capture, organize, preserve and present nucleotide sequence data as part of the open scientific record. INSDC member institutions – the EMBL European Bioinformatics Institute (EMBL-EBI), the NIG DNA Data Bank of Japan (NIG-DDBJ) and the National Library of Medicine’s National Center for Biotechnology Information at NIH (NCBI) – are committed to the continued delivery of this critical element of scientific infrastructure.

The global COVID-19 crisis has brought an urgent need for the rapid open sharing of data relating to the outbreak. Most importantly, access to sequence data from the SARS-CoV-2 viral genome is essential for our understanding of the biology and spread of COVID-19. To aid in that effort, all three INSDC members have prioritized processing of SARS-CoV-2 sequence data and have streamlined the submission process.

Availability of data through INSDC databases provides:

    • Rapid open access – INSDC quickly makes submitted data freely available to everyone, without restrictions on reuse
    • Linkage of raw sequence read data to genome assemblies, providing researchers with the ability to validate the integrity of assemblies and investigate asserted mutations and changes in genome sequences
    • Integration of SARS-CoV-2 sequences with entirety of INSDC data, including related coronaviruses genome sequences, enabling comparison across species
    • Linkage of sequences to the published literature
    • Tools – INSDC partners provide integrated data analysis tools, such as BLAST, enhancing the discovery process

In support of the global response to the COVID-19 crisis, the INSDC calls upon the research community to:

    • Submit raw SARS-CoV-2 data to the databases of the INSDC
    • Submit consensus/assembled SARS-CoV-2 data to the databases of the INSDC
    • Provide information relating to the sequenced isolate or sample as part of the sequence submission; minimally the time and place of isolation/sampling and an isolate/sample identifier should be provided to maximize the value of the sequences.
    • In cases where scientists have already established submissions to other databases, these submissions should continue in parallel to the INSDC submission

The integration of INSDC databases with the global bioinformatics data infrastructure, including tools, secondary databases, compute capacity and curation processes, assures the rapid dissemination of data and drives its maximal impact.

In addition to these fundamental roles of INSDC member institutions in the sharing of viral sequence data, each institution has rapidly established COVID-19-specific programs and resources: the European COVID-19 Data Platform from EMBL-EBI, the DDBJ’s Research Data Resources on New Coronavirus and the NCBI SARS-CoV-2 Resources. These resources both demonstrate the connectedness of INSDC databases to broader bioinformatics initiatives and serve to add immediate value to COVID-19 research.

Guy Cochrane (EMBL-EBI), Ilene Karsch-Mizrachi (NCBI-NLM-NIH), & Masanori Arita (DDBJ) on behalf of the International Nucleotide Sequence Database Collaboration

GenBank release 238 is available

GenBank release 238.0 (6/19/2020) is now available on the NCBI FTP site. This release has 8.93 trillion bases and 2 billion records.

The current release has 217,122,233 traditional records containing 427,823,258,901 base pairs of sequence data. There are also 1,302,852,615 WGS records containing 8,114,046,262,158 base pairs of sequence data, 409,725,050 bulk-oriented TSA records containing 359,947,709,062 base pairs of sequence data, and 75,063,181 bulk-oriented TLS records containing 27,500,635,128 base pairs of sequence data.

Continue reading “GenBank release 238 is available”