The current release has 227,888,889 traditional records containing 866,009,790,959 base pairs of sequence data. There are also 1,632,796,606 WGS records containing 13,442,974,346,437 base pairs of sequence data, 494,641,358 bulk-oriented TSA records containing 436,594,941,165 base pairs of sequence data, and 102,662,929 bulk-oriented TLS records containing 38,198,113,354 base pairs of sequence data. Continue reading “GenBank release 244.0”
The current release has 227,123,201 traditional records containing 832,400,799,511 base pairs of sequence data. There are also 1,590,670,459 WGS records containing 12,732,048,052,023 base pairs of sequence data, 481,154,920 bulk-oriented TSA records containing 425,076,483,459 base pairs of sequence data, and 102,395,753 bulk-oriented TLS records containing 37,998,534,461 base pairs of sequence data.
GenBank release 242.0 (2/16/2021) is now available on the NCBI FTP site and through Entrez and BLAST. This release has 13.49 trillion bases and 2.34 billion records.
Growth between releases
During the 57 days between the close dates for GenBank Releases 241.0 and 242.0, the ‘traditional’ portion of GenBank grew by 53,287,389,099 base pairs and by 4,773,649 sequence records. During that same period, 65,699 records were updated. An average of 84,901 ‘traditional’ records were added and/or updated per day.
Between releases 241.0 and 242.0, the WGS component of GenBank grew by 439,874,781,594 base pairs and by 45,942,354 sequence records. During the same period, the TSA component of GenBank grew by 15,398,434,562 base pairs and by 16,753,622 Sequence records. Finally, the TLS component of GenBank grew by 597,613,549 base pairs and by 2,091,409 sequence records.
The current release has 221,467,827 traditional records containing 723,003,822,007 base pairs of sequence data. There are also 1,517,995,689 WGS records containing 11,830,842,428,018 base pairs of sequence data, 446,397,378 bulk-oriented TSA records containing 392,206,975,386 base pairs of sequence data, and 88,039,152 bulk-oriented TLS records containing 33,036,509,446 base pairs of sequence data. Continue reading “GenBank release 241.0”
The current release has 219,055,207 traditional records containing 698,688,094,046 base pairs of sequence data. There are also 1,432,874,252 WGS records containing 9,215,815,569,509 base pairs of sequence data, 435,968,379 bulk-oriented TSA records containing 382,996,662,270 base pairs of sequence data, and 78,177,358 bulk-oriented TLS records containing 28,814,798,868 base pairs of sequence data.
Growth between releases
During the 71 days between the close dates for GenBank Releases 239.0 and 240.0, the ‘traditional’ portion of GenBank grew by 44,631,024,497 basepairs and by 412,969 sequence records. During that same period, 94,006 records were updated. An average of 7,140 ‘traditional’ records were added and/or updated per day.
Between releases 239.0 and 240.0, the WGS component of GenBank grew by 374,166,158,857 basepairs and by 24,751,365 sequence records. The TSA component of GenBank grew by 16,027,711,110 basepairs and by 18,443,812 sequence records. The TLS component of GenBank grew by 989,739,370 basepairs and by 2,495,201 sequence records.
The total number of sequence data files increased by 107 with this release. The divisions are as follows:
- BCT: 22 new files, now a total of 512
- CON: 1 new file, now a total of 218
- INV: 2 new files, now a total of 97
- PAT: 1 new file, now a total of 213
- PLN: 47 new files, now a total of 594
- PRI: 10 new files, now a total of 45
- ROD: 15 new files, now a total of 56
- VRL: 5 new files, now a total of 44
- VRT: 4 new files, now a total of 214
Delivery of GenBank 240.0 was delayed by two weeks
A power surge at the NCBI data center and subsequent downtime for a critical disk storage system led to a nearly two-week delay in the delivery of the data files for GenBank 240.0. There were no data losses, and public-facing systems remained available. However, between the direct impacts of the outage and subsequent efforts to resume processing pipelines, the GenBank release timeline was significantly pushed back. Our apologies for the delay!
New /ncRNA_class value : circRNA
- The allowed values for the /ncRNA_class qualifier have been extended to include “circRNA”, for circular RNA molecules. This change will not appear until (or after) GenBank Release 242.0 in February 2021.
New /circular_RNA qualifier
- Complementing the new “circRNA” ncRNA class, a new qualifier will be introduced in (or after) GenBank Release 242.0 in February 2021.
For downloading purposes, please keep in mind that the uncompressed GenBank Release 240.0 sequence data flatfiles require roughly 1,524 GB. The ASN.1 data files require approximately 958 GB.
GenBank release 239.0 (8/18/2020) is now available on the NCBI FTP site. This release has 9.89 trillion bases and 2.12 billion records.
The current release has 218,642,238 traditional records containing 654,057,069,549 base pairs of sequence data. There are also 1,408,122,887 WGS records containing 8,841,649,410,652 base pairs of sequence data, 417,524,567 bulk-oriented TSA records containing 366,968,951,160 base pairs of sequence data, and 75,682,157 bulk-oriented TLS records containing 27,825,059,498 base pairs of sequence data.
Growth between releases
During the 60 days between the close dates for GenBank Releases 238.0 and 239.0, the ‘traditional’ portion of GenBank grew by 226,233,810,648 basepairs and by 1,520,005 sequence records. During that same period, 80,474 records were updated. An average of 26,675 ‘traditional’ records were added and/or updated per day.
Between releases 238.0 and 239.0, the WGS component of GenBank grew by 727,603,148,494 basepairs and by 105,270,272 sequence records. The TSA component of GenBank grew by 7,021,242,098 basepairs and by 7,799,517 sequence records. The TLS component of GenBank grew by 324,424,370 basepairs and by 618,976 sequence records.
The total number of sequence data files increased by 425 with this release. The divisions are as follows:
- BCT: 37 new files, now a total of 490
- ENV: 2 new files, now a total of 62
- INV: 9 new files, now a total of 95
- MAM: 5 new files, now a total of 76
- PAT: 7 new files, now a total of 212
- PLN: 321 new files, now a total of 547
- PRI: 1 new file, now a total of 35
- ROD: 7 new files, now a total of 41
- VRL: 2 new files, now a total of 38
- VRT: 35 new files, now a total of 182
Note: The unusually large increase in the number of PLN-division files is due to an influx of multiple sets of near-gigabase-scale chromosomal records for wheat (Triticum aestivum) and barley (Hordeum vulgare subsp. vulgare).
For downloading purposes, please keep in mind that the uncompressed GenBank Release 239.0 sequence data flatfiles require roughly 1,461 GB. The ASN.1 data files require approximately 938 GB.
The National Library of Medicine and its partners in the International Nucleotide Database Collaboration (INSDC) have joined together to issue a statement encouraging the scientific community to submit their SARS-CoV-2 sequences to INSDC databases. The databases offer broad open access and integrated data, literature and tools – features that we believe are critical as the research community works together to understand and combat COVID-19. Read the full statement below.
The databases of the International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) capture, organize, preserve and present nucleotide sequence data as part of the open scientific record. INSDC member institutions – the EMBL European Bioinformatics Institute (EMBL-EBI), the NIG DNA Data Bank of Japan (NIG-DDBJ) and the National Library of Medicine’s National Center for Biotechnology Information at NIH (NCBI) – are committed to the continued delivery of this critical element of scientific infrastructure.
The global COVID-19 crisis has brought an urgent need for the rapid open sharing of data relating to the outbreak. Most importantly, access to sequence data from the SARS-CoV-2 viral genome is essential for our understanding of the biology and spread of COVID-19. To aid in that effort, all three INSDC members have prioritized processing of SARS-CoV-2 sequence data and have streamlined the submission process.
Availability of data through INSDC databases provides:
- Rapid open access – INSDC quickly makes submitted data freely available to everyone, without restrictions on reuse
- Linkage of raw sequence read data to genome assemblies, providing researchers with the ability to validate the integrity of assemblies and investigate asserted mutations and changes in genome sequences
- Integration of SARS-CoV-2 sequences with entirety of INSDC data, including related coronaviruses genome sequences, enabling comparison across species
- Linkage of sequences to the published literature
- Tools – INSDC partners provide integrated data analysis tools, such as BLAST, enhancing the discovery process
In support of the global response to the COVID-19 crisis, the INSDC calls upon the research community to:
- Submit raw SARS-CoV-2 data to the databases of the INSDC
- Submit consensus/assembled SARS-CoV-2 data to the databases of the INSDC
- Provide information relating to the sequenced isolate or sample as part of the sequence submission; minimally the time and place of isolation/sampling and an isolate/sample identifier should be provided to maximize the value of the sequences.
- In cases where scientists have already established submissions to other databases, these submissions should continue in parallel to the INSDC submission
The integration of INSDC databases with the global bioinformatics data infrastructure, including tools, secondary databases, compute capacity and curation processes, assures the rapid dissemination of data and drives its maximal impact.
In addition to these fundamental roles of INSDC member institutions in the sharing of viral sequence data, each institution has rapidly established COVID-19-specific programs and resources: the European COVID-19 Data Platform from EMBL-EBI, the DDBJ’s Research Data Resources on New Coronavirus and the NCBI SARS-CoV-2 Resources. These resources both demonstrate the connectedness of INSDC databases to broader bioinformatics initiatives and serve to add immediate value to COVID-19 research.
GenBank release 238.0 (6/19/2020) is now available on the NCBI FTP site. This release has 8.93 trillion bases and 2 billion records.
The current release has 217,122,233 traditional records containing 427,823,258,901 base pairs of sequence data. There are also 1,302,852,615 WGS records containing 8,114,046,262,158 base pairs of sequence data, 409,725,050 bulk-oriented TSA records containing 359,947,709,062 base pairs of sequence data, and 75,063,181 bulk-oriented TLS records containing 27,500,635,128 base pairs of sequence data.
NCBI is pleased to announce ongoing enhancements to submission of SARS-CoV-2 assembled genomes to GenBank, including a streamlined workflow on the web and a new API option. Both new options mean that you can receive accessions for SARS-CoV-2 data submissions more quickly!
A streamlined workflow with improved interface and enhanced validation on both web and API saves you time and effort and, most importantly, makes it possible to get SARS-CoV-2 accession numbers and public release of data within hours. In addition, we automatically annotate all SARS-CoV-2 genomes to produce standardized, consistent annotation which saves you time and benefits researchers who find your data valuable. Continue reading “New GenBank submission options for SARS-CoV-2 submitters”
As we described in an earlier post, GenBank uses average nucleotide identity (ANI) analysis to find and correct misidentified prokaryotic genome assemblies. You can now access ANI data for the more than 600,000 GenBank bacterial and archaeal genome assemblies through a downloadable report (ANI_report_prokaryotes.txt) available from the genomes/ASSEMBLY_REPORTS area of the FTP site. The README describes the contents of the report in detail. You can use the ANI data to evaluate the taxonomic identity of genome assemblies of interest for yourself.
The new ANI_report_prokaryotes.txt replaces the older ANI_report_bacteria.txt in the same directory. We are no longer updating the ANI_report_bacteria.txt file and will remove it after 31st May 2020.