The current release has 221,467,827 traditional records containing 723,003,822,007 base pairs of sequence data. There are also 1,517,995,689 WGS records containing 11,830,842,428,018 base pairs of sequence data, 446,397,378 bulk-oriented TSA records containing 392,206,975,386 base pairs of sequence data, and 88,039,152 bulk-oriented TLS records containing 33,036,509,446 base pairs of sequence data. Continue reading “GenBank release 241.0”
The current release has 219,055,207 traditional records containing 698,688,094,046 base pairs of sequence data. There are also 1,432,874,252 WGS records containing 9,215,815,569,509 base pairs of sequence data, 435,968,379 bulk-oriented TSA records containing 382,996,662,270 base pairs of sequence data, and 78,177,358 bulk-oriented TLS records containing 28,814,798,868 base pairs of sequence data.
Growth between releases
During the 71 days between the close dates for GenBank Releases 239.0 and 240.0, the ‘traditional’ portion of GenBank grew by 44,631,024,497 basepairs and by 412,969 sequence records. During that same period, 94,006 records were updated. An average of 7,140 ‘traditional’ records were added and/or updated per day.
Between releases 239.0 and 240.0, the WGS component of GenBank grew by 374,166,158,857 basepairs and by 24,751,365 sequence records. The TSA component of GenBank grew by 16,027,711,110 basepairs and by 18,443,812 sequence records. The TLS component of GenBank grew by 989,739,370 basepairs and by 2,495,201 sequence records.
The total number of sequence data files increased by 107 with this release. The divisions are as follows:
- BCT: 22 new files, now a total of 512
- CON: 1 new file, now a total of 218
- INV: 2 new files, now a total of 97
- PAT: 1 new file, now a total of 213
- PLN: 47 new files, now a total of 594
- PRI: 10 new files, now a total of 45
- ROD: 15 new files, now a total of 56
- VRL: 5 new files, now a total of 44
- VRT: 4 new files, now a total of 214
Delivery of GenBank 240.0 was delayed by two weeks
A power surge at the NCBI data center and subsequent downtime for a critical disk storage system led to a nearly two-week delay in the delivery of the data files for GenBank 240.0. There were no data losses, and public-facing systems remained available. However, between the direct impacts of the outage and subsequent efforts to resume processing pipelines, the GenBank release timeline was significantly pushed back. Our apologies for the delay!
New /ncRNA_class value : circRNA
- The allowed values for the /ncRNA_class qualifier have been extended to include “circRNA”, for circular RNA molecules. This change will not appear until (or after) GenBank Release 242.0 in February 2021.
New /circular_RNA qualifier
- Complementing the new “circRNA” ncRNA class, a new qualifier will be introduced in (or after) GenBank Release 242.0 in February 2021.
For downloading purposes, please keep in mind that the uncompressed GenBank Release 240.0 sequence data flatfiles require roughly 1,524 GB. The ASN.1 data files require approximately 958 GB.
GenBank release 239.0 (8/18/2020) is now available on the NCBI FTP site. This release has 9.89 trillion bases and 2.12 billion records.
The current release has 218,642,238 traditional records containing 654,057,069,549 base pairs of sequence data. There are also 1,408,122,887 WGS records containing 8,841,649,410,652 base pairs of sequence data, 417,524,567 bulk-oriented TSA records containing 366,968,951,160 base pairs of sequence data, and 75,682,157 bulk-oriented TLS records containing 27,825,059,498 base pairs of sequence data.
Growth between releases
During the 60 days between the close dates for GenBank Releases 238.0 and 239.0, the ‘traditional’ portion of GenBank grew by 226,233,810,648 basepairs and by 1,520,005 sequence records. During that same period, 80,474 records were updated. An average of 26,675 ‘traditional’ records were added and/or updated per day.
Between releases 238.0 and 239.0, the WGS component of GenBank grew by 727,603,148,494 basepairs and by 105,270,272 sequence records. The TSA component of GenBank grew by 7,021,242,098 basepairs and by 7,799,517 sequence records. The TLS component of GenBank grew by 324,424,370 basepairs and by 618,976 sequence records.
The total number of sequence data files increased by 425 with this release. The divisions are as follows:
- BCT: 37 new files, now a total of 490
- ENV: 2 new files, now a total of 62
- INV: 9 new files, now a total of 95
- MAM: 5 new files, now a total of 76
- PAT: 7 new files, now a total of 212
- PLN: 321 new files, now a total of 547
- PRI: 1 new file, now a total of 35
- ROD: 7 new files, now a total of 41
- VRL: 2 new files, now a total of 38
- VRT: 35 new files, now a total of 182
Note: The unusually large increase in the number of PLN-division files is due to an influx of multiple sets of near-gigabase-scale chromosomal records for wheat (Triticum aestivum) and barley (Hordeum vulgare subsp. vulgare).
For downloading purposes, please keep in mind that the uncompressed GenBank Release 239.0 sequence data flatfiles require roughly 1,461 GB. The ASN.1 data files require approximately 938 GB.
The National Library of Medicine and its partners in the International Nucleotide Database Collaboration (INSDC) have joined together to issue a statement encouraging the scientific community to submit their SARS-CoV-2 sequences to INSDC databases. The databases offer broad open access and integrated data, literature and tools – features that we believe are critical as the research community works together to understand and combat COVID-19. Read the full statement below.
The databases of the International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) capture, organize, preserve and present nucleotide sequence data as part of the open scientific record. INSDC member institutions – the EMBL European Bioinformatics Institute (EMBL-EBI), the NIG DNA Data Bank of Japan (NIG-DDBJ) and the National Library of Medicine’s National Center for Biotechnology Information at NIH (NCBI) – are committed to the continued delivery of this critical element of scientific infrastructure.
The global COVID-19 crisis has brought an urgent need for the rapid open sharing of data relating to the outbreak. Most importantly, access to sequence data from the SARS-CoV-2 viral genome is essential for our understanding of the biology and spread of COVID-19. To aid in that effort, all three INSDC members have prioritized processing of SARS-CoV-2 sequence data and have streamlined the submission process.
Availability of data through INSDC databases provides:
- Rapid open access – INSDC quickly makes submitted data freely available to everyone, without restrictions on reuse
- Linkage of raw sequence read data to genome assemblies, providing researchers with the ability to validate the integrity of assemblies and investigate asserted mutations and changes in genome sequences
- Integration of SARS-CoV-2 sequences with entirety of INSDC data, including related coronaviruses genome sequences, enabling comparison across species
- Linkage of sequences to the published literature
- Tools – INSDC partners provide integrated data analysis tools, such as BLAST, enhancing the discovery process
In support of the global response to the COVID-19 crisis, the INSDC calls upon the research community to:
- Submit raw SARS-CoV-2 data to the databases of the INSDC
- Submit consensus/assembled SARS-CoV-2 data to the databases of the INSDC
- Provide information relating to the sequenced isolate or sample as part of the sequence submission; minimally the time and place of isolation/sampling and an isolate/sample identifier should be provided to maximize the value of the sequences.
- In cases where scientists have already established submissions to other databases, these submissions should continue in parallel to the INSDC submission
The integration of INSDC databases with the global bioinformatics data infrastructure, including tools, secondary databases, compute capacity and curation processes, assures the rapid dissemination of data and drives its maximal impact.
In addition to these fundamental roles of INSDC member institutions in the sharing of viral sequence data, each institution has rapidly established COVID-19-specific programs and resources: the European COVID-19 Data Platform from EMBL-EBI, the DDBJ’s Research Data Resources on New Coronavirus and the NCBI SARS-CoV-2 Resources. These resources both demonstrate the connectedness of INSDC databases to broader bioinformatics initiatives and serve to add immediate value to COVID-19 research.
GenBank release 238.0 (6/19/2020) is now available on the NCBI FTP site. This release has 8.93 trillion bases and 2 billion records.
The current release has 217,122,233 traditional records containing 427,823,258,901 base pairs of sequence data. There are also 1,302,852,615 WGS records containing 8,114,046,262,158 base pairs of sequence data, 409,725,050 bulk-oriented TSA records containing 359,947,709,062 base pairs of sequence data, and 75,063,181 bulk-oriented TLS records containing 27,500,635,128 base pairs of sequence data.
NCBI is pleased to announce ongoing enhancements to submission of SARS-CoV-2 assembled genomes to GenBank, including a streamlined workflow on the web and a new API option. Both new options mean that you can receive accessions for SARS-CoV-2 data submissions more quickly!
A streamlined workflow with improved interface and enhanced validation on both web and API saves you time and effort and, most importantly, makes it possible to get SARS-CoV-2 accession numbers and public release of data within hours. In addition, we automatically annotate all SARS-CoV-2 genomes to produce standardized, consistent annotation which saves you time and benefits researchers who find your data valuable. Continue reading “New GenBank submission options for SARS-CoV-2 submitters”
As we described in an earlier post, GenBank uses average nucleotide identity (ANI) analysis to find and correct misidentified prokaryotic genome assemblies. You can now access ANI data for the more than 600,000 GenBank bacterial and archaeal genome assemblies through a downloadable report (ANI_report_prokaryotes.txt) available from the genomes/ASSEMBLY_REPORTS area of the FTP site. The README describes the contents of the report in detail. You can use the ANI data to evaluate the taxonomic identity of genome assemblies of interest for yourself.
The new ANI_report_prokaryotes.txt replaces the older ANI_report_bacteria.txt in the same directory. We are no longer updating the ANI_report_bacteria.txt file and will remove it after 31st May 2020.
GenBank release 237.0 (4/21/2020) is now available on the NCBI FTP site. This release has over 8.58 trillion bases and 1.95 billion records.
The release has 216,531,829 traditional records containing 415,770,027,949 base pairs of sequence data. There are also 1,267,547,429 WGS records containing 7,788,133,221,338 base pairs of sequence data, 396,392,280 bulk-oriented TSA records containing 349,692,751,528 base pairs of sequence data, and 65,521,132 bulk-oriented TLS records containing 24,615,270,313 base pairs of sequence data.
During the 63 days between the close dates for GenBank Releases 236.0 and 237.0, the ‘traditional’ portion of GenBank grew by 16,393,173,077 base pairs and by 317,614 sequence records. During that same period, 55,268 records were updated. An average of 5,919 ‘traditional’ records were added and/or updated per day.
Between releases 236.0 and 237.0, the WGS component of GenBank grew by 819,141,955,586 basepairs and by 60,826,741 sequence records. The TSA component of GenBank grew by 8,698,462,463 basepairs and by 9,747,409 sequence records. The TLS component of GenBank grew by 10,945,592,117 basepairs and by 31,483,761 sequence records.
The total number of sequence data files increased by 59 with this release. The divisions are as follows:
- BCT: 14 new files, now a total of 432
- CON: 1 new file, now a total of 217
- ENV: 1 new file, now a total of 60
- INV: 6 new files, now a total of 86
- MAM: 15 new files, now a total of 64
- PLN: 8 new files, now a total of 212
- VRT: 14 new files, now a total of 175
For downloading purposes, the uncompressed GenBank release 237.0 flat files require roughly 1142 GB, including the sequence files and the *.txt files. The ASN.1 data files require approximately 844 GB.
The 2020 Nucleic Acids Research database issue features papers from NCBI staff on GenBank, ClinVar and more. These papers are also available on PubMed. To read an article, click on the PMID number listed below.
“Database resources of the National Center for Biotechnology Information”
by Eric W Sayers, Jeff Beck, J Rodney Brister, Evan E Bolton, Kathi Canese et al. (PMID: 31602479)
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 38 distinct databases. This article provides a brief overview of the NCBI Entrez system of databases, followed by a summary of resources that were either introduced or significantly updated in the past year, including PubMed, PMC, Bookshelf, BLAST databases and more!
As the global health emergency around the Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2, formerly 2019-nCoV) continues, we continue to play a key role in providing the biomedical community free and easy access to genome sequences from the coronavirus. You can quickly access these data through the NCBI search (Figure 1).Figure 1. NCBI search results for the term “SARS-COV-2” showing the schematic map of the viral assembly and annotation and buttons that link to the data in the NCBI Virus resource, a specialized BLAST page that searches Betacoronavirus sequences, and the reference assembly download. The bottom panel provides links to the CDC website for COVID-19 information and a link to GenBank®/SRA sequence data.