Tag: GenBank

GenBank release 239 is available

GenBank release 239.0 (8/18/2020) is now available on the NCBI FTP site. This release has 9.89 trillion bases and 2.12 billion records.

The current release has 218,642,238 traditional records containing 654,057,069,549 base pairs of sequence data. There are also 1,408,122,887 WGS records containing 8,841,649,410,652 base pairs of sequence data, 417,524,567 bulk-oriented TSA records containing 366,968,951,160 base pairs of sequence data, and 75,682,157 bulk-oriented TLS records containing 27,825,059,498 base pairs of sequence data.

Growth between releases

During the 60 days between the close dates for GenBank Releases 238.0 and 239.0, the ‘traditional’ portion of GenBank grew by 226,233,810,648 basepairs and by 1,520,005 sequence records. During that same period, 80,474 records were updated. An average of 26,675 ‘traditional’ records were added and/or updated per day.

Between releases 238.0 and 239.0, the WGS component of GenBank grew by 727,603,148,494 basepairs and by 105,270,272 sequence records. The TSA component of GenBank grew by 7,021,242,098 basepairs and by 7,799,517 sequence records. The TLS component of GenBank grew by 324,424,370 basepairs and by 618,976 sequence records.

The total number of sequence data files increased by 425 with this release. The divisions are as follows:

  • BCT: 37 new files, now a total of 490
  • ENV: 2 new files, now a total of 62
  • INV: 9 new files, now a total of 95
  • MAM: 5 new files, now a total of 76
  • PAT: 7 new files, now a total of 212
  • PLN: 321 new files, now a total of 547
  • PRI: 1 new file, now a total of 35
  • ROD: 7 new files, now a total of 41
  • VRL: 2 new files, now a total of 38
  • VRT: 35 new files, now a total of 182

Note: The unusually large increase in the number of PLN-division files is due to an influx of multiple sets of near-gigabase-scale chromosomal records for wheat (Triticum aestivum) and barley (Hordeum vulgare subsp. vulgare).

For downloading purposes, please keep in mind that the uncompressed GenBank Release 239.0 sequence data flatfiles require roughly 1,461 GB. The ASN.1 data files require approximately 938 GB.

More information about GenBank release 239.0 is available in the release notes, as well as in the README files in the genbank and ASN.1 (ncbi-asn1) directories on FTP.

INSDC Statement on SARS-CoV-2 sequence data sharing during COVID-19

The National Library of Medicine and its partners in the International Nucleotide Database Collaboration (INSDC) have joined together to issue a statement encouraging the scientific community to submit their SARS-CoV-2 sequences to INSDC databases. The databases offer broad open access and integrated data, literature and tools – features that we believe are critical as the research community works together to understand and combat COVID-19.  Read the full statement below.


The databases of the International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) capture, organize, preserve and present nucleotide sequence data as part of the open scientific record. INSDC member institutions – the EMBL European Bioinformatics Institute (EMBL-EBI), the NIG DNA Data Bank of Japan (NIG-DDBJ) and the National Library of Medicine’s National Center for Biotechnology Information at NIH (NCBI) – are committed to the continued delivery of this critical element of scientific infrastructure.

The global COVID-19 crisis has brought an urgent need for the rapid open sharing of data relating to the outbreak. Most importantly, access to sequence data from the SARS-CoV-2 viral genome is essential for our understanding of the biology and spread of COVID-19. To aid in that effort, all three INSDC members have prioritized processing of SARS-CoV-2 sequence data and have streamlined the submission process.

Availability of data through INSDC databases provides:

    • Rapid open access – INSDC quickly makes submitted data freely available to everyone, without restrictions on reuse
    • Linkage of raw sequence read data to genome assemblies, providing researchers with the ability to validate the integrity of assemblies and investigate asserted mutations and changes in genome sequences
    • Integration of SARS-CoV-2 sequences with entirety of INSDC data, including related coronaviruses genome sequences, enabling comparison across species
    • Linkage of sequences to the published literature
    • Tools – INSDC partners provide integrated data analysis tools, such as BLAST, enhancing the discovery process

In support of the global response to the COVID-19 crisis, the INSDC calls upon the research community to:

    • Submit raw SARS-CoV-2 data to the databases of the INSDC
    • Submit consensus/assembled SARS-CoV-2 data to the databases of the INSDC
    • Provide information relating to the sequenced isolate or sample as part of the sequence submission; minimally the time and place of isolation/sampling and an isolate/sample identifier should be provided to maximize the value of the sequences.
    • In cases where scientists have already established submissions to other databases, these submissions should continue in parallel to the INSDC submission

The integration of INSDC databases with the global bioinformatics data infrastructure, including tools, secondary databases, compute capacity and curation processes, assures the rapid dissemination of data and drives its maximal impact.

In addition to these fundamental roles of INSDC member institutions in the sharing of viral sequence data, each institution has rapidly established COVID-19-specific programs and resources: the European COVID-19 Data Platform from EMBL-EBI, the DDBJ’s Research Data Resources on New Coronavirus and the NCBI SARS-CoV-2 Resources. These resources both demonstrate the connectedness of INSDC databases to broader bioinformatics initiatives and serve to add immediate value to COVID-19 research.

Guy Cochrane (EMBL-EBI), Ilene Karsch-Mizrachi (NCBI-NLM-NIH), & Masanori Arita (DDBJ) on behalf of the International Nucleotide Sequence Database Collaboration

GenBank release 238 is available

GenBank release 238.0 (6/19/2020) is now available on the NCBI FTP site. This release has 8.93 trillion bases and 2 billion records.

The current release has 217,122,233 traditional records containing 427,823,258,901 base pairs of sequence data. There are also 1,302,852,615 WGS records containing 8,114,046,262,158 base pairs of sequence data, 409,725,050 bulk-oriented TSA records containing 359,947,709,062 base pairs of sequence data, and 75,063,181 bulk-oriented TLS records containing 27,500,635,128 base pairs of sequence data.

Continue reading “GenBank release 238 is available”

New GenBank submission options for SARS-CoV-2 submitters

NCBI is pleased to announce ongoing enhancements to submission of SARS-CoV-2 assembled genomes to GenBank, including a streamlined workflow on the web and a new API option. Both new options mean that you can receive accessions for SARS-CoV-2 data submissions more quickly!

A streamlined workflow with improved interface and enhanced validation on both web and API saves you time and effort and, most importantly, makes it possible to get SARS-CoV-2 accession numbers and public release of data within hours. In addition, we automatically annotate all SARS-CoV-2 genomes to produce standardized, consistent annotation which saves you time and benefits researchers who find your data valuable. Continue reading “New GenBank submission options for SARS-CoV-2 submitters”

Expanded average nucleotide identity analysis now available for prokaryotic genome assemblies

As we described in an earlier post, GenBank uses average nucleotide identity (ANI) analysis to find and correct misidentified prokaryotic genome assemblies. You can now access ANI data for the more than 600,000 GenBank bacterial and archaeal genome assemblies through a downloadable report (ANI_report_prokaryotes.txt) available from the genomes/ASSEMBLY_REPORTS area of the FTP site. The README describes the contents of the report in detail. You can use the ANI data to evaluate the taxonomic identity of genome assemblies of interest for yourself.

The new ANI_report_prokaryotes.txt replaces the older ANI_report_bacteria.txt in the same directory. We are no longer updating the ANI_report_bacteria.txt file and will remove it after 31st May 2020.

GenBank release 237 is available

GenBank release 237.0 (4/21/2020) is now available on the NCBI FTP site. This release has over 8.58 trillion bases and 1.95 billion records.

The release has 216,531,829 traditional records containing 415,770,027,949 base pairs of sequence data. There are also 1,267,547,429 WGS records containing 7,788,133,221,338 base pairs of sequence data, 396,392,280 bulk-oriented TSA records containing 349,692,751,528 base pairs of sequence data, and 65,521,132 bulk-oriented TLS records containing 24,615,270,313 base pairs of sequence data.

During the 63 days between the close dates for GenBank Releases 236.0 and 237.0, the ‘traditional’ portion of GenBank grew by 16,393,173,077 base pairs and by 317,614 sequence records. During that same period, 55,268 records were updated. An average of 5,919 ‘traditional’ records were added and/or updated per day.

Between releases 236.0 and 237.0, the WGS component of GenBank grew by 819,141,955,586 basepairs and by 60,826,741 sequence records. The TSA component of GenBank grew by 8,698,462,463 basepairs and by 9,747,409 sequence records. The TLS component of GenBank grew by 10,945,592,117 basepairs and by 31,483,761 sequence records.

The total number of sequence data files increased by 59 with this release. The divisions are as follows:

  • BCT: 14 new files, now a total of 432
  • CON: 1 new file, now a total of 217
  • ENV: 1 new file, now a total of 60
  • INV: 6 new files, now a total of 86
  • MAM: 15 new files, now a total of 64
  • PLN: 8 new files, now a total of 212
  • VRT: 14 new files, now a total of 175

For downloading purposes, the uncompressed GenBank release 237.0 flat files require roughly 1142 GB, including the sequence files and the *.txt files. The ASN.1 data files require approximately 844 GB.

More information about GenBank release 237.0 is available in the Release Notes, as well as in the README files in the GenBank and ASN.1 (ncbi-asn1) directories on FTP.

Read about NCBI resources in 2020 Nucleic Acids Research database issue

The 2020 Nucleic Acids Research database issue features papers from NCBI staff on GenBank, ClinVar and more. These papers are also available on PubMed. To read an article, click on the PMID number listed below.

“Database resources of the National Center for Biotechnology Information”

by Eric W Sayers, Jeff Beck, J Rodney Brister, Evan E Bolton, Kathi Canese et al. (PMID: 31602479)

The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 38 distinct databases. This article provides a brief overview of the NCBI Entrez system of databases, followed by a summary of resources that were either introduced or significantly updated in the past year, including PubMed, PMC, BookshelfBLAST databases and more!

Continue reading “Read about NCBI resources in 2020 Nucleic Acids Research database issue”

Rapid access to SARS-CoV-2 data from the current public health emergency

As the global health emergency around the Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2, formerly 2019-nCoV) continues, we continue to play a key role in providing the biomedical community free and easy access to genome sequences from the coronavirus. You can quickly access these data through the NCBI search (Figure 1).sar-2_sensorFigure 1.  NCBI search results for the term “SARS-COV-2” showing the schematic map of the viral assembly and annotation and buttons that link to the data in the NCBI Virus resource, a specialized BLAST page that searches Betacoronavirus sequences, and the reference assembly download. The bottom panel provides links to the CDC website for COVID-19 information and a link to GenBank®/SRA sequence data.

Continue reading “Rapid access to SARS-CoV-2 data from the current public health emergency”

Dengue virus submission improvements now live!

When there is an outbreak of dengue fever in the world, it’s critical that viral genomic sequence data be submitted by researchers and made available to analyze as soon as possible.  You can now submit Dengue virus sequences to GenBank using a new workflow (Figure 1) in the Submission Portal designed to help make these data available as soon as possible.  The streamlined process, similar to the one described in a previous post for animal mitochondrial COX1 sequences, has an improved interface, enhanced validation, and automatic annotation that saves you time and effort.

Dengue_sub

Figure 1. The Submission Portal pages for targeted sequence submission workflows. Top panel. The new submission page for entering the workflow. Bottom panel. Submission Portal page with the Dengue virus submission option selected (boxed in red).  The service has options for other targeted submissions including mitochondrial COX1 from multicellular animals (metazoa), ribosomal RNA (rRNA), rRNA-ITS, Influenza virus, and Norovirus sequences.

This update is part of a larger and ongoing effort to consolidate GenBank submissions in a central location.  In addition to Dengue virus data, you can also submit Influenza A, B, C and Norovirus sequences as well as other targeted sequences including mitochondrial COX1 genes from multicellular animals (metazoa), ribosomal RNA (rRNA), and rRNA-ITS through the options on the Submission Portal.  You should submit other types of sequence data including other virus sequences to GenBank using BankIt or tbl2ASN.

You can use the search feature on the Submission Portal to find the appropriate submission tool for your data.

Novel coronavirus complete genome from the Wuhan outbreak now available in GenBank

Updated!

Get rapid access to Wuhan coronavirus (2019-nCoV) sequence data from the current outbreak as it becomes available. We will continue to update the page with newly released data.

The complete annotated genome sequence of the novel coronavirus associated with the outbreak of pneumonia in Wuhan, China is now available from GenBank for free and easy access by the global biomedical community. Figure 1 shows the relationship of the Wuhan virus to selected coronaviruses.

Wuhan-human-1_posterior-output2

Figure 1.  Phylogenetic tree showing the relationship of Wuhan-Hu-1 (circled in red) to selected coronaviruses. Nucleotide alignment was done with MUSCLE 3.8. The phylogenetic tree was estimated with MrBayes 3.2.6 with parameters for GTR+g+i.  The scale bar indicates estimated substitutions per site, and all branch support values are 99.3% or higher.

Continue reading “Novel coronavirus complete genome from the Wuhan outbreak now available in GenBank”