GenBank release 236 is available

GenBank release 236.0 (2/20/2020) is now available on the NCBI FTP site. This release has over 7.72 trillion bases and 1.84 billion records.

The release has 216,214,215 traditional records containing 399,376,854,872 base pairs of sequence data. There are also 1,206,720,688 WGS records containing 6,968,991,265,752 base pairs of sequence data, 386,644,871 bulk-oriented TSA records containing 340,994,289,065 base pairs of sequence data, and 34,037,371 bulk-oriented TLS records containing 13,669,678,196 base pairs of sequence data.

During the 70 days between the close dates for GenBank Releases 235.0 and 236.0, the ‘traditional’ portion of GenBank grew by 10,959,596,863 base pairs and by 881,195 sequence records. During that same period, 62,552 records were updated. An average of 13,482 ‘traditional’ records were added and/or updated per day.

Between releases 235.0 and 236.0, the WGS component of GenBank grew by 691,440,065,062 base pairs and by 79,696,818 sequence records. The TSA component of GenBank grew by 15,561,272,936 base pairs and by 19,451,027 sequence records. The TLS component of GenBank grew by 2,389,081,582 base pairs and by 5,810,191 sequence records. The VRT component of GenBank decreased due to the suppression of 40 chromosomal records for the Coregonus sp. ‘balchen’ genome, with 2.1Gbp of sequence data. This organism is already represented by underlying sequence contigs plus chromosomal CON-division/scaffold records built from those contigs. The 40 suppressed records are redundant with those scaffolds, and their suppression resulted in fewer VRT-division files.

The total number of sequence data files increased by 48 with this release. The divisions are as follows:

  • BCT: 17 new files, now a total of 418
  • CON: 4 new files, now a total of 216
  • ENV: 1 new file, now a total of 59
  • MAM: 10 new files, now a total of 49
  • PAT: 2 new files, now a total of 204
  • PLN: 18 new files, now a total of 204
  • VRL: 1 new file, now a total of 36
  • VRT: 5 fewer files, now a total of 161

For downloading purposes, the uncompressed GenBank release 236.0 flat files require roughly 1117 GB, including the sequence files and the *.txt files. 

More information about GenBank release 236.0 is available in the Release Notes, as well as in the README files in the GenBank and ASN.1 (ncbi-asn1) directories on FTP.

The entire corpus of the Sequence Read Archive (SRA) now live on two cloud platforms!

The National Library of Medicine (NLM) is pleased to announce that all controlled-access and publicly available data in SRA is now available through Google Cloud Platform (GCP) and Amazon Web Services (AWS). To access the data please visit our SRA in the Cloud webpage where you will find links to our new SRA Toolkit and other access methods.

The SRA data available in the two clouds currently totals more than 14 petabytes and consists of all data in the SRA format as well as some data in its original submission format.  Since May 2019, NCBI has been putting all submitted SRA data on the GCP and AWS clouds in both the submitted format and our converted SRA format. We have also been moving previously submitted original format data to the clouds and expect to complete that process in 2021. Continue reading

New ribosomal RNA BLAST databases available on the web BLAST service and for download

We have a curated set of ribosomal RNA (rRNA)  reference sequences (Targeted Loci) with verifiable organism sources and current names. This set is critical for correctly identifying and classifying prokaryotic (bacteria and archaea) and fungal samples (Table 1). To provide easy access to these sequences, we recently added a separate rRNA/ITS databases section on the nucleotide BLAST page for these targeted sequences that makes it convenient to quickly identify source organisms (Figure 1)

Database BioProjects Sequences
16S ribosomal RNA (Bacteria and Archaea) PRJNA33317 , PRJNA33175

 

20,845
18S ribosomal RNA sequences (SSU) from Fungi type and reference material PRJNA39195 2,337
28S ribosomal RNA sequences (LSU) from Fungi type and reference material PRJNA51803 5,185
Internal transcribed spacer region (ITS) from Fungi and Oomycete type and reference material PRJNA177353, PRJNA362621

 

10,874

Table 1.  NCBI curated targeted rRNA sequences now available as BLAST databases. Continue reading

NCBI staff to present 3 posters at Advances in Genome Biology and Technology (AGBT), February 2020

Next week, NCBI staff will attend AGBT in Marco Island, Florida. On Tuesday, February 25, 2020, three posters from NCBI staff will be on display from 4:40 p.m. – 6:10 p.m. during the Poster Session and Wine Reception in the Banyan and Calusa Ballroom Foyers, Levels 1 and 3. Read on to learn a little bit about what we’ll be presenting.

Continue reading

New PGAP release with Singularity, no-internet option, and Taxonomy Check

A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) with several important features is now available on Github.

  • In response to several requests we have added the option of running PGAP with Singularity, Podman or any other Docker-compatible executable you wish to use.
  • We have also lifted the requirement for internet access in case you have privacy concerns. To run the pipeline without internet access, set the flag
    --no-internet.
  • Are you unsure about the identity of organism you sequenced? We’ve added the Taxonomy-Check module to help you. This module will confirm the organism name or suggest a new taxonomic assignment through average nucleotide identity comparison with type material assemblies from GenBank. The check is currently an optional validation step prior to PGAP.

Try these new features and let us know what you think! Or submit your PGAP-annotated assembly to GenBank. And remember that if you are still improving the assembly and your genome doesn’t pass the pre-annotation validation, you can use the --ignore-all-errors flag to get a preliminary annotation.

Important changes coming to prokaryotic Reference and Representative genome assemblies

We are making changes to the set of bacterial and archaeal RefSeq Reference and Representative assemblies in February 2020.

  • We will reduce the number of Reference assemblies to 15 that have annotation provided by outside experts (Table 1) and re-annotate the 105 other current Reference assemblies using the latest Prokaryotic Genome Annotation Pipeline (PGAP) software. The re-annotated assemblies will lose reference status.
  • We will reassess and revise the set of Representative assemblies so that there is one assembly per species to better reflect the taxonomic diversity of the RefSeq bacterial and archaeal assemblies.

Continue reading

Opportunity for experienced data scientists at the National Library of Medicine: NIH DATA Scholar Program

The National Library of Medicine (NLM) is pleased to announce the Data and Technology Advancement (DATA) National Service Scholar Program, a new opportunity for experienced data and computer scientists and engineers to tackle biomedical data challenges in partnership with NIH Office of Data Science Strategy (ODSS).

The one- to two-year positions will be based in NIH offices located in Bethesda. DATA Scholars will lead transformative NIH projects for architecting search across petabyte-Scale genomic Sequence Read Archive (SRA) data:

  • pioneer sequence search strategies against the entire corpus of NIH’s SRA data to stimulate novel approaches to advance data analysis and accelerate biological discoveries.
  • develop methods to execute sequence-based searches, including those that involve machine-learning or other AI approaches.
  • directly communicate technical and project-related information with NIH senior leadership.
  • collaborate with other DATA Scholars and the NIH data science community across broad disciplinary boundaries.
  • engage with policymakers, top researchers, and industry partners.

Applicants should possess technical skills in areas such as artificial intelligence, cloud computing, data engineering, data science, database management, project management, software design, supercomputing, and/or bioinformatics. Industry experience is desired. Applicants should have an M.D., Ph.D. or equivalent doctoral degree and have advanced experience in data science or related fields.

Applications are due April 30, 2020. For more information and details on how to apply, visit our full job announcement.

DHHS and NIH are Equal Opportunity Employers. Applications from women, minorities, and persons with disabilities are strongly encouraged.

Try out our new table download options from the NCBI genome browsers and sequence viewers!

Have you ever wanted a list of the genes you’re looking at in the browser – maybe to give you a starting point for candidate gene analysis, or to cross-reference with other data?

In response to your feedback and helpful discussions with you, we’re excited to announce a new option to download gene annotation data directly from the web sequence viewers and browsers.

This new feature lets you get a table of gene names, coordinates and other helpful information from your genomic region of interest.

Go to the Download menu on the toolbar of the graphical viewer to find options for getting sequence and annotation data.

blog-634

Continue reading

Computational Medicine Codeathon and AWS workshop at Chapel Hill in March

NIH is pleased to announce a computational medicine-focused codeathon. To apply, please complete the application form by February 25, 2020. We will also be offering a free workshop, AWS Technical Essentials, the day before the codeathon. Read on for more information about the event. Continue reading

Important changes to the genomes FTP site in February

We have added the latest NCBI Eukaryotic Genome Annotation Pipeline results for the more than 580 species that we annotate to the genomes/refseq directory on the genomes FTP area. As we announced in December, we will stop publishing annotation results to the genus_species directories (example: genomes/Xenopus_tropicalis) on the genomes FTP site effective February 1, 2020. We will also move existing genus_species directories to genomes/archive/old_refseq during the month of February.X_t_assemblyFigure 1. The Assembly page for the Xenopus tropicalis UCB Xtro 10.0 (GCF_000004195.4) showing the blue download button. Annotation results such as the RefSeq transcript alignments that can be downloaded from the web page are now also under the genomes/refseq directory on the FTP site. The FTP path to the .bam alignment files is in red.

These FTP changes do not affect the Assembly download function. As always, you can download assembly data using the blue Download button on the web pages (Figure 1).