GenBank 240.0 is available and surpasses 10 trillion basepairs!

GenBank release 240.0 (10/28/2020) is now available on the NCBI FTP site. This release has 10.33 trillion bases and 2.17 billion records.

The current release has 219,055,207 traditional records containing 698,688,094,046 base pairs of sequence data. There are also 1,432,874,252 WGS records containing 9,215,815,569,509 base pairs of sequence data, 435,968,379 bulk-oriented TSA records containing 382,996,662,270 base pairs of sequence data, and 78,177,358 bulk-oriented TLS records containing 28,814,798,868 base pairs of sequence data.

Growth between releases

During the 71 days between the close dates for GenBank Releases 239.0 and 240.0, the ‘traditional’ portion of GenBank grew by 44,631,024,497 basepairs and by 412,969 sequence records. During that same period, 94,006 records were updated. An average of 7,140 ‘traditional’ records were added and/or updated per day.

Between releases 239.0 and 240.0, the WGS component of GenBank grew by 374,166,158,857 basepairs and by 24,751,365 sequence records. The TSA component of GenBank grew by 16,027,711,110 basepairs and by 18,443,812 sequence records. The TLS component of GenBank grew by 989,739,370 basepairs and by 2,495,201 sequence records.

The total number of sequence data files increased by 107 with this release. The divisions are as follows:

  • BCT: 22 new files, now a total of 512
  • CON: 1 new file, now a total of 218
  • INV: 2 new files, now a total of 97
  • PAT: 1 new file, now a total of 213
  • PLN: 47 new files, now a total of 594
  • PRI: 10 new files, now a total of 45
  • ROD: 15 new files, now a total of 56
  • VRL: 5 new files, now a total of 44
  • VRT: 4 new files, now a total of 214

Delivery of GenBank 240.0 was delayed by two weeks

A power surge at the NCBI data center and subsequent downtime for a critical disk storage system led to a nearly two-week delay in the delivery of the data files for GenBank 240.0. There were no data losses, and public-facing systems remained available. However, between the direct impacts of the outage and subsequent efforts to resume processing pipelines, the GenBank release timeline was significantly pushed back. Our apologies for the delay!

Upcoming Changes

New /ncRNA_class value : circRNA

  • The allowed values for the /ncRNA_class qualifier have been extended to include “circRNA”, for circular RNA molecules. This change will not appear until (or after) GenBank Release 242.0 in February 2021.

New /circular_RNA qualifier

  • Complementing the new “circRNA” ncRNA class, a new qualifier will be introduced in (or after) GenBank Release 242.0 in February 2021.

Additional Information

For downloading purposes, please keep in mind that the uncompressed GenBank Release 240.0 sequence data flatfiles require roughly 1,524 GB. The ASN.1 data files require approximately 958 GB.

More information about GenBank release 240.0 is available in the release notes, as well as in the README files in the genbank and ASN.1 (ncbi-asn1) directories on FTP.

November 18 Webinar: A new way to prepare genome submissions using NCBI’s Genome Workbench!

Join us November 18 to learn how to use Genome Workbench, NCBI’s sequence annotation and analysis package, to prepare genome submissions for GenBank.  This webinar will help you prepare for the upcoming retirement of Sequin submission tool in January 2021. You will learn how to use Genome Workbench’s Submission Wizard, Validation and Submitter Reports, Flat File View, and Graphical Sequence View to prepare your annotated genome submission to GenBank and help you find and fix any problems before submitting.

  • Date and time: Wed, November 18, 2020 12:00 PM – 12:45 PM EST
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

NCBI RefSeq and Ensembl/GENCODE taking MANE mainstream with v0.92!

NCBI and EBI have been hard at work on our joint MANE collaborationproviding a set of representative transcripts for human protein-coding genes that are identically annotated in the NCBI RefSeq and Ensembl/GENCODE annotation sets and exactly match the GRCh38 reference assembly. We’re pleased to announce MANE v0.92, now covering 16,865 genes or ~88% of known human protein-coding genes.

In particular, we’ve focused on clinically relevant genes and MANE Select now includes 99% of genes with high gene-disease validity. This release also includes 43 extra transcripts labeled “MANE Plus Clinical” that we’ve chosen to aid in clinical reporting, for example, when there are additional pathogenic variants not covered in the MANE Select transcript. While it’s critical to consider other alternatively-spliced transcripts for variant interpretation or functional analyses, the MANE Select and MANE Plus Clinical transcripts provide a common foundation for clinical reporting, and other analyses that benefit from using just one well-supported transcript or protein per gene.

Continue reading “NCBI RefSeq and Ensembl/GENCODE taking MANE mainstream with v0.92!”

IgBLAST 1.17 is now available with improved identification of productive V gene sequences

A new release of IgBLAST (1.17), the popular package for classifying and analyzing immunoglobulin and T cell receptor sequences, is now available on the web and from the FTP site. The updated package is better at identifying productive V gene sequences. We added a new field , “V frame shift”, to the IgBLAST output to indicate whether the V gene translation frame contains a frame-shift. We have also updated the definition of a productive V(D)J sequence to now exclude those with internal frame shifts (Figure 1).

Figure 1. A portion of the web IgBLAST output showing the new “V frame shift” field. The results of this field now inform the classification of the sequence as Productive (Yes / No).

See the new IgBLAST manual on the NCBI GitHub site for more information on setting up and running IgBLAST.

 

New feature in the dbGap submission portal: Automated study metadata

dbGaP has recently released a new feature to simplify submissions and provide study accessions faster. This video provides a quick overview of the new feature. 

Our new study config webform enables a study submitter to enter important study summary information including study description, inclusion/exclusion criteria, history, attribution, and associated publications online and instantly preview the study config content and study accession on their dbGaP study report page. Study design and type, PMIDsGenesMeSH terms, and associated Clinical Trials have built-in help and validation to ensure that the information provided is complete and searchable by users looking for that data. 

The database of Genotypes and Phenotypes (dbGaP) provides controlled-access to the data and results from studies that have investigated the interaction of genotype and phenotype in humans. dbGaP assigns stable, unique identifiers to studies and subsets of information from those studies, including documents, individual phenotypic variables, tables of trait data, sets of genotype data, computed phenotype-genotype associations, and groups of study subjects who have given similar consents for use of their data. 

Figure 1. dbGaP summary statistics

The submissions made to dbGaP represent the best and latest research in topic areas such as cardiovascular diseases, diabetes, autism spectrum disorders, precision medicine and many more. Submitters are central to the success of dbGaP and sharing of genomic research across the broader scientific community. Our submission portal serves as a central place to collect multiple components of a research study, including the metadata/summary and associated phenotype, genotype, and sequence data.

 

 

Programmatic access to Gene data using Datasets command-line and API

In March, we announced NCBI Datasets, a new resource that lets you easily retrieve and download data from across NCBI databases. Did you know you can now fetch NCBI Gene data programmatically using the NCBI Datasets API or command-line tool?  Quickly retrieve both metadata and gene sequence data for multiple Gene records including transcripts and proteins in one shell command or API request. The API documentation is a good way to get started with programmatic access (Figure 1).

Figure 1. The Datasets API documentation showing a demonstration retrieving Gene metadata using RefSeq mRNA accessions. The API returns a readily processed JSON object.

Continue reading “Programmatic access to Gene data using Datasets command-line and API”

New RefSeq annotations for mouse, maize, sunflower and more!

New RefSeq annotations for mouse, maize, sunflower and more!

In August and September, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

  • Amphiprion ocellaris (clown anemonefish)
  • Anopheles stephensi (Asian malaria mosquito)
  • Aplysia californica (California sea hare)
  • Bactrocera oleae (olive fruit fly)
  • Branchiostoma floridae (Florida lancelet)
  • Egretta garzetta (little egret)
  • Folsomia candida (springtail)
  • Fundulus heteroclitus (mummichog)
  • Halichoerus grypus (gray seal)
  • Helianthus annuus (common sunflower)
  • Homo sapiens (human)
  • Lynx canadensis (Canada lynx)
  • Molossus molossus (Pallas’s mastiff bat)
  • Monomorium pharaonis (pharaoh ant)
  • Mus musculus (house mouse)
  • Myotis myotis (bat)
  • Neolamprologus brichardi (lyretail cichlid)
  • Oncorhynchus keta (chum salmon)
  • Onychomys torridus (southern grasshopper mouse)
  • Oryzias melastigma (Indian medaka)
  • Phyllostomus discolor (pale spear-nosed bat)
  • Rousettus aegyptiacus (Egyptian rousette)
  • Sander lucioperca (pike-perch)
  • Zea mays (maize)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

Learn more about the annotation of the new mouse reference assembly, GRCm39, here. This is the first coordinate-changing update to the mouse reference since the 2012 release of GRCm38.

New PubMed updates and retirement of legacy PubMed on October 31

The new PubMed has been the default now since May, and more than 99% of you are using the new site. The recent NLM technical bulletin has details on features that we have added to the new PubMed based on your requests.

Legacy PubMed, which has been available in parallel with the new PubMed, will be finally taken down after October 31, 2020.  We will continue to provide API access to PubMed through the E-utilities, which uses the legacy system, for the foreseeable future and until we can transition to an API that accesses the new system.

We understand that it can take time to adapt to changes and find favorite features in a new interface. Several learning and training resources are available to help you use the new PubMed: Continue reading “New PubMed updates and retirement of legacy PubMed on October 31”

Structure viewer iCn3D 2.20.0 is available with new features including viewing an electrostatic potential map!

The NCBI structure viewer iCn3D 2.20.0 is now available on the NCBI web site and from GitHub. You can now view the electrostatic potential map for any subset of 3D structures within 30,000 atoms. The potential is calculated using the DelPhi program by solving a linear Poisson-Boltzmann equation. You can show the potential on a surface or show a equipotential map. The potential map shows the effect of charges on molecular interactions qualitatively.

The example in Figure 1 below shows the electrostatic potential for the binding of Gleevec to the human Abl2 protein. This new feature can be accessed from the menu “Analysis > DelPhi Potential.” You can also download the PQR file format with assigned partial charges.

Figure 1: 3GVU: The crystal structure of human ABL2 in complex with GLEEVEC. The ligand shows the -25 mV (red) and +25 mV (blue) equipotential map with a grid size 65, salt concentration 0.15 M, and pH 7. The protein shows the surface potential with a gradient from -75 mV (red) to +75 mV (blue). 

Continue reading “Structure viewer iCn3D 2.20.0 is available with new features including viewing an electrostatic potential map!”

NCBI Presents Two Online CoLabs at ASHG 2020!

NCBI Presents Two Online CoLabs at ASHG 2020!

Two up-and-coming NCBI resources will be featured in videos, surveys and live events at the American Society for Human Genetics (ASHG) 2020 Annual Meeting. Come and watch on-demand videos in the CoLab Theater. Then, let us know what you think and how you do or might use these resources by either taking an online survey or joining us for the CoLab Live! Events on Thursday, October 29, 2020.

Continue reading “NCBI Presents Two Online CoLabs at ASHG 2020!”