About NCBI Staff

The National Center for Biotechnology Information (NCBI), a division of the U.S. National Library of Medicine, provides access to scientific and biomedical databases, software tools for analyzing molecular data, and performs research in computational biology.

GenBank release 237 is available

GenBank release 237.0 (4/21/2020) is now available on the NCBI FTP site. This release has over 8.58 trillion bases and 1.95 billion records.

The release has 216,531,829 traditional records containing 415,770,027,949 base pairs of sequence data. There are also 1,267,547,429 WGS records containing 7,788,133,221,338 base pairs of sequence data, 396,392,280 bulk-oriented TSA records containing 349,692,751,528 base pairs of sequence data, and 65,521,132 bulk-oriented TLS records containing 24,615,270,313 base pairs of sequence data.

During the 63 days between the close dates for GenBank Releases 236.0 and 237.0, the ‘traditional’ portion of GenBank grew by 16,393,173,077 base pairs and by 317,614 sequence records. During that same period, 55,268 records were updated. An average of 5,919 ‘traditional’ records were added and/or updated per day.

Between releases 236.0 and 237.0, the WGS component of GenBank grew by 819,141,955,586 basepairs and by 60,826,741 sequence records. The TSA component of GenBank grew by 8,698,462,463 basepairs and by 9,747,409 sequence records. The TLS component of GenBank grew by 10,945,592,117 basepairs and by 31,483,761 sequence records.

The total number of sequence data files increased by 59 with this release. The divisions are as follows:

  • BCT: 14 new files, now a total of 432
  • CON: 1 new file, now a total of 217
  • ENV: 1 new file, now a total of 60
  • INV: 6 new files, now a total of 86
  • MAM: 15 new files, now a total of 64
  • PLN: 8 new files, now a total of 212
  • VRT: 14 new files, now a total of 175

For downloading purposes, the uncompressed GenBank release 237.0 flat files require roughly 1142 GB, including the sequence files and the *.txt files. The ASN.1 data files require approximately 844 GB.

More information about GenBank release 237.0 is available in the Release Notes, as well as in the README files in the GenBank and ASN.1 (ncbi-asn1) directories on FTP.

Canonical SPDI notation now in ClinVar

Did you know that you can see canonical SPDI notation – SPDI notation expressed on the GRCh38 chromosomal sequence – in ClinVar?

Figure 1. The canonical SPDI is provided within the “Variant details” tab. This is just one of many ways to see the notation.

This allows you to easily make connections between output from NCBI’s Variation Services and ClinVar data.

Continue reading

New feature added to Primer-BLAST to better design primers for expression assays

We’ve added a new feature (Max 3′ match), shown in Figure 1, to Primer-BLAST that limits the length of 3′ exon matches when designing exon-exon spanning primers. This makes it less likely that primers specifically designed to amplify transcripts will also amplify genomic DNA contamination in expression assays.


Figure 1. The new “Max 3′ match” option that limits the size of the 3′ match for exon-exon junction primers. This option helps avoid primers that may also produce product from genomic DNA. Continue reading

Flies Are A-buzzing in RefSeq!

Are you interested in comparative genomics or other studies using Drosophila genomics?

Then don’t miss our online poster #568A at TAGC 2020 Online (no meeting registration required). Also, tune in to the online Q&A session on Monday, April 27 at 12:00 – 12:30 pm EDT.

What’s happening? In coordination with FlyBase, we are transitioning almost all of the RefSeq Drosophila assemblies to annotation produced primarily by NCBI’s eukaryotic genome annotation pipeline. We’ll continue to use the FlyBase annotation for Drosophila melanogaster (soon to be updated to Release 6.32), but we’ll annotate the other species using available RNA-seq datasets and our latest software. This will allow us to provide consistent, high-quality annotations across the full spectrum of Drosophila species, and also rapidly provide annotations as new high-quality assemblies become available. Another benefit is that these annotations will be available in the full suite of NCBI resources, including nucleotide, protein, BLAST, GeneGenome Data Viewer, Genomes, Assembly, and more. You can download these annotation data from the NCBI genomes FTP site or you can try the new NCBI Datasets tool. By special request, we’re making orthology data relative to D. melanogaster available on the Gene FTP site, and plan to expose that data in our public pages in the future.

Continue reading

Recalculation of prokaryotic reference and representative genome assemblies

We have updated the collection of representative and reference assemblies for Bacteria and Archaea to better reflect the taxonomic breadth of the prokaryotes in RefSeq.  We chose the 11,478 representative assemblies in the new collection from the 180,000+ prokaryotic assemblies in RefSeq today.  We have selected one representative or reference assembly for every species based on several criteria including contiguity, completeness and whether the assembly is from type material.  We have also updated the reference and representative microbial Blast database to reflect these changes. This reference and representative set will be updated three times a year to reflect changes in RefSeq.  In addition, as we announced on Feb 14, we have reduced the number of reference genome assemblies — the subset of representative assemblies with annotation provided by outside experts —  to 15. See the list in our previous post .  We have re-annotated the 104 assemblies that are no longer reference with or Prokaryotic Genome Annotations Pipel (PGAP).

Recent RefSeq annotations: barn owl, monarch butterfly and more

800px-Barn_Owl,_Manchester_area,_UK,_by_Andy_Chilton_2016-07-06_(Unsplash)In February and March, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

  • Amblyraja radiata (thorny skate)
  • Catharus ustulatus (Swainson’s thrush)
  • Chelonoidis abingdonii (Abingdon island giant tortoise)
  • Chiroxiphia lanceolata (lance-tailed manakin)
  • Danaus plexippus plexippus (monarch butterfly)
  • Daphnia magna (crustacean)
  • Drosophila grimshawi (fly)
  • Drosophila mojavensis (fly)
  • Drosophila sechellia (fly)
  • Homo sapiens (human)
  • Hylobates moloch (silvery gibbon)
  • Lontra canadensis (Northern American river otter)
  • Lynx canadensis (Canada lynx)
  • Nasonia vitripennis (jewel wasp)
  • Odontomachus brunneus (ant)
  • Petromyzon marinus (sea lamprey)
  • Phocoena sinus (vaquita)
  • Rattus rattus (black rat)
  • Rhinolophus ferrumequinum (greater horseshoe bat)
  • Strigops habroptila (Kakapo)
  • Taeniopygia guttata (zebra finch)
  • Tyto alba (Barn owl)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

Streamlined submission of SARS-CoV-2 data with rapid turnaround

sars-cov-2 submission landing page

Figure 1. The SARS-CoV-2 submission landing page, where you can submit to GenBank or SRA. You can also view other resources related to SARS-CoV-2.

Quickly and easily add your SARS-CoV-2 sequence data to the growing public archive with new, special features and support from NCBI. Our new SARS-CoV-2 sequence submission landing page will help you get started. GenBank submissions are accessioned and released in approximately 1-2 working days, and Sequence Read Archive (SRA) submissions typically processed and released within hours. Submission is simple!

Continue reading

April 22 Webinar on NCBI’s ALFA: allele frequency data for variant analysis and interpretation

On Wednesday, April 22, 2020 at 12 PM,  join NCBI staff to learn how results from the Allele Frequency Aggregator (ALFA) project will help you interpret the biological impact of common and rare sequence variants. ALFA’s initial release includes analysis of genotype data from ~100K unrestricted dbGaP subjects and provides high-quality allele frequency data now displayed on relevant dbSNP records. In this webinar, you will learn about the data in the recent ALFA release, see how to access the data from the web, FTP, and how to programmatically retrieve data by positions, genes, and other attributes using E-utilities and Variation Services API in Python.

  • Date and time: Wed, Apr 22, 2020 12:00 PM – 12:45 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

The next RefSeq FTP release number will skip to 200

NCBI’s Reference Sequence (RefSeq) FTP release numbers will increment to 200 for the next release and skip over the numbers 100-199. The current, March 2020 release, is release 99. The next bi-monthly release in May 2020 will be release 200.  This change is to avoid overlapping with the release numbers of the completely independent RefSeq annotation releases for the eukaryotic genomes we annotate, which are currently in the range 100-109, for example Mus musculus Annotation Release 108. Continue reading