First annotation of Pacific white shrimp

NCBI announces Annotation Release 100 of the Pacific white shrimp (Penaeus vannamei) genome in RefSeq, based on the assembly (GCF_003789085.1) submitted by the Institute of Oceanology, Chinese Academy of Sciences. The Pacific white shrimp is one of the most important shrimp species in fisheries and aquaculture and represents the first decapod to have its genome annotated by NCBI.  We predicted 24,987 protein coding genes with evidence from alignment of six billion RNA-Seq reads and homology with invertebrate proteins. This annotation will enable genomic research in this commercially important species.

You can download the annotated assembly or browse and search it in the Genome Data Viewer.

Please visit our Eukaryotic RefSeq Genome Annotation Status page to  see more annotations in progress.


GenBank reaches over 4 terabytes of data in release 229

GenBank release 229.0 (12/15/2018) has 211,281,415 traditional records including non-bulk-oriented TSA) containing 285,688,542,186 base pairs of sequence data. There are also 773,773,190 WGS records containing 3,656,719,423,096 base pairs of sequence data, 274,845,473 bulk-oriented TSA records containing 248,592,892,188 base pairs of sequence data, and 20,924,588 bulk-oriented TLS records containing 8,511,829,281 base pairs of sequence data.

Continue reading

The TIGRFAM collection of protein family Hidden Markov Models moves to NCBI

NCBI has been asked to take over the ownership and maintenance of the TIGRFAM collection of Hidden Markov Models (HMMs), which is widely used for the annotation of prokaryotic genomes. The TIGRFAMs are a collection of curated protein families started in 1998 at The Institute of Genomic Research (TIGR), precursor to the J. Craig Venter Institute (JCVI). This collection is publicly available under a Creative Commons license and downloadable from NCBI. We have already made hundreds of improvements to TIGRFAM names and descriptions and we will continue to make regular updates.\

Continue reading

Check out improved tooltips in NCBI’s genome browsers and sequence displays!

We’ve recently improved the tooltips for gene features in NCBI’s graphical sequence displays in Genome Data Viewer (GDV) and on many resource pages, such as Gene and dbSNP.  These enhancements include quick details and helpful links about the feature and gene.

Figure 1

Figure 1. Merged transcript and CDS pair tooltip.

Continue reading

Pangenomics in the Cloud hackathon, March 25-27, 2019

We are pleased to announce the first ever pangenomics, graphs and haplotypes hackathon.

From March 25-27, 2019, the NCBI will help run a bioinformatics hackathon in Santa Cruz, California, hosted by the University of California, Santa Cruz (UCSC).  Potential topics include:

  • Building large scale graphs from pangenomes using several assembly methods
  • Simplification of mapping
  • Resolving haplotypes
  • Identification of population-specific structural variants
  • Defining haplotype-specific expression, visualization, and coordination with the GRC

Continue reading

NCBI to correct existing taxonomic information on public GenBank records with average nucleotide identity analysis

To ensure that taxonomic information on genome assemblies is as accurate as possible, NCBI will use average nucleotide identity (ANI) analysis to correct existing public records in GenBank.

We will contact submitters of records found to be misidentified and provide reports with ANI information based on comparison to type strains.  If there is no objection, the taxonomic change will be made, and a structured comment will be added to the record.

In cases where a genome assembly was not submitted with a binomial name (ex: Bacillus sp. 123) but was found to match a known species with high confidence, the strain will be merged with the binomial in the taxonomy database.  This will occur as part of the normal maintenance of merged taxonomic names. The submitter will not be contacted, but the structured comment indicating the change will be added to the record.

paper in the International Journal of Systematic and Evolutionary Microbiology presents the method NCBI scientists used to review all prokaryotic genome assemblies in GenBank, as well as the current status of GenBank verifications and recent developments in confirming species assignments in new genome submissions.

NCBI to Retire the UniGene Database

In July 2019, we will retire the UniGene database and take down the web interface.

UniGene was originally implemented as a gene-oriented grouping of transcript sequences in the absence of a reference genome for a broad range of organisms. We added genome-based grouping later.

UniGene has since been used as a source of approximate expression profiles, an index of available cDNA clones, and as a guide to transcript-oriented resource design. However, with the advent of short read sequencing, fewer and fewer ESTs are submitted to NCBI every year, and reference genomes are available for most organisms with a sizable research community. Consequently, the usage of and need for UniGene has dropped significantly.

Although we will retire the web interfaces, we will continue to have the most recent UniGene builds available on NCBI’s FTP site.  Web traffic to UniGene entries will redirect to relevant gene entries when those are available. When that’s not possible, web requests will be routed to either a representative nucleotide sequence entry or a helpful Entrez query against nucleotide records.

Please contact us at with any comments, concerns, or if you need help with the use of UniGene data.