Tag: Genome annotation

Improved access to SARS-CoV-2 data

NCBI Datasets has a simple, new way to get Coronoviridae data, including from SARS-CoV-2 (Figure 1). The data package includes genomic, protein and CDS sequences, annotation and a comprehensive data report for all complete genomes. You can also target your search to major taxonomic ranks within Coronaviridae.

Figure 1 – SARS-CoV-2 page within NCBI Datasets showing statistics as of June 16, 2020.

Interested in a specific protein? The SARS-CoV-2 protein page allows you to choose a protein and download the corresponding sequences, annotation and representative structures from all annotated genomes (Figure 2).

Figure 2 – SARS-CoV-2 protein page within NCBI Datasets showing annotations on the SARS-CoV-2 reference genome.

Looking for programmatic access? NCBI Datasets offers the same Coronoviridae genomic data and SARS-CoV-2 protein data through a command-line tool and a RESTful API. These tools support additional filtering including the ability to download only those genomes released after a date you specify.

We appreciate your feedback. Try NCBI Datasets and let us know what you think!

Orthologs Are A-Swimming and A-Buzzing in RefSeq!

Previously we wrote about improvements to Drosophila annotations in RefSeq. We’re excited to report that we’re also improving how we compute and report orthology data for fish and insects to help you find evolutionarily related genes across species. Currently when we annotate a vertebrate genome using our in-house eukaryotic genome annotation pipeline, we have a robust process that identifies 1:1 orthologs vs human using a combination of BLAST comparisons and local synteny. These results are available in NCBI Gene and our new Ortholog pages, and also on Gene’s FTP site. We also use the data to apply human gene and protein names to orthologs in other species, providing a very rich annotation for hundreds of vertebrates.


For fish, we’re now using a two-layer process. First, most of the fish now have 1:1 orthologs identified vs zebrafish, which typically results in identifying 50% more orthologs. Second, if we’ve identified a human ortholog for the zebrafish gene, then we also report the human gene. We’ve also switched primarily to applying gene symbols and names from zebrafish instead of human, mostly provided by the Zebrafish Information Network (ZFIN), to other fish orthologs. The end result is more ortholog connections and better nomenclature. For example, many fish have two related homeobox genes meis2a and meis2b, compared to the single MEIS2 gene in human. Our updated process has allowed us to identify and properly name meis2a and meis2b in 73 and 40 fish species, respectively.

Continue reading “Orthologs Are A-Swimming and A-Buzzing in RefSeq!”

RefSeq release 200 is public

RefSeq release 200 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of May 4, 2020, and contains 237,381,664 records, including 171,643,729 proteins, 31,244,247 RNAs, and sequences from 100,605 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Other announcements:

The number of organisms in RefSeq crosses 100,000!
The current RefSeq release contains 100,605 distinct species or taxons, with a net increase of 763 species since Release 99. This milestone coincides with the 100th release though the current release number is 200 (see below). Note that there is a decrease in the number of species for prokaryotes (bacteria and archaea) due to a clean-up that mainly removed unclassified bacteria, and assemblies from Metagenome-Assembled Genomes (MAGs).

The FTP release number has skipped to 200
As previously announced, NCBI’s Reference Sequence (RefSeq) FTP release number has incremented to 200 for this release, and skipped over the numbers 100-199. The previous, March 2020 release, was release 99. This change is to avoid overlapping with the release numbers of the independently numbered RefSeq annotation releases for the eukaryotic genomes we annotate, which are currently in the range 100-109, for example Mus musculus Annotation Release 108.

NCBI Protein Families
A new release of the NCBI protein families profiles used by PGAP (the Prokaryotic Genome Annotation Pipeline) is now available. You can search this collection of Hidden Markov models (HMM) against your favorite prokaryotic proteins to identify their function using hmmer.

Recalculation of Prokaryotic Reference and Representative Genome Assemblies
We have updated the collection of reference and representative assemblies for Bacteria and Archaea to better reflect the taxonomic breadth of the prokaryotes in RefSeq. We have selected one reference or representative assembly for every species based on several criteria including contiguity, completeness, and whether the assembly is from type material.

Future change: Mouse Reference Assembly Update
A full assembly update for the mouse GRCm38.p6 reference assembly is expected to be released in 2020 by the GRC. We anticipate updating the mouse RefSeq annotation to the new GRCm39 assembly this summer, for either RefSeq FTP Release 201 or 202.


Flies Are A-buzzing in RefSeq!

Are you interested in comparative genomics or other studies using Drosophila genomics?

Then don’t miss our online poster #568A at TAGC 2020 Online (no meeting registration required). Also, tune in to the online Q&A session on Monday, April 27 at 12:00 – 12:30 pm EDT.

What’s happening? In coordination with FlyBase, we are transitioning almost all of the RefSeq Drosophila assemblies to annotation produced primarily by NCBI’s eukaryotic genome annotation pipeline. We’ll continue to use the FlyBase annotation for Drosophila melanogaster (soon to be updated to Release 6.32), but we’ll annotate the other species using available RNA-seq datasets and our latest software. This will allow us to provide consistent, high-quality annotations across the full spectrum of Drosophila species, and also rapidly provide annotations as new high-quality assemblies become available. Another benefit is that these annotations will be available in the full suite of NCBI resources, including nucleotide, protein, BLAST, GeneGenome Data Viewer, Genomes, Assembly, and more. You can download these annotation data from the NCBI genomes FTP site or you can try the new NCBI Datasets tool. By special request, we’re making orthology data relative to D. melanogaster available on the Gene FTP site, and plan to expose that data in our public pages in the future.

Continue reading “Flies Are A-buzzing in RefSeq!”

RefSeq Release 99 is public

RefSeq release 99 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of March 2, 2020, and contains 231,402,293 records, including 167,278,920 proteins, 29,869,155 RNAs, and sequences from 99,842 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Other announcements: Continue reading “RefSeq Release 99 is public”

New PGAP release with Singularity, no-internet option, and Taxonomy Check

A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) with several important features is now available on Github.

  • In response to several requests we have added the option of running PGAP with Singularity, Podman or any other Docker-compatible executable you wish to use.
  • We have also lifted the requirement for internet access in case you have privacy concerns. To run the pipeline without internet access, set the flag
  • Are you unsure about the identity of organism you sequenced? We’ve added the Taxonomy-Check module to help you. This module will confirm the organism name or suggest a new taxonomic assignment through average nucleotide identity comparison with type material assemblies from GenBank. The check is currently an optional validation step prior to PGAP.

Try these new features and let us know what you think! Or submit your PGAP-annotated assembly to GenBank. And remember that if you are still improving the assembly and your genome doesn’t pass the pre-annotation validation, you can use the --ignore-all-errors flag to get a preliminary annotation.

December 11 Webinar: Running the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) on your own data

December 11 Webinar: Running the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) on your own data

On Wednesday, December 11, 2019 at 12 PM, NCBI staff will present a webinar that will show you how to use NCBI’s PGAP (https://github.com/ncbi/pgap) on your own data to predict genes on bacterial and archaeal genomes using the same inputs and applications used inside NCBI. You can run PGAP your own machine, a compute farm, or in the Cloud. Plus, you can now submit genome sequences annotated by your copy of PGAP to GenBank.  Attend the webinar to learn more!

  • Date and time: Wed, Dec 11, 2019 12:00 PM – 12:45 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

CCDS Release 23 for Mouse Now in Entrez Gene

Are you interested in high quality genomic annotations for human and mouse? Check out the Consensus Coding Sequence (CCDS) project! Release 23 of the CCDS project is now available in Entrez Gene. This release compares NCBI’s Mus musculus annotation release 108 to Ensembl’s annotation release 98. This update adds 1,570 new CCDS records and 175 genes to the mouse CCDS dataset. In total, release 23 includes 27,219 CCDS records that correspond to 20,486 genes.

Continue reading “CCDS Release 23 for Mouse Now in Entrez Gene”

RefSeq Release 96: complete re-annotation of mouse genome and new human annotation

You can now access RefSeq release 96  online, from the FTP  site, and through NCBI’s Entrez programming utilities (E-utilities).

This full release incorporates genomic, transcript, and protein data available, as of September 9, 2019 and contains 213,863,503 records, including 152,910,397 proteins, 28,017,380 RNAs, and sequences from 94,946 organisms.

The release is provided  as a complete dataset and also in several directories divided by logical groupings.

Special announcements:

1. New Mus musculus (house mouse) Annotation Release 108

The latest annotation run for Mus musculus, 108, is a complete re-annotation of the mouse GRCm38.p6 assembly that incorporates ongoing curation work and new computed models based on extensive long-read transcriptome data.
See the annotation report for  details.  You can access these  annotation products through the sequence databases and on the FTP site.

2. Updated Homo sapiens Annotation Release 109.20190905

Annotation Release 109.20190905 is an update of NCBI Homo sapiens Annotation Release 109. The annotation report has details. You can access the annotation products from the sequence databases or download the data from the FTP site. We will continue to update the human genome annotation frequently so that we can
incorporate ongoing curation work including the MANE project and other curation activities. See our post on the increased frequency of annotation for more information on the new schedule.

3. dbSNP Human Build 153

The short variations (SNPs) annotated on human RefSeq transcripts and RefSeqGene records now incorporate data from dbSNP build 153.