RefSeq release 91 is public


RefSeq release 91 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of November 5, 2018. It contains 179,672,083 records, including 125,530,811 proteins, 24,447,570 RNAs, and sequences from 85,308 organisms.

The release is provided in several directories as a complete dataset and as divided by logical groupings.

Continue reading

Matched Annotation by NCBI and EMBL-EBI (MANE): a new joint venture to define a set of representative transcripts for human protein-coding genes


The RefSeq project at the NCBI and the Ensembl/GENCODE project at EMBL-EBI have provided independent high-quality human reference gene datasets to biologists since the sequencing of the human genome. Now we’re joining together on an exciting new project we’re calling Matched Annotation from the NCBI and EMBL-EBI or MANE, to provide a matched set of well-supported transcripts for human protein-coding genes and define one representative transcript for each gene. Both RefSeq and Ensembl will continue to provide a rich set of alternate transcripts per gene.

The MANE project builds on the successful CCDS collaboration (PMCID: PMC5753299) and incorporates feedback from RefSeq and Ensembl/GENCODE users who requested a common reference transcript dataset including one or a few key transcripts for each gene where the RefSeq and Ensembl/GENCODE transcripts are identical in length and sequence, and completely match the human reference genome sequence. We expect to later expand the project to include a larger subset of full-length transcripts that more fully represent the functional complexity of many genes. We’re leveraging public deep sequencing datasets to optimize 5’ and 3’ UTR endpoints to more accurately reflect transcriptional processes. To pick representative transcripts, we’ve developed computational methods to evaluate and integrate transcript expression levels, protein conservation, support from archived transcript submissions, clinical relevance, and other factors. Complex genes are subject to review by annotation experts from both groups to agree on a representative transcript and often make improvements to both annotation sets.

The unified, high-quality transcript set provided by the MANE project will simplify the task of choosing a transcript for comparative genomics, clinical reporting, and basic research. When integrated across different public genome resources, this minimal, identically annotated transcript set will eliminate the need to choose between RefSeq and Ensembl/GENCODE datasets for genomic analyses. This will also make it easy for researchers who currently prefer one dataset over the other to exchange data or translate coordinates (or HGVS variation expressions) between RefSeq and Ensembl annotation results. Furthermore, the perfect alignment of all MANE transcripts to GRCh38 will make the set compatible with NGS-based sequencing technologies and other resources that use the latest and highest-quality reference human genome assembly available.

Our goal is for the final MANE dataset to be stable, although individual sequences and the dataset as a whole will be versioned and allow for future updates and expansions as needed to incorporate significant new data and additional curation. We plan to release a partial “beta” transcript set by the end of the year for testing, and a large sequence update in the next few months to refine 5’ and 3’ RefSeq transcript ends and match the GRCh38 sequence. Ensembl plans to release similar updates in spring 2019.

We’re looking forward to your feedback! Next week, we will be presenting the project at the annual American Society for Human Genetics (ASHG) meeting in San Diego, CA, USA. Please attend our talks scheduled in the Genome Reference Consortium (GRC) workshop on Tuesday, October 16, at 1:00 PM, and in the Importance of Isoform Expression in Variant Interpretation Session (#94) on Saturday, October 20th at 9:15 AM.  You can also visit us at the NCBI or Ensembl booths and posters throughout the meeting or send us feedback at info@ncbi.nlm.nih.gov. We’re looking forward to your valuable input on our new initiative!

RefSeq release 90 is public


RefSeq release 90 is accessible online, via FTP and through NCBI’s programming utilities.

This full release incorporates genomic, transcript, and protein data available as of September 10, 2018. It contains 173,956,003 records, including 121,138,769 proteins, 23,838,676 836, and sequences from 84,276 organisms.

The release is provided in several directories as a complete dataset and as divided by logical groupings.

May – July annotations in RefSeq: ants, Chinese alligator & more


In recent months, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

  • Alligator sinensis (Chinese alligator)
  • Athalia rosae (coleseed sawfly)
  • Bubalus bubalis (water buffalo)
  • Camponotus floridanus (Florida carpenter ant)
  • Canis lupus dingo (dingo)
  • Harpegnathos saltator (Jerdon’s jumping ant)
  • Melanaphis sacchari (aphid)
  • Pelodiscus sinensis (Chinese soft-shelled turtle)
  • Pogonomyrmex barbatus (red harvester ant)
  • Pomacea canaliculata (gastropod)
  • Sipha flava (yellow sugarcane aphid)
  • Theropithecus gelada (gelada)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

RefSeq release 89 is public


RefSeq release 89 is accessible online, via FTP and through NCBI’s programming utilities. This full release incorporates genomic, transcript, and protein data available as of July 9, 2018. It contains 163,859,625 records, including 113,429,348 proteins, 23,029,67 RNAs and sequences from 81,345 organisms. The release is in several directories as a complete dataset and as divided by logical groupings.

April and May annotations in RefSeq: cow, bonobo and more


In April and May, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

  • Bos taurus (cattle)
  • Cephus cinctus (wheat stem sawfly)
  • Citrus sinensis (sweet orange)
  • Cynara cardunculus cardunculus (eudicot)
  • Cynoglossus semilaevis (tongue sole)
  • Gallus gallus (chicken)
  • Kryptolebias marmoratus (mangrove rivulus)
  • Macaca nemestrina (pig-tailed macaque)
  • Maylandia zebra (zebra mbuna)
  • Medicago truncatula (barrel medic)
  • Pan paniscus (pygmy chimpanzee)
  • Pteropus alecto (black flying fox)
  • Python bivittatus (Burmese python)
  • Ricinus communis (castor bean)
  • Temnothorax curvispinosus (ant)
  • Tetranychus urticae (two-spotted spider mite)
  • Ziziphus jujuba (common jujube)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

Improved annotation of Streptomyces RefSeq genomes


We’ve completed the RefSeq reannotation of over 1,000 Streptomyces genomes! The genomes were reannotated using the Prokaryotic Genome Annotation Pipeline (PGAP). PGAP detected nearly 100% of ribosomally synthesized and post-translationally modified peptide natural products (RiPP)-encoding genes from known families, despite their small size, using a set of over 30 hidden Markov Models (HMMs) built by RefSeq biocurators. Over 70% (251) of lasso peptides now present in Streptomyces RefSeq genomes (354) were annotated for the first time.

If you are aware of any class of RiPP precursor in Streptomyces that was not found in our recent re-annotation, please contact us through the NCBI Help Desk, and we will add new HMMs to the rules we use to find and annotate RiPP precursor genes.

RefSeq release 88 available


RefSeq release 88 is now accessible online, via FTP and through NCBI’s programming utilities. This full release incorporates genomic, transcript, and protein data available, as of May 14, 2018. It contains 160,224,355 records, including 110,333,800 proteins, 22,461,378 RNAs, and sequences from 79,448 organisms. The release is in several directories as a complete dataset and as divided by logical groupings.

This release incorporates dbSNP release 151, which nearly doubles the number of SNPs annotated on the human GRCh38 genome, with matching increases in the size of the human nucleotide flatfile (.gbff) records.

Starting in November 2018, SNP variation features will no longer be in RefSeq genome assembly records.  The RefSeq release notes have information about this change.