RefSeq release 92 updates 10,000 human transcripts


RefSeq release 92 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available, as of January 4, 2019 and contains 185,738,687 records, including 130,366,644 proteins, 25,088,890 RNAs, and sequences from 86,867 organisms. The release is provided in several directories as a complete dataset and as divided by logical groupings.

Continue reading

Join NCBI at PAG in San Diego, January 12–16, 2019


Next week, NCBI staff will attend the Plant and Animal Genome (PAG) Conference. We have several activities planned, including 1 booth (#223), 4 workshops, 1 talk and 2 posters.

Read on to learn more about what you can look forward to if you’re attending PAG this year. (Note: The listed times are Pacific time.)

Continue reading

RefSeq release 91 is public


RefSeq release 91 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of November 5, 2018. It contains 179,672,083 records, including 125,530,811 proteins, 24,447,570 RNAs, and sequences from 85,308 organisms.

The release is provided in several directories as a complete dataset and as divided by logical groupings.

Continue reading

Matched Annotation by NCBI and EMBL-EBI (MANE): a new joint venture to define a set of representative transcripts for human protein-coding genes


The RefSeq project at the NCBI and the Ensembl/GENCODE project at EMBL-EBI have provided independent high-quality human reference gene datasets to biologists since the sequencing of the human genome. Now we’re joining together on an exciting new project we’re calling Matched Annotation from the NCBI and EMBL-EBI or MANE, to provide a matched set of well-supported transcripts for human protein-coding genes and define one representative transcript for each gene. Both RefSeq and Ensembl will continue to provide a rich set of alternate transcripts per gene.

The MANE project builds on the successful CCDS collaboration (PMCID: PMC5753299) and incorporates feedback from RefSeq and Ensembl/GENCODE users who requested a common reference transcript dataset including one or a few key transcripts for each gene where the RefSeq and Ensembl/GENCODE transcripts are identical in length and sequence, and completely match the human reference genome sequence. We expect to later expand the project to include a larger subset of full-length transcripts that more fully represent the functional complexity of many genes. We’re leveraging public deep sequencing datasets to optimize 5’ and 3’ UTR endpoints to more accurately reflect transcriptional processes. To pick representative transcripts, we’ve developed computational methods to evaluate and integrate transcript expression levels, protein conservation, support from archived transcript submissions, clinical relevance, and other factors. Complex genes are subject to review by annotation experts from both groups to agree on a representative transcript and often make improvements to both annotation sets.

The unified, high-quality transcript set provided by the MANE project will simplify the task of choosing a transcript for comparative genomics, clinical reporting, and basic research. When integrated across different public genome resources, this minimal, identically annotated transcript set will eliminate the need to choose between RefSeq and Ensembl/GENCODE datasets for genomic analyses. This will also make it easy for researchers who currently prefer one dataset over the other to exchange data or translate coordinates (or HGVS variation expressions) between RefSeq and Ensembl annotation results. Furthermore, the perfect alignment of all MANE transcripts to GRCh38 will make the set compatible with NGS-based sequencing technologies and other resources that use the latest and highest-quality reference human genome assembly available.

Our goal is for the final MANE dataset to be stable, although individual sequences and the dataset as a whole will be versioned and allow for future updates and expansions as needed to incorporate significant new data and additional curation. We plan to release a partial “beta” transcript set by the end of the year for testing, and a large sequence update in the next few months to refine 5’ and 3’ RefSeq transcript ends and match the GRCh38 sequence. Ensembl plans to release similar updates in spring 2019.

We’re looking forward to your feedback! Next week, we will be presenting the project at the annual American Society for Human Genetics (ASHG) meeting in San Diego, CA, USA. Please attend our talks scheduled in the Genome Reference Consortium (GRC) workshop on Tuesday, October 16, at 1:00 PM, and in the Importance of Isoform Expression in Variant Interpretation Session (#94) on Saturday, October 20th at 9:15 AM.  You can also visit us at the NCBI or Ensembl booths and posters throughout the meeting or send us feedback at info@ncbi.nlm.nih.gov. We’re looking forward to your valuable input on our new initiative!

RefSeq release 90 is public


RefSeq release 90 is accessible online, via FTP and through NCBI’s programming utilities.

This full release incorporates genomic, transcript, and protein data available as of September 10, 2018. It contains 173,956,003 records, including 121,138,769 proteins, 23,838,676 836, and sequences from 84,276 organisms.

The release is provided in several directories as a complete dataset and as divided by logical groupings.

May – July annotations in RefSeq: ants, Chinese alligator & more


In recent months, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

  • Alligator sinensis (Chinese alligator)
  • Athalia rosae (coleseed sawfly)
  • Bubalus bubalis (water buffalo)
  • Camponotus floridanus (Florida carpenter ant)
  • Canis lupus dingo (dingo)
  • Harpegnathos saltator (Jerdon’s jumping ant)
  • Melanaphis sacchari (aphid)
  • Pelodiscus sinensis (Chinese soft-shelled turtle)
  • Pogonomyrmex barbatus (red harvester ant)
  • Pomacea canaliculata (gastropod)
  • Sipha flava (yellow sugarcane aphid)
  • Theropithecus gelada (gelada)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

RefSeq release 89 is public


RefSeq release 89 is accessible online, via FTP and through NCBI’s programming utilities. This full release incorporates genomic, transcript, and protein data available as of July 9, 2018. It contains 163,859,625 records, including 113,429,348 proteins, 23,029,67 RNAs and sequences from 81,345 organisms. The release is in several directories as a complete dataset and as divided by logical groupings.