The RefSeq project at the NCBI and the Ensembl/GENCODE project at EMBL-EBI have provided independent high-quality human reference gene datasets to biologists since the sequencing of the human genome. Now we’re joining together on an exciting new project we’re calling Matched Annotation from the NCBI and EMBL-EBI or MANE, to provide a matched set of well-supported transcripts for human protein-coding genes and define one representative transcript for each gene. Both RefSeq and Ensembl will continue to provide a rich set of alternate transcripts per gene.
The MANE project builds on the successful CCDS collaboration (PMCID: PMC5753299) and incorporates feedback from RefSeq and Ensembl/GENCODE users who requested a common reference transcript dataset including one or a few key transcripts for each gene where the RefSeq and Ensembl/GENCODE transcripts are identical in length and sequence, and completely match the human reference genome sequence. We expect to later expand the project to include a larger subset of full-length transcripts that more fully represent the functional complexity of many genes. We’re leveraging public deep sequencing datasets to optimize 5’ and 3’ UTR endpoints to more accurately reflect transcriptional processes. To pick representative transcripts, we’ve developed computational methods to evaluate and integrate transcript expression levels, protein conservation, support from archived transcript submissions, clinical relevance, and other factors. Complex genes are subject to review by annotation experts from both groups to agree on a representative transcript and often make improvements to both annotation sets.
The unified, high-quality transcript set provided by the MANE project will simplify the task of choosing a transcript for comparative genomics, clinical reporting, and basic research. When integrated across different public genome resources, this minimal, identically annotated transcript set will eliminate the need to choose between RefSeq and Ensembl/GENCODE datasets for genomic analyses. This will also make it easy for researchers who currently prefer one dataset over the other to exchange data or translate coordinates (or HGVS variation expressions) between RefSeq and Ensembl annotation results. Furthermore, the perfect alignment of all MANE transcripts to GRCh38 will make the set compatible with NGS-based sequencing technologies and other resources that use the latest and highest-quality reference human genome assembly available.
Our goal is for the final MANE dataset to be stable, although individual sequences and the dataset as a whole will be versioned and allow for future updates and expansions as needed to incorporate significant new data and additional curation. We plan to release a partial “beta” transcript set by the end of the year for testing, and a large sequence update in the next few months to refine 5’ and 3’ RefSeq transcript ends and match the GRCh38 sequence. Ensembl plans to release similar updates in spring 2019.
We’re looking forward to your feedback! Next week, we will be presenting the project at the annual American Society for Human Genetics (ASHG) meeting in San Diego, CA, USA. Please attend our talks scheduled in the Genome Reference Consortium (GRC) workshop on Tuesday, October 16, at 1:00 PM, and in the Importance of Isoform Expression in Variant Interpretation Session (#94) on Saturday, October 20th at 9:15 AM. You can also visit us at the NCBI or Ensembl booths and posters throughout the meeting or send us feedback at firstname.lastname@example.org. We’re looking forward to your valuable input on our new initiative!