An updated dataset of human protein-coding regions from the Consensus Coding Sequence (CCDS) collaboration
Are you interested in a set of high-quality human coding regions (CDS) with equivalent annotation in NCBI’s RefSeq and EMBL-EBI’s (European Molecular Biology Laboratories-European Bioinformatics Institute) Ensembl annotations? Check out the new CCDS Release 24! This CCDS set was generated by comparing RefSeq Annotation Release 110 and Ensembl Release 108.
This update adds 2,746 new CCDS IDs and 237 new genes compared to the last human CCDS build (Release 22, 2018). CCDS Release 24 includes a total of 35,608 CCDS IDs that correspond to 19,107 genes, with 48,062 protein sequences from RefSeq and 47,762 from Ensembl.
CCDS is a collaborative project between the following:
- National Library of Medicine (NLM), NCBI, RefSeq group
- European Bioinformatics Institute (EBI), Ensembl/GENCODE group
- HUGO Gene Nomenclature Committee (HGNC) (nomenclature authorities for human genome annotations)
- Mouse Genome Informatics (MGI) (nomenclature authorities for mouse genome annotations)
Since 2005, the CCDS set has served as the gold standard for mouse and human protein-coding annotation. Each CCDS ID represents a unique coding region that has equivalent CDS annotation in RefSeq and Ensembl/GENCODE annotation sets. While the data is generated largely by computational methods and CCDS IDs assigned primarily based on coding region matches, CCDS is backed by a team of curators drawn from all collaborating groups who help maintain data quality. Additional details on the CCDS workflow are available on the CCDS webpage and the latest CCDS paper.
We have come a long way from the first human CCDS build in 2005, which had 14,795 CCDS IDs from 13,142 genes. In recent years, while the growth of the gene count in CCDS has slowed down, there is a steady increase in the number of genes with multiple CCDS IDs (i.e., we are adding multiple protein isoforms within a gene). Based on sequence data obtained from long-read sequencing technologies, we expect this trend to continue with the addition of more alternatively spliced protein-coding transcripts in both annotation sets.
How is CCDS different from MANE?
You may be familiar with Matched Annotation from NCBI and EMBL-EBI (MANE, PMID: 35388217), a more recent collaboration between NCBI and EMBL-EBI. The MANE set was developed to provide a set of reference transcripts for clinical variant reporting and other research applications.
The main difference (no pun intended!) between MANE and CCDS is that MANE provides a single representative transcript (MANE Select) per protein-coding gene, chosen based on multiple biological criteria and is meant to be used as a universal standard to report known clinical variants. Each MANE Select includes a RefSeq transcript and an Ensembl transcript with identical end-to-end annotation. Both retain their identifiers but can be used synonymously. On the other hand, CCDS includes all the coding regions in a gene that have equivalent annotation based on automated comparison of RefSeq and Ensembl/GENCODE annotations. Each coding region is assigned a unique identifier (CCDS ID). While RefSeq and Ensembl transcripts within a CCDS ID match in the coding region, they may differ in the untranslated regions (UTRs).
Although the two sets are different, work on MANE collaboration in the last four years involved several annotation updates, including updates to existing coding regions and the creation of new transcripts representing novel coding regions, which led to improvement and growth of human CCDS set.
We hope you find the new CCDS release useful for your research. Please contact us if you have questions or comments about CCDS.