There’s a new RefSeq annotation available for the human genome, and it’s quite an update!
About the release
Annotation release 109.20190607 is the first release of our new bimonthly annotation schedule as announced in a previous post. The annotated sequences are the latest sequences for the GRCh38, patch 13 assembly, GRCh38.p13 (GCF_000001405.39). The chromosome backbone sequences remain the same, but we’ve added 45 patch sequences representing novel and improved sequences that the Genome Reference Consortium will incorporate into the primary assembly in the future. The new annotation places the latest curated RefSeq transcripts and functional elements on the genome but keeps the same model dataset as in annotation release 109 except when the models have been replaced by curated RefSeqs or other review. We are also flagging MANE and other RefSeq Select transcripts. Continue reading for more details on these improvements below. You can download the updated annotation here!
Matched NCBI and EMBL-EBI and RefSeq Select annotation included
The release incorporates data from the Matched Annotation from the NCBI and EMBL-EBI (MANE) project, a collaboration with the Ensembl-GENCODE group to provide a matched set of well-supported transcripts for human protein-coding genes and define one representative transcript for each gene. Our annotation includes the MANE Select subset containing 10,277 matched transcripts covering ~53% of protein-coding genes. These matched transcripts have an Ensembl db_xref on the mRNA and CDS features of the genome sequence (Figure 1).
Figure 1. The mRNA and CDS features for the the MANE Select transcript (NM_001238.4) for the CCNE1 gene on the chromosome 19 record (NC_000019.10) showing the cross references (db_xref) to the Ensembl identifiers.
The identical Ensembl transcript and protein are now shown in the ##RefSeq-Attributes##, on the transcript sequence records. The transcript record also has a “MANE Select” keyword indicating that the RefSeq transcript is identical to an Ensembl transcript, and that both NCBI and EMBL-EBI have chosen the transcript to be a representative pick for this gene (Figure 2).
Figure 2. The MANE Select transcript record (NM_001238.4) for the CCNE1 gene showing the MANE Select keyword and the MANE Ensembl match RefSeq Attribute.
For those of you eager to use a more complete set, we’ve also marked up our Select transcript picks for the other 47% of protein-coding genes with a “RefSeq Select” keyword. If you want to know how we picked a particular transcript as Select, each has a “RefSeq Select criteria” attribute that describes the support. If you’re using our RefSeq genome annotation in either GFF3 or GTF form, the Ensembl db_xrefs will show up in column 9 for those transcripts and proteins that are matched. There’s also a “tag” attribute that labels the features for the “MANE Select” or “RefSeq Select” transcripts.
Major upgrades to the annotation
The annotation itself has had a major upgrade! There are nearly 5,000 new curated transcripts (with NM_ or NR_ accessions), and over 32,000 have had a sequence change as part of the MANE project. This represents half of the curated dataset! We primarily adjusted the UTRs to pick precisely-defined ends based on high-throughput CAGE data from the FANTOM consortium and polyA-seq data from many sources, then matched transcripts to the GRCh38 sequence. This includes the 10,277 MANE transcripts that have also been updated at Ensembl, and many additional transcripts to standardize the RefSeq dataset as much as possible. Finally, we have annotated over 2,000 more RefSeq Functional Element features, expanding this dataset to over 8,900 gene regulatory and other features from over 3,700 biological regions.