The annotation of human assemblies GRCh38.p14 and T2T-CHM13v2.0
We are happy to announce the first de novo annotation of human T2T-CHM13v2.0, the gap-less assembly generated by the T2T Consortium, and the full re-annotation of the human reference assembly, GRCh38.p14. We hope the results will serve both the needs of those eager to explore newly sequenced regions of the genome, including telomeres and centromeres, and those interested in refreshing their interpretation of the human reference, in light of recently curated transcripts and new transcriptomic and other data incorporated in the annotation.
Annotation of these two assemblies, referred to as Homo sapiens Annotation Release 110 (AR110), is in RefSeq and available for download and for browsing in GDV. This annotation incorporated 82,862 curated RefSeq transcripts and experimental evidence in the form of 9.7 billion RNA-seq reads, nearly 83 million PacBio and Oxford Nanopore long transcriptome reads, 8.6 million ESTs, and 345,700 GenBank cDNAs and corresponding proteins. Details, including statistics on the annotation products, the input data used in the pipeline, and intermediate alignment results, can be found here.
Annotated MANE Select transcripts
The annotation of GRCh38.p14 includes all 19,062 MANE Select transcripts that are in MANE release v1.0. The MANE project is a collaboration between the National Library of Medicine’s (NLM) National Center for Biotechnology Information (NCBI) and the EMBL’s European Bioinformatics Institute (EMBL-EBI) and aims at establishing a set of identically annotated RefSeq and Ensembl/GENCODE protein-coding transcripts to promote consistency in clinical variant reporting and facilitate data exchange. While the full benefits of MANE are best realized by using GRCh38.p14, the RefSeq transcripts in the MANE dataset are also annotated on T2T-CHM13v2.0, allowing users interested in variant analysis on T2T-CHM13v2.0 to readily interpret variation in the context of MANE transcripts. Transcript-genome alignment files in BAM format are provided to help with adjusting for sequence differences present in T2T-CHM13v2.0.
Refined 5-prime and 3-prime transcript coordinates
One of the highlights of this annotation is the higher accuracy of transcription start (TSSs) and polyA sites. TSSs were obtained by identifying clusters of aligned reads from the FANTOM5 Human Cap Analysis Gene Expression project in or near the 5-prime regions of colocating annotated transcripts. PolyA-Seq data produced by multiple studies was used in a similar manner to annotate poly-A sites at the 3-prime end of the transcripts.
Extra gene content in T2T-CHM13-v2.0
We are still looking at this, but at first glance there are over 400 protein-coding genes annotated on T2T-CHM13-v2.0 that do not have an equivalent in GRCh38p14. These include genes in well characterized families such as the B melanoma antigen proteins BAGE4 and BAGE in the chromosome 13 juxtacentromeric region, the cancer/testis antigen family members GAGE4, GAGE5 and GAGE7 on chromosome X, mucin 3B MUC3B on chromosome 7 (See Figure 1), and the double homeobox gene DUX5 on the small arm of chromosome 14.
Figure 1 showing the annotation of the MUC3B gene in the mucin region on chromosome 7. The gene is not annotated on the current reference, GRCh38.p14.
More than 700 ribosomal RNA genes are annotated on T2T-CHM13-v2.0 but not on GRCh38.p14. These are found in the middle of the short arms of chromosomes 13, 14, 15, 21, and 22, as expected from the literature. A total of 75 rDNA cassettes are annotated on chr 13; 15 on chr 14; 50 on chr 15; 51 on chr 21; and 20 on chr 22. Finally, 91 5S rRNAs are annotated on chr 1. In addition, 86 transfer RNAs absent from GRCh38.p14 were identified on T2T-CHM13-v2.0 chr 1.
New patches in GRCh38.p14
A total of 53 fix patches and 18 novel patches were added to GRCh38.p14 (more information here). Notably, a fix patch in the region including MUC16 allows a complete representation of the gene (using the newly created transcript NM_001401501.1) with three additional coding exons. The chromosome sequence in the GRCh38.p14 Primary Assembly causes two non-synonymous amino-acid changes in FUT3 gene, one of which is known to inactivate the encoded enzyme. A novel patch (NW_025791810.1:16467..30706) allows the representation of the enzymatically active form of the protein.
Finally, if you are still using GRCh37.p13, we have released an updated annotation for this older version of the reference assembly, which includes the curated RefSeq transcripts that were used for AR 110. It is available for download and for browsing in GDV.