In an earlier blog post, we discussed how sequence updates in GRCh38, the most recent version of the human reference genome, filled in a gap in human chromosome 17 near position 21,300K and expanded the region by 500K (500,000 base pairs). In this post, we will again consider this same region, but with an emphasis now on how GRCh38 also improved the gene annotation.
Figure 1. Annotation of a region of chromosome 17 near the KCNJ12 and KCNJ18 genes. Top panel: Annotation release 105 on GRCh37.p13 represented by a configured graphic display of sequence record NC_000017.10. Bottom panel: Annotation release 106 on assembly GRCh38 represented by a configured graphic display of sequence record NC_000017.11. New gene models are circled.
Figure 1 shows a narrower area that corresponds to components AC068418.5 and AC233702.5 on GRCh38. The graphic display is configured so that it shows annotated gene models without the corresponding transcripts and proteins. The two assemblies share component AC068418.5 along with the five gene models annotated on it. That the same sequence would have the same annotation over time might seem an obvious outcome, but this is not always the case. Annotations on the same sequence (same assembly) can change from one annotation release to another if new transcript data support a new gene model, and this process of gathering and presenting new evidence for gene models is one of the major purposes of new annotation releases on a given assembly.
Progressing from AC068418.5 towards the gap in GRCh37.p13, the gene annotation diverges. Obviously, nothing (or anything) can be annotated within the GRCh37.p13 gap. But in GRCh38/Annotation release 106, where this gap has been filled by AC233702.5 (along with other new sequences), a new gene designated as KCNJ18 now appears. KCNJ18 (Gene ID: 100134444), a member of the inwardly-rectifying channel subfamily that J. Devon Ryan and his colleagues from the University of California recently discovered. They also reported evidence that the gene is associated with a muscle disorder (PMCID: PMC2885139). The transcript sequence of the gene was deposited to GenBank in 2008 and updated in 2010 (FJ434338.2). Improvements in the new assembly now allow this transcript sequence to align to the assembled genome sequence, and thus KCNJ18 has found its place on the human genome.
One might predict from the above discussion that if a research had searched for FJ434338.2 in GRCh37.p13, they would have found nothing because of the gap in the assembly. In fact, the genomic sequence was available in GRCh37.p13 on NW_003315950.2, a separate sequence record that is one of the fix patches that GRC released during the five-year period between the releases of GRCh37 and GRCh38. A fix patch is a region where the sequence has been improved or corrected between assemblies. Now, in GRCh38, the sequence of the fix patch has been integrated into chromosome 17 in the region that we have just examined.
The KCNJ18 gene is one of numerous genes where a gap closure allowed placements of new gene models on the genome. The following are other examples:
Gaps, however, are not the only problem in genomic assemblies. While small-scale deletions or insertions usually allow gene model placement on the genome, they often cause misalignments between the genome and transcript sequences. Some examples where correcting deletions or insertions improved gene annotations in GRCh38 are the following:
We hope this provides a starting point for exploring the improvements in GRCh38. You can find more information about the new release at the links below.