Sequence updates in human assembly GRCh38: improving gene annotation

In an earlier blog post, we discussed how sequence updates in GRCh38, the most recent version of the human reference genome, filled in a gap in human chromosome 17 near position 21,300K and expanded the region by 500K (500,000 base pairs). In this post, we will again consider this same region, but with an emphasis now on how GRCh38 also improved the gene annotation.


Figure 1. Annotation of a region of chromosome 17 near the KCNJ12 and KCNJ18 genes. Top panel: Annotation release 105 on GRCh37.p13 represented by a configured graphic display of sequence record NC_000017.10. Bottom panel: Annotation release 106 on assembly GRCh38 represented by a configured graphic display of sequence record NC_000017.11. New gene models are circled. 

Figure 1 shows a narrower area that corresponds to components AC068418.5 and AC233702.5 on GRCh38. The graphic display is configured so that it shows annotated gene models without the corresponding transcripts and proteins. The two assemblies share component AC068418.5 along with the five gene models annotated on it.  That the same sequence would have the same annotation over time might seem an obvious outcome, but this is not always the case. Annotations on the same sequence (same assembly) can change from one annotation release to another if new transcript data support a new gene model, and this process of gathering and presenting new evidence for gene models is one of the major purposes of new annotation releases on a given assembly.

Progressing from AC068418.5 towards the gap in GRCh37.p13, the gene annotation diverges. Obviously, nothing (or anything) can be annotated within the GRCh37.p13 gap. But in GRCh38/Annotation release 106, where this gap has been filled by AC233702.5 (along with other new sequences), a new gene designated as KCNJ18 now appears. KCNJ18 (Gene ID: 100134444), a member of the inwardly-rectifying channel subfamily that J. Devon Ryan and his colleagues from the University of California recently discovered. They also reported evidence that the gene is associated with a muscle disorder (PMCID: PMC2885139). The transcript sequence of the gene was deposited to GenBank in 2008 and updated in 2010 (FJ434338.2). Improvements in the new assembly now allow this transcript sequence to align to the assembled genome sequence, and thus KCNJ18 has found its place on the human genome.

One might predict from the above discussion that if a research had searched for FJ434338.2 in GRCh37.p13, they would have found nothing because of the gap in the assembly. In fact, the genomic sequence was available in GRCh37.p13 on NW_003315950.2, a separate sequence record that is one of the fix patches that GRC released during the five-year period between the releases of GRCh37 and GRCh38. A fix patch is a region where the sequence has been improved or corrected between assemblies. Now, in GRCh38, the sequence of the fix patch has been integrated into chromosome 17 in the region that we have just examined.

The KCNJ18 gene is one of numerous genes where a gap closure allowed placements of new gene models on the genome.  The following are other examples:

Gaps, however, are not the only problem in genomic assemblies. While small-scale deletions or insertions usually allow gene model placement on the genome, they often cause misalignments between the genome and transcript sequences.  Some examples where correcting deletions or insertions improved gene annotations in GRCh38 are the following:

We hope this provides a starting point for exploring the improvements in GRCh38. You can find more information about the new release at the links below.

Advice for NIH Grantees: How to comply with the NIH Public Access Policy

“The NIH public access policy requires scientists to submit final peer-reviewed journal manuscripts that arise from NIH funds to PubMed Central immediately upon acceptance for publication.” –

To comply with NIH Public Access Policy, here are the steps you should take:

Determine if the Public Access Policy applies to your publication

Generally, the NIH Public Access Policy applies to any peer-reviewed journal article that was accepted for publication on or after April 7, 2008 and that arose from NIH funding in Fiscal Year 2008 or later.

Determine Applicability for Your Publication

What does the NIH consider to be a ‘journal’?

Review your publication agreement

Before you sign a publication agreement or similar copyright transfer agreement, first make sure that the agreement allows the paper to be posted to PubMed Central (PMC) in accordance with the NIH Public Access Policy.

Continue reading

New SciENcv Features Allow Users To Create and Download Multiple Biosketches

NCBI’s recent update to the SciENcv feature in MyNCBI gives researchers the ability to create multiple biosketches for grants from federal agencies engaged in scientific research, allowing a more tailored and convenient approach to the grant application process.

What is SciENcv?

SciENcv (Science Experts Network Curriculum Vitae) is designed to help researchers assemble an NIH biosketch by extracting information from NIH eRA Commons and PubMed. The SciENcv interagency working group includes NIH, as well as DOD, DOE, EPA, NSF, USDA and the Smithsonian. You can access SciENcv if you have a My NCBI account. My NCBI accounts are free and offer many useful features, such as saving searches, automated e-mail alerts and My Bibliography.

 Create your biosketch

Based on user suggestions, we’ve made it possible to create biosketches in three ways: from scratch, from an external source, or by duplicating an existing profile (see Figure 1). While the eRA Commons data feed is currently the only external data option, we plan on adding other external data sources in a future release of SciENcv.

Figure 1. Three ways to create your NIH biosketches in SciENcv

Figure 1. Three ways to create your NIH biosketches in SciENcv

Continue reading

Sequence updates in human genome assembly GRCh38: filling in the gaps

In a previous blog post, we explained several important concepts about the human reference genome.  We presented a region of human chromosome 17 as an example of a location where the genome sequence was not fully assembled.  In this post, we are going to revisit the same gapped region to see how the Genome Reference Consortium (GRC) changed this part of the genome in GRCh38, the updated human reference assembly released in December 2013.  This region represents just one of the more than 1,000 changes and improvements that the GRC introduced in GRCh38.

Continue reading

Early Developments in the PubMed Commons Pilot

It’s been an exciting and productive time since the PubMed Commons beta launch. We’ve learned a great deal, both here working under the hood and from the conversations in social media and blog posts.

We are working on answers to questions that people are asking, via our Twitter account and by revising and expanding information on the PubMed Commons page soon. And we will try out a Twitter chat: so keep your eye out on @PubMedCommons for the announcement.

There are now about 1,000 people signed up in the Commons. Remember, any author in PubMed can join, from anywhere in the world. Check out our step-by-step guide. Once you are in, you can invite others. So please spread the word!

Continue reading

The Human Reference Genome – Understanding the New Genome Assemblies

What is a genome assembly?

The haploid human genome consists of 22 autosomal chromosomes and the Y and the X chromosomes. Each of the chromosomes represents a single DNA molecule, a sequence of millions of nucleotide bases.  These molecules are linear, so one might expect that we should represent each chromosome by a single, continuous sequence. Unfortunately, this is not the case for two main reasons: 1) because of the nature of genomic DNA and the limitations of our sequencing methods, some parts of the genome remain unsequenced, and 2) emerging evidence suggests that some regions of the genome vary so much between individual people that they cannot be represented as a single sequence. In response to this, modern genomic data sets present a model of the genome known as a genome assembly. This post will introduce the basic concepts of how we produce such assemblies as well as some basic vocabulary.

Continue reading

What does NCBI’s Internet Explorer 7 warning mean?

Over the past several months, you may have noticed a warning message if you’ve accessed the NCBI site using Microsoft’s Internet Explorer web browser:

Internet Explorer Warning

If you have been using Internet Explorer versions 7 or 8 (on “compatibility mode”) to surf the web, you may have seen this warning at the top of NCBI webpages.

This message has caused some concern among some users about exactly what changed on January 1, 2013 and whether or not they will still be able to access PubMed and other NCBI resources.  We hope that this post will address some of the more common questions.

Continue reading