In an earlier blog post, we discussed how sequence updates in GRCh38, the most recent version of the human reference genome, filled in a gap in human chromosome 17 near position 21,300K and expanded the region by 500K (500,000 base pairs). In this post, we will again consider this same region, but with an emphasis now on how GRCh38 also improved the gene annotation.
In a previous blog post, we explained several important concepts about the human reference genome. We presented a region of human chromosome 17 as an example of a location where the genome sequence was not fully assembled. In this post, we are going to revisit the same gapped region to see how the Genome Reference Consortium (GRC) changed this part of the genome in GRCh38, the updated human reference assembly released in December 2013. This region represents just one of the more than 1,000 changes and improvements that the GRC introduced in GRCh38.
In late December 2013, the Genome Reference Consortium (GRC) released an updated version of the human reference genome assembly, GRCh38, and submitted these new sequences to GenBank. This is the first time in four years that a new major version of the human genome has become available to the genomics community.
Perhaps you’ve been working on data mapped to the previous assembly (GRCh37) that became available in March 2009, or maybe you are still using an even earlier version, such as NCBI36 from March 2006. Is there a way to reduce the amount of time and effort required to reanalyze your data in the context of the new assembly?
Yes! It’s NCBI’s Genome Remapping Service, or NCBI Remap for short.
This month marks a major event in the realm of human genome research: the release of a new assembly of the genome, GRCh38. It has been over four years since the last major release (GRCh37 in March 2009), and we are going to explore several aspects of this new assembly in a series of blog posts over the coming weeks. In this initial post, we will give an overview of the data flow so that you will understand how NCBI received the data, where the data are at NCBI and what genome annotations you can expect from NCBI in the near future.
What is a genome assembly?
The haploid human genome consists of 22 autosomal chromosomes and the Y and the X chromosomes. Each of the chromosomes represents a single DNA molecule, a sequence of millions of nucleotide bases. These molecules are linear, so one might expect that we should represent each chromosome by a single, continuous sequence. Unfortunately, this is not the case for two main reasons: 1) because of the nature of genomic DNA and the limitations of our sequencing methods, some parts of the genome remain unsequenced, and 2) emerging evidence suggests that some regions of the genome vary so much between individual people that they cannot be represented as a single sequence. In response to this, modern genomic data sets present a model of the genome known as a genome assembly. This post will introduce the basic concepts of how we produce such assemblies as well as some basic vocabulary.