Sequence updates in human genome assembly GRCh38: filling in the gaps

In a previous blog post, we explained several important concepts about the human reference genome.  We presented a region of human chromosome 17 as an example of a location where the genome sequence was not fully assembled.  In this post, we are going to revisit the same gapped region to see how the Genome Reference Consortium (GRC) changed this part of the genome in GRCh38, the updated human reference assembly released in December 2013.  This region represents just one of the more than 1,000 changes and improvements that the GRC introduced in GRCh38.

First, we’ll examine corresponding regions of chromosome 17 in the previous (GRCh37.p13) and current (GRCh38) reference assemblies (Figure 1). The representation of the region spanning from 21,200K to 21,700K in GRCh37.p13 contained a gap of unknown size (the blank area with no components in the figure), which was arbitrarily set to 100K. The ‘K’ indicates a kilo base pair, or 1,000 bp.

blog 91 fig 1

Figure 1: Updates to the reference human genome assembly in a region of chromosome 17. Top panel: A region of chromosome 17 (NC_000017.10) from the GRCh37.p13 assembly showing the components. Bottom panel: The corresponding region of chromosome 17 (NC_000017.11) in the GRCh38 assembly showing the new components. Components shared between the two builds are marked with checks. Components not present in GRCh38 are marked with an “X”. The labeled components AC233702.5 and ABBA01006765.1 are two of the 11 new components in GRCh38.

GRCh38 contains new information in this region, including 11 new components that expanded the area by about 500K. This expansion resulted in a change in coordinates for all downstream genes and features on chromosome 17. Moreover, three of the components in GRCh37.p13 within this region are no longer in GRCh38, and therefore any genes or features annotated on those components have moved somewhere else in the new build or have been deleted.

Major changes such as these require researchers to take a moment to assess the consequences of the updates on any genes of interest in the many regions that have been updated in GRCh38. To make this easier, NCBI offers a Genome Remapping Service that converts mapping data from one build to another, as described in a previous post. You can use this tool to confirm that the region on chromosome 17 between 21,200 K and 21,700 K in GRCh37.p13 roughly corresponds to the region between 21,300 K and 22,200K on chromosome 17 in GRCh38.

In a future post, we will take a closer look at this region of chromosome 17 to see how the gene annotations changed as a result of the new components.

The Tasmanian Devil 2: The tumor and Tasmanian devil mitochondrial genomes

The Tasmanian devil (Sarcophilus harrisii), the last remaining large marsupial carnivore, now faces extinction because of a strange and deadly infection, a transmissible cancer known as Transmissible Devil Facial Tumor Disease (TDFTD).  In a previous NCBI Insights post, we discussed gene expression data from the tumors that established their neural origin and showed the tumors were likely derived from Schwann cells.  In this post, we’ll consider some of the genome sequencing projects in the NCBI databases and explore evidence that the tumor originated in a different individual than the affected animal supporting the idea that the tumor cells themselves are infectious agents. Continue reading

NCBI’s Genome Remapping Service assists in the transition to the new human genome reference assembly (GRCh38)

In late December 2013, the Genome Reference Consortium (GRC) released an updated version of the human reference genome assembly, GRCh38, and submitted these new sequences to GenBank. This is the first time in four years that a new major version of the human genome has become available to the genomics community.

Perhaps you’ve been working on data mapped to the previous assembly (GRCh37) that became available in March 2009, or maybe you are still using an even earlier version, such as NCBI36 from March 2006. Is there a way to reduce the amount of time and effort required to reanalyze your data in the context of the new assembly?

Yes! It’s NCBI’s Genome Remapping Service, or NCBI Remap for short.

Continue reading

A Librarian’s Guide to NCBI — an intensive training course for medical librarians to be offered April 2014

The NCBI in partnership with the National Library of Medicine Training Center (NTC) will offer the Librarian’s Guide to NCBI course on the NIH campus in April 2014. This will be the second presentation of the course; it was previously offered in the spring of 2013 (NCBI Insights April 11 and May 6, 2013). After the course, we will post lecture slides and hands-on practical exercises on the education area of the NCBI FTP site and video tutorials of the course lectures will be available on the NCBI YouTube channel. Materials from the 2013 course are available, as well as lecture videos for the expression module.
Continue reading

Introducing the New Human Genome Assembly: GRCh38

This month marks a major event in the realm of human genome research: the release of a new assembly of the genome, GRCh38. It has been over four years since the last major release (GRCh37 in March 2009), and we are going to explore several aspects of this new assembly in a series of blog posts over the coming weeks. In this initial post, we will give an overview of the data flow so that you will understand how NCBI received the data, where the data are at NCBI and what genome annotations you can expect from NCBI in the near future.
Continue reading

Making Custom Databases for Web BLAST

An easy way to speed up your BLAST analysis is to search a smaller database targeted to sequences of interest. We’ll describe here a few ways to create such custom databases on the BLAST web pages.  For this Quick Tip we’ll use the pages in the Basic BLAST section of the BLAST home page.

BLAST parent databases

Generating a custom database begins with selecting the appropriate parent database. The BLAST Guide provides database descriptions to help with choosing a database.  You select the parent in the Database pull-down menu, shown in Figure 1. Selecting the database is really your first opportunity to customize.

BLAST Parent Database Pull-down Menu

Figure 1. The database selection pull-down lists: top panel, nucleotide databases; bottom panel, protein databases

Continue reading

Setting Up Automatic NCBI Searches and New Record Alerts

Do you regularly perform PubMed searches to find new articles on your topic of interest?

Would you like to know when new sequence records become available for your gene?

Is it important to be alerted when new bioactivity assays are available with inhibitor data for your enzyme?

With a free My NCBI account, you can easily set up a series of e-mail alerts to notify you of such new information. You can read more about the many other functions of My NCBI.

Here’s how to set up these alerts:

Continue reading

NCBI’s 25th Anniversary and The Jim Gray eScience Award

November 2013 marks 25 years since the founding of the National Center for Biotechnology Information (NCBI).

Cardin Congressional Record Statement

In honor of NCBI’s 25th anniversary, United States Senator Ben Cardin read a statement into the Congressional Record recognizing years of service in providing access to biomedical and genomic information to enhance the world’s science and health.

On November 1st an awards and recognition program was held on the NIH Campus in Bethesda, Maryland to commemorate this occasion.

Presentation of the Jim Gray eScience Award

Tony Hey, Ph.D., Vice President of Microsoft Research, presenting the Jim Gray eScience award to David Lipman, M.D., Director of the NCBI.

At this event, Tony Hey, PhD, Vice President of Microsoft Research, presented NCBI Director David Lipman, MD, with the Jim Gray eScience Award which recognizes researchers who have made outstanding contributions to the field of data-intensive computing in the pursuit of open, supportive, and collaborative research models. Continue reading

Early Developments in the PubMed Commons Pilot

It’s been an exciting and productive time since the PubMed Commons beta launch. We’ve learned a great deal, both here working under the hood and from the conversations in social media and blog posts.

We are working on answers to questions that people are asking, via our Twitter account and by revising and expanding information on the PubMed Commons page soon. And we will try out a Twitter chat: so keep your eye out on @PubMedCommons for the announcement.

There are now about 1,000 people signed up in the Commons. Remember, any author in PubMed can join, from anywhere in the world. Check out our step-by-step guide. Once you are in, you can invite others. So please spread the word!

Continue reading

Joining PubMed Commons: A Step-by-step Guide

In our previous post we wrote about a new service called PubMed Commons that allows researchers to add comments to individual PubMed records. As we described in that post, PubMed Commons is currently a beta pilot release, and requires interested people to join the system before they can view or add comments. This post will describe how to join PubMed Commons.

Continue reading