NCBI and EBI have been hard at work on our joint MANE collaboration, providing a set of representative transcripts for human protein-coding genes that are identically annotated in the NCBI RefSeq and Ensembl/GENCODE annotation sets and exactly match the GRCh38 reference assembly. We’re pleased to announce MANE v0.92, now covering 16,865 genes or ~88% of known human protein-coding genes.
In particular, we’ve focused on clinically relevant genes and MANE Select now includes 99% of genes with high gene-disease validity. This release also includes 43 extra transcripts labeled “MANE Plus Clinical” that we’ve chosen to aid in clinical reporting, for example, when there are additional pathogenic variants not covered in the MANE Select transcript. While it’s critical to consider other alternatively-spliced transcripts for variant interpretation or functional analyses, the MANE Select and MANE Plus Clinical transcripts provide a common foundation for clinical reporting, and other analyses that benefit from using just one well-supported transcript or protein per gene.
A total of 20,203 protein-coding genes and 17,871 non-coding genes were annotated.
The number of annotated curated transcripts increased by 17% and genes with two or more curated alternative variants increased by 8%.
The annotation includes 6,862 features and 2,075 GeneIDs for non-genic functional elements, such as regulatory regions and known structural elements. For example, see the opsin locus control region (OPSIN-LCR).
To continue providing efficient and timely processing, annotation, and dissemination of data, dbSNP’s architecture and process flow have been redesigned. The technical redesign prepares the database for increasing data volumes and providing timely, effective and trustworthy reference SNP results as submission rates continue to increase.
Highlights of the new system include:
Use of data objects instead of a relational database
Improved algorithms for clustering data into unique Reference SNPs
Automation of the entire process to provide timely releases
Guaranteed data consistency across dbSNP data accessed using web-based products or downloaded content, such as VCF and FTP files
In an earlier blog post, we discussed how sequence updates in GRCh38, the most recent version of the human reference genome, filled in a gap in human chromosome 17 near position 21,300K and expanded the region by 500K (500,000 base pairs). In this post, we will again consider this same region, but with an emphasis now on how GRCh38 also improved the gene annotation.
In a previous blog post, we explained several important concepts about the human reference genome. We presented a region of human chromosome 17 as an example of a location where the genome sequence was not fully assembled. In this post, we are going to revisit the same gapped region to see how the Genome Reference Consortium (GRC) changed this part of the genome in GRCh38, the updated human reference assembly released in December 2013. This region represents just one of the more than 1,000 changes and improvements that the GRC introduced in GRCh38.
In late December 2013, the Genome Reference Consortium (GRC) released an updated version of the human reference genome assembly, GRCh38, and submitted these new sequences to GenBank. This is the first time in four years that a new major version of the human genome has become available to the genomics community.
Perhaps you’ve been working on data mapped to the previous assembly (GRCh37) that became available in March 2009, or maybe you are still using an even earlier version, such as NCBI36 from March 2006. Is there a way to reduce the amount of time and effort required to reanalyze your data in the context of the new assembly?
This month marks a major event in the realm of human genome research: the release of a new assembly of the genome, GRCh38. It has been over four years since the last major release (GRCh37 in March 2009), and we are going to explore several aspects of this new assembly in a series of blog posts over the coming weeks. In this initial post, we will give an overview of the data flow so that you will understand how NCBI received the data, where the data are at NCBI and what genome annotations you can expect from NCBI in the near future. Continue reading “Introducing the New Human Genome Assembly: GRCh38”→
The haploid human genome consists of 22 autosomal chromosomes and the Y and the X chromosomes. Each of the chromosomes represents a single DNA molecule, a sequence of millions of nucleotide bases. These molecules are linear, so one might expect that we should represent each chromosome by a single, continuous sequence.
Unfortunately, this is not the case for two main reasons: 1) because of the nature of genomic DNA and the limitations of our sequencing methods, some parts of the genome remain unsequenced, and 2) emerging evidence suggests that some regions of the genome vary so much between individual people that they cannot be represented as a single sequence.
In response to this, modern genomic data sets present a model of the genome known as a genome assembly. This post will introduce the basic concepts of how we produce such assemblies as well as some basic vocabulary.