If you’ve been searching in Gene, Nucleotide, Protein, Genome or Assembly databases, you’ve probably noticed the new search experience we introduced in September to interpret several common language searches and offer improved results. We’re excited to announce we’ve added as-you-type suggestions to the search bar in these databases.
Here’s a peek at the new menu in the NCBI Gene database.
Figure 1. Typing into the search box brings up automatic suggestions of the most popular queries.
Next Wednesday, November 14, 2018, NCBI staff will show you how to use NCBI’s genome browsers and other resources to interpret variants. The graphical displays of Genome Data Viewer (GDV) and Variation Viewer offer an interactive experience that allows you to explore NCBI’s rich collection of annotations, datasets and literature for deciphering your variant-associated data. In this presentation, we’ll step through case studies and show you how to quickly display relevant NCBI track sets — including the new RefSeq Functional Elements track, upload a file or remotely-hosted dataset and display these as a track, and use browser tracks to identify known variants, then assess variant functional and clinical significance and allele frequency. You will also learn how to navigate from the browsers to NCBI resources such as ClinVar, dbSNP and PubMed, for additional variant information.
Date and time: Wed, Nov 14, 2018 12:00 PM – 12:45 PM EDT
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.
The RefSeq project at the NCBI and the Ensembl/GENCODE project at EMBL-EBI have provided independent high-quality human reference gene datasets to biologists since the sequencing of the human genome. Now we’re joining together on an exciting new project we’re calling Matched Annotation from the NCBI and EMBL-EBI or MANE, to provide a matched set of well-supported transcripts for human protein-coding genes and define one representative transcript for each gene. Both RefSeq and Ensembl will continue to provide a rich set of alternate transcripts per gene.
The MANE project builds on the successful CCDS collaboration (PMCID: PMC5753299) and incorporates feedback from RefSeq and Ensembl/GENCODE users who requested a common reference transcript dataset including one or a few key transcripts for each gene where the RefSeq and Ensembl/GENCODE transcripts are identical in length and sequence, and completely match the human reference genome sequence. We expect to later expand the project to include a larger subset of full-length transcripts that more fully represent the functional complexity of many genes. We’re leveraging public deep sequencing datasets to optimize 5’ and 3’ UTR endpoints to more accurately reflect transcriptional processes. To pick representative transcripts, we’ve developed computational methods to evaluate and integrate transcript expression levels, protein conservation, support from archived transcript submissions, clinical relevance, and other factors. Complex genes are subject to review by annotation experts from both groups to agree on a representative transcript and often make improvements to both annotation sets.
The unified, high-quality transcript set provided by the MANE project will simplify the task of choosing a transcript for comparative genomics, clinical reporting, and basic research. When integrated across different public genome resources, this minimal, identically annotated transcript set will eliminate the need to choose between RefSeq and Ensembl/GENCODE datasets for genomic analyses. This will also make it easy for researchers who currently prefer one dataset over the other to exchange data or translate coordinates (or HGVS variation expressions) between RefSeq and Ensembl annotation results. Furthermore, the perfect alignment of all MANE transcripts to GRCh38 will make the set compatible with NGS-based sequencing technologies and other resources that use the latest and highest-quality reference human genome assembly available.
Our goal is for the final MANE dataset to be stable, although individual sequences and the dataset as a whole will be versioned and allow for future updates and expansions as needed to incorporate significant new data and additional curation. We plan to release a partial “beta” transcript set by the end of the year for testing, and a large sequence update in the next few months to refine 5’ and 3’ RefSeq transcript ends and match the GRCh38 sequence. Ensembl plans to release similar updates in spring 2019.
We’re looking forward to your feedback! Next week, we will be presenting the project at the annual American Society for Human Genetics (ASHG) meeting in San Diego, CA, USA. Please attend our talks scheduled in the Genome Reference Consortium (GRC) workshop on Tuesday, October 16, at 1:00 PM, and in the Importance of Isoform Expression in Variant Interpretation Session (#94) on Saturday, October 20th at 9:15 AM. You can also visit us at the NCBI or Ensembl booths and posters throughout the meeting or send us feedback at email@example.com. We’re looking forward to your valuable input on our new initiative!
Earlier this year, we announced the release of a new and improved search feature that interprets plain language to give better results for common searches. This feature, originally developed in NCBI Labs and later released on the NCBI All Databases search, is now available across several NCBI resources: Nucleotide, Protein, Gene, Genome, and Assembly. Whether you are searching for a specific gene or for a whole genome, you will now retrieve NCBI’s best results regardless of the database you search.
The image below shows the results for a search for human INS in the Nucleotide database. Even though this is a Nucleotide search, the results include relevant information from Gene, Protein, Taxonomy, plus links to the NCBI reference sequences (RefSeq) as well as access to BLAST and the insulin gene region in NCBI’s genome browser, the Genome Data Viewer.Figure 1. The new natural language search result in the Nucleotide database from a search for human INS.
Try out this new search capability and let us know what you think. And keep visiting the NCBI Labs search page to try our latest experiments, which we’ll also announce here on NCBI Insights.
Professors, we know you’re busy — really, really busy. You have to develop and teach your courses and labs, coordinate and run your journal clubs and seminars, direct your lab’s research efforts, write grants and publications, counsel and mentor your students, and stay current on everything related to your teaching and research topics.
NCBI has information that can help with all of this, but there are so many interesting records and so little time to organize them. Sign up (Help) for or log in (Help) to your free NCBI Account and let us help you get started and get organized!
Read on – or watch the video embedded below – to learn more about what you can do with your NCBI Account.
The Consensus Coding Sequence (CCDS) update that compares NCBI’s Homo sapiens annotation release 109 to Ensembl’s release 92 is now reflected in Gene. This update adds 894 new CCDS IDs, and adds 154 Genes into the human CCDS set. CCDS release 22 includes a total of 33,397 CCDS IDs that correspond to 19,033 GeneIDs.
The CCDS project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. The long-term goal is to support convergence towards a standard set of gene annotations.
You can now download human annotation release 109 on FTP or explore it in the Genome Data Viewer, in the Gene database, and with BLAST.
Highlights in release 109:
- A total of 20,203 protein-coding genes and 17,871 non-coding genes were annotated.
- The number of annotated curated transcripts increased by 17% and genes with two or more curated alternative variants increased by 8%.
- The annotation includes 6,862 features and 2,075 GeneIDs for non-genic functional elements, such as regulatory regions and known structural elements. For example, see the opsin locus control region (OPSIN-LCR).
A study (PMID: 28158543) published in the July 2017 issue of Bioinformatics collects, classifies and analyzes single nucleotide variants (SNVs) that may affect response to currently approved drugs. They identified 2,640 SNVs of interest, most of which occur rarely in populations (minor allele frequency <0.01).
The researchers used protein sequence alignment tools and mined open data from multiple information resources accessed through E-utilities including PubChem Compound (Kim et al., 2016 PMID: 26400175), NCBI Gene (Maglott D, et al., 2014. PMID: 25355515), NCBI Protein (Sayers, 2013), MMDB (Madej et al., 2012 PMID: 22135289), PDB (Berman et al., 2000 PMID: 10592235), dbSNP (Sherry et al., 2001 PMID: 11125122), and ClinVar (Landrum et al., 2016 PMID: 26582918).
Questions, comments, and other feedback may be sent to Yanli Wang.
Last February, we added gene expression data to Gene. Now, you can access these data in a few new ways.
Figure 1. The expression teaser text from the human CYP2C19 gene record. CYP2C19 is a phase-one drug-metabolism gene expressed in liver and other organs/tissues involved in metabolizing drugs and other xenobiotics.
Expression pattern “teasers” in Summary
We’ve added a brief sentence describing the expression pattern to the Summary section. This teaser sentence describes tissue-specific expression of the gene, with a link to the complete description that appears in the Expression section.
For ease in accessing the orthology data subset, a new gene_orthologs FTP file has been created on the Gene FTP site. The file uses the same format as the gene_group file. As of January 31, 2018, the gene_group FTP file no longer includes orthologs.