Now we’re joining together on an exciting new project we’re calling Matched Annotation from the NCBI and EMBL-EBI or MANE, to provide a matched set of well-supported transcripts for human protein-coding genes and define one representative transcript for each gene. Both RefSeq and Ensembl will continue to provide a rich set of alternate transcripts per gene.
Earlier this year, we announced the release of a new and improved search feature that interprets plain language to give better results for common searches. This feature, originally developed in NCBI Labs and later released on the NCBI All Databases search, is now available across several NCBI resources: Nucleotide, Protein, Gene, Genome, and Assembly. Whether you are searching for a specific gene or for a whole genome, you will now retrieve NCBI’s best results regardless of the database you search.
The image below shows the results for a search for human INS in the Nucleotide database. Even though this is a Nucleotide search, the results include relevant information from Gene, Protein, Taxonomy, plus links to the NCBI reference sequences (RefSeq) as well as access to BLAST and the insulin gene region in NCBI’s genome browser, the Genome Data Viewer.Figure 1. The new natural language search result in the Nucleotide database from a search for human INS.
Try out this new search capability and let us know what you think. And keep visiting the NCBI Labs search page to try our latest experiments, which we’ll also announce here on NCBI Insights.
Professors, we know you’re busy — really, really busy. You have to develop and teach your courses and labs, coordinate and run your journal clubs and seminars, direct your lab’s research efforts, write grants and publications, counsel and mentor your students, and stay current on everything related to your teaching and research topics.
NCBI has information that can help with all of this, but there are so many interesting records and so little time to organize them. Sign up (Help) for or log in (Help) to your free NCBI Account and let us help you get started and get organized!
Read on – or watch the video embedded below – to learn more about what you can do with your NCBI Account.
The Consensus Coding Sequence (CCDS) update that compares NCBI’s Homo sapiens annotation release 109 to Ensembl’s release 92 is now reflected in Gene. This update adds 894 new CCDS IDs, and adds 154 Genes into the human CCDS set. CCDS release 22 includes a total of 33,397 CCDS IDs that correspond to 19,033 GeneIDs.
The CCDS project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. The long-term goal is to support convergence towards a standard set of gene annotations.
Highlights in release 109:
- A total of 20,203 protein-coding genes and 17,871 non-coding genes were annotated.
- The number of annotated curated transcripts increased by 17% and genes with two or more curated alternative variants increased by 8%.
- The annotation includes 6,862 features and 2,075 GeneIDs for non-genic functional elements, such as regulatory regions and known structural elements. For example, see the opsin locus control region (OPSIN-LCR).
A study (PMID: 28158543) published in the July 2017 issue of Bioinformatics collects, classifies and analyzes single nucleotide variants (SNVs) that may affect response to currently approved drugs. They identified 2,640 SNVs of interest, most of which occur rarely in populations (minor allele frequency <0.01).
The researchers used protein sequence alignment tools and mined open data from multiple information resources accessed through E-utilities including PubChem Compound (Kim et al., 2016 PMID: 26400175), NCBI Gene (Maglott D, et al., 2014. PMID: 25355515), NCBI Protein (Sayers, 2013), MMDB (Madej et al., 2012 PMID: 22135289), PDB (Berman et al., 2000 PMID: 10592235), dbSNP (Sherry et al., 2001 PMID: 11125122), and ClinVar (Landrum et al., 2016 PMID: 26582918).
Questions, comments, and other feedback may be sent to Yanli Wang.
Last February, we added gene expression data to Gene. Now, you can access these data in a few new ways.
Expression pattern “teasers” in Summary
We’ve added a brief sentence describing the expression pattern to the Summary section. This teaser sentence describes tissue-specific expression of the gene, with a link to the complete description that appears in the Expression section.
For ease in accessing the orthology data subset, a new gene_orthologs FTP file has been created on the Gene FTP site. The file uses the same format as the gene_group file. As of January 31, 2018, the gene_group FTP file no longer includes orthologs.
The protein interactions dataset now has:
- 8,005 interactions,
- 16,215 interaction descriptions,
- 3,859 proteins encoded by 3,757 human genes,
- and 6,822 publications.
The replication interactions dataset now has:
- 1,595 interactions,
- 1,854 interaction descriptions,
- 1,583 proteins encoded by 1,583 human genes,
- and 229 publications.