In March, we announced NCBI Datasets, a new resource that lets you easily retrieve and download data from across NCBI databases. Did you know you can now fetch NCBI Gene data programmatically using the NCBI Datasets API or command-line tool? Quickly retrieve both metadata and gene sequence data for multiple Gene records including transcripts and proteins in one shell command or API request. The API documentation is a good way to get started with programmatic access (Figure 1).
Figure 1. The Datasets API documentation showing a demonstration retrieving Gene metadata using RefSeq mRNA accessions. The API returns a readily processed JSON object.
NCBI Datasets now offers Gene tables: customizable tables of the genes you specify, with key gene information, and the ability to easily download a dataset of genomic, transcript and protein sequences.
Drag and drop a list of Gene IDs or gene symbols, and the data table shows your genes with up to 15 columns of metadata, including genomic coordinates, RefSeq transcript and protein accessions, Ensembl IDs and UniProt accessions, and other gene information. You can browse and select items in your table on the web, or download everything to your computer for later analysis (Figure 1).
Interested in human genes involved in COVID-19 biology? NCBI’s RefSeq group has been hard at work compiling a set of human genes with roles in coronavirus infection and disease. You can now see and search for these genes and their regulatory elements in NCBI Gene and RefSeq.
Figure 1. Top section of the human ACE2 record in the Gene database. COVID-19 information can be found in the Summary and Annotation information sections.
We’ve added several new enhancements to the RefSeq Functional Elements dataset, which provides genome annotation and richly annotated RefSeq and Gene records for experimentally validated non-genic functional regions in human and mouse. Read on to see what we’ve done!
Are you interested in high quality genomic annotations for human and mouse? Check out the Consensus Coding Sequence (CCDS) project! Release 23 of the CCDS project is now available in Entrez Gene. This release compares NCBI’s Mus musculus annotation release 108 to Ensembl’s annotation release 98. This update adds 1,570 new CCDS records and 175 genes to the mouse CCDS dataset. In total, release 23 includes 27,219 CCDS records that correspond to 20,486 genes.
Genome Workbench version 3 is a major upgrade, including the addition of the Genome Submission Wizard. This video guides you through the wizard, from uploading your genome data file to completion of the submitter report, which is ready to submit to GenBank using tools such as Submission Portal or BankIt. Note: An on-line tutorial is under “Manuals” on the Genome Workbench home page.
On Wednesday, September 11, 2019 at 12 PM, NCBI staff will present a webinar for people with limited experience working with gene and sequence information. You will learn about the kinds of data available for genes and sequences, how to select the most informative records, and how to find related genes and sequences using pre-computed information and the BLAST sequence search service.
Date and time: Wed, Sep 11, 2019 12:00 PM – 12:30 PM EDT
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.
As we previously announced, we planned to retire the UniGene web pages at the end of July, 2019. All UniGene pages now redirect to this post. We have also removed links to UniGene from the NCBI home page and other resources.
Although the web pages are no longer available, you will still be able to download the final UniGene builds as static content from the FTP site. You will also be able to match UniGene cluster numbers to Gene records by searching Gene with UniGene cluster numbers. For best results, restrict to the “UniGene Cluster Number” field rather than all fields in Gene. For example, a search with Mm.2108[UniGene Cluster Number] finds the mouse transthyretin Gene record (Ttr). You can use the advanced search page to help construct these searches. Keep in mind that the Gene record contains selected Reference Sequences and GenBank mRNA sequences rather than the larger set of expressed sequences in the UniGene cluster.
Please write to us with any comments, concerns, or if you need help using UniGene data.
We recently showed you a new a way to search for and view sets of orthologous genes from vertebrates. You can now get an additional set of search results that we are calling similar genes. These are related through protein architecture to the orthologous gene set and include genes from all metazoans and selected plant, fungal, and protist species. You can quickly find related genes within a species, compare them to those from other annotated metazoan genomes, and have access to other useful gene resources. To find a set of similar genes, enter a gene symbol or select the gene symbol + orthologs option from the selections menu.
For example if you search for ‘AGO2 orthologs‘, in addition to the link to orthologs from vertebrates, you’ll get a link to a set of similar genes (Genes with similar protein architectures) across a broad evolutionary spectrum that includes genes from invertebrates, fungi, and green plants (Figure 1).
Figure 1. Genes with similar protein architectures to AGO2. The original search was AGO2 orthologs, which brings up the suggestion box with the links to similar genes as well as the AGO2 vertebrate orthologs. The similar genes include entries from a broad taxonomic range of eukaryotic organisms.
If you search for ‘GH1‘, you’ll get a link to similar genes that includes members of the growth hormone family that are not part of NCBI’s vertebrate ortholog set.
Figure 2. The human subset of genes with similar protein architectures to GH1 showing other members (paralogs) of the GH1 gene family (GH2, CSH1, CSH2, CSHL1). These are not included in the ortholog set.
Try out the following searches and follow the links to the Genes with similar protein architectures
We have a new and improved search experience for viral genes from select human pathogens. When you search for a virus such as HIV-1 (more examples below), you now get an interactive graphical representation of the viral genome where you can see all the annotated viral proteins in context. Clicking on the gene / protein objects allows you to access sequences, publications, and analysis tools for the selected protein. This new feature is designed to help you quickly find information relevant to your research on clinically important viruses.Figure 1. Top: The virus genome graphic result for a search with HIV-1 with access to analysis tools, downloads, and relevant results in the Genome and Virus resources. Bottom: The result obtained by clicking the env gene graphic, which provides links to protein and nucleotide sequences, the literature, analysis tools, and downloads.
Try it out using the following example searches and let us know what you think!