Programmatic access to Gene data using Datasets command-line and API

In March, we announced NCBI Datasets, a new resource that lets you easily retrieve and download data from across NCBI databases. Did you know you can now fetch NCBI Gene data programmatically using the NCBI Datasets API or command-line tool?  Quickly retrieve both metadata and gene sequence data for multiple Gene records including transcripts and proteins in one shell command or API request. The API documentation is a good way to get started with programmatic access (Figure 1).

Figure 1. The Datasets API documentation showing a demonstration retrieving Gene metadata using RefSeq mRNA accessions. The API returns a readily processed JSON object.

NCBI Datasets now provides downloads of gene data for more than 30 thousand organisms

NCBI Datasets now offers Gene tables: customizable tables of the genes you specify, with key gene information, and the ability to easily download a dataset of genomic, transcript and protein sequences.

Drag and drop a list of Gene IDs or gene symbols, and the data table shows your genes with up to 15 columns of metadata, including genomic coordinates, RefSeq transcript and protein accessions, Ensembl IDs and UniProt accessions, and other gene information. You can browse and select items in your table on the web, or download everything to your computer for later analysis (Figure 1).

Figure 1. The Data tables web download. Top panel. Enter or upload a list of gene identifiers or symbols. Bottom panel. The resulting table display allows you to browse results, download the table or the sequence data for the genes (genomic, transcripts, proteins).

The latest in COVID-19 related human gene annotation now in NCBI RefSeq and Gene

Interested in human genes involved in COVID-19 biology? NCBI’s RefSeq group has been hard at work compiling a set of human genes with roles in coronavirus infection and disease. You can now see and search for these genes and their regulatory elements in NCBI Gene and RefSeq.

Figure 1. Top section of the human ACE2 record in the Gene database. COVID-19 information can be found in the Summary and Annotation information sections.

New interaction data, downloads and track hub available for RefSeq Functional Elements 

We’ve added several new enhancements to the RefSeq Functional Elements dataset, which provides genome annotation and richly annotated RefSeq and Gene records for experimentally validated non-genic functional regions in human and mouse. Read on to see what we’ve done!

CCDS Release 23 for Mouse Now in Entrez Gene

Are you interested in high quality genomic annotations for human and mouse? Check out the Consensus Coding Sequence (CCDS) project! Release 23 of the CCDS project is now available in Entrez Gene. This release compares NCBI’s Mus musculus annotation release 108 to Ensembl’s annotation release 98. This update adds 1,570 new CCDS records and 175 genes to the mouse CCDS dataset. In total, release 23 includes 27,219 CCDS records that correspond to 20,486 genes.

NCBI on YouTube: new videos on PubMed, My Bibliography, sequence data and more

Here are the latest videos on our YouTube channel. Subscribe to get alerts for new videos.

Introducing the Genome Submission Wizard in Genome Workbench v3.0

Genome Workbench version 3 is a major upgrade, including the addition of the Genome Submission Wizard. This video guides you through the wizard, from uploading your genome data file to completion of the submitter report, which is ready to submit to GenBank using tools such as Submission Portal or BankIt. Note: An on-line tutorial is under “Manuals” on the Genome Workbench home page.

September 11 Webinar: A beginner’s guide to genes and sequences at NCBI

On Wednesday, September 11, 2019 at 12 PM, NCBI staff will present a webinar for people with limited experience working with gene and sequence information. You will learn about the kinds of data available for genes and sequences, how to select the most informative records, and how to find related genes and sequences using pre-computed information and the BLAST sequence search service.

The UniGene web pages are now retired

As we previously announced,  we planned to retire the UniGene web pages at the end of July, 2019.   All UniGene pages now redirect to this post. We have also removed links to UniGene from the NCBI home page and other resources.

Although the web pages are no longer available, you will still be able to download the final UniGene builds as static content from the FTP site.  You will also be able to match UniGene cluster numbers to Gene records by searching Gene with UniGene cluster numbers. For best results, restrict to the “UniGene Cluster Number” field rather than all fields in Gene.  For example, a search with Mm.2108[UniGene Cluster Number] finds the mouse  transthyretin Gene record (Ttr).  You can use the advanced search page to help construct these searches. Keep in mind that the Gene record contains selected Reference Sequences and GenBank mRNA sequences rather than the larger set of expressed sequences in the UniGene cluster.

Please write to us with any comments, concerns, or if you need help using UniGene data.

A new way to find an expanded set of similar genes

We recently showed you a new a way to search for and view sets of orthologous genes  from vertebrates. You can now get an additional set of search results that we are calling similar genes.  These are related through protein architecture to the orthologous gene set and include genes from all metazoans and selected plant, fungal, and protist species. You can quickly find related genes within a species, compare them to those from other annotated metazoan genomes, and have access to other useful gene resources. To find a set of similar genes, enter a gene symbol or select the gene symbol + orthologs option from the selections menu.

For example if you search for ‘AGO2 orthologs‘,  in addition to the  link to orthologs from vertebrates, you’ll get a link to a set of similar genes (Genes with similar protein architectures) across a broad evolutionary spectrum that includes genes from invertebrates, fungi, and green plants (Figure 1).

AGO2_Fig1Figure 1.  Genes with similar protein architectures to AGO2. The original search was AGO2 orthologs, which brings up the suggestion box with the links to similar genes as well as the AGO2 vertebrate orthologs. The similar genes include entries from a broad taxonomic range of eukaryotic organisms.

If you search for ‘GH1‘, you’ll get a link to similar genes that includes members of the growth hormone family that are not part of NCBI’s vertebrate ortholog set.

GH1_Fig2.pngFigure 2. The human subset of genes with similar protein architectures to GH1 showing other members (paralogs) of the GH1 gene family (GH2, CSH1, CSH2, CSHL1). These are not included in the ortholog set.

Try out the  following searches and follow the links to the Genes with similar protein architectures

Please  let us know what you think!

Genome context graphic now in virus search results

We have a new and improved search experience for viral genes from select human pathogens. When you search  for a virus such as HIV-1 (more examples below),  you now get an interactive graphical representation of the viral genome where you can see all the annotated viral proteins in context. Clicking on the gene / protein objects allows you to access sequences, publications, and analysis tools for the selected protein. This new feature is designed to help you quickly find information relevant to your research on clinically important viruses.Virus_searchFigure 1. Top: The virus genome graphic result for a search with HIV-1 with access to analysis tools, downloads, and relevant results in the Genome and Virus resources. Bottom: The result obtained by clicking the env gene graphic, which provides links to protein and nucleotide sequences, the literature, analysis tools, and downloads.

Try it out using the following example searches and  let us know what you think!