Programmatic access to Gene data using Datasets command-line and API

In March, we announced NCBI Datasets, a new resource that lets you easily retrieve and download data from across NCBI databases. Did you know you can now fetch NCBI Gene data programmatically using the NCBI Datasets API or command-line tool?  Quickly retrieve both metadata and gene sequence data for multiple Gene records including transcripts and proteins in one shell command or API request. The API documentation is a good way to get started with programmatic access (Figure 1).

Figure 1. The Datasets API documentation showing a demonstration retrieving Gene metadata using RefSeq mRNA accessions. The API returns a readily processed JSON object.

  • If you already know the gene symbols for the genes you want, you can use those!
  • If you have the gene IDs from the NCBI website, you can use those!
  • If you want Gene metadata related to RefSeq nucleotide or protein records, you can easily get this using RefSeq accessions (Figure 1)!

Searching by gene symbol also requires your taxon of interest since gene symbols are not unique across taxa.  If you have multiple taxa, fetching multiple datasets in batches is easy using common programmatic methods.

For more information and some useful examples, visit the Gene command line documentation or the Gene API documentation.  We have also have Jupyter Notebooks that rely on the Datasets python library that will help you see what Gene data you can retrieve.  You can also find these and other useful code libraries in the Datasets python library repository on GitHub.

Leave a Reply