The Datasets command-line tool now provides ortholog data

Note: Please see our more recent post about the new Datasets command-line clients and the documentation on how to get orthologs using the newer client. The command-lines below do not work in the current datasets client (NCBI Datasets CLIv14).

You can now get gene ortholog data using the NCBI Datasets command-line tool using a gene ID, gene symbol, or RefSeq nucleotide or protein accession. Data are available for vertebrates and insects. The vertebrate orthologs includes a specialized set for fish.  (See our recent post for more information on the orthologs for fish and insects.)

You can retrieve metadata for gene orthologs in JSON Format, or you can download a compressed (zip) archive containing both metadata and sequences (Figure 1).

Figure 1. Command-lines  that use a gene symbol (BRCA1) to retrieve mammalian ortholog metadata (top, JSON metadata shown in part in the image) and sequences (bottom). 

For example, if you want the mammalian orthologs of the human BRCA1 gene you can use the following summary command to get metadata for these genes:

datasets summary ortholog symbol BRCA1 --taxon human --taxon-filter mammals > brca1-mammals.json

The gene metadata includes gene names and synonyms, genomic coordinates, RefSeq transcript and protein data, as well as Ensembl and UniProt accessions and other gene information.

If you want the sequences, use the datasets download command to download a zip archive that includes gene, transcript, and protein sequences as well as metadata in tabular and JSON lines formats:

datasets download ortholog symbol BRCA1 --taxon human --taxon-filter mammals --filename

See our help documentation, for more information on using the datasets command-line tool to access ortholog data.


4 thoughts on “The Datasets command-line tool now provides ortholog data

  1. What version was the example command tested on? It does not work for version 14. longer works for version 14.16.0.

Leave a Reply