The Datasets command-line tool now provides ortholog data

You can now get gene ortholog data using the NCBI Datasets command-line tool using a gene ID, gene symbol, or RefSeq nucleotide or protein accession. Data are available for vertebrates and insects. The vertebrate orthologs includes a specialized set for fish.  (See our recent post for more information on the orthologs for fish and insects.)

You can retrieve metadata for gene orthologs in JSON Format, or you can download a compressed (zip) archive containing both metadata and sequences (Figure 1).

Figure 1. Command-lines  that use a gene symbol (BRCA1) to retrieve mammalian ortholog metadata (top, JSON metadata shown in part in the image) and sequences (bottom). 

For example, if you want the mammalian orthologs of the human BRCA1 gene you can use the following summary command to get metadata for these genes:

datasets summary ortholog symbol BRCA1 --taxon human --taxon-filter mammals > brca1-mammals.json

The gene metadata includes gene names and synonyms, genomic coordinates, RefSeq transcript and protein data, as well as Ensembl and UniProt accessions and other gene information.

If you want the sequences, use the datasets download command to download a zip archive that includes gene, transcript, and protein sequences as well as metadata in tabular and JSON lines formats:

datasets download ortholog symbol BRCA1 --taxon human --taxon-filter mammals --filename

See our help documentation, for more information on using the datasets command-line tool to access ortholog data.


Leave a Reply