Orthologs Are A-Swimming and A-Buzzing in RefSeq!

Previously we wrote about improvements to Drosophila annotations in RefSeq. We’re excited to report that we’re also improving how we compute and report orthology data for fish and insects to help you find evolutionarily related genes across species. Currently when we annotate a vertebrate genome using our in-house eukaryotic genome annotation pipeline, we have a robust process that identifies 1:1 orthologs vs human using a combination of BLAST comparisons and local synteny. These results are available in NCBI Gene and our new Ortholog pages, and also on Gene’s FTP site. We also use the data to apply human gene and protein names to orthologs in other species, providing a very rich annotation for hundreds of vertebrates.


For fish, we’re now using a two-layer process. First, most of the fish now have 1:1 orthologs identified vs zebrafish, which typically results in identifying 50% more orthologs. Second, if we’ve identified a human ortholog for the zebrafish gene, then we also report the human gene. We’ve also switched primarily to applying gene symbols and names from zebrafish instead of human, mostly provided by the Zebrafish Information Network (ZFIN), to other fish orthologs. The end result is more ortholog connections and better nomenclature. For example, many fish have two related homeobox genes meis2a and meis2b, compared to the single MEIS2 gene in human. Our updated process has allowed us to identify and properly name meis2a and meis2b in 73 and 40 fish species, respectively.


We’ve also released an orthology dataset on the Gene FTP site for over 100 insects compared to Drosophila melanogaster. The average insect has orthologs identified for half of its coding genes, and that increases to over 80% for most Drosophila species. We’re still working on exposing this data on the web, but for starters you can use the new nomenclature report (gene_info) and gene ontology data from D. melanogaster, or get proteins from a set of orthologs using gene2refseq.

You can learn more about these datasets at TAGC 2020 Online, which includes poster presentations on the fish (#739A) and insect (#568A) datasets (no meeting registration required). Please let us know if you have comments or questions about these data and tools. We’d love to hear from you!

