Reduced redundancy. Faster searches. More diverse proteins and organisms in your BLAST results. Check out our new ClusteredNR database – derived from the default BLAST protein nr database by clustering sequences at 90% identity / 90% length (details below). Get quicker results and access to information about the distribution of your hits across a wider range of organisms and evolutionary distances.
You can choose the ClusteredNR database in the Choose Search Set section of the BLAST submission form where you normally pick the BLAST database. Simply select the Experimental databases radio button. You can also select the checkbox to search both ClusteredNR and the standard nr at the same time allowing you to compare results (Figure 1).
You can now get gene ortholog data using the NCBI Datasetscommand-line tool using a gene ID, gene symbol, or RefSeq nucleotide or protein accession. Data are available for vertebrates and insects. The vertebrate orthologs includes a specialized set for fish. (See our recent post for more information on the orthologs for fish and insects.)
You can retrieve metadata for gene orthologs in JSON Format, or you can download a compressed (zip) archive containing both metadata and sequences (Figure 1).
Figure 1. Command-lines that use a gene symbol (BRCA1) to retrieve mammalian ortholog metadata (top, JSON metadata shown in part in the image) and sequences (bottom).
Previously we wrote about improvements to Drosophila annotations in RefSeq. We’re excited to report that we’re also improving how we compute and report orthology data for fish and insects to help you find evolutionarily related genes across species. Currently when we annotate a vertebrate genome using our in-house eukaryotic genome annotation pipeline, we have a robust process that identifies 1:1 orthologs vs human using a combination of BLAST comparisons and local synteny. These results are available in NCBI Gene and our new Ortholog pages, and also on Gene’s FTP site. We also use the data to apply human gene and protein names to orthologs in other species, providing a very rich annotation for hundreds of vertebrates.
For fish, we’re now using a two-layer process. First, most of the fish now have 1:1 orthologs identified vs zebrafish, which typically results in identifying 50% more orthologs. Second, if we’ve identified a human ortholog for the zebrafish gene, then we also report the human gene. We’ve also switched primarily to applying gene symbols and names from zebrafish instead of human, mostly provided by the Zebrafish Information Network (ZFIN), to other fish orthologs. The end result is more ortholog connections and better nomenclature. For example, many fish have two related homeobox genes meis2a and meis2b, compared to the single MEIS2 gene in human. Our updated process has allowed us to identify and properly name meis2a and meis2b in 73 and 40 fish species, respectively.
We recently showed you a new a way to search for and view sets of orthologous genes from vertebrates. You can now get an additional set of search results that we are calling similar genes. These are related through protein architecture to the orthologous gene set and include genes from all metazoans and selected plant, fungal, and protist species. You can quickly find related genes within a species, compare them to those from other annotated metazoan genomes, and have access to other useful gene resources. To find a set of similar genes, enter a gene symbol or select the gene symbol + orthologs option from the selections menu.
For example if you search for ‘AGO2 orthologs‘, in addition to the link to orthologs from vertebrates, you’ll get a link to a set of similar genes (Genes with similar protein architectures) across a broad evolutionary spectrum that includes genes from invertebrates, fungi, and green plants (Figure 1).
Figure 1. Genes with similar protein architectures to AGO2. The original search was AGO2 orthologs, which brings up the suggestion box with the links to similar genes as well as the AGO2 vertebrate orthologs. The similar genes include entries from a broad taxonomic range of eukaryotic organisms.
If you search for ‘GH1‘, you’ll get a link to similar genes that includes members of the growth hormone family that are not part of NCBI’s vertebrate ortholog set.
Figure 2. The human subset of genes with similar protein architectures to GH1 showing other members (paralogs) of the GH1 gene family (GH2, CSH1, CSH2, CSHL1). These are not included in the ortholog set.
Try out the following searches and follow the links to the Genes with similar protein architectures
NCBI is testing a new way to find and retrieve orthologous vertebrate genes. To find orthologs enter a gene symbol (e.g. RAG1) or a gene symbol combined with a taxonomic group (e.g. primate RAG1). Select the matching entry from the suggestions menu or you can select the orthologs option (e.g. Rag1 orthologs) to see all orthologs. Your search will return a results link to the set of orthologs provided by NCBI’s Gene resource. Click on the results link to see information for that ortholog group (Figure 1).
Figure 1. Search for Rag1 orthologs showing the link to the set of RAG1 genes from vertebrates.
If you’re a protein researcher, one thing you may want to do is to find homologs for a protein of interest on the basis of its sequence. This can provide insights into what the protein does and how it does it, and may identify proteins with known three-dimensional structures that can serve as models for the protein of interest. The Conserved Domains Database (CDD) groups proteins that have strong sequence similarity to protein domain fingerprints and allows you to search these groups with any protein sequence. Such searches are often more sensitive than standard BLAST searches since the scoring matrices used are tuned to locate important functional sites and sequence motifs that are highly conserved within the domain. You can then use the results to explore the evolutionary relationships of these proteins or identify these important sequence and structural features.
Here is a method to find protein sequences from many organisms that contain a particular conserved domain: