Effective June 2023, the HomoloGene records will redirect to the Datasets Gene Table
Do you use HomoloGene to view and download data? You can now access updated homology data from NCBI Datasets through the Datasets Gene Table with connections to NCBI Orthologs. Go directly from a HomoloGene record to the Datasets Gene Table that will give you access to up-to-date sequence data and metadata. NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
The Datasets Gene Table provides connections to the NCBI Ortholog interface (Figure 1) that provides the following data:
- Orthology data based on an updated algorithm that identifies orthologs spanning > 500 vertebrate species
- Similar gene data based on protein architectures that spans all eukaryotes
Continue reading “New Way to View and Download Related Genes”
As we previously announced, we are offering a ClusteredNR protein database on the web BLAST service that provides faster searches, greater taxonomic reach, and easier to interpret results than the traditional nr database. We’ve added some new features to the results that make the ClusteredNR even more useful by allowing analyses within each cluster including the ability to:
- Align the query to the members of the cluster.
- Display Tree View and MSA View the cluster alignment.
- Submit the cluster to COBALT to generate a true multiple sequence alignment of the members.
- Display a BLAST Taxonomy Report to see the taxonomic distribution of the sources of the members.
Figure 1 shows you how access these in-cluster analysis options. The new Cluster Taxonomy report is shown in Figure 2. Try ClusteredNR yourself — follow this link to set up a search!
Continue reading “Try out the latest BLAST ClusteredNR database results. Now with in-cluster analyses!”
Reduced redundancy. Faster searches. More diverse proteins and organisms in your BLAST results. Check out our new ClusteredNR database – derived from the default BLAST protein nr database by clustering sequences at 90% identity / 90% length (details below). Get quicker results and access to information about the distribution of your hits across a wider range of organisms and evolutionary distances.
You can choose the ClusteredNR database in the Choose Search Set section of the BLAST submission form where you normally pick the BLAST database. Simply select the Experimental databases radio button. You can also select the checkbox to search both ClusteredNR and the standard nr at the same time allowing you to compare results (Figure 1).
Figure 1. The ‘Choose Search Set’ section of the BLAST submission form. Selecting the Experimental databases radio button chooses ClusteredNR. You can also perform simultaneous searches against the clustered and the standard nr by checking ‘Select to compare standard and experimental database.’ Continue reading “New ClusteredNR database: faster searches and more informative BLAST results”
Note: Please see our more recent post about the new Datasets command-line clients and the documentation on how to get orthologs using the newer client. The command-lines below do not work in the current datasets client (NCBI Datasets CLIv14).
You can now get gene ortholog data using the NCBI Datasets command-line tool using a gene ID, gene symbol, or RefSeq nucleotide or protein accession. Data are available for vertebrates and insects. The vertebrate orthologs includes a specialized set for fish. (See our recent post for more information on the orthologs for fish and insects.)
You can retrieve metadata for gene orthologs in JSON Format, or you can download a compressed (zip) archive containing both metadata and sequences (Figure 1).
Figure 1. Command-lines that use a gene symbol (BRCA1) to retrieve mammalian ortholog metadata (top, JSON metadata shown in part in the image) and sequences (bottom).
Continue reading “The Datasets command-line tool now provides ortholog data”
Previously we wrote about improvements to Drosophila annotations in RefSeq. We’re excited to report that we’re also improving how we compute and report orthology data for fish and insects to help you find evolutionarily related genes across species. Currently when we annotate a vertebrate genome using our in-house eukaryotic genome annotation pipeline, we have a robust process that identifies 1:1 orthologs vs human using a combination of BLAST comparisons and local synteny. These results are available in NCBI Gene and our new Ortholog pages, and also on Gene’s FTP site. We also use the data to apply human gene and protein names to orthologs in other species, providing a very rich annotation for hundreds of vertebrates.
For fish, we’re now using a two-layer process. First, most of the fish now have 1:1 orthologs identified vs zebrafish, which typically results in identifying 50% more orthologs. Second, if we’ve identified a human ortholog for the zebrafish gene, then we also report the human gene. We’ve also switched primarily to applying gene symbols and names from zebrafish instead of human, mostly provided by the Zebrafish Information Network (ZFIN), to other fish orthologs. The end result is more ortholog connections and better nomenclature. For example, many fish have two related homeobox genes meis2a and meis2b, compared to the single MEIS2 gene in human. Our updated process has allowed us to identify and properly name meis2a and meis2b in 73 and 40 fish species, respectively.
Continue reading “Orthologs Are A-Swimming and A-Buzzing in RefSeq!”
We recently showed you a new a way to search for and view sets of orthologous genes from vertebrates. You can now get an additional set of search results that we are calling similar genes. These are related through protein architecture to the orthologous gene set and include genes from all metazoans and selected plant, fungal, and protist species. You can quickly find related genes within a species, compare them to those from other annotated metazoan genomes, and have access to other useful gene resources. To find a set of similar genes, enter a gene symbol or select the gene symbol + orthologs option from the selections menu.
For example if you search for ‘AGO2 orthologs‘, in addition to the link to orthologs from vertebrates, you’ll get a link to a set of similar genes (Genes with similar protein architectures) across a broad evolutionary spectrum that includes genes from invertebrates, fungi, and green plants (Figure 1).
Figure 1. Genes with similar protein architectures to AGO2. The original search was AGO2 orthologs, which brings up the suggestion box with the links to similar genes as well as the AGO2 vertebrate orthologs. The similar genes include entries from a broad taxonomic range of eukaryotic organisms.
If you search for ‘GH1‘, you’ll get a link to similar genes that includes members of the growth hormone family that are not part of NCBI’s vertebrate ortholog set.
Figure 2. The human subset of genes with similar protein architectures to GH1 showing other members (paralogs) of the GH1 gene family (GH2, CSH1, CSH2, CSHL1). These are not included in the ortholog set.
Try out the following searches and follow the links to the Genes with similar protein architectures
Please let us know what you think!
NCBI is testing a new way to find and retrieve orthologous vertebrate genes. To find orthologs enter a gene symbol (e.g. RAG1) or a gene symbol combined with a taxonomic group (e.g. primate RAG1). Select the matching entry from the suggestions menu or you can select the orthologs option (e.g. Rag1 orthologs) to see all orthologs. Your search will return a results link to the set of orthologs provided by NCBI’s Gene resource. Click on the results link to see information for that ortholog group (Figure 1).
Figure 1. Search for Rag1 orthologs showing the link to the set of RAG1 genes from vertebrates.
Continue reading “Searching for orthologous genes at NCBI”
If you’re a protein researcher, one thing you may want to do is to find homologs for a protein of interest on the basis of its sequence. This can provide insights into what the protein does and how it does it, and may identify proteins with known three-dimensional structures that can serve as models for the protein of interest. The Conserved Domains Database (CDD) groups proteins that have strong sequence similarity to protein domain fingerprints and allows you to search these groups with any protein sequence. Such searches are often more sensitive than standard BLAST searches since the scoring matrices used are tuned to locate important functional sites and sequence motifs that are highly conserved within the domain. You can then use the results to explore the evolutionary relationships of these proteins or identify these important sequence and structural features.
Here is a method to find protein sequences from many organisms that contain a particular conserved domain:
Continue reading “Using Conserved Domains to Find Protein Homologs”