Using Conserved Domains to Find Protein Homologs

If you’re a protein researcher, one thing you may want to do is to find homologs for a protein of interest on the basis of its sequence. This can provide insights into what the protein does and how it does it, and may identify proteins with known three-dimensional structures that can serve as models for the protein of interest. The Conserved Domains Database (CDD) groups proteins that have strong sequence similarity to protein domain fingerprints and allows you to search these groups with any protein sequence. Such searches are often more sensitive than standard BLAST searches since the scoring matrices used are tuned to locate important functional sites and sequence motifs that are highly conserved within the domain. You can then use the results to explore the evolutionary relationships of these proteins or identify these important sequence and structural features.

Here is a method to find protein sequences from many organisms that contain a particular conserved domain:

1. If you have a Protein sequence record for your gene of interest, click on  “Identify Conserved Domains” on the right-hand side of the page in the “Analyze this sequence” section.

2. This Conserved Domains Summary page shows a brief view summarizing the identity and location of regions matching the amino acid fingerprint (PSSM – Position Specific Scoring Matrix) for particular protein domains and domain families.

Please note that the definition of these domains comes from several sources (NCBI curation efforts, SMART, Pfam and TIGRFAM).  You can look at all of the conserved domains that match this region by clicking on “View full result.” Clicking on any of the bars will take you to a record that describes that particular domain as reported by the submitting organization (NCBI, SMART, Pfam, TIGR).

In either the “Brief view” or “Full result” view, the “Specific hit” shown at the top is the domain that contains the most curated information.  These are often curated by the NCBI Conserved Domain curation staff and have accessions that begin with “cd.” If you mouse-over this top-most bar, you’ll get a preview of the full Conserved Domain record.

3. Click on the top bar to go to the Conserved Domain record page, which describes what is known about the function of your domain.

On this page is a Links box which has hyperlinks to relevant records in other databases.  The link to “Specific Protein” retrieves Protein database records that have a high degree of similarity to this conserved domain. There is also a “Related Protein” link, which retrieves protein sequences with less similarity to the domain than the “specific protein” records and may contain this domain or a functionally related domain.

4. Click on either the “Specific Protein” or “Related Protein” link to retrieve the related records in the Protein database.

You can further filter these records to display only those from the Reference Sequence project, which contains curated, non-redundant sequences representing the currently best understood, most representative sequences for each biological molecule. To do this, click on the “RefSeq” link in the upper right hand corner of the page.

In addition, if you are just interested in finding a putative functional homolog for your protein in a particular organism, you can filter your search result using the “Top Organisms” portlet on the right-hand side of the page.

To download sequences for further sequence analysis, click on the “Send to” link in the upper-right hand side of the page to save the set of records, click the File radio button, and then select the record format.

If you want to perform evolutionary analysis and/or create a phylogenetic tree of the retrieved sequences, we suggest that you download the FASTA-formatted Reference Sequence records.  This file can be used as the data source for most Phylogenetic Tree Analysis Programs or an Alignment program that can display results as a Phylogenetic Tree (such as COBALT).

Bonus Tip:  You may want to explore the NCBI Curated “cd” records in CDD a bit further. In addition to full descriptions about the function of the domain, they also contain links to relevant literature in PubMed and the NCBI Bookshelf, data about the taxonomic distribution of the domain, links to molecular pathways (BioSystems) in which proteins with this domain are known to participate, and solved 3D Structure models (Structure) for the domain that often includes annotation identifying key functional and regulatory residues.

For More Information:

Leave a Reply