Evidence for naming the protein now on non-redundant refseq records (WP_ accessions)

We are now showing the curated evidence used for assigning names and, if possible, gene symbols, publications, and Enzyme Commission numbers on nearly 70% (83 million) microbial RefSeq proteins. This evidence includes a hierarchical collection of curated Hidden Markov Model (HMM)-based and BLAST-based protein families, and conserved domain architectures.

On a protein record such as WP_004152100.1,  you can follow the link (NF033727.1) in the Evidence Accession field of the Evidence-For-Name-Assignment comment block (Figure 1) to find out more about the naming evidence, including the thresholds used for defining a match and access to all the prokaryotic proteins that match the evidence (Figure 2). WP_Evid_1Figure 1: The Evidence-For-Name-Assignment block on WP_004152100.1. The name “arsenite efflux transporter metallochaperone ArsD” is based on its match to the evidence NF033727.1, a Hidden Markov model that defines a family of arsenite efflux transporter metallochaperones. Proteins named for this evidence also inherit publications and a gene symbol (arsD) from NF033727.1.

HMM_topandbottomFigure 2: Naming evidence NF033727.1, a Hidden Markov model.  The top part of the page contains a short text description for the protein family defined by the evidence, the thresholds to be included in the family defined by the evidence, and the publications associated with the protein family.  The lower part of the page provides the RefSeq proteins in the family, named by the present evidence (left tab), or named using evidence with a higher-precedence (right tab). You can filter and download the list too!

Sixty-nine percent of available prokaryotic RefSeq proteins now have the Evidence-For-Name-Assignment comment block. The remaining 31% are not yet covered by the evidence system and are named based on BLAST hits to a non-curated collection of protein cluster representatives.

What does this mean for you?

  • You can better differentiate proteins with functional annotation that is based on curated evidence versus Blast hits to a non-curated database. The query “Evidence-For-Name-Assignment[Properties]” in the Protein resource returns all proteins with names based on a curated evidence.
  • You can find and download all archaeal and bacterial proteins that are matched to the same evidence.
  • You can get your publication cited on protein records by providing NCBI better names for a protein.

We welcome your input! Please send your suggestions and feedback to the NCBI Help Desk.

Conserved Domain Database (CDD) 3.17 is now available

The latest version of the Conserved Domain Database contains 3,272 new or updated NCBI-curated domains and now mirrors Pfam version 31 as well as models from NCBIfams, a collection of protein family hidden Markov models (HMMs) for improving bacterial genome annotation. A fine-grained classification of the major facilitator superfamily has also been added. You can find this updated content on the CDD FTP site.

Continue reading

Using Conserved Domains to Find Protein Homologs

If you’re a protein researcher, one thing you may want to do is to find homologs for a protein of interest on the basis of its sequence. This can provide insights into what the protein does and how it does it, and may identify proteins with known three-dimensional structures that can serve as models for the protein of interest. The Conserved Domains Database (CDD) groups proteins that have strong sequence similarity to protein domain fingerprints and allows you to search these groups with any protein sequence. Such searches are often more sensitive than standard BLAST searches since the scoring matrices used are tuned to locate important functional sites and sequence motifs that are highly conserved within the domain. You can then use the results to explore the evolutionary relationships of these proteins or identify these important sequence and structural features.

Here is a method to find protein sequences from many organisms that contain a particular conserved domain:

Continue reading