Identical Protein Groups: Non-redundant access to protein records

Have you ever searched the NCBI Protein database and been overwhelmed with the number of sequences returned? Have you tried searching with a protein name, thinking that would greatly limit the results, only to still be presented with many sequences (all with the same name)? It’s a common problem in this time of greatly expanding sequence databases powered by large-scale genomic sequencing of similar organisms. Redundancy in the sequence databases is high and only getting worse.

To address this, in 2013 NCBI released the WP records, which collect identical protein sequences annotated on bacterial genomes. In 2014, NCBI released the Identical Protein Reports on Protein records, which displays information about all other proteins identical to that protein. Now, we are releasing a new resource: Identical Protein Groups (IPG).  IPG offers several features:

  • Records representing each unique protein sequence in the NCBI databases
  • Coding regions from GenBank, RefSeq, SwissProt, and PDB
  • Record titles derived from the highest quality record in the group
  • Nucleotide coordinate mapping for each coding region (for GenBank and RefSeq records)
  • Search filtering options including the source database, taxonomy, and the size of the group

Let’s say that you’re interested in glutamate dehydrogenase from E. coli. Searching the Protein database currently returns over 2,600 records. The same search in IPG returns only 267 records (Figure 1).

Screenshot of Identical Protein Groups resource
Figure 1. IPG results for glutamate dehydrogenase from E. coli. The upper panel shows four records returned after applying a limit to the size of the group (“Protein count bins” set to >1,000). The lower panel shows information presented if you click on the third record, with sequences identical to WP_000373044.1, along with links to their coding sequences.

You can narrow the IPG results further using the filters on the left. For example, you might limit to groups having more than 1,000 sequences, as these tend to represent more commonly found sequences. In this case, four groups are returned. Within these records, tables allow easy access to the coding regions in the different genomes for the proteins in the group.

Leave a Reply