Have you ever searched the NCBI Protein database and been overwhelmed with the number of sequences returned? Have you tried searching with a protein name, thinking that would greatly limit the results, only to still be presented with many sequences (all with the same name)? It’s a common problem in this time of greatly expanding sequence databases powered by large-scale genomic sequencing of similar organisms. Redundancy in the sequence databases is high and only getting worse.
To address this, in 2013 NCBI released the WP records, which collect identical protein sequences annotated on bacterial genomes. In 2014, NCBI released the Identical Protein Reports on Protein records, which displays information about all other proteins identical to that protein. Now, we are releasing a new resource: Identical Protein Groups (IPG). IPG offers several features:
- Records representing each unique protein sequence in the NCBI databases
- Coding regions from GenBank, RefSeq, SwissProt, and PDB
- Record titles derived from the highest quality record in the group
- Nucleotide coordinate mapping for each coding region (for GenBank and RefSeq records)
- Search filtering options including the source database, taxonomy, and the size of the group
Let’s say that you’re interested in glutamate dehydrogenase from E. coli. Searching the Protein database currently returns over 2,600 records. The same search in IPG returns only 267 records (Figure 1).
You can narrow the IPG results further using the filters on the left. For example, you might limit to groups having more than 1,000 sequences, as these tend to represent more commonly found sequences. In this case, four groups are returned. Within these records, tables allow easy access to the coding regions in the different genomes for the proteins in the group.