Reduced redundancy. Faster searches. More diverse proteins and organisms in your BLAST results. Check out our new ClusteredNR database – derived from the default BLAST protein nr database by clustering sequences at 90% identity / 90% length (details below). Get quicker results and access to information about the distribution of your hits across a wider range of organisms and evolutionary distances.
You can choose the ClusteredNR database in the Choose Search Set section of the BLAST submission form where you normally pick the BLAST database. Simply select the Experimental databases radio button. You can also select the checkbox to search both ClusteredNR and the standard nr at the same time allowing you to compare results (Figure 1).
Figure 1. The ‘Choose Search Set’ section of the BLAST submission form. Selecting the Experimental databases radio button chooses ClusteredNR. You can also perform simultaneous searches against the clustered and the standard nr by checking ‘Select to compare standard and experimental database.’
Why make a clustered database?
As sequencing technology has advanced, the standard protein nr database has grown rapidly and is now very large, with over 300 million sequences. Growth of the data has not been evenly distributed across all organisms and classes of proteins. Some organisms are overrepresented, as are some types of proteins. This can create a challenge when interpreting results. In some cases, your results may be dominated by the same kind of protein from the source organism for your query or a very closely related species. Clustering the database produces a smaller database that better represents the diversity of organisms and proteins in the original database. Reducing the size of the database also improves search speed. As new proteins are added, many will join pre-existing clusters, and the growth of the clustered database will not be as rapid as the parent nr.
More about clusters
We generate ClusteredNR from the standard protein nr database with MMseqs2 so each cluster contains proteins that are more than 90% identical to each other and within 90% of the length of the longest member. We select a single well-annotated protein that indicates the function of the proteins in the cluster as the lead or representative protein. The title of the representative protein is the title that shows in the BLAST results. Each cluster may contain sequences for multiple organisms (species). On the BLAST results, clusters are identified by the name of the organism for the title protein as well as the most recent common ancestor taxon for all organisms in the cluster. This makes it clear when the cluster includes multiple species. You can expand a cluster on your BLAST results to view and download a report or the sequences of all member proteins, and you can also perform a BLAST alignment of all the members of the cluster (Figure 2).
Figure 2. An expanded cluster from the results of a ClusteredNR search. The representative sequence at the top is an M-type creatine kinase (NP_990838.1) from chicken. The cluster contains 14 members from 13 different species of birds. You can download information about the set from the expanded cluster including a list of the members in text or CSV format or the sequence records in FASTA or GenBank format. You can also align the members of the cluster with BLAST by clicking the ‘Show Alignment’ link.
Examples with ClusteredNR
Here are two simple searches that show how ClusteredNR expands taxonomic coverage and gives a better overview of the distribution of related proteins compared to a search against nr.
The first example uses the dnaK chaperonin from Escherichia coli (NP_414555.1) as a query. In the search against the standard nr, nearly all of the matches are to proteins from other E. coli genome assemblies (nr results). The same search against ClusteredNR shows matches to clusters from a wide range of bacteria (ClusteredNR results).
Human creatine B-type creatine kinase
Another example is a search with the human B-type creatine kinase (NP_0001814.2). The creatine kinases are a small family of several related proteins in animals including the B-type, M-type, U-type, and S-types. However, the nr results show matches only to B-type proteins from placental mammals (nr results). The search against the ClusteredNR finds clustered matches that include several different creatine kinases (paralogs) from all groups of vertebrates including mammals, birds, lizards and snakes, amphibians, and all classes of fishes (ClusteredNR results).
We are working to provide more tools for examining and aligning the cluster members. We are also adding more download options. ClusteredNR may also be an option for translating (blastx) searches in the future.
We hope the ClusteredNR database helps you identify protein sequences and find homologs. As always, we want to hear from you! If you have any questions or feedback, please write to firstname.lastname@example.org