New ClusteredNR database: faster searches and more informative BLAST results

New ClusteredNR database: faster searches and more informative BLAST results

Reduced redundancy. Faster searches. More diverse proteins and organisms in your BLAST results. Check out our new ClusteredNR database – derived from the default BLAST protein nr database by clustering sequences at 90% identity / 90% length (details below).  Get quicker results and access to information about the distribution of your hits across a wider range of organisms and evolutionary distances.

Searching ClusteredNR

You can choose the ClusteredNR database in the Choose Search Set section of the BLAST submission form where you normally pick the BLAST database.  Simply select the Experimental databases radio button.  You can also select the checkbox to search both ClusteredNR and the standard nr at the same time allowing you to compare results (Figure 1).

Figure 1. The ‘Choose Search Set’ section of the BLAST submission form. Selecting the Experimental databases radio button chooses ClusteredNR. You can also perform simultaneous searches against the clustered and the standard nr by checking ‘Select to compare standard and experimental database.’

Why make a clustered database?

As sequencing technology has advanced, the standard protein nr database has grown rapidly and is now very large, with over 300 million sequences. Growth of the data has not been evenly distributed across all organisms and classes of proteins. Some organisms are overrepresented, as are some types of proteins. This can create a challenge when interpreting results.  In some cases, your results may be dominated by the same kind of protein from the source organism for your query or a very closely related species. Clustering the database produces a smaller database that better represents the diversity of organisms and proteins in the original database. Reducing the size of the database also improves search speed. As new proteins are added, many will join pre-existing clusters, and the growth of the clustered database will not be as rapid as the parent nr.

More about clusters

We generate ClusteredNR from the standard protein nr database with MMseqs2 so each cluster contains proteins that are more than 90% identical to each other and within 90% of the length of the longest member.  We select a single well-annotated protein that indicates the function of the proteins in the cluster as the lead or representative protein. The title of the representative protein is the title that shows in the BLAST results. Each cluster may contain sequences for multiple organisms (species). On the BLAST results, clusters are identified by the name of the organism for the title protein as well as the most recent common ancestor taxon for all organisms in the cluster. This makes it clear when the cluster includes multiple species. You can expand a cluster on your BLAST results to view and download a report or the sequences of all member proteins, and you can also perform a BLAST alignment of all the members of the cluster (Figure 2).

An expanded sequence cluster from the BLAST results

Figure 2. An expanded cluster from the results of a ClusteredNR search.  The representative sequence at the top is an M-type creatine kinase (NP_990838.1) from chicken. The cluster contains 14 members from 13 different species of birds. You can download information about the set from the expanded cluster including a list of the members in text or CSV format or the sequence records in FASTA or GenBank format. You can also align the members of the cluster with BLAST by clicking the ‘Show Alignment’ link.

Examples with ClusteredNR

Here are two simple searches that show how ClusteredNR expands taxonomic coverage and gives a better overview of the distribution of related proteins compared to a search against nr.

dnaK chaperonin

The first example uses the dnaK chaperonin from Escherichia coli (NP_414555.1) as a query. In the search against the standard nr, nearly all of the matches are to proteins from other E. coli genome assemblies (nr results). The same search against ClusteredNR shows matches to clusters from a wide range of bacteria (ClusteredNR results).

Human creatine B-type creatine kinase

Another example is a search with the human B-type creatine kinase (NP_0001814.2). The creatine kinases are a small family of several related proteins in animals including the B-type, M-type, U-type, and S-types. However, the nr results show matches only to B-type proteins from placental mammals (nr results). The search against the ClusteredNR finds clustered matches that include several different creatine kinases (paralogs) from all groups of  vertebrates including mammals, birds, lizards and snakes, amphibians, and all classes of fishes (ClusteredNR results).

Future development

We are working to provide more tools for examining and aligning the cluster members. We are also adding more download options. ClusteredNR may also be an option for translating (blastx) searches in the future.

We hope the ClusteredNR database helps you identify protein sequences and find homologs. As always, we want to hear from you! If you have any questions or feedback, please write to blast-help@ncbi.nlm.nih.gov

9 thoughts on “New ClusteredNR database: faster searches and more informative BLAST results

  1. I’d like to have a downloadable version too. Have you considered creating more with smaller % identity cutoffs? We make our own with a 75% cutoff to speed up searches even more. It takes awfully long to cluster the complete nr set though. I think I’ll re-cluster your 90% database to 75% after it becomes available.

    1. Thanks for your interest. We’re exploring different clustering cutoffs. We should have a standalone database available for download this fall.

      1. I see the database files are now available for download at https://ftp.ncbi.nlm.nih.gov/blast/db/experimental/nr_cluster_seq.tar.gz which is great!
        Are you going to update that regularly?
        I will have to extract the sequences from it as a FASTA to re-cluster with 75% cutoff. It’d be nice if you could provide the FASTA file for download too, in the same format as /blast/db/FASTA/nr.gz with the accompanying md5 file. That’d help automate its download after updates. A plain text file with the cluster members would also be helpful.

      2. Thanks for asking. Right now we are not updating it regularly because it experimental and we are still acquiring computational resources to build it faster, but we plan to update it on a regular schedule eventually. We probably will not provide a separate FASTA file. You can generate a FASTA database of the representative sequences from the formatted ClusteredNR using the BLAST program blastdbcmd.

Leave a Reply