Apply to attend October 2022 interactive, hands-on workshops
Want to learn more about NCBI resources and how to implement our cutting-edge tools in your research? NCBI offers a variety of educational opportunities, including workshops, webinars, codeathons, tutorials, and more!
We are excited to announce our upcoming virtual workshop series for October 2022. Our interactive, hands-on workshops are taught by experienced NCBI Education Faculty. Applications are open to the public; however, each workshop will accept a limited number of participants to facilitate the best possible educational experience. Continue reading “New Upcoming NCBI Virtual Workshops!”→
Conserved Domain Search (CD Search) results now show domain architecture information and other annotations that further characterize predicted domain and protein function. These include links to PubMed, Gene Ontology (GO) terms, Enzyme Commission (EC) numbers, and the SPARCLE Domain Architecture Viewer. You can use these links on the results to find literature (PubMed), assign biological roles and protein function (GO and EC), and find proteins with the same domain architecture (Domain Architecture Viewer). These annotations are currently available for a limited number of architectures, but we will continue to add them as part of our curation effort.
As we previously announced, we are offering a ClusteredNR protein database on the web BLAST service that provides faster searches, greater taxonomic reach, and easier to interpret results than the traditional nr database. We’ve added some new features to the results that make the ClusteredNR even more useful by allowing analyses within each cluster including the ability to:
Align the query to the members of the cluster.
Display Tree View and MSA View the cluster alignment.
Submit the cluster to COBALT to generate a true multiple sequence alignment of the members.
Display a BLAST Taxonomy Report to see the taxonomic distribution of the sources of the members.
Figure 1 shows you how access these in-cluster analysis options. The new Cluster Taxonomy report is shown in Figure 2. Try ClusteredNR yourself — follow this link to set up a search!
As part of an ongoing effort to modernize and improve your experience, NLM’s NCBI Datasets is introducing all-new genome pages. These pages make it easier for you to browse and download genome sequence and metadata, and navigate to tools such as the Genome Data Viewer (GDV) and BLAST.
To get started, search NCBI Datasets by assembly accession (e.g., GCF_016699485.2), assembly name (e.g., bGalGal1.mat.broiler.GRCg7b), WGS accession (e.g., JAENSK01), or species name + genome (e.g., chicken genome), and click on the title in the box. See the top red arrow in Figure 1 below where we search for ‘chicken genome’.
Figure 1: Finding the chicken reference assembly. A search for ‘chicken genome’ returns a box that provides a quick link to the new genome page (middle red arrow). From there, the download button (bottom red arrow) allows you to select the files you need (see ‘Download Package’ window on the left) along with a detailed metadata report that includes all the metadata on the web page. Continue reading “Introducing NLM’s new NCBI Datasets genome page!”→
An updated bacterial and archaeal representative genomes collection is available! A total of 16,105 assemblies among the 249,000 prokaryotic assemblies in RefSeq were selected to represent their respective species. The collection has grown by 3.7% since January 2022. A total of 706 species are represented for the first time. In addition, 186 species are represented by a better assembly, and 124 species were removed because of changes in NCBI Taxonomy or uncertainty in their species assignment.
Reduced redundancy. Faster searches. More diverse proteins and organisms in your BLAST results. Check out our new ClusteredNR database – derived from the default BLAST protein nr database by clustering sequences at 90% identity / 90% length (details below). Get quicker results and access to information about the distribution of your hits across a wider range of organisms and evolutionary distances.
You can choose the ClusteredNR database in the Choose Search Set section of the BLAST submission form where you normally pick the BLAST database. Simply select the Experimental databases radio button. You can also select the checkbox to search both ClusteredNR and the standard nr at the same time allowing you to compare results (Figure 1).
BLAST+ 2.13.0 includes several important new features including SRA BLAST programs, ARM Linux executables, and the ability to produce database metadata as well as some important improvements, and a few bug fixes. You can download the new BLAST release from the FTP site.
SRA / WGS BLAST (blastn_vdb, tblastn_vdb)
Beginning with this release, the BLAST distribution now includes the SRA BLAST programs blastn_vdb and tblastn_vdb that can directly search SRA and WGS projects without the need to build a BLAST database. See the BLAST documentation on how to use these programs with WGS projects.
Starting with BLAST+ 2.13.0, the makeblastdb program generates an additional file with the file extension .njs for nucleotide databases or .pjs for protein databases. These files contain BLAST database metadata in JSON format. See the BLAST database metadata section in the BLAST User Manual for an example. This file can be easily read by many tools and makes the BLAST database more compliant with FAIR principles.
See the release notes for more details on improvements and bug fixes for the release.
Important reminder about usage reporting
As we announced previously, BLAST can report limited usage information back to NCBI. This information shows us whether BLAST+ is being used by the community, and therefore is worth being maintained and developed. It also allows us to focus our development efforts on the most used aspects of BLAST+. Please help us improve BLAST by allowing BLAST to share information about your search. The BLAST privacy statement provides details on the information collected, how it is used, and how to opt-out of reporting if you don’t want to participate.
ElasticBLAST is a new tool that helps you run BLAST searches on the cloud. ElasticBLAST is perfect for you if you have thousands to millions of queries to our Basic Local Alignment Search Tool (BLAST ®), or if you want to use cloud infrastructure for your searches. ElasticBLAST can handle large searches that are not appropriate for NCBI web BLAST, and it runs them more quickly than stand-alone BLAST+.
ElasticBLAST works on two of the current NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) partners- Amazon Web Services (AWS) and Google Cloud Platform (GCP). ElasticBLAST works by distributing your searches across multiple cloud instances to process them in tandem. The ability to scale resources in this way allows you to process large numbers of queries in a shorter time than you could with BLAST+. ElasticBLAST can handle millions of queries, and it also supports most BLAST+ options and programs.
Making it easier to run BLAST on the cloud
ElasticBLAST reduces the barrier to using the cloud by creating and managing cloud resources for you. It manages the software and database installation, handles partitioning of the BLAST workload among the various instances and deallocates cloud resources when the searches are done. For example, ElasticBLAST will select the best cloud instance type for your search based on the database metadata that provides database size and memory needs (Figure 1). You can also manually select the instance type if you prefer.
Fig. 1: JSON metadata for the 16S_ribosomal_RNA database. The “bytes-to-cache” information helps ElasticBLAST pick out an instance with the appropriate capacity.
ElasticBLAST can access the 28 NCBI databases available on AWS and GCP. These are the same databases that are also available from the NCBI FTP site. For instance, databases available on the two cloud providers include the RefSeq Eukaryotic Representative Genomes database, 16S database based on Targeted Loci, and Human and mouse genomes databases.
You can also provide your own databases, and you can produce the metadata needed to select an instance through a Python script that comes with ElasticBLAST.
ElasticBLAST can perform a variety of searches with query sets that range from hundreds to millions of sequences and BLAST databases of all sizes. Table 1 shows ElasticBLAST searches with query sets that range up to billions of letters using a variety of BLAST databases.
Table 1: Sample ElasticBLAST searches. This table demonstrates the breadth of searches supported by ElasticBLAST. Additionally, the first row demonstrates the ability of ElasticBLAST to use many CPUs (3200) on a cloud provider at once to complete a task in hours that would have taken days on a single machine.
Because ElasticBLAST runs on cloud providers, using it will incur some cost. Based on current cost structures on AWS and GCP, in most cases these costs are quite small. For example, a protein search with a query of about 20 million residues against a database of about 20 billion residues can cost less than $5. Even a larger search with a query of 3-4 billion DNA bases can cost only around $50. Both cloud services include the option to bid on instances for less than full price, which can result in significant savings. ElasticBLAST can be configured to request such instances. Your costs will obviously vary based on many factors, and we encourage you to explore these options with the individual cloud providers. Also, both AWS and GCP offer a free tier or time-limited trial of their cloud services, and you can find information about using ElasticBLAST with the free tiers here.
ElasticBLAST is a cloud-native package developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) with support from the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.