The ongoing sequencing revolution has resulted in exponential growth of the NCBI BLAST databases. The default BLAST nucleotide database (nt), the most popular Web BLAST database, is currently 903 billion letters and continues to grow rapidly – doubling in size in the last year. This growth will cause longer search times, reduced capacity, and more delays in updating the database. In the not-too-distant future, searching the entire nt database on the web will no longer be possible unless we modify the database scope and composition.
Because of the above concerns, we want to make the default Web BLAST nucleotide database smaller and more efficient. Some options are to:
Change its composition to improve the quality of sequence entries included
Take steps to slow its growth rate
Divide it into several databases by biological or functional categories
RefSeq release 215 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of November 7, 2022, and contains 335,372,031 records, including 244,583,657 proteins and sequences from 125,116 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 215”→
NLM’s NCBI Datasetsannounces the release of version 14 of our command-line (CLI) tools, datasets, and dataformat. This release (CLI v14.0.0) contains many improvements that are inspired by your feedback. It’s now easier than ever to browse and format metadata, generate customized tables, and download data packages. We hope these updates will improve your experience!
We will present a variety of talks and posters featuring our clinical and human genetic resources, as well as genome products and tools. We are excited to introduce the NIH Comparative Genomics Resource (CGR), a multi-year National Library of Medicine (NLM) project to maximize the impact of eukaryotic research organisms and their genomic data resources to biomedical research. If you’re interested in providing feedback that will be used to help drive CGR forward, consider joining our round table discussion.
Check out NCBI’s schedule of activities and events:
In October 2022, NCBI Datasets will release version 14 of our datasets and dataformat command-line tools. This release will contain breaking changes to the command syntax, content of the data packages and data reports. Thank you for your feedback that inspired these new features. We hope they will improve your experience!
We will continue to support CLI v13.x, although new features and improvements will be exclusive to CLI v14.0.0 release and up.
A new version of the Conserved Domain Database (CDD) is now available. Version 3.20 contains 1,614 new or updated NCBI/CDD-curated domains and now mirrors Pfam version 34 as well as new models from the NCBIfam collection. Fine-grained classifications of the [(+)ssRNA] virus RNA-dependent RNA polymerase catalytic domain, RING-finger/U-box, dimerization/docking domains of the cAMP-dependent protein kinase regulatory subunit, and Galactose/rhamnose-binding lectin domain superfamily have been added, along with many other new models.
We have significantly increased the fraction of CD-Search and interactive BATCH CD-Search queries that yield results showing conserved domain architecture information and attributes that further characterize protein function through links to information-rich resources such as Enzyme Commission (EC) numbers , Gene Ontology (GO) terms, PubMed IDs, and identifiers from the CaZY, TCDB, and MEROPS databases. See our earlier post for additional details. You can access CDD and find updated content on the CDD FTP site at CDD version 3.20.
RefSeq release 214 is now available online, from the FTP site, and through NCBI’s Entrez programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of September 12, 2022, and contains 328,588,569 records, including 239,609,016 proteins, 47,387,931 RNAs, and sequences from 123,394 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.
Foreign contamination screening
Introducing the new Foreign Contamination Screen (FCS) tool! If you produce assembled genomes, check out FCS, a tool you can run yourself to improve your genome assemblies and facilitate high-quality data submissions to GenBank. FCS is part of the NIH Comparative Genomics Resource (CGR), an NLM project to establish an ecosystem to facilitate reliable comparative genomics analyses for all eukaryotic organisms. See our previous blog post to learn how FCS enhances contaminant detection sensitivity. Continue reading “RefSeq release 214 is available!”→
Learn about the NIH Comparative Genomics Resource (CGR) Project
The Biodiversity Genomics conference will take place virtually, October 2-7, 2022. This event is hosted by the Earth BioGenome Project and is open and free for all to attend.
NCBI staff will present a variety of recorded talks and posters highlighting various elements of the NIH Comparative Genomics Resource (CGR), including NCBI Datasets and the Comparative Genome Viewer (CGV). CGR is a multi-year National Library of Medicine (NLM) project to maximize the impact of eukaryotic research organisms and their genomic data resources to biomedical research. NCBI is charged with leading CGR development and engaging genomics communities. The CGR project will facilitate reliable comparative genomics analyses for all eukaryotic organisms in collaboration with the genomics community.
Conserved Domain Search (CD Search) results now show domain architecture information and other annotations that further characterize predicted domain and protein function. These include links to PubMed, Gene Ontology (GO) terms, Enzyme Commission (EC) numbers, and the SPARCLE Domain Architecture Viewer. You can use these links on the results to find literature (PubMed), assign biological roles and protein function (GO and EC), and find proteins with the same domain architecture (Domain Architecture Viewer). These annotations are currently available for a limited number of architectures, but we will continue to add them as part of our curation effort.