Author: NCBI Staff

Conserved Domain Database version 3.20 is available!

Conserved Domain Database version 3.20 is available!

A new version of the Conserved Domain Database (CDD) is now available. Version 3.20 contains 1,614 new or updated NCBI/CDD-curated domains and now mirrors Pfam version 34 as well as new models from the NCBIfam collection. Fine-grained classifications of the [(+)ssRNA] virus RNA-dependent RNA polymerase catalytic domain, RING-finger/U-box, dimerization/docking domains of the cAMP-dependent protein kinase regulatory subunit, and Galactose/rhamnose-binding lectin domain superfamily have been added, along with many other new models.

We have significantly increased the fraction of CD-Search and interactive BATCH CD-Search queries that yield results showing conserved domain architecture information and attributes that further characterize protein function through links to information-rich resources such as Enzyme Commission (EC) numbers , Gene Ontology (GO) terms, PubMed IDs, and identifiers from the CaZY, TCDB, and MEROPS databases. See our earlier post for additional details. You can access CDD and find updated content on the CDD FTP site at CDD version 3.20.

 Database statistics for CDD version 3.20:

Models Source
64,234 Total models from all Source Databases

Organized into 4,541 multi-model Superfamilies

18,882 NCBI CDD curation effort
1,125 NCBIfams
1,009 SMART v6.0
19,178 PFAM v34
4,871 COGs v1.0
10,140 NCBI Protein Clusters
4,488 TIGRFAM v15
59,693 Total models form the default CD-Search database

CD Search is part of the NIH Comparative Genomics Resource (CGR), an NLM project to establish an ecosystem to facilitate reliable comparative genomics analyses for all eukaryotic organisms.

Join our mailing list to keep up to date with CD Search and other CGR news.

RefSeq release 214 is available!

RefSeq release 214 is available!

RefSeq release 214 is now available online, from the FTP site, and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of September 12, 2022, and contains 328,588,569 records, including 239,609,016 proteins, 47,387,931 RNAs, and sequences from 123,394 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Foreign contamination screening
Introducing the new Foreign Contamination Screen (FCS) tool! If you produce assembled genomes, check out FCS, a tool you can run yourself to improve your genome assemblies and facilitate high-quality data submissions to GenBank. FCS is part of the NIH Comparative Genomics Resource (CGR), an NLM project to establish an ecosystem to facilitate reliable comparative genomics analyses for all eukaryotic organisms. See our previous blog post to learn how FCS enhances contaminant detection sensitivity. Continue reading “RefSeq release 214 is available!”

Join NCBI virtually at the Biodiversity Genomics 2022 conference

Join NCBI virtually at the Biodiversity Genomics 2022 conference

Learn about the NIH Comparative Genomics Resource (CGR) Project

The Biodiversity Genomics conference will take place virtually, October 2-7, 2022. This event is hosted by the Earth BioGenome Project and is open and free for all to attend.

NCBI staff will present a variety of recorded talks and posters highlighting various elements of the NIH Comparative Genomics Resource (CGR), including NCBI Datasets and the Comparative Genome Viewer (CGV). CGR is a multi-year National Library of Medicine (NLM) project to maximize the impact of eukaryotic research organisms and their genomic data resources to biomedical research. NCBI is charged with leading CGR development and engaging genomics communities. The CGR project will facilitate reliable comparative genomics analyses for all eukaryotic organisms in collaboration with the genomics community.

Check out NCBI’s schedule of activities to learn more about CGR: Continue reading “Join NCBI virtually at the Biodiversity Genomics 2022 conference”

NCBI hidden Markov models (HMM) release 10.0 now available!

NCBI hidden Markov models (HMM) release 10.0 now available!

Release 10.0 of the NCBI Hidden Markov models (HMM) used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available for download. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.

The 10.0 release contains 15,360 models maintained by NCBI, including 228 that are new since 9.0, 99 that were modified significantly, and 205 that were assigned better names, EC numbers, Gene Ontology (GO) terms, gene symbols or publications. You can search and view the details for these in the Protein Family Model collection, which also includes conserved domain architectures and BlastRules, and find all RefSeq proteins they name.

GO terms associated with HMMs are now propagated to CDSs and proteins annotated with PGAP. In case you missed it, see our previous blog post on this topic.

Coming soon: Updated PubMed E-utilities!

Coming soon: Updated PubMed E-utilities!

Important Note: This release is being postponed and will go live Monday, November 21, 2022.

PubMed will be moving to an updated version of the E-utilities  API on November 15, 2022. As previously announced, this updated version of E-utilities will use the same technology as the web version of PubMed released in 2020. So, search results returned by the updated ESearch E-utility  will now match those of the PubMed.gov website 

This update only affects E-utility calls with &db=pubmed. There are no changes to the E-utilities for other databases. You can refer to our previous post or watch our recorded webinar for more details on this update.   Continue reading “Coming soon: Updated PubMed E-utilities!”

New ClinVar graphical display

New ClinVar graphical display

Maps clinically significant variants by gene and position!

ClinVar is a freely accessible, public archive of reports of the relationships between human variations and phenotypes, with supporting evidence at NLM/NCBI. To help you access your variants of interest quickly, ClinVar is offering an experimental release of an all-new visualization tool in the search results. This graphical display provides an overview of variants when you search by gene or genomic region (Figures 1 and 2).

Currently the graphical display is implemented as an experiment and will appear for only 10 percent of searches by gene or genomic region, but the links in this post will show the display so you can try it out. Alternatively, if you would like to bring up the graphical display for your gene or genomic region search, you can edit the URL in the address bar to change the default gr=0 to gr=1.  For example, the following URL with show the graphical display:

https://www.ncbi.nlm.nih.gov/clinvar/?gr=1&term=DSG2[gene]

Note that you can only get the graphical display with gene or genomic region searches. For other types of searches, you will see the table only.

Gene search display

The display for a gene search highlights small variants within the gene. Large structural variants are also marked as a single dot in the middle of the variation. The interactive display shows the placement of variants on the gene and their clinical significance and allows you to zoom in or pan right / left and limit results to variants in a chosen gene. Figure 1 shows the graphical display as it appears at the top of the search results for the desmoglein 2 (DSG2) gene and how to filter and navigate to variants of interest (Search ClinVar: DSG2[gene]).

Figure 1 (A-D). Graphical display views in ClinVar for variants in DSG2, a gene with many known pathogenic variants

A. Graphical view showing all variants for the DSG2 gene.  Results default to the GRCh37 assembly. You can change to the GRCh38 assembly by clicking the arrow at the upper left (circled in red).

B. You can zoom in by mousing over the 8th exon in the gene diagram, which activates a pop-up menu that allows you to re-display only this region by following the link (red box).

C. Refreshed result for the 8th exon of DSG2 showing a number of variants including pathogenic, benign, and ones with conflicting interpretations of pathogenicity. You can select the filters on the left-hand side of the ClinVar result to limit to variants with characteristics of interest, for example Conflicting Interpretations of pathogenicity.

D. Variants in exon 8 of DSG2 filtered for conflicting interpretations of pathogenicity. You can retrieve individual variants by mousing over the graphic to activate the pop-up menu and following the link (red box).

Continue reading “New ClinVar graphical display”

Celebrating 1 Year of NCBI Virtual Outreach Events

Celebrating 1 Year of NCBI Virtual Outreach Events

We launched the NCBI Virtual Outreach Event series in the fall of 2021 to expand our online outreach to a worldwide audience of people who use NCBI resources for biological/biomedical research, science education, and clinical applications. Our virtual outreach events include interactive workshops, webinars, and codeathons. In the past year, we have hosted 34 virtual events and served over 1,600 participants (Figure 1).

Continue reading “Celebrating 1 Year of NCBI Virtual Outreach Events”

Top 3 reasons to use ElasticBLAST

Top 3 reasons to use ElasticBLAST

ElasticBLAST is a new way to BLAST large numbers of queries, faster and on the cloud. Here are the top three reasons you should use ElasticBLAST:

1. ElasticBLAST can handle much LARGER queries! 

ElasticBLAST can search query sets that have hundreds to millions of sequences and against BLAST databases of all sizes.

2. ElasticBLAST is FASTER

ElasticBLAST distributes your searches across multiple cloud instances to process them simultaneously. The ability to scale resources in this way allows you to process large numbers of queries in a shorter time than you could with BLAST+.

3. ElasticBLAST is EASY to run on the cloud

ElasticBLAST is easy to set up using our step-by-step instructions (Amazon Web Services (AWS), Google Cloud Platform (GCP)) and allows you to leverage the power of the cloud. Once configured, it manages the software and database installation, handles partitioning of the BLAST workload among the various instances, and deallocates cloud resources when the searches are done.

ElasticBLAST also selects the instance (i.e., machine) type for you based on database size. Of course, you can also choose the instance type manually if you prefer Continue reading “Top 3 reasons to use ElasticBLAST”

Announcing GenBank Release 251.0

Announcing GenBank Release 251.0

GenBank release 251.0 (8/15/2022) is now available on the NCBI FTP site. This release has 19.55 trillion bases and 2.94 billion records. The current release has 239,915,786 traditional records containing 1,492,800,704,497 base pairs of sequence data. There are also 2,024,099,677 WGS records containing 17,511,809,676,629 base pairs of sequence data, 560,196,830 bulk-oriented TSA records containing 497,501,380,386 base pairs of sequence data, and 115,103,527 bulk-oriented TLS records containing 43,852,280,645 base pairs of sequence data. 

Continue reading “Announcing GenBank Release 251.0”

Announcing new links and annotations on Conserved Domain Search results!

Announcing new links and annotations on Conserved Domain Search results!

Conserved Domain Search (CD Search) results now show domain architecture information and other annotations that further characterize predicted domain and protein function. These include links to PubMed, Gene Ontology (GO) terms, Enzyme Commission (EC) numbers, and the SPARCLE Domain Architecture Viewer. You can use these links on the results to find literature (PubMed), assign biological roles and protein function (GO and EC), and find proteins with the same domain architecture (Domain Architecture Viewer).  These annotations are currently available for a limited number of architectures, but we will continue to add them  as part of our curation effort.

Figure 1 shows the results of an example CD Search showing these new links.  Note that you can use the GO and EC information provided to retrieve protein models with these annotations from the Protein Family Models database, for example GO:0030246[GOTermId] — molecular function carbohydrate binding or  2.7.11.1[ECNumber]non-specific serine/threonine protein kinase.

Figure 1. Conserved Domain Database search results for a hypothetical protein (XP_007132600.1) from the common bean (Phaseolus vulgaris). The results classify the protein as a plant receptor-like protein kinase. The results also show the EC number and the GO terms associated with this domain architecture, a link to a PubMed citation for the protein family (receptor-like protein kinases), and a link to the Domain Architecture Viewer for G-type lectin S-receptor-like serine/threonine-protein kinases. The Domain Architecture Viewer shows other proteins from the NCBI databases with the same domain architecture (order, number and types of domains).  Continue reading “Announcing new links and annotations on Conserved Domain Search results!”