Tag: Protein

Updated protein family models used by PGAP available for download

Release 3.0 of the NCBI protein family models used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available from our FTP site. You can search this collection of hidden Markov models (HMMs) against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.

The 3.0 release contains 17,350 models: 12,864 HMMs built at NCBI (111 more than in release 2.0) and 4,486 TIGRFAM HMMs. In addition, since release 2.0,  we have assigned product names to over 2,000 Pfam HMMs, bringing the total to 6,698 Pfam HMMs with names that can be transferred by PGAP to the annotated proteins they hit. You can access a table of these product names from the release directory.Prot_evidenceFigure 1. The evidence for name assignment for type III secretion system (T3SS) translocon subunit SctB (NF038055) showing the protein matches. Species-specific names for this highly variable component of T3SS include YopD, EspB, IpaC, SipC, etc. Instead, we used the standard moniker for core genes of T3SS, Sct, Secretion and cellular translocation (PMID 26520801,  PMID 9618447) providing a unified nomenclature for this secretion system.  Continue reading “Updated protein family models used by PGAP available for download”

Improved access to SARS-CoV-2 data

NCBI Datasets has a simple, new way to get Coronoviridae data, including from SARS-CoV-2 (Figure 1). The data package includes genomic, protein and CDS sequences, annotation and a comprehensive data report for all complete genomes. You can also target your search to major taxonomic ranks within Coronaviridae.

comm-1318_fig1
Figure 1 – SARS-CoV-2 page within NCBI Datasets showing statistics as of June 16, 2020.

Interested in a specific protein? The SARS-CoV-2 protein page allows you to choose a protein and download the corresponding sequences, annotation and representative structures from all annotated genomes (Figure 2).

comm-1318_fig2
Figure 2 – SARS-CoV-2 protein page within NCBI Datasets showing annotations on the SARS-CoV-2 reference genome.

Looking for programmatic access? NCBI Datasets offers the same Coronoviridae genomic data and SARS-CoV-2 protein data through a command-line tool and a RESTful API. These tools support additional filtering including the ability to download only those genomes released after a date you specify.

We appreciate your feedback. Try NCBI Datasets and let us know what you think!

New viral protein domain models for annotation of coronaviruses

NLM’s Conserved Domain Database (CDD) has expanded its scope to now include 153 new viral protein domain family models for the annotation of coronaviruses, including models such as for the S1 subunit of coronavirus Spike proteins (cd21527), the nucleocapsid (N) protein of coronavirus (cd21595), and the coronavirus RNA-dependent RNA polymerase (cd21530).

Each curated domain model consists of a multiple sequence alignment containing conserved sequence features that may have been confirmed experimentally, plus links to relevant publications. When available, the domain models include 3D structures with links to interactive 3D views and interacting partners.

Check out this tabular summary of SARS-CoV-2 gene products for links to matching conserved domain models and representative 3D protein structures.

Want to view these alignments in 3D space? We’ve updated iCn3D, a web-based 3D structure viewer, with new rendering, annotation, and alignment features.  Read more about how you can use iCn3D to view and analyze SARS-CoV-2-related structures.

Don’t forget to review our SARS-CoV-2 resources page to keep up to date on other coronavirus data at NCBI!

Conserved Domain Database (CDD) v. 3.18 is now available

The latest version of the Conserved Domain Database contains 2,128 new or updated NCBI-curated domains and now mirrors Pfam version 32 as well as models from NCBIfams, a collection of protein family hidden Markov models (HMMs) for improving bacterial genome annotation. We have also added fine-grained classifications of the cupin and PBP1 superfamilies. You can find this updated content on the CDD FTP site. Read on for detailed release statistics.

Continue reading “Conserved Domain Database (CDD) v. 3.18 is now available”

Protein family models used by PGAP are now available for download

A new release of the NCBI protein families profiles used by PGAP (the Prokaryotic Genome Annotation Pipeline) is now available. You can search this collection of Hidden Markov models (HMM) against your favorite prokaryotic proteins to identify their function using hmmer.

The HMMs are used as hints for the structural annotation of protein-coding genes in bacterial genomes and are also one of the sources for the names assigned to PGAP-annotated proteins presented in the Evidence-For-Name-Assignment comment block of RefSeq protein records (See for example, WP_004152100.1).

The collection comprises 12,753 HMMs that were built at NCBI, and 4,486 TIGRFAM HMMs whose ownership was transferred to NCBI in April 2018. In addition to the HMM profiles and seed alignments, a tab-delimited file containing the product names and other attributes added to the HMMs by curators is available.

  • 85% of models were assigned a product name that can be transferred to proteins hit by the model.
  • 7702 models have gene symbols.
  • 14508 are supported by a least one publication.
  • 6266 are assigned an Enzyme Commission number.
  • 617 represent anti-microbial resistance proteins.
  • Product names added to 4,686 PFAM HMMs owned by EBI-EMBL and used for functional annotation by PGAP are also included.

A total of 57 million RefSeq prokaryotic proteins have been named based on these curated HMMs, and can be identified with the Entrez query “meta Evidence-For-Name-Assignment”[Properties] AND “Evidence Category=HMM”[Text Word]. See an example and more information on web displays of HMMs in a previous post.

New search helps you find prokaryotic proteins

The latest improvement in the NCBI search experience is designed to help you quickly find microbial proteins. Now when you search for a prokaryotic protein name such as recombinase RecA in NCBI’s sequence databases or in the All databases search, a high-quality representative protein sequence is highlighted in a panel at the top of the results page (Figure 1).

The result panel also allows you to quickly link to related resources such as NCBI’s new pages for protein family models, Identical Protein Groups, and SPARCLE, NCBI’s protein domain architecture resource. We also provide as-you-type suggestions so you don’t have to type out some of the long names.

RecA

Figure 1.  The result for a search with recombinase RecA. The panel provides access to analysis tools, downloads, and relevant links to the protein family, the RefSeq protein, the identical protein group, and citations in PubMed.

Try these protein name searches, or your own, and use the as-you-type suggestions to assist your searches.

Please let us know how you like these results!

New results for organelle genome searches

As part of our ongoing effort to improve your search experience, we’ve made it easier for you to find the sequence of your favorite organelle genome plus all the information and data associated with it. To find organelle genomes, search for an organism name combined with an organelle description, for example human mitochondriontomato chloroplast or Toxoplasma gondii RH apicoplast.

A new results panel will appear with links to the organelle genome sequence, annotated genes, and related phylogenetic and population studies. The panel appears with these searches in an All Databases search or within any of NCBI’s sequence databases including Gene, Nucleotide, Protein, Genome, Assembly.  For the human mitochondrial genome, a graphical schematic of the genome allows you to navigate to individual mitochondrial encoded genes (Figure 1).

Organelle_sensor

Figure 1.  The organelle genome results for a search with human mitochondrion. The panel provides access to analysis tools, downloads, and other relevant results. Clicking any of the gene objects on the genome graphic links leads to the relevant Gene record, for example Gene ID: 4512 in the case of COX1.

Try it out using the following example searches and  let us know what you think!

September 11 Webinar: A beginner’s guide to genes and sequences at NCBI

September 11 Webinar: A beginner’s guide to genes and sequences at NCBI

On Wednesday, September 11, 2019 at 12 PM, NCBI staff will present a webinar for people with limited experience working with gene and sequence information. You will learn about the kinds of data available for genes and sequences, how to select the most informative records, and how to find related genes and sequences using pre-computed information and the BLAST sequence search service.

  • Date and time: Wed, Sep 11, 2019 12:00 PM – 12:30 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

Evidence for naming the protein now on non-redundant refseq records (WP_ accessions)

We are now showing the curated evidence used for assigning names and, if possible, gene symbols, publications, and Enzyme Commission numbers on nearly 70% (83 million) microbial RefSeq proteins. This evidence includes a hierarchical collection of curated Hidden Markov Model (HMM)-based and BLAST-based protein families, and conserved domain architectures.

Continue reading “Evidence for naming the protein now on non-redundant refseq records (WP_ accessions)”

A new way to find an expanded set of similar genes

We recently showed you a new a way to search for and view sets of orthologous genes  from vertebrates. You can now get an additional set of search results that we are calling similar genes.  These are related through protein architecture to the orthologous gene set and include genes from all metazoans and selected plant, fungal, and protist species. You can quickly find related genes within a species, compare them to those from other annotated metazoan genomes, and have access to other useful gene resources. To find a set of similar genes, enter a gene symbol or select the gene symbol + orthologs option from the selections menu.

For example if you search for ‘AGO2 orthologs‘,  in addition to the  link to orthologs from vertebrates, you’ll get a link to a set of similar genes (Genes with similar protein architectures) across a broad evolutionary spectrum that includes genes from invertebrates, fungi, and green plants (Figure 1).

AGO2_Fig1Figure 1.  Genes with similar protein architectures to AGO2. The original search was AGO2 orthologs, which brings up the suggestion box with the links to similar genes as well as the AGO2 vertebrate orthologs. The similar genes include entries from a broad taxonomic range of eukaryotic organisms.

If you search for ‘GH1‘, you’ll get a link to similar genes that includes members of the growth hormone family that are not part of NCBI’s vertebrate ortholog set.

GH1_Fig2.pngFigure 2. The human subset of genes with similar protein architectures to GH1 showing other members (paralogs) of the GH1 gene family (GH2, CSH1, CSH2, CSHL1). These are not included in the ortholog set.

Try out the  following searches and follow the links to the Genes with similar protein architectures

Please  let us know what you think!