Tag: Comparative Genomics Resource (CGR)

Connect with NCBI at ASHG 2022

Connect with NCBI at ASHG 2022

Join us October 25-29 in Los Angeles, CA

We are looking forward to seeing you in-person at the American Society of Human Genetics (ASHG) annual meeting, October 25-29, 2022, in Los Angeles, California.

We will present a variety of talks and posters featuring our clinical and human genetic resources, as well as genome products and tools. We are excited to introduce the NIH Comparative Genomics Resource (CGR), a multi-year National Library of Medicine (NLM) project to maximize the impact of eukaryotic research organisms and their genomic data resources to biomedical research. If you’re interested in providing feedback that will be used to help drive CGR forward, consider joining our round table discussion.  

Check out NCBI’s schedule of activities and events: 

Continue reading “Connect with NCBI at ASHG 2022”

Coming soon! Changes to NCBI Datasets command-line tool in version 14 (CLIv14.0.0)

Coming soon! Changes to NCBI Datasets command-line tool in version 14 (CLIv14.0.0)

In October 2022, NCBI Datasets will release version 14 of our datasets and dataformat command-line tools. This release will contain breaking changes to the command syntax, content of the data packages and data reports. Thank you for your feedback that inspired these new features. We hope they will improve your experience!

We will continue to support CLI v13.x, although new features and improvements will be exclusive to CLI v14.0.0 release and up.

NCBI Datasets supports the NIH Comparative Genomics Resource (CGR), an NLM project to establish an ecosystem to facilitate reliable comparative genomics analyses for all eukaryotic organisms. Join our mailing list to keep up to date with NCBI Datasets and other CGR news.

More details

How is version 14 of the Datasets command-line tools (CLI v14.x) different from CLI v13.x and previous versions?  Continue reading “Coming soon! Changes to NCBI Datasets command-line tool in version 14 (CLIv14.0.0)”

Conserved Domain Database version 3.20 is available!

Conserved Domain Database version 3.20 is available!

A new version of the Conserved Domain Database (CDD) is now available. Version 3.20 contains 1,614 new or updated NCBI/CDD-curated domains and now mirrors Pfam version 34 as well as new models from the NCBIfam collection. Fine-grained classifications of the [(+)ssRNA] virus RNA-dependent RNA polymerase catalytic domain, RING-finger/U-box, dimerization/docking domains of the cAMP-dependent protein kinase regulatory subunit, and Galactose/rhamnose-binding lectin domain superfamily have been added, along with many other new models.

We have significantly increased the fraction of CD-Search and interactive BATCH CD-Search queries that yield results showing conserved domain architecture information and attributes that further characterize protein function through links to information-rich resources such as Enzyme Commission (EC) numbers , Gene Ontology (GO) terms, PubMed IDs, and identifiers from the CaZY, TCDB, and MEROPS databases. See our earlier post for additional details. You can access CDD and find updated content on the CDD FTP site at CDD version 3.20.

 Database statistics for CDD version 3.20:

Models Source
64,234 Total models from all Source Databases

Organized into 4,541 multi-model Superfamilies

18,882 NCBI CDD curation effort
1,125 NCBIfams
1,009 SMART v6.0
19,178 PFAM v34
4,871 COGs v1.0
10,140 NCBI Protein Clusters
4,488 TIGRFAM v15
59,693 Total models form the default CD-Search database

CD Search is part of the NIH Comparative Genomics Resource (CGR), an NLM project to establish an ecosystem to facilitate reliable comparative genomics analyses for all eukaryotic organisms.

Join our mailing list to keep up to date with CD Search and other CGR news.

RefSeq release 214 is available!

RefSeq release 214 is available!

RefSeq release 214 is now available online, from the FTP site, and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of September 12, 2022, and contains 328,588,569 records, including 239,609,016 proteins, 47,387,931 RNAs, and sequences from 123,394 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Foreign contamination screening
Introducing the new Foreign Contamination Screen (FCS) tool! If you produce assembled genomes, check out FCS, a tool you can run yourself to improve your genome assemblies and facilitate high-quality data submissions to GenBank. FCS is part of the NIH Comparative Genomics Resource (CGR), an NLM project to establish an ecosystem to facilitate reliable comparative genomics analyses for all eukaryotic organisms. See our previous blog post to learn how FCS enhances contaminant detection sensitivity. Continue reading “RefSeq release 214 is available!”

Join NCBI virtually at the Biodiversity Genomics 2022 conference

Join NCBI virtually at the Biodiversity Genomics 2022 conference

Learn about the NIH Comparative Genomics Resource (CGR) Project

The Biodiversity Genomics conference will take place virtually, October 2-7, 2022. This event is hosted by the Earth BioGenome Project and is open and free for all to attend.

NCBI staff will present a variety of recorded talks and posters highlighting various elements of the NIH Comparative Genomics Resource (CGR), including NCBI Datasets and the Comparative Genome Viewer (CGV). CGR is a multi-year National Library of Medicine (NLM) project to maximize the impact of eukaryotic research organisms and their genomic data resources to biomedical research. NCBI is charged with leading CGR development and engaging genomics communities. The CGR project will facilitate reliable comparative genomics analyses for all eukaryotic organisms in collaboration with the genomics community.

Check out NCBI’s schedule of activities to learn more about CGR: Continue reading “Join NCBI virtually at the Biodiversity Genomics 2022 conference”

Announcing new links and annotations on Conserved Domain Search results!

Announcing new links and annotations on Conserved Domain Search results!

Conserved Domain Search (CD Search) results now show domain architecture information and other annotations that further characterize predicted domain and protein function. These include links to PubMed, Gene Ontology (GO) terms, Enzyme Commission (EC) numbers, and the SPARCLE Domain Architecture Viewer. You can use these links on the results to find literature (PubMed), assign biological roles and protein function (GO and EC), and find proteins with the same domain architecture (Domain Architecture Viewer).  These annotations are currently available for a limited number of architectures, but we will continue to add them  as part of our curation effort.

Figure 1 shows the results of an example CD Search showing these new links.  Note that you can use the GO and EC information provided to retrieve protein models with these annotations from the Protein Family Models database, for example GO:0030246[GOTermId] — molecular function carbohydrate binding or  2.7.11.1[ECNumber]non-specific serine/threonine protein kinase.

Figure 1. Conserved Domain Database search results for a hypothetical protein (XP_007132600.1) from the common bean (Phaseolus vulgaris). The results classify the protein as a plant receptor-like protein kinase. The results also show the EC number and the GO terms associated with this domain architecture, a link to a PubMed citation for the protein family (receptor-like protein kinases), and a link to the Domain Architecture Viewer for G-type lectin S-receptor-like serine/threonine-protein kinases. The Domain Architecture Viewer shows other proteins from the NCBI databases with the same domain architecture (order, number and types of domains).  Continue reading “Announcing new links and annotations on Conserved Domain Search results!”

New annotations in RefSeq

New annotations in RefSeq

In June and July, the NCBI Eukaryotic Genome Annotation Pipeline released twenty-six new annotations in RefSeq for the following organisms:

  • Anopheles coluzzii (mosquito)
  • Anopheles funestus (African malaria mosquito)
  • Astyanax mexicanus (Mexican tetra)
  • Athalia rosae (coleseed sawfly)
  • Bactrocera dorsalis (oriental fruit fly)
  • Brassica napus (rape)
  • Brienomyrus brachyistius (bony fish)
  • Canis lupus dingo (dingo) (pictured)
  • Caretta caretta (Loggerhead turtle)
  • Dendroctonus ponderosae (mountain pine beetle)
  • Epinephelus fuscoguttatus (brown-marbled grouper)
  • Lagopus muta (rock ptarmigan)
  • Marmota marmota marmota (Alpine marmot)
  • Nematostella vectensis (starlet sea anemone)
  • Ostrea edulis (bivalve)
  • Panthera uncia (snow leopard)
  • Plutella xylostella (diamondback moth)
  • Pyrus x bretschneideri (Chinese white pear)
  • Rhincodon typus (whale shark)
  • Rhipicephalus sanguineus (brown dog tick)
  • Solanum stenotomum (eudicot)
  • Solanum verrucosum (eudicot)
  • Sphaerodactylus townsendi (lizard)
  • Stegostoma fasciatum (shark)
  • Triticum urartu (monocot)
  • Ziziphus jujuba (common jujube)

Continue reading “New annotations in RefSeq”

Foreign Contamination Screen (FCS) tool for GenBank submissions

Foreign Contamination Screen (FCS) tool for GenBank submissions

We are excited to introduce a Foreign Contamination Screen (FCS) tool that you can now run yourself, with enhanced contaminant detection sensitivity to improve your genome assemblies and facilitate high-quality data submissions to GenBank. If you submit genome assembly data to GenBank, the FCS tool is for you!

What is the FCS tool?

FCS, a quality assurance process used to make data suitable for submission, consists of two parts: FCS-adaptor and FCS-GX. FCS-adaptor searches for short sequences that are used as part of the lab preparation process and sometimes wind up in the final assembly by mistake. FCS-GX searches for sequences from a wide range of organisms including bacteria, fungi, protists, viruses, and others to identify sequences that don’t look like they are from the intended organism. In each case, you receive a report of the coordinates and identities of potential contaminants to be reviewed and removed (see Figure 1 for a sample report of the FCS-GX summary output). Both tools are designed to screen both eukaryote and prokaryote genomes.

Figure 1. FCS-GX report showing the summary of contamination identified in a tomato genome. The output indicates there are 83 sequences, adding up to 381 kb total length, to be removed from a mix of insect, fungal, and bacterial sources.

How do I use FCS?

FCS is available from GitHub. Simply download the two programs (FCS-adaptor and FCS-GX), and follow a few steps as outlined in the Quickstart. Both tools are also easy and inexpensive to run on commercial clouds such as Amazon Web Services (AWS) or Google Cloud Platform (GCP), and can screen genomes in a fraction of the time of other approaches. 

Why is FCS important?

Having high quality data available for analysis is necessary in order to arrive at accurate conclusions during research. With FCS, rapid detection of contaminants from foreign organisms in assembled genomes ensures that high value data is being provided for submission and available for reuse. We’ve already used FCS-GX to remove over one hundred megabases of contaminants and thousands of erroneous genes and proteins from previously submitted eukaryote genomes to make the data more useful for all. 

We want to hear from you!

We will update the FCS tool based on your feedback, so try it out and let us know what you think. Please contact us with comments and suggestions.

FCS is part of the NIH Comparative Genomics Resource (CGR), an NLM project to establish an ecosystem to facilitate reliable comparative genomics analyses for all eukaryotic organisms.

Join our mailing list to keep up to date with FCS and other CGR news.

Try out the latest BLAST ClusteredNR database results. Now with in-cluster analyses!

Try out the latest BLAST ClusteredNR database results. Now with in-cluster analyses!

As we previously announced, we are offering a ClusteredNR protein database on the web BLAST service that provides faster searches, greater taxonomic reach, and easier to interpret results than the traditional nr database. We’ve added some new features to the results that make the ClusteredNR even more useful by allowing analyses within each cluster including the ability to:

    • Align the query to the members of the cluster.
    • Display Tree View and MSA View the cluster alignment.
    • Submit the cluster to COBALT to generate a true multiple sequence alignment of the members.
    • Display a BLAST Taxonomy Report to see the taxonomic distribution of the sources of the members.

Figure 1 shows you how access these in-cluster analysis options. The new Cluster Taxonomy report is shown in Figure 2. Try ClusteredNR yourself — follow this link to set up a search!

Continue reading “Try out the latest BLAST ClusteredNR database results. Now with in-cluster analyses!”

NLM’s all-new NCBI Datasets genome table is now available

NLM’s all-new NCBI Datasets genome table is now available

We are excited to introduce new and useful updates to the Datasets genome table that let you quickly find and download a genome dataset including genome, transcript and protein sequence, annotation, and a data report.

The new genome table includes many new features and benefits (see Figure 1). With the new genome table you can:

  • Find all current genomes, including metagenomes
  • View multiple taxa such as birds and bees, or polyphyletic groups like fish
  • Easily find genomes with NCBI RefSeq annotations
  • Get more accurate genome counts, since each row now represents a single genome with GenBank and RefSeq accessions for that genome in the same row
  • Customize your downloads to include either GenBank or RefSeq files, or both
  • Download tables or data packages

Continue reading “NLM’s all-new NCBI Datasets genome table is now available”