Tag: Comparative Genomics Resource (CGR)

New Improvements! Try out our Foreign Contamination Screen (FCS) Tool

New Improvements! Try out our Foreign Contamination Screen (FCS) Tool

Want to submit high-quality data quickly and easily to GenBank? Check out our Foreign Contamination Screen (FCS) tool, a quality assurance process that you can run yourself. FCS offers enhanced contaminant detection sensitivity to improve your genome assemblies and facilitate high-quality data submissions to GenBank. We recently made several improvements to make the tool even easier to use! 

What’s New?
  • Now quicker and easier to run!  
  • Decontaminate your genome with just one extra step. 
    • Save the removed sequences in a separate file, if desired.  
  • More accurate!  
  • Find more contaminants with improved coverage of prokaryotes, protists, and more. 
  • Screen your genome on the cloud in minutes. 

Continue reading “New Improvements! Try out our Foreign Contamination Screen (FCS) Tool”

New Way to View and Download Related Genes

New Way to View and Download Related Genes

Effective June 2023, the HomoloGene records will redirect to the Datasets Gene Table

Do you use HomoloGene to view and download data? You can now access updated homology data from NCBI Datasets through the Datasets Gene Table with connections to NCBI Orthologs. Go directly from a HomoloGene record to the Datasets Gene Table that will give you access to up-to-date sequence data and metadata. NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.

The Datasets Gene Table provides connections to the NCBI Ortholog interface (Figure 1) that provides the following data: 

  • Orthology data based on an updated algorithm that identifies orthologs spanning > 500 vertebrate species 
  • Similar gene data based on protein architectures that spans all eukaryotes 

Continue reading “New Way to View and Download Related Genes”

Read About NCBI Resources in 2023 Nucleic Acids Research Database Issue

Read About NCBI Resources in 2023 Nucleic Acids Research Database Issue

The 2023 Nucleic Acids Research Database Issue features papers from NCBI staff on GenBank, Conserved Domain Database, and more. The citations are available in PubMed with full-text available in PubMed Central (PMC). To read an article, click on the PMCID number listed below.  Continue reading “Read About NCBI Resources in 2023 Nucleic Acids Research Database Issue”

RefSeq Release 217

RefSeq Release 217

RefSeq release 217 is now available online and from the FTP site. You can access RefSeq data through NCBI Datasets.

What’s included in this release?

As of March 8, 2023, this full release incorporates genomic, transcript, and protein data, containing:

  • 348,351,219 records
  • 254,500,694 proteins
  • 50,975,429 RNAs
  • sequences from 130,837 organisms

The release is provided in several directories as a complete dataset and divided by logical groupings. Continue reading “RefSeq Release 217”

New & Improved NCBI Datasets Genome and Assembly Pages

New & Improved NCBI Datasets Genome and Assembly Pages

Legacy pages will be redirected effective June 2023

In June 2023, NCBI’s Assembly and Genome record pages will be redirected to new Datasets pages as part of our ongoing effort to modernize and improve your user experience. NCBI Datasets is a new resource that makes it easier to find and download genome data 

We will update the following pages:
  • The NCBI Assembly pages will be redirected to the new DatasetsGenome pages that describe assembled genomes and provide links to related NCBI tools such as Genome Data Viewer and BLAST. 
  • The NCBIGenome pages will be redirected to the DatasetsTaxonomy pages that provide a taxonomy-focused portal to genes, genomes and additional NCBI resources.  
  • During this transition, you will have the option to return to the legacy Genome and Assembly pages. 

Continue reading “New & Improved NCBI Datasets Genome and Assembly Pages”

Now Available! More Mammalian Cross-Species Alignments in the Comparative Genome Viewer (CGV)

Now Available! More Mammalian Cross-Species Alignments in the Comparative Genome Viewer (CGV)

In response to your feedback, we’ve made more whole genome cross-species alignments available in NCBI’s Comparative Genome Viewer (CGV). You can use these alignments to explore genome rearrangements between species. You can also zoom in to analyze regions of conserved gene synteny.

There are over 20 new cross-species alignments available, including human-mouse, mouse-rat, human-chimp, human-cattle, dog-cat, and others! These cross-species alignments provide additional opportunities to explore evolutionary relationships at the genomic and gene levels. We will add more cross-species alignments in the coming months.

The latest cross-species alignments added to CGV include imports from the UCSC Genomics Institute, as well as those generated at NCBI.

Check out two examples of cross-species whole-genome alignments in CGV below (Figure 1).

Figure 1. Whole genome alignments between (A) mouse and human (GRCm39 vs. GRCh38.p14)  and (B) cat and dog (F.catus_Fca126_mat1.0 vs. ROS_Cfam_1.0). Colored bands connects aligned regions; green indicates same orientation, blue indicates opposite orientation.

When you zoom in on an alignment (Figure 2), you can compare gene annotation on the two assemblies and see the extent of conservation of synteny. You can also see which genes are missing from one or the other assembly, indicating changes in sequence or differences in annotation.

Continue reading “Now Available! More Mammalian Cross-Species Alignments in the Comparative Genome Viewer (CGV)”

New annotations in RefSeq!

New annotations in RefSeq!

In December and January, the NCBI Eukaryotic Genome Annotation Pipeline released twenty-nine new annotations in RefSeq for the following organisms:

  • Acinonyx jubatus (cheetah)
  • Anopheles cruzii (mosquito)
  • Anopheles moucheti (mosquito)
  • Bicyclus anynana (squinting bush brown)
  • Budorcas taxicolor (takin)
  • Carassius gibelio (silver crucian carp)
  • Citrus sinensis (sweet orange)
  • Crassostrea angulata (Portugese oyster)
  • Culex pipiens pallens (northern house mosquito)
  • Drosophila gunungcola (fruit fly)
  • Galleria mellonella (greater wax moth)
  • Gossypium arboreum (tree cotton)
  • Gossypium raimondii (Peruvian cotton)
  • Harpia harpyja (harpy eagle)
  • Hemicordylus capensis (graceful crag lizard)
  • Lactuca sativa (garden lettuce)
  • Mercenaria mercenaria (northern quahog)
  • Mya arenaria (softshell)
  • Octopus bimaculoides (California two-spot octopus)
  • Oncorhynchus keta (chum salmon)
  • Pangasianodon hypophthalmus (striped catfish)
  • Panonychus citri (citrus red mite)
  • Panthera uncia (snow leopard) (pictured)
  • Peromyscus californicus insignis (California mouse)
  • Podarcis raffonei (Aeolian wall lizard)
  • Populus trichocarpa (black cottonwood)
  • Scomber japonicus (chub mackerel)
  • Tympanuchus pallidicinctus (lesser prairie-chicken)
  • Vigna angularis (adzuki bean)

Continue reading “New annotations in RefSeq!”

Announcing New Names for Eukaryotic Genome Annotations in RefSeq!

Announcing New Names for Eukaryotic Genome Annotations in RefSeq!

The RefSeq eukaryotic genome annotation pipeline (EGAP) is moving to a new annotation naming format that can be used to unambiguously reference both the genome assembly and the RefSeq annotation. This will improve clarity when reporting the data you use and make the data more FAIR (Findable, Accessible, Interoperable, and Reusable). The new naming convention applies to all eukaryotic annotations released after December 15, 2022.

Historically, RefSeq EGAP has used an integer to identify a particular annotation release, such as Homo sapiens Annotation Release 110. This method provides no information on the assembly used for the annotation. In the new RefSeq  naming system, annotation releases are designated by a combination of the assembly identifier (e.g., GCF_000001405.40) and an annotation name (e.g., RS_2022_04). The annotation name consists of an RS prefix to indicate RefSeq annotation, and the year and month that it was generated, RS_YYYY_MM. You should always use the annotation name in combination with the corresponding assembly accession.version, for example, GCF_026419915.1-RS_2022_12 (as shown in Figure 1). This ensures that you’re always using the name that defines a specific annotation for a specific genome assembly. If you use only part of the name, it will be ambiguous.

Figure 1. The annotation section of the Datasets Genome page for the assembly bHarHar1 for the harpy eagle (Harpia harpyja) showing the new annotation release GCF_026419915.1-RS_2022_12. Continue reading “Announcing New Names for Eukaryotic Genome Annotations in RefSeq!”

Now Available! Add your favorite organism(s) to your BLAST ClusteredNR searches

Now Available! Add your favorite organism(s) to your BLAST ClusteredNR searches

Do you currently add an organism name(s) to focus your searches when using the BLAST standard nr database? You can now focus your searches by organism with the BLAST ClusteredNR database and get faster results with a better overview of protein homologs in a wider range of organisms. Your searches will be restricted to protein clusters that contain one or more sequences from the organism(s) you add.  

ClusteredNR results

A search of the ClusteredNR database (results) using human myoglobin (NP_005359.1) as a query and limited to Cetacea (whales & dolphins) returns clusters containing all the whale myoglobin matches present in a search of standard nr, as well as matches to clusters containing cytoglobin (Figure 1 A). These significant cytoglobin matches are not shown in the standard nr results with the Cetacea limit, which are dominated by matches to proteins from a single species, Physeter catodon (sperm whale) (Figure 1 B).  Continue reading “Now Available! Add your favorite organism(s) to your BLAST ClusteredNR searches”

RefSeq Release 216

RefSeq Release 216

RefSeq release 216 is now available online, from the FTP site, and through NCBI’s new resource, Datasets.

This full release incorporates genomic, transcript, and protein data available as of January 9, 2023, and contains 342,395,932 records, including 249,868,639 proteins, 49,869,497 RNAs, and sequences from 128,299 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 216”