Tag: Assembly

Introducing the new NCBI Datasets Genomes page

The updated NCBI Datasets Genomes page now has genome data for all domains of life, including bacterial and viral genomes.

The genomes table (Figure 1) now offers filters for:

  • Reference genomes — switch it on to only show reference or representative genomes
  • Annotated — switch it on to only show annotated genomes
  • Assembly level — use the assembly level slider to select higher-quality genomes
  • Year released — use the slider to limit your results to recent genomes

In addition, the new Actions column connects you to NCBI’s Genome Data Viewer, BLAST, and Assembly. The Text filter box lets you search by the name of the assembly, species/infraspecies, or submitter.Figure 1. The new Datasets Genomes page with primate assemblies showing the STATUS switches (reference genomes, annotated); expanded filters section with ASSEMBLY LEVEL and YEAR RELEASED sliding selectors; and the Actions column menu with access to Assembly details, BLAST, the Genome Data Viewer, and Download options. Continue reading “Introducing the new NCBI Datasets Genomes page”

Vertebrate Genome Project genome assemblies annotated by NCBI

Vertebrate Genome Project genome assemblies annotated by NCBI

NCBI is an active partner of the Vertebrate Genomes Project (VGP), who recently published a series of papers on the initial results of their efforts to sequence all 70,000 vertebrate species.  See the VGP press release  for more details. To date, this project has submitted over 130 diploid chromosome-level assemblies to NCBI’s GenBank  and the European Nucleotide Archive.  NCBI has annotated 94 of the VGP assemblies from 85 species using the NCBI Eukaryotic Genome Annotation Pipeline.

These sequence and annotation data are available through NCBI web resources including Gene, Assembly, Nucleotide, Protein, and Datasets and are included in the GenBank and RefSeq releases. You can browse the assemblies in the Genome Data Viewer  and  download metadata, sequence, and annotation data for the latest assemblies in the VGP BioProject using the NCBI Datasets command-line tools  as shown below. Continue reading “Vertebrate Genome Project genome assemblies annotated by NCBI”

Assembly database passes 1 million genome assemblies!

The NCBI Assembly database now provides sequence and metadata for more than 1 million genome assemblies from over 85,000 different species.

Assembly crossed the 1 million genome assemblies milestone on Sunday, April 18, 2021 (Figure 1).

Figure 1. Assembly status and growth. More than 1 million assemblies are now searchable through the NCBI web site (top panel). The number of genome assemblies at NCBI has accelerated rapidly in the past decade.

Continue reading “Assembly database passes 1 million genome assemblies!”

Improvements to NCBI Assembly

NCBI’s genome Assembly has a number of significant improvements!

Assembly records now have a link to Primer-BLAST making it easy to design primers in the context of a specific eukaryote genome assembly.  Figure 1 shows the Assembly page for the Genome Reference Consortium Mouse Build 39 (GRCm39) with the link to Primer-BLAST.

Figure 1. The Assembly page for the mouse reference genome (GCF_000001635.27). Showing the new Run Primer-BLAST link, which loads the assembly as a database in the Primer-BLAST search (bottom) and the new expandable note sections, Genome-Annotation-Data in this case. 
Continue reading “Improvements to NCBI Assembly”

Prokaryotic representative genomes updated — now over 13 thousand assemblies!

We have updated the bacterial and archaeal representative genome collection!  The current collection contains over 13,000 assemblies selected from the 203,000 prokaryotic RefSeq assemblies to represent their respective species. The collection has increased by 11% since August 2020.  We’ve included about 1,400 species for the first time, have used better assemblies for 1,177 species, and have removed 65 species because of changes in NCBI Taxonomy or uncertainty in their species assignment.

We have also updated the  Representative Genomes Database on the Microbial Nucleotide BLAST page as well as the RefSeq Representative Genome Database on basic nucleotide BLAST, to reflect these changes. Continue reading “Prokaryotic representative genomes updated — now over 13 thousand assemblies!”

Updated and improved collection of RefSeq representative genome assemblies now available

We have updated the collection of representative genome assemblies for Bacteria and Archaea. As announced in April, this set is now recalculated three times a year. We selected a total of 11,727 prokaryotic assemblies to represent their respective species among the 192,000 assemblies in RefSeq. Six hundred and thirty-five species were included in the collection for the first time, while 395 organisms from undefined species (such as Bacillus bacterium) were removed. We were able to choose a higher-quality representative than in the previous set for 18% of Bacterial and Archaeal species due to improvements in the logic of the selection that is now based on the assembly length, number of pseudo CDSs called in the PGAP annotation, number of scaffolds, whether Gene IDs are available in the Gene database for the assembly that is currently representative, and type strain status. You can see the exact criteria in order of importance on the Prokaryotic RefSeq Genomes page. Now that the new selection process is in place, we expect future updates to have fewer changes. We will replace a representative only if the assembly has changed RefSeq status or if a substantially better assembly becomes available.

We have updated the database on the Microbial Nucleotide BLAST page as well as the basic nucleotide BLAST RefSeq Representative Genome Database, to reflect these changes.

You can download the reference and representative set from the Assembly resource. If you are interested in the annotation on these genomes, you can limit searches to proteins annotated on representative genomes by adding “refseq_select[filter]” to any query in the Protein database. For example, you can find all proteins annotated on representative genomes in the genus Klebsiella by using the query: “Klebsiella[organism] AND refseq_select[filter]“.  A BLAST database of proteins annotated on representative genomes will be coming soon. Stay tuned!

Enhanced prokaryote type strain report now with details on needed type strain data

The Prokaryote type strain report provides information on type-strains for over 18,000 species. We revised and expanded the report to make it easier to identify cases where sequencing or establishing type material would have the biggest impact on improving prokaryote taxonomy and accurate identification.  These cases include species with designated type strains but without a sequenced type strain assembly and species without designated type material. We hope that the community will prioritize sequencing type strains for the former set of species (Table 1) and establishing a neotype or reftype, where applicable (as defined in Cuifo et al 2018) for the latter set (Table 2).

Other changes from the old format file are detailed in a recent genomes announce post.

Scientific Name Type material/co-identical strains Assemblies
Burkholderia ubonensis CCUG:48852, CIP:1070, … 308
Escherichia albertii Albert 19982, BCCM/LMG:20976, … 181
Xanthomonas perforans AATCC:BAA-983, DSM:18975, … 153
Listeria innocua ATCC:33090, BCCM/LMG:11387, … 106
Streptococcus iniae ATCC:29178, BCCM/LMG:14520, … 94
Vibrio lentus CECT:5110, CIP:107166, … 87
Vibrio cyclitrophicus ATCC:700982, BCCM/LMG:21359, … 83
Pseudomonas coronafaciens BCCM/LMG:5060, CFPB:2216, … 77
Aliivibrio fischeri ATCC:7744, BCCM/LMG:4414, … 66
Xanthomonas fragariae ATCC:33239, BCCM/LMG:708, … 61

Table 1. The top 10 candidate species for sequencing type-strains sorted by the number of assemblies. These have designated type strains but no type strain assembly. We generated the list by sorting by “number of assemblies from type materials per species”, then by decreasing “number of assemblies per taxon”, then filtering out “type materials and coidentical strains” = “na”.

Table 2. The top 10 candidates for proposing a reftype assembly, or neotype where applicable sorted by the number of assemblies. These species have no designated type strain.  We generated the list by selecting for “type materials and coidentical strains” = “na”, “number of assemblies from type materials per species” = 0, and sorting by decreasing “number of assemblies per taxon”, then filtering out Candidatus.

Please contact info@ncbi.nlm.nih.gov if you want to provide information about missing type-strains.

Expanded average nucleotide identity analysis now available for prokaryotic genome assemblies

As we described in an earlier post, GenBank uses average nucleotide identity (ANI) analysis to find and correct misidentified prokaryotic genome assemblies. You can now access ANI data for the more than 600,000 GenBank bacterial and archaeal genome assemblies through a downloadable report (ANI_report_prokaryotes.txt) available from the genomes/ASSEMBLY_REPORTS area of the FTP site. The README describes the contents of the report in detail. You can use the ANI data to evaluate the taxonomic identity of genome assemblies of interest for yourself.

The new ANI_report_prokaryotes.txt replaces the older ANI_report_bacteria.txt in the same directory. We are no longer updating the ANI_report_bacteria.txt file and will remove it after 31st May 2020.

Recalculation of prokaryotic reference and representative genome assemblies

We have updated the collection of representative and reference assemblies for Bacteria and Archaea to better reflect the taxonomic breadth of the prokaryotes in RefSeq.  We chose the 11,478 representative assemblies in the new collection from the 180,000+ prokaryotic assemblies in RefSeq today.  We have selected one representative or reference assembly for every species based on several criteria including contiguity, completeness and whether the assembly is from type material.  We have also updated the reference and representative microbial Blast database to reflect these changes. This reference and representative set will be updated three times a year to reflect changes in RefSeq.  In addition, as we announced on Feb 14, we have reduced the number of reference genome assemblies — the subset of representative assemblies with annotation provided by outside experts —  to 15. See the list in our previous post .  We have re-annotated the 104 assemblies that are no longer reference with or Prokaryotic Genome Annotations Pipel (PGAP).

Important changes coming to prokaryotic Reference and Representative genome assemblies

We are making changes to the set of bacterial and archaeal RefSeq Reference and Representative assemblies in February 2020.

  • We will reduce the number of Reference assemblies to 15 that have annotation provided by outside experts (Table 1) and re-annotate the 105 other current Reference assemblies using the latest Prokaryotic Genome Annotation Pipeline (PGAP) software. The re-annotated assemblies will lose reference status.
  • We will reassess and revise the set of Representative assemblies so that there is one assembly per species to better reflect the taxonomic diversity of the RefSeq bacterial and archaeal assemblies.

Continue reading “Important changes coming to prokaryotic Reference and Representative genome assemblies”