Tag: nucleotide BLAST

Updated and improved collection of RefSeq representative genome assemblies now available

We have updated the collection of representative genome assemblies for Bacteria and Archaea. As announced in April, this set is now recalculated three times a year. We selected a total of 11,727 prokaryotic assemblies to represent their respective species among the 192,000 assemblies in RefSeq. Six hundred and thirty-five species were included in the collection for the first time, while 395 organisms from undefined species (such as Bacillus bacterium) were removed. We were able to choose a higher-quality representative than in the previous set for 18% of Bacterial and Archaeal species due to improvements in the logic of the selection that is now based on the assembly length, number of pseudo CDSs called in the PGAP annotation, number of scaffolds, whether Gene IDs are available in the Gene database for the assembly that is currently representative, and type strain status. You can see the exact criteria in order of importance on the Prokaryotic RefSeq Genomes page. Now that the new selection process is in place, we expect future updates to have fewer changes. We will replace a representative only if the assembly has changed RefSeq status or if a substantially better assembly becomes available.

We have updated the database on the Microbial Nucleotide BLAST page as well as the basic nucleotide BLAST RefSeq Representative Genome Database, to reflect these changes.

You can download the reference and representative set from the Assembly resource. If you are interested in the annotation on these genomes, you can limit searches to proteins annotated on representative genomes by adding “refseq_select[filter]” to any query in the Protein database. For example, you can find all proteins annotated on representative genomes in the genus Klebsiella by using the query: “Klebsiella[organism] AND refseq_select[filter]“.  A BLAST database of proteins annotated on representative genomes will be coming soon. Stay tuned!

New ribosomal RNA BLAST databases available on the web BLAST service and for download

We have a curated set of ribosomal RNA (rRNA)  reference sequences (Targeted Loci) with verifiable organism sources and current names. This set is critical for correctly identifying and classifying prokaryotic (bacteria and archaea) and fungal samples (Table 1). To provide easy access to these sequences, we recently added a separate rRNA/ITS databases section on the nucleotide BLAST page for these targeted sequences that makes it convenient to quickly identify source organisms (Figure 1)

Database BioProjects Sequences
16S ribosomal RNA (Bacteria and Archaea) PRJNA33317 , PRJNA33175

 

20,845
18S ribosomal RNA sequences (SSU) from Fungi type and reference material PRJNA39195 2,337
28S ribosomal RNA sequences (LSU) from Fungi type and reference material PRJNA51803 5,185
Internal transcribed spacer region (ITS) from Fungi and Oomycete type and reference material PRJNA177353, PRJNA362621

 

10,874

Table 1.  NCBI curated targeted rRNA sequences now available as BLAST databases. Continue reading “New ribosomal RNA BLAST databases available on the web BLAST service and for download”

The new BLAST results are now the default view

As you may know,  we have been offering a new BLAST results (Figure 1) as a test page since April.  In response to your positive reception and after incorporating many improvements that you suggested, we made the new results the default today,  August 1, 2019.

You will still be able to access to the traditional results for a several months. This will provide you additional time if you need it to adjust your workflows or teaching materials to the new display.

Continue reading “The new BLAST results are now the default view”

Troubleshooting GenBank Submissions: Annotating the Coding Region (CDS)

This article is intended for GenBank data submitters with a basic knowledge of BLAST who submit sequence data from protein-coding genes.

One of the most common problems when submitting DNA or RNA sequence data from protein-coding genes to GenBank is failing to add information about the coding region (often abbreviated as CDS) or incorrectly defining the CDS. Incomplete or incorrect CDS information will prevent you from having accession numbers assigned to your submission data set, but there is a procedure that will help you troubleshoot any problems with the CDS feature annotation: doing a BLAST analysis with your sequences before you submit your data.

NOTE: We have changed BLAST search results displays since publishing this blog. For updated guidance on using Nucleotide BLAST (blastn) to help you troubleshoot coding region annotation, see the articles in the  NCBI Support Center.

Here’s how to use nucleotide BLAST (blastn) and the formatting options menu to analyze, interpret and troubleshoot your submissions:

1. To start the BLAST analysis, go to the BLAST homepage and select “nucleotide blast”.

nucleotide blast link. click to start BLAST analysis
Figure 1. Select “nucleotide blast”.

Continue reading “Troubleshooting GenBank Submissions: Annotating the Coding Region (CDS)”