We have updated the collection of representative genome assemblies for Bacteria and Archaea. As announced in April, this set is now recalculated three times a year. We selected a total of 11,727 prokaryotic assemblies to represent their respective species among the 192,000 assemblies in RefSeq. Six hundred and thirty-five species were included in the collection for the first time, while 395 organisms from undefined species (such as Bacillus bacterium) were removed. We were able to choose a higher-quality representative than in the previous set for 18% of Bacterial and Archaeal species due to improvements in the logic of the selection that is now based on the assembly length, number of pseudo CDSs called in the PGAP annotation, number of scaffolds, whether Gene IDs are available in the Gene database for the assembly that is currently representative, and type strain status. You can see the exact criteria in order of importance on the Prokaryotic RefSeq Genomes page. Now that the new selection process is in place, we expect future updates to have fewer changes. We will replace a representative only if the assembly has changed RefSeq status or if a substantially better assembly becomes available.
You can download the reference and representative set from the Assembly resource. If you are interested in the annotation on these genomes, you can limit searches to proteins annotated on representative genomes by adding “refseq_select[filter]” to any query in the Protein database. For example, you can find all proteins annotated on representative genomes in the genus Klebsiella by using the query: “Klebsiella[organism] AND refseq_select[filter]“. A BLAST database of proteins annotated on representative genomes will be coming soon. Stay tuned!
We have a curated set of ribosomal RNA (rRNA) reference sequences (Targeted Loci) with verifiable organism sources and current names. This set is critical for correctly identifying and classifying prokaryotic (bacteria and archaea) and fungal samples (Table 1). To provide easy access to these sequences, we recently added a separate rRNA/ITS databases section on the nucleotide BLAST page for these targeted sequences that makes it convenient to quickly identify source organisms (Figure 1)
As you may know, we have been offering a new BLAST results (Figure 1) as a test page since April. In response to your positive reception and after incorporating many improvements that you suggested, we made the new results the default today, August 1, 2019.
You will still be able to access to the traditional results for a several months. This will provide you additional time if you need it to adjust your workflows or teaching materials to the new display.
This article is intended for GenBank data submitters with a basic knowledge of BLAST who submit sequence data from protein-coding genes.
One of the most common problems when submitting DNA or RNA sequence data from protein-coding genes to GenBank is failing to add information about the coding region (often abbreviated as CDS) or incorrectly defining the CDS. Incomplete or incorrect CDS information will prevent you from having accession numbers assigned to your submission data set, but there is a procedure that will help you troubleshoot any problems with the CDS feature annotation: doing a BLAST analysis with your sequences before you submit your data.
NOTE: We have changed BLAST search results displays since publishing this blog. For updated guidance on using Nucleotide BLAST (blastn) to help you troubleshoot coding region annotation, see the articles in the NCBI Support Center.
Here’s how to use nucleotide BLAST (blastn) and the formatting options menu to analyze, interpret and troubleshoot your submissions:
1. To start the BLAST analysis, go to the BLAST homepage and select “nucleotide blast”.