Have you ever searched the NCBI Protein database and been overwhelmed with the number of sequences returned? Have you tried searching with a protein name, thinking that would greatly limit the results, only to still be presented with many sequences (all with the same name)? It’s a common problem in this time of greatly expanding sequence databases powered by large-scale genomic sequencing of similar organisms. Redundancy in the sequence databases is high and only getting worse.
To address this, in 2013 NCBI released the WP records, which collect identical protein sequences annotated on bacterial genomes. In 2014, NCBI released the Identical Protein Reports on Protein records, which displays information about all other proteins identical to that protein. Now, we are releasing a new resource: Identical Protein Groups (IPG). IPG offers several features:
GenBank release 220.0 (6/18/2017) has 201,663,568 traditional records containing 234,997,362,623 base pairs of sequence data. In addition, there are 487,891,767 WGS records containing 2,164,683,993,369 base pairs of sequence data, 176812130 TSA records containing 158,112,969,073 base pairs of sequence data, and 1,628,475 TLS records containing 824,191,338 base pairs of sequence data.
This blog post is directed toward Assembly users.
A new “Download assemblies” button is now available in the Assembly database. This makes it easy to download data for multiple genomes without having to write scripts.
For example, you can run a search in Assembly and use check boxes (see left side of screenshot below) to refine the set of genome assemblies of interest. Then, just open the “Download assemblies” menu, choose the source database (GenBank or RefSeq), choose the file type, and start the download. An archive file will be saved to your computer that can be expanded into a folder containing your selected genome data files.
GenBank release 219.0 (4/14/2017) has 200,877,884 traditional records containing 231,824,951,552 base pairs of sequence data. In addition, there are 451,840,147 WGS records containing 2,035,032,639,807 base pairs of sequence data, 165,068,542 TSA records containing 149,038,907,599 base pairs of sequence data, as well as 1,438,349 TLS records containing 636,923,295 base pairs of sequence data.
This article is intended for GenBank data submitters with a basic knowledge of BLAST who submit sequence data from protein-coding genes.
One of the most common problems when submitting DNA or RNA sequence data from protein-coding genes to GenBank is failing to add information about the coding region (often abbreviated as CDS) or incorrectly defining the CDS. Incomplete or incorrect CDS information will prevent you from having accession numbers assigned to your submission data set, but there is a procedure that will help you troubleshoot any problems with the CDS feature annotation: doing a BLAST analysis with your sequences before you submit your data.
Here’s how to use nucleotide BLAST (blastn) and the formatting options menu to analyze, interpret and troubleshoot your submissions:
1. To start the BLAST analysis, go to the BLAST homepage and select “nucleotide blast”.
Figure 1. Select “nucleotide blast”.
In the July 19, 2013 issue of the journal Science, an interesting article describes the discovery and characterization of two “giant” viruses that are proposed to comprise the first members of the “Pandoravirus” genus.
Nadege Philippe and co-workers obtained the viruses from sediment samples in Chile and Australia and found that they have no morphological resemblance to any previously defined virus families. The investigators isolated the genomes of these viruses and sequenced them using a variety of NextGen methodologies. They then assembled the reads into contigs and characterized them using various sequence similarity algorithms (including NCBI’s BLAST and CD-Search). Interestingly, while related to each other, the genomes were not similar to those of any other organism or virus. Additionally, 93% of protein-coding sequences had no recognizable homologs.
Submitting sequences to GenBank can seem complicated at first, but starting with a solid foundation in the form of a properly formatted file will make the process go smoothly.
Before submitting sequence data to GenBank, the data must be formatted correctly, the most common file format being FASTA. This post will show you how to create a FASTA file for submitting single- and multiple-nucleotide sequences.
Submitters can upload FASTA-formatted sequence files using NCBI’s stand-alone software Sequin, command line tbl2asn or our web-based submission tool BankIt.
The image below depicts a single sequence in FASTA format. For multiple sequences, such as those of population or phylogenetic studies, environmental samples, and batch sequences of the same gene, create the file using the steps below and put the set of sequences together in a single FASTA file.
Here is how to create the FASTA file: