This article is intended for GenBank data submitters with a basic knowledge of BLAST who submit sequence data from protein-coding genes.
One of the most common problems when submitting DNA or RNA sequence data from protein-coding genes to GenBank is failing to add information about the coding region (often abbreviated as CDS) or incorrectly defining the CDS. Incomplete or incorrect CDS information will prevent you from having accession numbers assigned to your submission data set, but there is a procedure that will help you troubleshoot any problems with the CDS feature annotation: doing a BLAST analysis with your sequences before you submit your data.
NOTE: We have changed BLAST search results displays since publishing this blog. For updated guidance on using Nucleotide BLAST (blastn) to help you troubleshoot coding region annotation, see the articles in the NCBI Support Center.
Here’s how to use nucleotide BLAST (blastn) and the formatting options menu to analyze, interpret and troubleshoot your submissions:
1. To start the BLAST analysis, go to the BLAST homepage and select “nucleotide blast”.
2. Once the nucleotide BLAST results are available, open the Formatting options menu and check the CDS feature box.
3. If you want to see only the differences between your sequence and records in the blastn database, change “Pairwise” to “Pairwise with dots for identities”.
4. Click the blue “Reformat” button for the new settings to take effect.
When you look at your BLAST results, you will see how your sequence aligns (over its entire length or in part) to records in the blastn default NR database. You need to know how long your sequence data are and carefully look at the 5′ and 3′ ends to see if alignments go through the entire length of your sequence data.
With the CDS feature box checked, the displayed alignments expand from pairs of two rows to groups of four rows.
The middle two rows (“Query” and “Sbjct”) are the nucleotide sequence data, the bottom row is the CDS feature existing in the BLAST NR database record that matches your sequence, and the top row is an on-the-fly translation of your sequence using the Standard Genetic Code. (Blastn will always use the Standard Genetic Code. With blastx, you can specify the appropriate genetic code.) Since this sequence is from a mitochondrial gene, the blastn record has W for tryptophan (highlighted in pink in Figure 3) instead of * in the translation of your sequence. This is not a problem; it simply illustrates the different genetic codes being used for translation.
With the CDS feature box checked, you can determine the nt number range for the coding region that is present in your sequence. In the previous image, the coding region starts at nt 67 with the Met ATG start codon. Not shown is the 3′ end, where * indicates the termination, or stop, codon.
If your sequence cannot be translated into a protein sequence that matches a blastn NR database record, you will see consecutive, staggered pink letters, with the letters representing amino acids not aligned on top of each other. The dash (“-“) in the Query row indicates a missing base in your sequence, which shifts the reading frame and results in the wrong protein translation from the nucleotide sequence.
Another common problem when adding the CDS feature is designating the wrong strand. Sequence data can be submitted from the opposite strand from the one used for transcription and translation in the biological cell, but the CDS feature must match the strand for the nucleotide sequence being submitted.
Several clues in the blastn alignments can indicate an opposite strand existing in your sequence. Blastn always marks your sequence as the “Plus” strand. Look at the “Sbjct” nt and aa numbers: are they starting with larger numbers and going to smaller numbers through the end of the alignment? Look at where the amino acid M is present. Does an ATG codon correspond to the M (plus strand) or does a CAT codon correspond to the M amino acid (which would indicate that the sequence data are from the “opposite” strand)?
If your sequence is from a eukaryotic genome, look at the beginning and end of the intronic segments (indicated with “~”). Does the intron start with “GT” and end with “AG”, or does the intron have “CT” and “AC” at the beginning and end? If the introns have “CT” and “AC”, this indicates that your sequence is from the “opposite” or “minus” strand. To add the CDS feature for this situation, make sure that you mark it as being from the “Minus” strand.
It is important to look at more than just one alignment with one NR database record – any conclusions need to be from multiple consistent comparisons to your sequence. The GenBank Submission staff will not assign accession numbers for sequences with translation problems. Sometimes the 5′ and 3′ ends have short reading frameshifts and the GenBank staff recognize that these few amino acids are not the correct protein sequence, even though no premature stop codons occur from the reading frameshift(s).
One more tip: If you want to run this search more quickly and focus on the most useful records – those with coding sequence features – you can add an Entrez query before running the blastn program to limit the BLAST NR database records to those with a CDS feature. You can also limit the retrieved records to newer NR database records by adding a date restriction to the Entrez query. The latter restriction helps because annotation styles and standards change and newer records are better examples for how to add feature annotation for your sequence.
Once you have blastn results, you can see how the matching NR database accession number is annotated. Click on the “GenBank” link in the “Range” row to see the feature annotation for the nt range that matches your sequence.
If you have sequencing differences that result in reading frameshifts or incorrect protein translation, you need to check the original sequencing trace reads to determine if the sequence differences are real. Some sequencing technologies are less accurate at the beginning or end of a sequence so it is allowed to trim those inaccurate ends. You may also need to re-sequence some of the samples or remove them from your data set before submission.
Though this blog post focuses on blastn and the CDS feature box, you can also use blastx (translating a nucleotide sequence and seeing which protein NR database records match the translation) to help in troubleshooting your sequence data submission.
More details about using BLAST to troubleshoot your GenBank submission are available in two webinars:
- The first webinar, Troubleshooting GenBank Submissions, Coding Region Annotations, focused on intronless protein-coding gene data submissions.
- The second webinar, Troubleshooting GenBank Submissions, Eukaryotic CDS Annotation, focused on coding region feature annotation for multi-part CDS features.
- Part 1 introduces troubleshooting eukaryotic gene data GenBank submissions. Most eukaryotic genes contain introns that complicate describing the CDS.
- Part 2 describes corresponding GenBank display records and breaks down the Features section.
- Part 3 explains how BLAST can be used to identify CDS sequences within your nucleotide sequence, with three different data set examples.
- Part 4 outlines the steps to add CDS information in GenBank BankIt and Sequin submission tools for both individual features and with Feature Table files. Criteria to determine which submission tool to use are included. Finally, we offer additional data set troubleshooting suggestions.
If you have difficulty accessing YouTube, the videos can be found in MP4 format via FTP.