How To Format Sequence Data For GenBank Submissions


Submitting sequences to GenBank can seem complicated at first, but starting with a solid foundation in the form of a properly formatted file will make the process go smoothly.

Before submitting sequence data to GenBank, the data must be formatted correctly, the most common file format being FASTA. This post will show you how to create a FASTA file for submitting single- and multiple-nucleotide sequences.

Submitters can upload FASTA-formatted sequence files using NCBI’s stand-alone software Sequin, command line tbl2asn or our web-based submission tool BankIt.

The image below depicts a single sequence in FASTA format. For multiple sequences, such as those of population or phylogenetic studies, environmental samples, and batch sequences of the same gene, create the file using the steps below and put the set of sequences together in a single FASTA file.

Image

Here is how to create the FASTA file:

1) We strongly recommend that you use a text editor. If you use a word processing program, you must save the file as plain ASCII text in order to retain the FASTA format.

2) Create a short, unique sequence ID (SeqID) that you can use for each sequence. This functions as a placeholder until GenBank assigns accession numbers to replace them.

The following is an example of a good SeqID: 1234_abc

  • You can also use a unique isolate number, unique clone number, or other simple unique IDs.
  • Please limit the SeqID to 25 characters or less. Use of brackets (“[]“) in the SeqID is also prohibited.

3) Type the greater than caret   >  and then the SeqID. Then press the SPACE key on your keyboard. To ensure the FASTA file will be read by Sequin or BankIt, a single space is required before entering the [organism=genus species] information.

Example:

>Seq_123 [organism=Homo sapiens]  [isolate=456]

4) Use square brackets around the formatted organism data like this: [organism=Genus species]

Add other source information like clone, isolate, breed, and cultivar in brackets.  A list of additional source modifiers is found here: http://www.ncbi.nlm.nih.gov/Sequin/modifiers.html

5) Add a brief description of the sequence and then press the return or enter key on your keyboard to create a hard return to the next line.

6) Enter the nucleotide sequence and press the return or enter key on your keyboard to create  a hard return to the next line..

7) For multiple sequences, repeat steps 2-6 until all sequences for the set are in the file.

8) Save as .txt (plain ASCII text)

Look for a future Quick Tips blog post on creating a source modifier file for multiple sequences or sequences that have many source modifiers.

For more information:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s