Tag: GenBank

Rapid access to SARS-CoV-2 data from the current public health emergency

As the global health emergency around the Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2, formerly 2019-nCoV) continues, we continue to play a key role in providing the biomedical community free and easy access to genome sequences from the coronavirus. You can quickly access these data through the NCBI search (Figure 1).sar-2_sensorFigure 1.  NCBI search results for the term “SARS-COV-2” showing the schematic map of the viral assembly and annotation and buttons that link to the data in the NCBI Virus resource, a specialized BLAST page that searches Betacoronavirus sequences, and the reference assembly download. The bottom panel provides links to the CDC website for COVID-19 information and a link to GenBank®/SRA sequence data.

Continue reading “Rapid access to SARS-CoV-2 data from the current public health emergency”

Dengue virus submission improvements now live!

When there is an outbreak of dengue fever in the world, it’s critical that viral genomic sequence data be submitted by researchers and made available to analyze as soon as possible.  You can now submit Dengue virus sequences to GenBank using a new workflow (Figure 1) in the Submission Portal designed to help make these data available as soon as possible.  The streamlined process, similar to the one described in a previous post for animal mitochondrial COX1 sequences, has an improved interface, enhanced validation, and automatic annotation that saves you time and effort.

Dengue_sub

Figure 1. The Submission Portal pages for targeted sequence submission workflows. Top panel. The new submission page for entering the workflow. Bottom panel. Submission Portal page with the Dengue virus submission option selected (boxed in red).  The service has options for other targeted submissions including mitochondrial COX1 from multicellular animals (metazoa), ribosomal RNA (rRNA), rRNA-ITS, Influenza virus, and Norovirus sequences.

This update is part of a larger and ongoing effort to consolidate GenBank submissions in a central location.  In addition to Dengue virus data, you can also submit Influenza A, B, C and Norovirus sequences as well as other targeted sequences including mitochondrial COX1 genes from multicellular animals (metazoa), ribosomal RNA (rRNA), and rRNA-ITS through the options on the Submission Portal.  You should submit other types of sequence data including other virus sequences to GenBank using BankIt or tbl2ASN.

You can use the search feature on the Submission Portal to find the appropriate submission tool for your data.

Novel coronavirus complete genome from the Wuhan outbreak now available in GenBank

Updated!

Get rapid access to Wuhan coronavirus (2019-nCoV) sequence data from the current outbreak as it becomes available. We will continue to update the page with newly released data.

The complete annotated genome sequence of the novel coronavirus associated with the outbreak of pneumonia in Wuhan, China is now available from GenBank for free and easy access by the global biomedical community. Figure 1 shows the relationship of the Wuhan virus to selected coronaviruses.

Wuhan-human-1_posterior-output2

Figure 1.  Phylogenetic tree showing the relationship of Wuhan-Hu-1 (circled in red) to selected coronaviruses. Nucleotide alignment was done with MUSCLE 3.8. The phylogenetic tree was estimated with MrBayes 3.2.6 with parameters for GTR+g+i.  The scale bar indicates estimated substitutions per site, and all branch support values are 99.3% or higher.

Continue reading “Novel coronavirus complete genome from the Wuhan outbreak now available in GenBank”

GenBank release 235

GenBank release 235

GenBank release 235.0 (12/11/2019) is now available on the NCBI FTP site. This release has 7 trillion bases and 1.74 billion records.

The current release has 215,333,020 traditional records containing 388,417,258,009 base pairs of sequence data. There are also 1,127,023,870 WGS records containing 6,277,551,200,690 base pairs of sequence data, 367,193,844 bulk-oriented TSA records containing 325,433,016,129 base pairs of sequence data, and 28,227,180 bulk-oriented TLS records containing 11,280,596,614 base pairs of sequence data.

Continue reading “GenBank release 235”

Mitochondrial COX1 submission improvements now live in submission portal!

GenBank submitters, you can now submit mitochondrial COX1 (cytochrome oxidase subunit I; COI) sequence data from multicellular animals (metazoa) using a new workflow (Figure 1) with an improved interface, enhanced validation, and automatic COX1 CDS feature annotation.  Once you have submitted mitochondrial COX1 data using this tool, you’ll have a single, helpful page to reference your submission information: accession number(s), COX1 submission status, relevant files and more. Plus, you can also fix any errors from this page.

COX1_Submission2
Figure 1. Submission Portal page with the mitochondrial COX1 submission option selected (boxed in red).  The service has options for other targeted submissions including ribosomal RNA (rRNA), rRNA-ITS, Influenza virus, and Norovirus sequences.

Continue reading “Mitochondrial COX1 submission improvements now live in submission portal!”

Feature propagation in BankIt: easily annotate many sequences at once for GenBank submission

Do you need a quick way to annotate features on a similar set of sequences for your GenBank submission? You can now submit sequences from the same region or gene in an alignment format in BankIt and use the new ‘Feature propagation option’ (Figure 1) to apply features from a single sequence to other aligned sequences. You simply annotate one sequence and then copy that annotation across all the sequences in your submission.

Here’s how you can propagate features in three easy steps:

  1. Provide nucleotide sequences in an alignment format.
  2. Select a sequence and annotate it.
  3. Propagate the features and edit results.

Continue reading “Feature propagation in BankIt: easily annotate many sequences at once for GenBank submission”

New release of the Prokaryotic Genome Annotation Pipeline with updated tRNAscan and protein models

A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) is now available on GitHub. This release uses a new and improved version of tRNAscan (tRNAscan-SE:2.0.4) and includes our most up-to-date Hidden Markov Model and BlastRule collections for naming proteins.

Remember that you can submit the results of PGAP to GenBank. Or, if you are still improving the assembly and your genome doesn’t pass the pre-annotation validation, you can use the –ignore-all-errors mode to get a preliminary annotation.

See our previous post and our documentation for details on how to set up and run PGAP yourself.

Try PGAP and let us know how you like it!

GenBank release 234 is available

GenBank release 234 is available

GenBank release 234.0 (10/14/2019) is now available on the NCBI FTP site. This release has 6.69 trillion bases and 1.68 billion records.

The release has 216,763,706 traditional records containing 386,197,018,538 base pairs of sequence data. There are also 1,097,629,174 WGS records containing 5,985,250,251,028 base pairs of sequence data, 342,811,151 bulk-oriented TSA records containing 305,371,891,408 base pairs of sequence data, and 27,460,978 bulk-oriented TLS records containing 10,848,455,369 base pairs of sequence data.

Continue reading “GenBank release 234 is available”

GenBank submitters, is your genome assembly within the expected size range?

Validation issues can delay the processing of your submissions to GenBank. To avoid one type of delay, use the new “expected genome size” API to check the length of your genome assembly before submission.

The API compares the size of submitted genome assemblies to the expected genome size range for the species to identify outliers that can result from errors such as:

  • incorrect organism assignment
  • metagenome submitted as an organism genome
  • targeted sub-genome assembly not flagged as partial genome representation
  • gross contamination with other sequences

You can check in advance for these possible problems using the API. The API accepts the taxid for the species (taxid = Taxonomy ID – see our Taxonomy quick start guide on how to find the taxid for a given species) and the length of your assembly (excluding gaps and runs of Ns) as input and returns XML with the expected length, the acceptable range, and a status that tells you whether your assembly is too large, too small, or within the acceptable range. Look for <length_status>within_range</length_status> which confirms that your sequence passes the test!

Try the following examples:

https://api.ncbi.nlm.nih.gov/genome/v0/expected_genome_size?species_taxid=1773&length=4.41M
https://api.ncbi.nlm.nih.gov/genome/v0/expected_genome_size?species_taxid=562&length=7221235
https://api.ncbi.nlm.nih.gov/genome/v0/expected_genome_size?species_taxid=5476&length=5.72M

For more information, see the Genome Size Check documentation.

New release of the Prokaryotic Genome Annotation Pipeline now available

We have released a new version of the Prokaryotic Genome Annotation Pipeline (PGAP), available on GitHub. The new release includes the ability to ignore pre-annotation validation errors (–ignore-all-errors). This new feature allows you to produce a preliminary annotation for a draft version of the genome, even one that contains vector and adapter sequences or that is outside of the size range for the species. This draft annotation should be helpful with your ongoing work on the genome assembly. Please keep in mind that these pre-annotations and assemblies with contaminants or other errors are not suitable for submission to GenBank.

Another new feature allows you to provide the name of the consortium that generated the assembly and annotation so that this information appears in the final GenBank records. For more details, consult our guidelines on input files.

See our previous post and our documentation for details on how to obtain and run PGAP yourself.

Next on our to-do list is a module for calculating Average Nucleotide Identity (ANI) to confirm the assembly’s taxonomic assignment. Stay tuned!