The 2018 Nucleic Acids Research database issue features several papers from NCBI staff that cover the status and future of databases including CCDS, ClinVar, GenBank and RefSeq. These papers are also available on PubMed. To read an article, click on the PMID number listed below.
GenBank release 223.0 (12/15/2017) has 206,293,625 traditional records (including non-bulk-oriented TSA) containing 249,722,163,594 base pairs of sequence data. In addition, there are 551,063,065 WGS records containing 2,466,098,053,327 base pairs of sequence data, 201,559,502 TSA records containing 181,394,660,188 base pairs of sequence data, and 12,695,198 TLS records containing 4,458,042,616 base pairs of sequence data.
NCBI now offers a flu sequence submission wizard that makes submissions easier and will provide you with accession numbers sooner. To get started, sign in to NCBI, go to the Submission Portal and choose the link for “Ribosomal RNA (rRNA), rRNA-ITS or Influenza sequences” from the GenBank section.
GenBank release 222.0 (10/14/2017) has 203,953,682 traditional records (including non-bulk-oriented TSA) containing 244,914,705,468 base pairs of sequence data. In addition, there are 508,825,331 WGS records containing 2,318,156,361,999 base pairs of sequence data, 192,754,804 TSA records containing 172,909,268,535 base pairs of sequence data, and 9,479,460 TLS records containing 2,993,818,315 base pairs of sequence data.
GenBank release 221.0 (8/13/2017) has 203,180,606 traditional records containing 240,343,378,258 base pairs of sequence data. In addition, there are 499,965,722 WGS records containing 2,242,294,609,510 base pairs of sequence data, 186,777,106 TSA records containing 167,045,663,417 base pairs of sequence data, and 1,628,475 TLS records containing 824,191,338 base pairs of sequence data.
Have you ever searched the NCBI Protein database and been overwhelmed with the number of sequences returned? Have you tried searching with a protein name, thinking that would greatly limit the results, only to still be presented with many sequences (all with the same name)? It’s a common problem in this time of greatly expanding sequence databases powered by large-scale genomic sequencing of similar organisms. Redundancy in the sequence databases is high and only getting worse.
To address this, in 2013 NCBI released the WP records, which collect identical protein sequences annotated on bacterial genomes. In 2014, NCBI released the Identical Protein Reports on Protein records, which displays information about all other proteins identical to that protein. Now, we are releasing a new resource: Identical Protein Groups (IPG). IPG offers several features:
GenBank release 220.0 (6/18/2017) has 201,663,568 traditional records containing 234,997,362,623 base pairs of sequence data. In addition, there are 487,891,767 WGS records containing 2,164,683,993,369 base pairs of sequence data, 176812130 TSA records containing 158,112,969,073 base pairs of sequence data, and 1,628,475 TLS records containing 824,191,338 base pairs of sequence data.
This blog post is directed toward Assembly users.
A new “Download assemblies” button is now available in the Assembly database. This makes it easy to download data for multiple genomes without having to write scripts.
For example, you can run a search in Assembly and use check boxes (see left side of screenshot below) to refine the set of genome assemblies of interest. Then, just open the “Download assemblies” menu, choose the source database (GenBank or RefSeq), choose the file type, and start the download. An archive file will be saved to your computer that can be expanded into a folder containing your selected genome data files.
GenBank release 219.0 (4/14/2017) has 200,877,884 traditional records containing 231,824,951,552 base pairs of sequence data. In addition, there are 451,840,147 WGS records containing 2,035,032,639,807 base pairs of sequence data, 165,068,542 TSA records containing 149,038,907,599 base pairs of sequence data, as well as 1,438,349 TLS records containing 636,923,295 base pairs of sequence data.
This article is intended for GenBank data submitters with a basic knowledge of BLAST who submit sequence data from protein-coding genes.
One of the most common problems when submitting DNA or RNA sequence data from protein-coding genes to GenBank is failing to add information about the coding region (often abbreviated as CDS) or incorrectly defining the CDS. Incomplete or incorrect CDS information will prevent you from having accession numbers assigned to your submission data set, but there is a procedure that will help you troubleshoot any problems with the CDS feature annotation: doing a BLAST analysis with your sequences before you submit your data.
Here’s how to use nucleotide BLAST (blastn) and the formatting options menu to analyze, interpret and troubleshoot your submissions:
1. To start the BLAST analysis, go to the BLAST homepage and select “nucleotide blast”.