Tag: GenBank

Introducing Pebblescout: Index and Search Petabyte-Scale Sequence Resources Faster than Ever

Introducing Pebblescout: Index and Search Petabyte-Scale Sequence Resources Faster than Ever

NCBI is excited to introduce Pebblescout, a pilot web service that allows you to search for sequence matches in very large nucleotide databases, such as runs in the NIH Sequence Read Archive (SRA) and assemblies for whole genome shotgun sequencing projects in Genbank – faster and more efficiently!  

Pebblescout uses short segments of your query sequences to identify database records with matches. Matches are based on the frequency of a segment’s occurrence in a database. Result produced for each query is a ranked list of matching records where the ranking utilizes informativeness of matching segments.  Continue reading “Introducing Pebblescout: Index and Search Petabyte-Scale Sequence Resources Faster than Ever”

GenBank Release 257.0 is Available!

GenBank Release 257.0 is Available!

GenBank release 257.0 (8/15/2023) is now available on the NCBI FTP site. This release has 25.10 trillion bases and 3.69 billion records.

The current release has:

  • 246,119,175 traditional records containing 2,112,058,517,945 base pairs of sequence data
  • 2,631,493,489 WGS records containing 22,294,446,104,543 base pairs of sequence data
  • 686,271,945 bulk-oriented TSA records containing 646,176,166,908 base pairs of sequence data
  • 124,421,006 bulk-oriented TLS records containing 48,289,699,026 base pairs of sequence data

During the 59 days between the close dates for GenBank Releases 256.0 and 257.0, the traditional portion of GenBank grew by 145,578,541,799 base pairs and by 2,558,312 sequence records. We updated 34,840 records during that same period. We added and/or updated an average of 43,952 traditional records per day! Continue reading “GenBank Release 257.0 is Available!”

Using Average Nucleotide Identity (ANI) to Expose Potentially Problematic Taxonomic Merges

Using Average Nucleotide Identity (ANI) to Expose Potentially Problematic Taxonomic Merges

Help us improve our microbial taxonomy

NCBI uses Average Nucleotide Identity (ANI) to evaluate the taxonomic classification of prokaryotic genomes submitted to GenBank. As part of this effort, we identified heterotypic synonyms that fail to match each other with high ANI, and we invite you to help us evaluate these cases.

What is Heterotypic Synonymy?

Heterotypic synonymy refers to two or more names for different taxa (such as species) that were described independently but have been subsequently merged into a single taxon. The merged taxon will generally be referred to by the oldest name. Continue reading “Using Average Nucleotide Identity (ANI) to Expose Potentially Problematic Taxonomic Merges”

table2asn: An Updated, More Powerful Command-Line Program

table2asn: An Updated, More Powerful Command-Line Program

As part of our ongoing effort to enhance your experience, NCBI is excited to promote table2asn, an updated, more powerful command-line program that creates sequence records for submission to GenBank. table2asn is the replacement for the older now-obsolete tool tbl2asn, with very similar operation and additional features. The program is used most frequently to create annotated eukaryotic or prokaryotic genome files for submission and was released several years ago as table2asn_gff. 

Important Note: The older tbl2asn is no longer available for download from the FTP site, so please download table2asn to get the newest version of this powerful tool. Effective June 1, 2024, we will no longer accept tbl2asn-created genome submissions. 

What’s new?

table2asn has added functionality compared to tbl2asn!  Continue reading “table2asn: An Updated, More Powerful Command-Line Program”

GenBank Release 256.0 is Available!

GenBank Release 256.0 is Available!

Genbank release 256.0 (6/21/2023) is now available on the NCBI FTP site. This release has 24.45 trillion bases and 3.66 billion records.

The current release has:

  • 243,560,863 traditional records containing 1,966,479,976,146 base pairs of sequence data
  • 2,611,654,455 WGS records containing 21,791,125,594,114 base pairs of sequence data
  • 683,922,756 bulk-oriented TSA records containing 643,127,590,034 base pairs of sequence data
  • 122,798,571 bulk-oriented TLS records containing 47,302,831,210 base pairs of sequence data
Growth between releases

During the 63 days between the close dates for GenBank Releases 255.0 and 256.0, the traditional portion of GenBank grew by 139,733,657,333 base pairs and by 1,005,927 sequence records. We updated 107,417 records during that same period. We added and/or updated an average of 17,672 traditional records per day! Continue reading “GenBank Release 256.0 is Available!”

NCBI Virus: Mutation-Based Search for SARS-CoV-2 Data

NCBI Virus: Mutation-Based Search for SARS-CoV-2 Data

Millions of SARS-CoV-2 samples from around the world have been made publicly available as assembled and unassembled sequence data in GenBank and the Sequence Read Archive (SRA). Now you can find sequences with a particular mutation by searching with the protein and the amino acid change (e.g. S:F486V). Visit our SARS-CoV-2 Variant Overview on NCBI Virus and click on the Mutation tab to get started (Figure 1). 

Figure 1: SARS-CoV-2 Variants Overview. Arrows indicate important features on the page, including the “Lineages” and “Mutations” tabs to switch between views, the search box, and the information box describing the mutation format. The results are also indicated, including a summary of the total records found that contain the searched term as well as the results table.   Continue reading “NCBI Virus: Mutation-Based Search for SARS-CoV-2 Data”

Now Available! Access Data from the Human Pangenome Research Consortium (HPRC) at NCBI

Now Available! Access Data from the Human Pangenome Research Consortium (HPRC) at NCBI

Have you ever wondered how your genetic make-up is different from your neighbor’s? The National Human Genome Research Institute (NHGRI)-funded Human Pangenome Research Consortium (HPRC) has built an initial version of a pangenome reference – a collection of new human reference genome sequences representing 47 individuals from across the globe. Pangenome graphs relate the sequences from the different genomes to one another. The pangenome allows researchers to compare these DNA sequences and get a more detailed view of the range of human genetic variation. This is the first step toward the HPRC’s goal of building a pangenome reference comprised of the genomes of 350 individuals from diverse genetic backgrounds.  Continue reading “Now Available! Access Data from the Human Pangenome Research Consortium (HPRC) at NCBI”

Revolutionize your research with the NIH Comparative Genomics Resource (CGR)

Revolutionize your research with the NIH Comparative Genomics Resource (CGR)

Unlock the full potential of eukaryotic research organisms and their genomic data with the National Institutes of Health (NIH) Comparative Genomics Resource (CGR). CGR facilitates reliable comparative genomics analyses through community collaboration as well as an NCBI toolkit of interconnected, interoperable data and tools.   

Comparative genomics is a field of study that uses the genomes of many different organisms to help us understand basic biological processes and human disease. NCBI is developing CGR to help researchers take full advantage of the rapidly growing number of eukaryotic organisms that, due to recent technological advances, now have sequenced genomes and associated data that can be used in these types of studies. Its NCBI toolkit offers new and modern resources for such analyses, and its emphasis on community collaboration brings new opportunities to share and connect data.   Continue reading “Revolutionize your research with the NIH Comparative Genomics Resource (CGR)”

GenBank Release 255.0 is Available!

GenBank Release 255.0 is Available!

GenBank release 255.0 (4/21/2023) is now available on the NCBI FTP site. This release has 23.44 trillion bases and 3.48 billion records.

The current release has:

  • 242,554,936 traditional records containing 1,826,746,318,813 base pairs of sequence data
  • 2,440,470,464 WGS records containing 20,926,504,760,221 base pairs of sequence data
  • 678,332,682 bulk-oriented TSA records containing 636,291,358,227 base pairs of sequence data
  • 121,186,672 bulk-oriented TLS records containing 46,567,924,833 base pairs of sequence data
Growth between releases

During the 58 days between the close dates for GenBank Releases 254.0 and 255.0, the traditional portion of GenBank grew by 95,444,070,395 base pairs and by 724,301 sequence records. We updated 172,014 records during that same period. We added and/or updated an average of 15,453 traditional records per day!  Continue reading “GenBank Release 255.0 is Available!”

Coming Soon! Including Sample Location and Collection Date and Time for Sequences Submitted to GenBank and SRA

Coming Soon! Including Sample Location and Collection Date and Time for Sequences Submitted to GenBank and SRA

As previously announced, in collaboration with our partners at the International Nucleotide Sequence Database Collaboration (INSDC), we will begin to systematically gather ‘location of collection’ and ‘date and time of collection’ for sequence data submitted to GenBank and the Sequence Read Archive (SRA). Gathering information about where and when a biological sample was collected aligns with other global sequence submission standardization efforts and will increase the utility of data made available through GenBank and SRA. These changes will be implemented in a phased approach through December 2024.

What’s new?

Sequence data submitted to GenBank and the SRA will need to include information about location and date and time of sample collection. These metadata will be entered using the pre-existing fields ‘country’ and ‘collection_date.’ Minimum information for these fields is described below. We encourage submitters to provide additional details when available: Continue reading “Coming Soon! Including Sample Location and Collection Date and Time for Sequences Submitted to GenBank and SRA”