Author: NCBI Staff

December 2 Webinar: Using the new Read assembly and Annotation Pipeline Tool (RAPT) to assemble and annotate microbial genomes

December 2 Webinar: Using the new Read assembly and Annotation Pipeline Tool (RAPT) to assemble and annotate microbial genomes

Join us December 2 to learn how to use the Read assembly and Annotation Pipeline Tool (RAPT). With RAPT, you can assemble and annotate a microbial genome right out of the sequencing machine! Provide the short genomic reads or an SRA run on input, and get back the sequence annotated with a complete gene set. The assembly is built with SKESA and annotated with PGAP. In addition, RAPT also verifies the taxonomic assignment of the genome with the Average Nucleotide Identity tool. In this webinar, you will learn how you can run RAPT on your own machine or on the Google Cloud Platform.

  • Date and time: Wed, December 2, 2020 12:00 PM – 12:45 PM EST
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

Search NCBI’s Pathogen Detection websites with simple keywords

We’ve redesigned the filters on NCBI’s Pathogen Detection websites to make searching easier!

For example, say you wanted to search for outbreak isolates related to flour. Before the filters were redesigned, you’d have to know that some of the available metadata terms include “flour”, “All-purpose Wheat flour”, and “wheat flour”, along with seventeen other terms. Now, you can see all of your available options after typing in your search term and select only those that are relevant to your search.

Figure 1. Isolates Browser. The “Filters” button to see and search all filters.

Continue reading “Search NCBI’s Pathogen Detection websites with simple keywords”

BLAST+ 2.11.0 now available with limited usage reporting to help improve BLAST

BLAST+ 2.11.0 release is now available from our FTP site.  With this release, BLAST+ now provides usage reports to NCBI to help us improve BLAST.  This information is limited to the name of the BLAST program, some basic database metadata, a few BLAST parameters, as well the number and total size of your queries (Figure 1).

Figure 1. An example of the report sent back to NCBI from the 2.11.0 BLAST programs.

Continue reading “BLAST+ 2.11.0 now available with limited usage reporting to help improve BLAST”

RefSeq Release 203 now available

RefSeq Release 203 now available

RefSeq release 203 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of November 2, 2020, and contains 256,340,911 records, including 186,482,096 proteins, 34,176,314 RNAs, and sequences from 105,349 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Other announcements: 

RefSeq annotation of mouse GRCm39
RefSeq has finished its initial annotation of the new mouse reference assembly, GRCm39, recently released by the Genome Reference Consortium. This is the first coordinate-changing update to the mouse reference since the 2012 release of GRCm38, resolving over 400 issues, almost doubling the scaffold N50, closing almost half the gaps, and adding 1.9 Mb of sequence.

The annotation report for annotation release 109 is available here.

The annotation products are available in the sequence databases and on the FTP site.

New eukaryotic genome annotations
In addition to mouse (GRCm39), this release contains new annotations generated by NCBI’s eukaryotic genome annotation pipeline for 27 species, including:

  • Pallas’s mastiff bat annotation release 100, based on the assembly mMolMol1.p (GCF_014108415.1)
  • Myotis myotis bat annotation release 100, based on the assembly mMyoMyo1.p (GCF_014108235.1)
  • southern grasshopper mouse annotation release 100, based on the new assembly mOncTor1.1 (GCF_903995425.1)
  • American pika (pictured above) annotation release 102 based on new assembly OchPri4.0 (GCF_014633375.1)
  • pharaoh ant annotation release 102 based on new assembly ASM1337386v2 (GCF_013373865.1)
  • olive fruit fly annotation release 101, based on the assembly MU_Boleae_v2 (GCF_001188975.3)

Updated human genome Annotation Release 105.20201022 (GRCh37.p13)
Annotation Release 105.20201022 is an annotation update for the previous human reference assembly, GRCh37.p13 (hg19). This update is not a part of RefSeq FTP release but the annotation products are available in the sequence databases and on the genomes FTP site.

COVID-19 related human gene annotation now in NCBI RefSeq and Gene
The RefSeq group has compiled a set of human genes with roles in coronavirus infection and disease. You can now see and search for these genes and their regulatory elements in NCBI Gene and RefSeq.

Matched Annotation by NCBI and EMBL-EBI (MANE) version 0.92
NCBI RefSeq and Ensembl/GENCODE announced MANE v0.92, which covers 16,865 genes or ~88% of known human protein-coding genes.

NCBI Datasets

NCBI Datasets now provides downloads of gene data for more than 30 thousand organisms.

Human GRCh37 (hg19) RefSeq annotation update 

The NCBI RefSeq group has been in overdrive, making improvements to our human genome annotation and reference transcript and protein sets, with 8,000 new and 15,000 updated transcripts in the last year alone! That’s about 30% of our curated transcript dataset (the transcripts with NM_ and NR_ accessions), with a big focus on transcripts that are well-expressed, have conserved exons, or are transcribed from new promoters.

With all these improvements, we’ve been updating the RefSeq annotation of GRCh38.p13 every quarter. But what about GRCh37 (hg19), which many of you still use?

Continue reading “Human GRCh37 (hg19) RefSeq annotation update “

Genome Workbench Submission Wizard to replace Sequin for prokaryotic and eukaryotic genome submissions in January 2021

If you use Sequin to submit prokaryotic or eukaryotic genome sequences to GenBank, you need to be aware that Sequin will be retired in January 2021. Genome Workbench’s Submission Wizard, which is already available for submitting annotated genomes, will be the submission tool to use for annotated genomes going forward.

Genome Workbench is desktop software that offers a rich set of integrated tools for studying and analyzing genetic data. You can explore and compare data from multiple sources, including the NCBI databases or the your own private data. The Submission Wizard, available since 2019, allows you to prepare submissions of single genomes where all sequences come from the same organism. This interface (Figure 1) is particularly valuable for:

  1. Eukaryotic genomes with annotations, for example those prepared with tbl2asn
  2. Prokaryotic genomes annotated by non-NCBI tools including Prokka and RAST.

Please register to attend our webinar on November 18 to see how to use Genome Workbench to prepare a submission. 

(Note: You should continue to submit organelle and viral genomes using BankIt. Please visit the Submission Portal page for information on other submission options.)

Figure 1. Genome Workbench and Submission Wizard. Once the Sequence Editing package is enabled the Submission menu can open the Genome Submission Wizard that prompts you to upload sequence data and presents  a tabbed set of forms for entering information about the submission. The Wizard validates the submission and provides editing capabilities for correcting errors. Continue reading “Genome Workbench Submission Wizard to replace Sequin for prokaryotic and eukaryotic genome submissions in January 2021”

GenBank 240.0 is available and surpasses 10 trillion basepairs!

GenBank release 240.0 (10/28/2020) is now available on the NCBI FTP site. This release has 10.33 trillion bases and 2.17 billion records.

The current release has 219,055,207 traditional records containing 698,688,094,046 base pairs of sequence data. There are also 1,432,874,252 WGS records containing 9,215,815,569,509 base pairs of sequence data, 435,968,379 bulk-oriented TSA records containing 382,996,662,270 base pairs of sequence data, and 78,177,358 bulk-oriented TLS records containing 28,814,798,868 base pairs of sequence data.

Growth between releases

During the 71 days between the close dates for GenBank Releases 239.0 and 240.0, the ‘traditional’ portion of GenBank grew by 44,631,024,497 basepairs and by 412,969 sequence records. During that same period, 94,006 records were updated. An average of 7,140 ‘traditional’ records were added and/or updated per day.

Between releases 239.0 and 240.0, the WGS component of GenBank grew by 374,166,158,857 basepairs and by 24,751,365 sequence records. The TSA component of GenBank grew by 16,027,711,110 basepairs and by 18,443,812 sequence records. The TLS component of GenBank grew by 989,739,370 basepairs and by 2,495,201 sequence records.

The total number of sequence data files increased by 107 with this release. The divisions are as follows:

  • BCT: 22 new files, now a total of 512
  • CON: 1 new file, now a total of 218
  • INV: 2 new files, now a total of 97
  • PAT: 1 new file, now a total of 213
  • PLN: 47 new files, now a total of 594
  • PRI: 10 new files, now a total of 45
  • ROD: 15 new files, now a total of 56
  • VRL: 5 new files, now a total of 44
  • VRT: 4 new files, now a total of 214

Delivery of GenBank 240.0 was delayed by two weeks

A power surge at the NCBI data center and subsequent downtime for a critical disk storage system led to a nearly two-week delay in the delivery of the data files for GenBank 240.0. There were no data losses, and public-facing systems remained available. However, between the direct impacts of the outage and subsequent efforts to resume processing pipelines, the GenBank release timeline was significantly pushed back. Our apologies for the delay!

Upcoming Changes

New /ncRNA_class value : circRNA

  • The allowed values for the /ncRNA_class qualifier have been extended to include “circRNA”, for circular RNA molecules. This change will not appear until (or after) GenBank Release 242.0 in February 2021.

New /circular_RNA qualifier

  • Complementing the new “circRNA” ncRNA class, a new qualifier will be introduced in (or after) GenBank Release 242.0 in February 2021.

Additional Information

For downloading purposes, please keep in mind that the uncompressed GenBank Release 240.0 sequence data flatfiles require roughly 1,524 GB. The ASN.1 data files require approximately 958 GB.

More information about GenBank release 240.0 is available in the release notes, as well as in the README files in the genbank and ASN.1 (ncbi-asn1) directories on FTP.

November 18 Webinar: A new way to prepare genome submissions using NCBI’s Genome Workbench!

Join us November 18 to learn how to use Genome Workbench, NCBI’s sequence annotation and analysis package, to prepare genome submissions for GenBank.  This webinar will help you prepare for the upcoming retirement of Sequin submission tool in January 2021. You will learn how to use Genome Workbench’s Submission Wizard, Validation and Submitter Reports, Flat File View, and Graphical Sequence View to prepare your annotated genome submission to GenBank and help you find and fix any problems before submitting.

  • Date and time: Wed, November 18, 2020 12:00 PM – 12:45 PM EST
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

NCBI RefSeq and Ensembl/GENCODE taking MANE mainstream with v0.92!

NCBI and EBI have been hard at work on our joint MANE collaborationproviding a set of representative transcripts for human protein-coding genes that are identically annotated in the NCBI RefSeq and Ensembl/GENCODE annotation sets and exactly match the GRCh38 reference assembly. We’re pleased to announce MANE v0.92, now covering 16,865 genes or ~88% of known human protein-coding genes.

In particular, we’ve focused on clinically relevant genes and MANE Select now includes 99% of genes with high gene-disease validity. This release also includes 43 extra transcripts labeled “MANE Plus Clinical” that we’ve chosen to aid in clinical reporting, for example, when there are additional pathogenic variants not covered in the MANE Select transcript. While it’s critical to consider other alternatively-spliced transcripts for variant interpretation or functional analyses, the MANE Select and MANE Plus Clinical transcripts provide a common foundation for clinical reporting, and other analyses that benefit from using just one well-supported transcript or protein per gene.

Continue reading “NCBI RefSeq and Ensembl/GENCODE taking MANE mainstream with v0.92!”

IgBLAST 1.17 is now available with improved identification of productive V gene sequences

A new release of IgBLAST (1.17), the popular package for classifying and analyzing immunoglobulin and T cell receptor sequences, is now available on the web and from the FTP site. The updated package is better at identifying productive V gene sequences. We added a new field , “V frame shift”, to the IgBLAST output to indicate whether the V gene translation frame contains a frame-shift. We have also updated the definition of a productive V(D)J sequence to now exclude those with internal frame shifts (Figure 1).

Figure 1. A portion of the web IgBLAST output showing the new “V frame shift” field. The results of this field now inform the classification of the sequence as Productive (Yes / No).

See the new IgBLAST manual on the NCBI GitHub site for more information on setting up and running IgBLAST.