Read assembly and Annotation Pipeline Tool (RAPT) is available for use and testing

We are excited to launch a beta version of RAPT, the Read assembly and Annotation Pipeline Tool, a one-step application for the genome assembly and gene annotation of archaeal and bacterial isolates. Start from an Illumina run in SRA or on your local machine and get a fully annotated genome!

A RAPT Docker container includes SKESA, a high-accuracy assembler for short reads, PGAP, the annotation pipeline written in the common workflow language (CWL) and used by RefSeq, and cwltool, the reference implementation for CWL. A RAPT release also includes a set of reference data that are critical for a quality annotation. RAPT can be executed with Docker, Singularity or podman on any local or remote machine meeting basic requirements. For users of the Google Cloud Platform, RAPT can be launched from the Google Shell without configuring a virtual machine in advance.

To learn more about RAPT, register for our upcoming webinar.

Questions? Interest in becoming a beta tester? Contact us!

RAPT is available here.

New Columns added to the web BLAST Descriptions Table

In response to your requests, we have added new columns to the Descriptions Table for the web BLAST output. The new columns are  Scientific Name, Common Name, Taxid, and Accession Length. Common Name and Accession Length are now part of the default display. You can click ‘Select columns’ or ‘Manage columns’ to add or remove columns from the display (Figure 1). Your preferences will be saved for your next visit to BLAST, and when you download your results,  whatever columns you have displayed will be saved.

Figure 1. The web BLAST Descriptions Table with all possible columns. You can remove columns through the ‘Manage columns’ menu. If you are not displaying any non-default columns, you can add them using the same menu that will be titled ‘Select columns’.

Customize columns in NCBI’s Multiple Sequence Alignment Viewer

We’re excited to report that researchers using the NCBI Multiple Sequence Alignment Viewer (MSAV) can now add or remove columns from the alignment view. In this way, you can choose to show only columns with data relevant for analysis of the sequences in your alignment.

When you arrive at an MSA alignment view, you’ll see columns for the Sequence ID (e.g., sequence accession number), Start and End of the alignment, and the organism (species name).

Sometimes, the information in these default columns isn’t the most useful information for sorting through the alignment. In the example above, all the sequences are from the same organism, so looking at the Organism column won’t help in figuring out the differences among the different sequences in the alignment.

December 2 Webinar: Using the new Read assembly and Annotation Pipeline Tool (RAPT) to assemble and annotate microbial genomes

Join us December 2 to learn how to use the Read assembly and Annotation Pipeline Tool (RAPT). With RAPT, you can assemble and annotate a microbial genome right out of the sequencing machine! Provide the short genomic reads or an SRA run on input, and get back the sequence annotated with a complete gene set. The assembly is built with SKESA and annotated with PGAP. In addition, RAPT also verifies the taxonomic assignment of the genome with the Average Nucleotide Identity tool. In this webinar, you will learn how you can run RAPT on your own machine or on the Google Cloud Platform.

  • Date and time: Wed, December 2, 2020 12:00 PM – 12:45 PM EST
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

Search NCBI’s Pathogen Detection websites with simple keywords

We’ve redesigned the filters on NCBI’s Pathogen Detection websites to make searching easier!

For example, say you wanted to search for outbreak isolates related to flour. Before the filters were redesigned, you’d have to know that some of the available metadata terms include “flour”, “All-purpose Wheat flour”, and “wheat flour”, along with seventeen other terms. Now, you can see all of your available options after typing in your search term and select only those that are relevant to your search.

Figure 1. Isolates Browser. The “Filters” button to see and search all filters.

BLAST+ 2.11.0 now available with limited usage reporting to help improve BLAST

BLAST+ 2.11.0 release is now available from our FTP site.  With this release, BLAST+ now provides usage reports to NCBI to help us improve BLAST.  This information is limited to the name of the BLAST program, some basic database metadata, a few BLAST parameters, as well the number and total size of your queries (Figure 1).

Figure 1. An example of the report sent back to NCBI from the 2.11.0 BLAST programs.

RefSeq Release 203 now available

RefSeq release 203 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of November 2, 2020, and contains 256,340,911 records, including 186,482,096 proteins, 34,176,314 RNAs, and sequences from 105,349 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Other announcements: 

RefSeq annotation of mouse GRCm39
RefSeq has finished its initial annotation of the new mouse reference assembly, GRCm39, recently released by the Genome Reference Consortium. This is the first coordinate-changing update to the mouse reference since the 2012 release of GRCm38, resolving over 400 issues, almost doubling the scaffold N50, closing almost half the gaps, and adding 1.9 Mb of sequence.

The annotation report for annotation release 109 is available here.

The annotation products are available in the sequence databases and on the FTP site.

New eukaryotic genome annotations
In addition to mouse (GRCm39), this release contains new annotations generated by NCBI’s eukaryotic genome annotation pipeline for 27 species, including:

  • Pallas’s mastiff bat annotation release 100, based on the assembly mMolMol1.p (GCF_014108415.1)
  • Myotis myotis bat annotation release 100, based on the assembly mMyoMyo1.p (GCF_014108235.1)
  • southern grasshopper mouse annotation release 100, based on the new assembly mOncTor1.1 (GCF_903995425.1)
  • American pika (pictured above) annotation release 102 based on new assembly OchPri4.0 (GCF_014633375.1)
  • pharaoh ant annotation release 102 based on new assembly ASM1337386v2 (GCF_013373865.1)
  • olive fruit fly annotation release 101, based on the assembly MU_Boleae_v2 (GCF_001188975.3)

Updated human genome Annotation Release 105.20201022 (GRCh37.p13)
Annotation Release 105.20201022 is an annotation update for the previous human reference assembly, GRCh37.p13 (hg19). This update is not a part of RefSeq FTP release but the annotation products are available in the sequence databases and on the genomes FTP site.

COVID-19 related human gene annotation now in NCBI RefSeq and Gene
The RefSeq group has compiled a set of human genes with roles in coronavirus infection and disease. You can now see and search for these genes and their regulatory elements in NCBI Gene and RefSeq.

Matched Annotation by NCBI and EMBL-EBI (MANE) version 0.92
NCBI RefSeq and Ensembl/GENCODE announced MANE v0.92, which covers 16,865 genes or ~88% of known human protein-coding genes.

NCBI Datasets

NCBI Datasets now provides downloads of gene data for more than 30 thousand organisms.

Human GRCh37 (hg19) RefSeq annotation update 

The NCBI RefSeq group has been in overdrive, making improvements to our human genome annotation and reference transcript and protein sets, with 8,000 new and 15,000 updated transcripts in the last year alone! That’s about 30% of our curated transcript dataset (the transcripts with NM_ and NR_ accessions), with a big focus on transcripts that are well-expressed, have conserved exons, or are transcribed from new promoters.

With all these improvements, we’ve been updating the RefSeq annotation of GRCh38.p13 every quarter. But what about GRCh37 (hg19), which many of you still use?

Genome Workbench Submission Wizard to replace Sequin for prokaryotic and eukaryotic genome submissions in January 2021

If you use Sequin to submit prokaryotic or eukaryotic genome sequences to GenBank, you need to be aware that Sequin will be retired in January 2021. Genome Workbench’s Submission Wizard, which is already available for submitting annotated genomes, will be the submission tool to use for annotated genomes going forward.

Genome Workbench is desktop software that offers a rich set of integrated tools for studying and analyzing genetic data. You can explore and compare data from multiple sources, including the NCBI databases or the your own private data. The Submission Wizard, available since 2019, allows you to prepare submissions of single genomes where all sequences come from the same organism. This interface (Figure 1) is particularly valuable for:

  1. Eukaryotic genomes with annotations, for example those prepared with tbl2asn
  2. Prokaryotic genomes annotated by non-NCBI tools including Prokka and RAST.

Please register to attend our webinar on November 18 to see how to use Genome Workbench to prepare a submission. 

(Note: You should continue to submit organelle and viral genomes using BankIt. Please visit the Submission Portal page for information on other submission options.)

Figure 1. Genome Workbench and Submission Wizard. Once the Sequence Editing package is enabled the Submission menu can open the Genome Submission Wizard that prompts you to upload sequence data and presents  a tabbed set of forms for entering information about the submission. The Wizard validates the submission and provides editing capabilities for correcting errors.

GenBank 240.0 is available and surpasses 10 trillion basepairs!

GenBank release 240.0 (10/28/2020) is now available on the NCBI FTP site. This release has 10.33 trillion bases and 2.17 billion records.

The current release has 219,055,207 traditional records containing 698,688,094,046 base pairs of sequence data. There are also 1,432,874,252 WGS records containing 9,215,815,569,509 base pairs of sequence data, 435,968,379 bulk-oriented TSA records containing 382,996,662,270 base pairs of sequence data, and 78,177,358 bulk-oriented TLS records containing 28,814,798,868 base pairs of sequence data.

Growth between releases

During the 71 days between the close dates for GenBank Releases 239.0 and 240.0, the ‘traditional’ portion of GenBank grew by 44,631,024,497 basepairs and by 412,969 sequence records. During that same period, 94,006 records were updated. An average of 7,140 ‘traditional’ records were added and/or updated per day.

Between releases 239.0 and 240.0, the WGS component of GenBank grew by 374,166,158,857 basepairs and by 24,751,365 sequence records. The TSA component of GenBank grew by 16,027,711,110 basepairs and by 18,443,812 sequence records. The TLS component of GenBank grew by 989,739,370 basepairs and by 2,495,201 sequence records.

The total number of sequence data files increased by 107 with this release. The divisions are as follows:

  • BCT: 22 new files, now a total of 512
  • CON: 1 new file, now a total of 218
  • INV: 2 new files, now a total of 97
  • PAT: 1 new file, now a total of 213
  • PLN: 47 new files, now a total of 594
  • PRI: 10 new files, now a total of 45
  • ROD: 15 new files, now a total of 56
  • VRL: 5 new files, now a total of 44
  • VRT: 4 new files, now a total of 214

Delivery of GenBank 240.0 was delayed by two weeks

A power surge at the NCBI data center and subsequent downtime for a critical disk storage system led to a nearly two-week delay in the delivery of the data files for GenBank 240.0. There were no data losses, and public-facing systems remained available. However, between the direct impacts of the outage and subsequent efforts to resume processing pipelines, the GenBank release timeline was significantly pushed back. Our apologies for the delay!

Upcoming Changes

New /ncRNA_class value : circRNA

  • The allowed values for the /ncRNA_class qualifier have been extended to include “circRNA”, for circular RNA molecules. This change will not appear until (or after) GenBank Release 242.0 in February 2021.

New /circular_RNA qualifier

  • Complementing the new “circRNA” ncRNA class, a new qualifier will be introduced in (or after) GenBank Release 242.0 in February 2021.

Additional Information

For downloading purposes, please keep in mind that the uncompressed GenBank Release 240.0 sequence data flatfiles require roughly 1,524 GB. The ASN.1 data files require approximately 958 GB.

More information about GenBank release 240.0 is available in the release notes, as well as in the README files in the genbank and ASN.1 (ncbi-asn1) directories on FTP.