Month: May 2020

Download high-quality graphics from the NCBI Multiple Sequence Alignment Viewer (MSAV)

You can now download a publication-quality graphic images of  the alignment displayed in the NCBI Multiple Sequence Alignment Viewer (Figure 1). Load sequence alignments into the viewer from BLAST or COBALT results or upload alignment files directly. Once you have the the alignment set in the viewer, choose the “Printer-friendly PDF/SVG” option in the Download menu on the toolbar to save the image. The PDF and SVG files contain vector graphics suitable for presentation and publication. MSA_downloadFigure 1. The image download options in the MSAV. You can adjust the desired coordinate range and choose to download a PDF or SVG image. You can also preview the PDF download . Choose simplified color shading to improve compatibility with some graphics programs.

The downloaded image will show the coordinate range you requested and will include all the rows in the alignment.

Please contact us through the Feedback link on the MSA Viewer or write to the NCBI Help Desk to provide feedback and let us know how we can make the NCBI Multiple Sequence Viewer work better for you.

Orthologs Are A-Swimming and A-Buzzing in RefSeq!

Previously we wrote about improvements to Drosophila annotations in RefSeq. We’re excited to report that we’re also improving how we compute and report orthology data for fish and insects to help you find evolutionarily related genes across species. Currently when we annotate a vertebrate genome using our in-house eukaryotic genome annotation pipeline, we have a robust process that identifies 1:1 orthologs vs human using a combination of BLAST comparisons and local synteny. These results are available in NCBI Gene and our new Ortholog pages, and also on Gene’s FTP site. We also use the data to apply human gene and protein names to orthologs in other species, providing a very rich annotation for hundreds of vertebrates.


For fish, we’re now using a two-layer process. First, most of the fish now have 1:1 orthologs identified vs zebrafish, which typically results in identifying 50% more orthologs. Second, if we’ve identified a human ortholog for the zebrafish gene, then we also report the human gene. We’ve also switched primarily to applying gene symbols and names from zebrafish instead of human, mostly provided by the Zebrafish Information Network (ZFIN), to other fish orthologs. The end result is more ortholog connections and better nomenclature. For example, many fish have two related homeobox genes meis2a and meis2b, compared to the single MEIS2 gene in human. Our updated process has allowed us to identify and properly name meis2a and meis2b in 73 and 40 fish species, respectively.

Continue reading “Orthologs Are A-Swimming and A-Buzzing in RefSeq!”

Expanded average nucleotide identity analysis now available for prokaryotic genome assemblies

As we described in an earlier post, GenBank uses average nucleotide identity (ANI) analysis to find and correct misidentified prokaryotic genome assemblies. You can now access ANI data for the more than 600,000 GenBank bacterial and archaeal genome assemblies through a downloadable report (ANI_report_prokaryotes.txt) available from the genomes/ASSEMBLY_REPORTS area of the FTP site. The README describes the contents of the report in detail. You can use the ANI data to evaluate the taxonomic identity of genome assemblies of interest for yourself.

The new ANI_report_prokaryotes.txt replaces the older ANI_report_bacteria.txt in the same directory. We are no longer updating the ANI_report_bacteria.txt file and will remove it after 31st May 2020.

RefSeq release 200 is public

RefSeq release 200 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of May 4, 2020, and contains 237,381,664 records, including 171,643,729 proteins, 31,244,247 RNAs, and sequences from 100,605 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Other announcements:

The number of organisms in RefSeq crosses 100,000!
The current RefSeq release contains 100,605 distinct species or taxons, with a net increase of 763 species since Release 99. This milestone coincides with the 100th release though the current release number is 200 (see below). Note that there is a decrease in the number of species for prokaryotes (bacteria and archaea) due to a clean-up that mainly removed unclassified bacteria, and assemblies from Metagenome-Assembled Genomes (MAGs).

The FTP release number has skipped to 200
As previously announced, NCBI’s Reference Sequence (RefSeq) FTP release number has incremented to 200 for this release, and skipped over the numbers 100-199. The previous, March 2020 release, was release 99. This change is to avoid overlapping with the release numbers of the independently numbered RefSeq annotation releases for the eukaryotic genomes we annotate, which are currently in the range 100-109, for example Mus musculus Annotation Release 108.

NCBI Protein Families
A new release of the NCBI protein families profiles used by PGAP (the Prokaryotic Genome Annotation Pipeline) is now available. You can search this collection of Hidden Markov models (HMM) against your favorite prokaryotic proteins to identify their function using hmmer.

Recalculation of Prokaryotic Reference and Representative Genome Assemblies
We have updated the collection of reference and representative assemblies for Bacteria and Archaea to better reflect the taxonomic breadth of the prokaryotes in RefSeq. We have selected one reference or representative assembly for every species based on several criteria including contiguity, completeness, and whether the assembly is from type material.

Future change: Mouse Reference Assembly Update
A full assembly update for the mouse GRCm38.p6 reference assembly is expected to be released in 2020 by the GRC. We anticipate updating the mouse RefSeq annotation to the new GRCm39 assembly this summer, for either RefSeq FTP Release 201 or 202.


Changing of the Guard: A New Acting Director for NCBI

We wanted to take a moment to announce an important internal development at NCBI. After an illustrious, 32-year career, Dr. James Ostell retired from federal service on March 31, 2020.

Dr. James Ostell
Dr. James Ostell

Dr. Ostell (or “Jim” as we all know him) came to NCBI at its very inception in 1988 and spent the majority of his time at NCBI as the Chief of the Information Engineering Branch. In this role he was responsible for designing, building, and deploying virtually all of the public production services that NCBI provides. In 2017, he became NCBI’s second Director, and championed initial efforts to move NCBI services to cloud environments. During his long tenure, Jim oversaw the growth of NCBI from a handful of people wondering how to confront the coming era of biological data to a vibrant center of some 700 staff serving more than 7 million users each day. We celebrate Jim’s leadership in building these services that continue to provide free and reliable access to data that are critical to biomedical research and the NIH mission to enhance human health.

We are also pleased to welcome Dr. Stephen Sherry as the new Acting Director of NCBI.

Dr. Stephen Sherry
Dr. Stephen Sherry

Dr. Sherry (or “Steve”) joined NCBI in 1998 and has led the development of several NCBI resources including dbSNP, dbVar, dbGaP, ClinVar, and SRA. He has also played a central role in the ongoing move of the SRA dataset onto cloud architectures. Steve has long-standing interests in storing population genetic data in ways that make these data useful to researchers while preserving the privacy of study participants.

As we wish Jim a fond farewell, we hope you will join us in welcoming Steve to this new role.

Identify conditions in ClinVar and Genetic Testing Registry with MONDO IDs

Identify conditions in ClinVar and Genetic Testing Registry with MONDO IDs

In support of data sharing efforts, NCBI’s ClinVar and Genetic Testing Registry (GTR) now accept submissions that use MONDO IDs to identify conditions.

To submit to ClinVar, download our updated spreadsheet templates and enter MONDO as the Condition ID type. Note: The updated template is necessary only if you identify the condition by MONDO ID, not by name.

GTR submitters can use MONDO IDs to identify phenotypes in the clinical tests submitted via spreadsheet, and Mondo phenotype names in both clinical and research test submissions.

Continue reading “Identify conditions in ClinVar and Genetic Testing Registry with MONDO IDs”
May 20 webinar: Exploring SRA metadata in the cloud with BigQuery

May 20 webinar: Exploring SRA metadata in the cloud with BigQuery

Join us on May 20th to learn how to use Google’s BigQuery to quickly search the data from the Sequence Read Archive (SRA) in the cloud to speed up your bioinformatic research and discovery projects. BigQuery is a tool for exploring cloud-based data tables with SQL-like queries. In this webinar, we’ll introduce you to using BigQuery to mine SRA submitter-supplied metadata and the results of taxonomic analysis for SRA runs. You’ll see real-world case studies that demonstrate how to find key information about SRA runs and identify data sets for your own analysis pipelines.

  • Date and time: Wed, May 20, 2020 12:00 PM – 12:45 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

Structure viewer iCn3D 2.15.0 with new rendering, annotation, and alignment features

 iCn3D 2.15.0 is now available on NCBI web site and as a release on GitHub. To use the updated web application, retrieve any structure from the Molecular Modeling Database (MMDB), open the structure summary page, and click the button for “full-featured 3D viewer” in the molecular graphic. For example, you can retrieve structures that contain the term SARS-COV-2, click on a structure of interest, then follow the link for “full-featured 3D viewer.” You can also open iCn3D and use the “File” menu to retrieve a structure by its ID, for example 6MOJor to open a structure file from your local computer. spike_protFigure 1. iCn3D showing the structure of the SARS-COV-2 spike protein (6MOJ) with custom coloring of conserved residues and a multiple sequence alignment of other coronavirus spike proteins. The ability to apply custom color to specific residues or chains and the ability to add multiple alignments as tracks are some of the new features available in 2.15.0 Continue reading “Structure viewer iCn3D 2.15.0 with new rendering, annotation, and alignment features”

A new version of IgBLAST (1.16.0) is here!

We’ve released a new version (1.16.0) of IgBLAST , the popular NCBI package for classifying and analyzing immunoglobulin (IG) and T cell receptor (TCR) variable domain sequences. Version 1.16.0 has three new improvements.

  1. Added the ability to extend the J gene alignment at 3’ the end of the region (Figure 1). This allows you to view the unaligned bases that otherwise would not be included because of low sequence similarity. IgBLAST_options

Figure 1. The new “extend alignment at the 3′ end” option on the IgBLAST web form. The command line option is ‘-extend_align3end’. Continue reading “A new version of IgBLAST (1.16.0) is here!”