New ribosomal RNA BLAST databases available on the web BLAST service and for download

We have a curated set of ribosomal RNA (rRNA)  reference sequences (Targeted Loci) with verifiable organism sources and current names. This set is critical for correctly identifying and classifying prokaryotic (bacteria and archaea) and fungal samples (Table 1). To provide easy access to these sequences, we recently added a separate rRNA/ITS databases section on the nucleotide BLAST page for these targeted sequences that makes it convenient to quickly identify source organisms (Figure 1)

Database BioProjects Sequences
16S ribosomal RNA (Bacteria and Archaea) PRJNA33317 , PRJNA33175


18S ribosomal RNA sequences (SSU) from Fungi type and reference material PRJNA39195 2,337
28S ribosomal RNA sequences (LSU) from Fungi type and reference material PRJNA51803 5,185
Internal transcribed spacer region (ITS) from Fungi and Oomycete type and reference material PRJNA177353, PRJNA362621



Table 1.  NCBI curated targeted rRNA sequences now available as BLAST databases. Continue reading

NCBI staff to present 3 posters at Advances in Genome Biology and Technology (AGBT), February 2020

Next week, NCBI staff will attend AGBT in Marco Island, Florida. On Tuesday, February 25, 2020, three posters from NCBI staff will be on display from 4:40 p.m. – 6:10 p.m. during the Poster Session and Wine Reception in the Banyan and Calusa Ballroom Foyers, Levels 1 and 3. Read on to learn a little bit about what we’ll be presenting.

Continue reading

New PGAP release with Singularity, no-internet option, and Taxonomy Check

A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) with several important features is now available on Github.

  • In response to several requests we have added the option of running PGAP with Singularity, Podman or any other Docker-compatible executable you wish to use.
  • We have also lifted the requirement for internet access in case you have privacy concerns. To run the pipeline without internet access, set the flag
  • Are you unsure about the identity of organism you sequenced? We’ve added the Taxonomy-Check module to help you. This module will confirm the organism name or suggest a new taxonomic assignment through average nucleotide identity comparison with type material assemblies from GenBank. The check is currently an optional validation step prior to PGAP.

Try these new features and let us know what you think! Or submit your PGAP-annotated assembly to GenBank. And remember that if you are still improving the assembly and your genome doesn’t pass the pre-annotation validation, you can use the --ignore-all-errors flag to get a preliminary annotation.

Important changes coming to prokaryotic Reference and Representative genome assemblies

We are making changes to the set of bacterial and archaeal RefSeq Reference and Representative assemblies in February 2020.

  • We will reduce the number of Reference assemblies to 15 that have annotation provided by outside experts (Table 1) and re-annotate the 105 other current Reference assemblies using the latest Prokaryotic Genome Annotation Pipeline (PGAP) software. The re-annotated assemblies will lose reference status.
  • We will reassess and revise the set of Representative assemblies so that there is one assembly per species to better reflect the taxonomic diversity of the RefSeq bacterial and archaeal assemblies.

Continue reading

Opportunity for experienced data scientists at the National Library of Medicine: NIH DATA Scholar Program

The National Library of Medicine (NLM) is pleased to announce the Data and Technology Advancement (DATA) National Service Scholar Program, a new opportunity for experienced data and computer scientists and engineers to tackle biomedical data challenges in partnership with NIH Office of Data Science Strategy (ODSS).

The one- to two-year positions will be based in NIH offices located in Bethesda. DATA Scholars will lead transformative NIH projects for architecting search across petabyte-Scale genomic Sequence Read Archive (SRA) data:

  • pioneer sequence search strategies against the entire corpus of NIH’s SRA data to stimulate novel approaches to advance data analysis and accelerate biological discoveries.
  • develop methods to execute sequence-based searches, including those that involve machine-learning or other AI approaches.
  • directly communicate technical and project-related information with NIH senior leadership.
  • collaborate with other DATA Scholars and the NIH data science community across broad disciplinary boundaries.
  • engage with policymakers, top researchers, and industry partners.

Applicants should possess technical skills in areas such as artificial intelligence, cloud computing, data engineering, data science, database management, project management, software design, supercomputing, and/or bioinformatics. Industry experience is desired. Applicants should have an M.D., Ph.D. or equivalent doctoral degree and have advanced experience in data science or related fields.

Applications are due April 30, 2020. For more information and details on how to apply, visit our full job announcement.

DHHS and NIH are Equal Opportunity Employers. Applications from women, minorities, and persons with disabilities are strongly encouraged.

Try out our new table download options from the NCBI genome browsers and sequence viewers!

Have you ever wanted a list of the genes you’re looking at in the browser – maybe to give you a starting point for candidate gene analysis, or to cross-reference with other data?

In response to your feedback and helpful discussions with you, we’re excited to announce a new option to download gene annotation data directly from the web sequence viewers and browsers.

This new feature lets you get a table of gene names, coordinates and other helpful information from your genomic region of interest.

Go to the Download menu on the toolbar of the graphical viewer to find options for getting sequence and annotation data.


Continue reading

Computational Medicine Codeathon and AWS workshop at Chapel Hill in March

NIH is pleased to announce a computational medicine-focused codeathon. To apply, please complete the application form by February 25, 2020. We will also be offering a free workshop, AWS Technical Essentials, the day before the codeathon. Read on for more information about the event. Continue reading

Important changes to the genomes FTP site in February

We have added the latest NCBI Eukaryotic Genome Annotation Pipeline results for the more than 580 species that we annotate to the genomes/refseq directory on the genomes FTP area. As we announced in December, we will stop publishing annotation results to the genus_species directories (example: genomes/Xenopus_tropicalis) on the genomes FTP site effective February 1, 2020. We will also move existing genus_species directories to genomes/archive/old_refseq during the month of February.X_t_assemblyFigure 1. The Assembly page for the Xenopus tropicalis UCB Xtro 10.0 (GCF_000004195.4) showing the blue download button. Annotation results such as the RefSeq transcript alignments that can be downloaded from the web page are now also under the genomes/refseq directory on the FTP site. The FTP path to the .bam alignment files is in red.

These FTP changes do not affect the Assembly download function. As always, you can download assembly data using the blue Download button on the web pages (Figure 1).


Read about NCBI resources in 2020 Nucleic Acids Research database issue

The 2020 Nucleic Acids Research database issue features papers from NCBI staff on GenBank, ClinVar and more. These papers are also available on PubMed. To read an article, click on the PMID number listed below.

“Database resources of the National Center for Biotechnology Information”

by Eric W Sayers, Jeff Beck, J Rodney Brister, Evan E Bolton, Kathi Canese et al. (PMID: 31602479)

The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 38 distinct databases. This article provides a brief overview of the NCBI Entrez system of databases, followed by a summary of resources that were either introduced or significantly updated in the past year, including PubMed, PMC, BookshelfBLAST databases and more!

Continue reading

NLM announces Curation at Scale Workshop

Data curation plays a critical role in today’s biomedical research and ensures scientific data will be accessible for future research and reuse. To improve the speed and scope of manual curation, computer automation/assistance is becoming increasingly desired. 

The National Library of Medicine (NLM) is pleased to announce a two-day Curation at Scale Workshop, to be held on April 27-28, 2020 on the National Institutes of Health (NIH) campus in Bethesda, Maryland, USA. The NLM workshop featuring invited speakers will bring together biocurators, developers of automated curation methods, and other stakeholders, and will offer an opportunity to learn more about the current status of biomedical data curation, to share your research and your challenges, and to discuss the implementation of advanced computational techniques in scientific data curation. We invite participants from academia, government, publishers, and industry interested in the methods and tools employed in curation of biomedical data to attend this exciting workshop. Participants are encouraged to submit an abstract for consideration for poster presentation.

Poster abstract submission deadline: March 6, 2020

Registration deadline: April 17, 2020