Author: NCBI Staff

A new service to evaluate the quality of your assembled genome!

A new service to evaluate the quality of your assembled genome!

Are you wondering about the quality of a human, mouse or rat genome that you have assembled?

We offer a new service for evaluating the completeness, correctness, and base accuracy of your human, mouse or rat genome assembly compared to a reference assembly. You simply provide NCBI with one or more assemblies in FASTA format and we will do an annotation-based evaluation of the genome(s) using the expert-curated, high-confidence RefSeq transcripts for the species.

Continue reading “A new service to evaluate the quality of your assembled genome!”

Updated prokaryotic representative genome collection

The bacterial and archaeal representative genome collection has been updated!  We selected a total of 14,912 of the 224,000 prokaryotic RefSeq assemblies to represent their respective species. The collection has grown by 8% since April 2021 and now includes Candidatus and endosymbiont species (Figure 1), which constitute 303 and 140 respectively of the 1,077 newly added species. In addition, 719 species are represented by a better assembly, and 70 species were removed because of changes in NCBI Taxonomy or uncertainty in their species assignment.

Figure 1. Graphical view of a portion of the RefSeq Representative assembly for the bedbug endosymbiont Candidatus Wolbachia massiliensis isolate PL13.

Continue reading “Updated prokaryotic representative genome collection”

NCBI Presentations at Biodiversity Genomics 2021 Highlight Growing Support for Comparative Genomics

NCBI Presentations at Biodiversity Genomics 2021 Highlight Growing Support for Comparative Genomics

The National Center for Biotechnology Information (NCBI) has several speakers at the upcoming Biodiversity Genomics Conference from September 27 to October 1, 2021.

Valerie Schneider, head of NCBI’s SeqPlus Program and Deputy Director for Sequence Offerings, will present a poster discussing how NCBI’s new comparative genome research focus will enable researchers to explore all eukaryotic research organisms, find related organisms and support additional organism-specific resources that a specific community may have or wish to develop.

Nuala O’Leary, Product Owner, NCBI Datasets will present the latest developments for Datasets, a beta resource that supports intuitive and flexible access to genome data for a broad range of taxa via a redesigned website and command-line tools.

Adelaide Rhodes, Cloud Subject Matter Expert in Education, will present two case studies that emphasize the ease of navigating the new Datasets website as well as the use of command line tools to speed up data discovery for genes and genomes of interest.

Terence Murphy, Product Owner, NCBI RefSeq will present a new tool for genome providers to identify contamination in newly assembled sequences with high sensitivity, specificity, and performance.

The Biodiversity Genomics Conference brings together a global audience to celebrate achievements in genome sequencing across the eukaryotic tree of life, explore current challenges and solutions, and to develop strategies for sequencing and data sharing in the upcoming decade of biodiversity genomics. NCBI has several programs that support the needs of this scientific research group.

Four new options to simplify your SARS-CoV-2 submissions

Four new options to simplify your SARS-CoV-2 submissions

We have recently added several exciting improvements to the SARS-CoV-2 GenBank submission process based on community feedback. To save you time, NCBI completes feature annotation for you, which means SARS-CoV-2 GenBank submission only requires a FASTA file and source metadata. Here are other new features to ease and simplify your submission workflow.

Automatically remove failed sequences from a submission: On the web, a single click lets you opt-in to automatic removal of failed sequences (Figure 1) so that the rest of your sequences can be swiftly accessioned! A report provided after the submission lists your failed sequences and points out potential sequence problems so that you can take a closer look after your error-free sequences are released. This option is also available for submission via FTP.

Need to set up FTP submissions? The NCBI team is here to help. Contact

Figure 1. GenBank submission page showing the option to remove sequences with processing errors.

Continue reading “Four new options to simplify your SARS-CoV-2 submissions”

NIH’s Cloud Data Delivery Service: SRA Delivers Even More Big Data to your Cloud Bucket

NIH’s Cloud Data Delivery Service: SRA Delivers Even More Big Data to your Cloud Bucket

The Sequence Read Archive (SRA) is the National Institute of Health’s (NIH) primary repository for raw, high-throughput sequencing data, containing both controlled- and open-access datasets that continue to grow exponentially. SRA is managed by the National Library of Medicine’s National Center for Biotechnology Information (NCBI), and the data are available from NCBI’s servers as well as through cloud platforms:  Amazon Web Services (AWS) and Google Cloud Platform (GCP).  Cloud access was made possible by support from NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.

Due to SRA’s exponential growth in size, data in the cloud environment are currently partitioned in hot and cold storage to keep SRA sustainable and accessible. Per industry standards, data in hot storage are immediately accessible; because hot storage is more expensive to host, we make efforts to align this distribution method with our most frequently requested datasets. The less frequently requested datasets are available in cold storage, which may not be immediately accessible. Fret not! SRA is constantly evolving to meet our users’ needs. NCBI’s Cloud Data Delivery Service (CDDS) now allows you to get public and controlled-access data delivered from cold and hot storage directly to your chosen cloud bucket in just a few hours. The minor cost is currently handled by NCBI but certain limits apply; within a 30-day request cycle, users are able to request up to 5TB from cold storage and 20TB from hot storage to their cloud bucket.

Continue reading “NIH’s Cloud Data Delivery Service: SRA Delivers Even More Big Data to your Cloud Bucket”

Fungal Disease Awareness Week: fungal pathogen data and literature at NCBI

Fungal Disease Awareness Week: fungal pathogen data and literature at NCBI

This post is in support of the CDC’s Fungal Disease Awareness Week — September 20-24, 2021.

The impact of fungal diseases on human health has often been neglected, but increased association of fungal infections with severe illness and death during the COVID-19 pandemic has brought fungal diseases into the spotlight.

According to the CDC, the most common fungal co-infections in patients with COVID-19 include aspergillosis or invasive candidiasis including healthcare-associated infection from Candida auris.  Other reported diseases are mucormycosis, coccidioidomycosis and cryptococcosis. Aspergillosis is commonly caused by Aspergillus fumigatus, mucormycosis by Rhizopus species, coccidioidomycosis by Coccidioides immitis and C. posadasii and cryptococcosis by Cryptococcus neoformans.

This post explores several NCBI resources that have relevant information about the fungal pathogens implicated in these COVID-19 related illnesses.

Assembled genomes

Correctly identified and annotated genome assemblies are available for the fungal taxa implicated as co-infections in COVID-19 patients are summarized in table below.  These and  many other fungi are also available as curated RefSeq genome assemblies.

Continue reading “Fungal Disease Awareness Week: fungal pathogen data and literature at NCBI”

RefSeq Release 208 is available!

RefSeq Release 208 is available!

RefSeq release 208 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of September 7, 2021, and contains 288,903,207 records, including 210,703,648 proteins, 40,213,945 RNAs, and sequences from 113,002 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 208 is available!”

New in NCBI Datasets: Species pages and species browser

NCBI Datasets introduces species pages and species browser! The species pages summarize taxon information and provide access to genomic data, including reference genomes. For example, see Figure 1, the Nothobranchius furzeri (turquoise killifish) species page.

Figure 1: Nothobranchius furzeri species page. The browse species button will take you to the species browser. 

Continue reading “New in NCBI Datasets: Species pages and species browser”

Learn the best way to find data in NIH’s Sequence Read Archive (SRA) on the cloud

Learn the best way to find data in NIH’s Sequence Read Archive (SRA) on the cloud

NCBI will present a workshop at the American Society for Human Genetics (ASHG) as part of their conference activities in 2021. The workshop is scheduled for Wednesday, September 15, 2021.

Register now!

Adelaide Rhodes, Ph.D. from the Customer Experience team and Adam Stine, SRA Curator will co-lead the workshop, which will introduce attendees to powerful metadata searches on BigQuery on Google Cloud Platform (GCP) and Athena on Amazon Web Services (AWS) to speed up analytic workflows using the NIH’s Sequence Read Archive (SRA).

Cloud-based query services with expanded metadata options for SRA help researchers to find the target data more quickly than ever before. The workshop will be a mix of training in Structured Query Language (SQL), demos on the cloud console and hands-on exercises in Jupyter notebooks with examples to help researchers understand how to build searches in SQL. Researchers who attend this workshop will learn how to extract specific data sets as well as how to conduct exploratory analysis of the entirety of the SRA data available in the cloud.

Both BigQuery and Athena require SQL but no prior SQL experience is required. By the end of this workshop you will know how to run cloud metadata queries using SQL to find SRA data based on parameters that are of interest to you.

Adam Stine, Ph.D., SRA Curator
Adelaide Rhodes, Ph.D., Customer Experience