Four new options to simplify your SARS-CoV-2 submissions

Four new options to simplify your SARS-CoV-2 submissions

We have recently added several exciting improvements to the SARS-CoV-2 GenBank submission process based on community feedback. To save you time, NCBI completes feature annotation for you, which means SARS-CoV-2 GenBank submission only requires a FASTA file and source metadata. Here are other new features to ease and simplify your submission workflow.

Automatically remove failed sequences from a submission: On the web, a single click lets you opt-in to automatic removal of failed sequences (Figure 1) so that the rest of your sequences can be swiftly accessioned! A report provided after the submission lists your failed sequences and points out potential sequence problems so that you can take a closer look after your error-free sequences are released. This option is also available for submission via FTP.

Need to set up FTP submissions? The NCBI team is here to help. Contact gb-admin@ncbi.nlm.nih.gov.

Figure 1. GenBank submission page showing the option to remove sequences with processing errors.

Continue reading “Four new options to simplify your SARS-CoV-2 submissions”

NIH’s Cloud Data Delivery Service: SRA Delivers Even More Big Data to your Cloud Bucket

NIH’s Cloud Data Delivery Service: SRA Delivers Even More Big Data to your Cloud Bucket

The Sequence Read Archive (SRA) is the National Institute of Health’s (NIH) primary repository for raw, high-throughput sequencing data, containing both controlled- and open-access datasets that continue to grow exponentially. SRA is managed by the National Library of Medicine’s National Center for Biotechnology Information (NCBI), and the data are available from NCBI’s servers as well as through cloud platforms:  Amazon Web Services (AWS) and Google Cloud Platform (GCP).  Cloud access was made possible by support from NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.

Due to SRA’s exponential growth in size, data in the cloud environment are currently partitioned in hot and cold storage to keep SRA sustainable and accessible. Per industry standards, data in hot storage are immediately accessible; because hot storage is more expensive to host, we make efforts to align this distribution method with our most frequently requested datasets. The less frequently requested datasets are available in cold storage, which may not be immediately accessible. Fret not! SRA is constantly evolving to meet our users’ needs. NCBI’s Cloud Data Delivery Service (CDDS) now allows you to get public and controlled-access data delivered from cold and hot storage directly to your chosen cloud bucket in just a few hours. The minor cost is currently handled by NCBI but certain limits apply; within a 30-day request cycle, users are able to request up to 5TB from cold storage and 20TB from hot storage to their cloud bucket.

Continue reading “NIH’s Cloud Data Delivery Service: SRA Delivers Even More Big Data to your Cloud Bucket”

Fungal Disease Awareness Week: fungal pathogen data and literature at NCBI

Fungal Disease Awareness Week: fungal pathogen data and literature at NCBI

This post is in support of the CDC’s Fungal Disease Awareness Week — September 20-24, 2021.

The impact of fungal diseases on human health has often been neglected, but increased association of fungal infections with severe illness and death during the COVID-19 pandemic has brought fungal diseases into the spotlight.

According to the CDC, the most common fungal co-infections in patients with COVID-19 include aspergillosis or invasive candidiasis including healthcare-associated infection from Candida auris.  Other reported diseases are mucormycosis, coccidioidomycosis and cryptococcosis. Aspergillosis is commonly caused by Aspergillus fumigatus, mucormycosis by Rhizopus species, coccidioidomycosis by Coccidioides immitis and C. posadasii and cryptococcosis by Cryptococcus neoformans.

This post explores several NCBI resources that have relevant information about the fungal pathogens implicated in these COVID-19 related illnesses.

Assembled genomes

Correctly identified and annotated genome assemblies are available for the fungal taxa implicated as co-infections in COVID-19 patients are summarized in table below.  These and  many other fungi are also available as curated RefSeq genome assemblies.

Continue reading “Fungal Disease Awareness Week: fungal pathogen data and literature at NCBI”

RefSeq Release 208 is available!

RefSeq Release 208 is available!

RefSeq release 208 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of September 7, 2021, and contains 288,903,207 records, including 210,703,648 proteins, 40,213,945 RNAs, and sequences from 113,002 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 208 is available!”

New in NCBI Datasets: Species pages and species browser

NCBI Datasets introduces species pages and species browser! The species pages summarize taxon information and provide access to genomic data, including reference genomes. For example, see Figure 1, the Nothobranchius furzeri (turquoise killifish) species page.

Figure 1: Nothobranchius furzeri species page. The browse species button will take you to the species browser. 

Continue reading “New in NCBI Datasets: Species pages and species browser”

Learn the best way to find data in NIH’s Sequence Read Archive (SRA) on the cloud

Learn the best way to find data in NIH’s Sequence Read Archive (SRA) on the cloud

NCBI will present a workshop at the American Society for Human Genetics (ASHG) as part of their conference activities in 2021. The workshop is scheduled for Wednesday, September 15, 2021.

Register now!

Adelaide Rhodes, Ph.D. from the Customer Experience team and Adam Stine, SRA Curator will co-lead the workshop, which will introduce attendees to powerful metadata searches on BigQuery on Google Cloud Platform (GCP) and Athena on Amazon Web Services (AWS) to speed up analytic workflows using the NIH’s Sequence Read Archive (SRA).

Cloud-based query services with expanded metadata options for SRA help researchers to find the target data more quickly than ever before. The workshop will be a mix of training in Structured Query Language (SQL), demos on the cloud console and hands-on exercises in Jupyter notebooks with examples to help researchers understand how to build searches in SQL. Researchers who attend this workshop will learn how to extract specific data sets as well as how to conduct exploratory analysis of the entirety of the SRA data available in the cloud.

Both BigQuery and Athena require SQL but no prior SQL experience is required. By the end of this workshop you will know how to run cloud metadata queries using SQL to find SRA data based on parameters that are of interest to you.

Adam Stine, Ph.D., SRA Curator
Adelaide Rhodes, Ph.D., Customer Experience

 

Sept 22 Webinar: Using NCBI Datasets command-line tools to access data and metadata for genomes

Sept 22 Webinar: Using NCBI Datasets command-line tools to access data and metadata for genomes

Join us on September 22, 2021 at 12PM eastern time learn to use the datasets command-line tools (datasets and dataformat) to access, filter, download, and format data and metadata for genomes. Through examples from eukaryotes and the SARS-CoV-2 coronavirus, you will see how to use metadata to filter for genome sequences with desired properties such as genomes with high contig N50 values.

  • Date and time: Wed, September 22, 2021 12:00 PM – 12:45 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI webinars playlist on the NLM YouTube channel. You can learn about future webinars on the Webinars and Courses page.

NCBI’s GI sequence identifiers will soon exceed 32-bit numbers. Are you and your software ready?

In 2016, NCBI announced that it was curtailing its display of its numeric ‘GI’ in popular sequence data formats such as FASTA and GenBank flatfiles. Due to the continued growth of GenBank, NCBI will soon begin assigning GIs exceeding the signed 32-bit threshold of 2,147,483,647 for those remaining sequence types that still receive these identifiers.

NCBI has updated products including Entrez systemGenBank (Nucleotide), BLAST™ and the C++ Toolkit to prepare for that moment by upgrading GI-related code and APIs to accept 64-bit integers. This change over is projected for late 2021. Stay tuned for additional communications from NCBI and take note of the following information if you think you may be impacted.

For a seamless transition, all organizations and developers using our products should review software for any remaining reliance on GIs and compatibility with these larger identifiers. Note that this update requires no changes to submission procedures or assignment of accessions.  Continue reading “NCBI’s GI sequence identifiers will soon exceed 32-bit numbers. Are you and your software ready?”

PubMed Central Article Datasets are Now Available on the Cloud

To enhance machine access to biomedical literature and drive impactful analyses and reuse, the National Library of Medicine (NLM) is pleased to announce the availability of the PubMed Central (PMC) Article Datasets on Amazon Web Services (AWS) Registry of Open Data as part of AWS’s Open Data Sponsorship Program (ODP). These datasets collectively span 4 million of PMC’s 7 million (total) full-text scientific articles.

screenshot of the registry of open data of AWS (Amazon Web Services)
Figure 1. NCBI PMC Article Datasets on Registry of Open Data on AWS.

Continue reading “PubMed Central Article Datasets are Now Available on the Cloud”