We are excited to announce our new NCBI Virtual Workshop Series! This workshop series aims to engage and educate people who use NCBI resources for their biological/biomedical research, science education, and clinical application efforts.
We have recently added several exciting improvements to the SARS-CoV-2 GenBank submission process based on community feedback. To save you time, NCBI completes feature annotation for you, which means SARS-CoV-2 GenBank submission only requires a FASTA file and source metadata. Here are other new features to ease and simplify your submission workflow.
Automatically remove failed sequences from a submission: On the web, a single click lets you opt-in to automatic removal of failed sequences (Figure 1) so that the rest of your sequences can be swiftly accessioned! A report provided after the submission lists your failed sequences and points out potential sequence problems so that you can take a closer look after your error-free sequences are released. This option is also available for submission via FTP.
Need to set up FTP submissions? The NCBI team is here to help. Contact firstname.lastname@example.org.
Figure 1. GenBank submission page showing the option to remove sequences with processing errors.
The Sequence Read Archive (SRA) is the National Institute of Health’s (NIH) primary repository for raw, high-throughput sequencing data, containing both controlled- and open-access datasets that continue to grow exponentially. SRA is managed by the National Library of Medicine’s National Center for Biotechnology Information (NCBI), and the data are available from NCBI’s servers as well as through cloud platforms: Amazon Web Services (AWS) and Google Cloud Platform (GCP). Cloud access was made possible by support from NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.
Due to SRA’s exponential growth in size, data in the cloud environment are currently partitioned in hot and cold storage to keep SRA sustainable and accessible. Per industry standards, data in hot storage are immediately accessible; because hot storage is more expensive to host, we make efforts to align this distribution method with our most frequently requested datasets. The less frequently requested datasets are available in cold storage, which may not be immediately accessible. Fret not! SRA is constantly evolving to meet our users’ needs. NCBI’s Cloud Data Delivery Service (CDDS) now allows you to get public and controlled-access data delivered from cold and hot storage directly to your chosen cloud bucket in just a few hours. The minor cost is currently handled by NCBI but certain limits apply; within a 30-day request cycle, users are able to request up to 5TB from cold storage and 20TB from hot storage to their cloud bucket.
This post is in support of the CDC’s Fungal Disease Awareness Week — September 20-24, 2021.
The impact of fungal diseases on human health has often been neglected, but increased association of fungal infections with severe illness and death during the COVID-19 pandemic has brought fungal diseases into the spotlight.
According to the CDC, the most common fungal co-infections in patients with COVID-19 include aspergillosis or invasive candidiasis including healthcare-associated infection from Candida auris. Other reported diseases are mucormycosis, coccidioidomycosis and cryptococcosis. Aspergillosis is commonly caused by Aspergillus fumigatus, mucormycosis by Rhizopus species, coccidioidomycosis by Coccidioides immitis and C. posadasii and cryptococcosis by Cryptococcus neoformans.
This post explores several NCBI resources that have relevant information about the fungal pathogens implicated in these COVID-19 related illnesses.
Correctly identified and annotated genome assemblies are available for the fungal taxa implicated as co-infections in COVID-19 patients are summarized in table below. These and many other fungi are also available as curated RefSeq genome assemblies.
This full release incorporates genomic, transcript, and protein data available as of September 7, 2021, and contains 288,903,207 records, including 210,703,648 proteins, 40,213,945 RNAs, and sequences from 113,002 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 208 is available!”
NCBI Datasets introduces species pages and species browser! The species pages summarize taxon information and provide access to genomic data, including reference genomes. For example, see Figure 1, the Nothobranchius furzeri (turquoise killifish) species page.
Figure 1: Nothobranchius furzeri species page. The browse species button will take you to the species browser.
NCBI will present a workshop at the American Society for Human Genetics (ASHG) as part of their conference activities in 2021. The workshop is scheduled for Wednesday, September 15, 2021.
Adelaide Rhodes, Ph.D. from the Customer Experience team and Adam Stine, SRA Curator will co-lead the workshop, which will introduce attendees to powerful metadata searches on BigQuery on Google Cloud Platform (GCP) and Athena on Amazon Web Services (AWS) to speed up analytic workflows using the NIH’s Sequence Read Archive (SRA).
Cloud-based query services with expanded metadata options for SRA help researchers to find the target data more quickly than ever before. The workshop will be a mix of training in Structured Query Language (SQL), demos on the cloud console and hands-on exercises in Jupyter notebooks with examples to help researchers understand how to build searches in SQL. Researchers who attend this workshop will learn how to extract specific data sets as well as how to conduct exploratory analysis of the entirety of the SRA data available in the cloud.
Both BigQuery and Athena require SQL but no prior SQL experience is required. By the end of this workshop you will know how to run cloud metadata queries using SQL to find SRA data based on parameters that are of interest to you.
Join us on September 22, 2021 at 12PM eastern time learn to use the datasets command-line tools (datasets and dataformat) to access, filter, download, and format data and metadata for genomes. Through examples from eukaryotes and the SARS-CoV-2 coronavirus, you will see how to use metadata to filter for genome sequences with desired properties such as genomes with high contig N50 values.
- Date and time: Wed, September 22, 2021 12:00 PM – 12:45 PM EDT
After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI webinars playlist on the NLM YouTube channel. You can learn about future webinars on the Webinars and Courses page.
In 2016, NCBI announced that it was curtailing its display of its numeric ‘GI’ in popular sequence data formats such as FASTA and GenBank flatfiles. Due to the continued growth of GenBank, NCBI will soon begin assigning GIs exceeding the signed 32-bit threshold of 2,147,483,647 for those remaining sequence types that still receive these identifiers.
NCBI has updated products including Entrez system, GenBank (Nucleotide), BLAST™ and the C++ Toolkit to prepare for that moment by upgrading GI-related code and APIs to accept 64-bit integers. This change over is projected for late 2021. Stay tuned for additional communications from NCBI and take note of the following information if you think you may be impacted.
For a seamless transition, all organizations and developers using our products should review software for any remaining reliance on GIs and compatibility with these larger identifiers. Note that this update requires no changes to submission procedures or assignment of accessions. Continue reading “NCBI’s GI sequence identifiers will soon exceed 32-bit numbers. Are you and your software ready?”
To enhance machine access to biomedical literature and drive impactful analyses and reuse, the National Library of Medicine (NLM) is pleased to announce the availability of the PubMed Central (PMC) Article Datasets on Amazon Web Services (AWS) Registry of Open Data as part of AWS’s Open Data Sponsorship Program (ODP). These datasets collectively span 4 million of PMC’s 7 million (total) full-text scientific articles.