The National Center for Biotechnology Information (NCBI) has several speakers at the upcoming Biodiversity Genomics Conference from September 27 to October 1, 2021.
Valerie Schneider, head of NCBI’s SeqPlus Program and Deputy Director for Sequence Offerings, will present a poster discussing how NCBI’s new comparative genome research focus will enable researchers to explore all eukaryotic research organisms, find related organisms and support additional organism-specific resources that a specific community may have or wish to develop.
Nuala O’Leary, Product Owner, NCBI Datasets will present the latest developments for Datasets, a beta resource that supports intuitive and flexible access to genome data for a broad range of taxa via a redesigned website and command-line tools.
Adelaide Rhodes, Cloud Subject Matter Expert in Education, will present two case studies that emphasize the ease of navigating the new Datasets website as well as the use of command line tools to speed up data discovery for genes and genomes of interest.
Terence Murphy, Product Owner, NCBI RefSeq will present a new tool for genome providers to identify contamination in newly assembled sequences with high sensitivity, specificity, and performance.
The Biodiversity Genomics Conference brings together a global audience to celebrate achievements in genome sequencing across the eukaryotic tree of life, explore current challenges and solutions, and to develop strategies for sequencing and data sharing in the upcoming decade of biodiversity genomics. NCBI has several programs that support the needs of this scientific research group.
We are excited to announce our new NCBI Virtual Workshop Series! This workshop series aims to engage and educate people who use NCBI resources for their biological/biomedical research, science education, and clinical application efforts.
We have recently added several exciting improvements to the SARS-CoV-2 GenBank submission process based on community feedback. To save you time, NCBI completes feature annotation for you, which means SARS-CoV-2 GenBank submission only requires a FASTA file and source metadata. Here are other new features to ease and simplify your submission workflow.
Automatically remove failed sequences from a submission: On the web, a single click lets you opt-in to automatic removal of failed sequences (Figure 1) so that the rest of your sequences can be swiftly accessioned! A report provided after the submission lists your failed sequences and points out potential sequence problems so that you can take a closer look after your error-free sequences are released. This option is also available for submission via FTP.
The Sequence Read Archive (SRA) is the National Institute of Health’s (NIH) primary repository for raw, high-throughput sequencing data, containing both controlled- and open-access datasets that continue to grow exponentially. SRA is managed by the National Library of Medicine’s National Center for Biotechnology Information (NCBI), and the data are available from NCBI’s servers as well as through cloud platforms: Amazon Web Services (AWS) and Google Cloud Platform (GCP). Cloud access was made possible by support from NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.
Due to SRA’s exponential growth in size, data in the cloud environment are currently partitioned in hot and cold storage to keep SRA sustainable and accessible. Per industry standards, data in hot storage are immediately accessible; because hot storage is more expensive to host, we make efforts to align this distribution method with our most frequently requested datasets. The less frequently requested datasets are available in cold storage, which may not be immediately accessible. Fret not! SRA is constantly evolving to meet our users’ needs. NCBI’s Cloud Data Delivery Service (CDDS) now allows you to get public and controlled-access data delivered from cold and hot storage directly to your chosen cloud bucket in just a few hours. The minor cost is currently handled by NCBI but certain limits apply; within a 30-day request cycle, users are able to request up to 5TB from cold storage and 20TB from hot storage to their cloud bucket.
The impact of fungal diseases on human health has often been neglected, but increased association of fungal infections with severe illness and death during the COVID-19 pandemic has brought fungal diseases into the spotlight.
According to the CDC, the most common fungal co-infections in patients with COVID-19 include aspergillosis or invasive candidiasis including healthcare-associated infection from Candida auris. Other reported diseases are mucormycosis, coccidioidomycosis and cryptococcosis. Aspergillosis is commonly caused by Aspergillus fumigatus, mucormycosis by Rhizopus species, coccidioidomycosis by Coccidioides immitis and C. posadasii and cryptococcosis by Cryptococcus neoformans.
This post explores several NCBI resources that have relevant information about the fungal pathogens implicated in these COVID-19 related illnesses.
Correctly identified and annotated genome assemblies are available for the fungal taxa implicated as co-infections in COVID-19 patients are summarized in table below. These and many other fungi are also available as curated RefSeq genome assemblies.
RefSeq release 208 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.
This full release incorporates genomic, transcript, and protein data available as of September 7, 2021, and contains 288,903,207 records, including 210,703,648 proteins, 40,213,945 RNAs, and sequences from 113,002 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq Release 208 is available!”→
NCBI Datasets introduces species pages and species browser! The species pages summarize taxon information and provide access to genomic data, including reference genomes. For example, see Figure 1, the Nothobranchius furzeri (turquoise killifish) species page.
Figure 1: Nothobranchius furzeri species page. The browse species button will take you to the species browser.
Adelaide Rhodes, Ph.D. from the Customer Experience team and Adam Stine, SRA Curator will co-lead the workshop, which will introduce attendees to powerful metadata searches on BigQuery on Google Cloud Platform (GCP) and Athena on Amazon Web Services (AWS) to speed up analytic workflows using the NIH’s Sequence Read Archive (SRA).
Cloud-based query services with expanded metadata options for SRA help researchers to find the target data more quickly than ever before. The workshop will be a mix of training in Structured Query Language (SQL), demos on the cloud console and hands-on exercises in Jupyter notebooks with examples to help researchers understand how to build searches in SQL. Researchers who attend this workshop will learn how to extract specific data sets as well as how to conduct exploratory analysis of the entirety of the SRA data available in the cloud.
Both BigQuery and Athena require SQL but no prior SQL experience is required. By the end of this workshop you will know how to run cloud metadata queries using SQL to find SRA data based on parameters that are of interest to you.
Join us on September 22, 2021 at 12PM eastern time learn to use the datasets command-line tools (datasets and dataformat) to access, filter, download, and format data and metadata for genomes. Through examples from eukaryotes and the SARS-CoV-2 coronavirus, you will see how to use metadata to filter for genome sequences with desired properties such as genomes with high contig N50 values.
Date and time: Wed, September 22, 2021 12:00 PM – 12:45 PM EDT
In 2016, NCBI announced that it was curtailing its display of its numeric ‘GI’ in popular sequence data formats such as FASTA and GenBank flatfiles. Due to the continued growth of GenBank, NCBI will soon begin assigning GIs exceeding the signed 32-bit threshold of 2,147,483,647 for those remaining sequence types that still receive these identifiers.
NCBI has updated products including Entrez system, GenBank (Nucleotide), BLAST™ and the C++ Toolkit to prepare for that moment by upgrading GI-related code and APIs to accept 64-bit integers. This change over is projected for late 2021. Stay tuned for additional communications from NCBI and take note of the following information if you think you may be impacted.