The impact of fungal diseases on human health has often been neglected, but increased association of fungal infections with severe illness and death during the COVID-19 pandemic has brought fungal diseases into the spotlight.
According to the CDC, the most common fungal co-infections in patients with COVID-19 include aspergillosis or invasive candidiasis including healthcare-associated infection from Candida auris. Other reported diseases are mucormycosis, coccidioidomycosis and cryptococcosis. Aspergillosis is commonly caused by Aspergillus fumigatus, mucormycosis by Rhizopus species, coccidioidomycosis by Coccidioides immitis and C. posadasii and cryptococcosis by Cryptococcus neoformans.
This post explores several NCBI resources that have relevant information about the fungal pathogens implicated in these COVID-19 related illnesses.
Correctly identified and annotated genome assemblies are available for the fungal taxa implicated as co-infections in COVID-19 patients are summarized in table below. These and many other fungi are also available as curated RefSeq genome assemblies.
During the COVID-19 pandemic, an often-heard refrain in the arena of public health was “Testing, testing, testing!”. Testing for the presence of the SARS-CoV-2 virus in patients with symptoms or potential exposure, or for the presence of antibodies to the virus in patients who had recovered from the disease, took on vital importance in efforts to curb its spread. Last fall, the NIH Genetic Testing Registry (GTR) expanded its scope to include molecular and serology tests for microorganisms impacting human health and disease. It now contains 70+ tests for COVID-19.
There are 54 molecular genetic tests that detect viral RNA from individual samples or pools using nucleic acid amplification technologies. While most of the tests detect the SARS-CoV-2 viral RNA alone, 8 tests detect multiple bacterial or viral markers as part of a panel. Two tests detect viral variants in a targeted variant analysis of the whole viral genome. Sixteen serologic tests detect antibodies to SARS-CoV-2.
Interested in human genes involved in COVID-19 biology? NCBI’s RefSeq group has been hard at work compiling a set of human genes with roles in coronavirus infection and disease. You can now see and search for these genes and their regulatory elements in NCBI Gene and RefSeq.
Figure 1. Top section of the human ACE2 record in the Gene database. COVID-19 information can be found in the Summary and Annotation information sections.
The COVID-19 pandemic has drawn attention to the human host genes associated with SARS-CoV-2 entry and to the elements that regulate expression of these genes. At NCBI, we have prioritized curation of experimentally validated regulatory elements for these genes in the RefSeq Functional Elements project. Our annotations include several enhancers, promoters, cis-regulatory elements and protein binding sites, among other feature types. We have annotated 236 regulatory features for 27 distinct biological regions in the latest human Annotation Release (109.20200522) including regulatory elements for the ABO, ACE2, ANPEP, CD209, CLEC4G, CLEC4M, CTSL, DPP4,and TMPRSS2genes.
You can view our regulatory element to target gene linkages in the regulatory interactions track using our new track hub that we recently announced. You can also see the biological regions and features tracks. These have functional and descriptive metadata, including biological region summaries, experimental evidence types, publication support and more.
The example in Figure 1 shows RefSeq Functional Element feature annotation in NCBI’s Genome Data Viewer (GDV) for the ABO gene region (GRCh38, NW_009646201.1: 73,864-103,789) the determiner of the human ABO blood group. A genome-wide association study recently identified non-coding ABO variants associated with COVID-19 disease severity (PMID:32558485), which map to some of the RefSeq Functional Elements in this region.Figure 1. The human ABO gene region in the NCBI GDV displaying the RefSeq Functional Element features. The biological regions aggregate track shows underlying feature annotation for an ABO upstream enhancer (LOC112637023), promoter region (LOC112679202), +5.8 intron 1 enhancer (LOC112679198), a 3′ regulatory region (LOC112639999), and a +36.0 downstream enhancer (LOC112637025). Functional Element features include numerous enhancers, promoters, cis-regulatory elements and protein / transcription factor binding sites.
We have more information about RefSeq Functional Elements on our website, including data download and extraction options. Stay tuned to NCBI Insights and other NCBI social media for future announcements about RefSeq Functional Elements!
The National Library of Medicine and its partners in the International Nucleotide Database Collaboration (INSDC) have joined together to issue a statement encouraging the scientific community to submit their SARS-CoV-2 sequences to INSDC databases. The databases offer broad open access and integrated data, literature and tools – features that we believe are critical as the research community works together to understand and combat COVID-19. Read the full statement below.
The databases of the International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) capture, organize, preserve and present nucleotide sequence data as part of the open scientific record. INSDC member institutions – the EMBL European Bioinformatics Institute (EMBL-EBI), the NIG DNA Data Bank of Japan (NIG-DDBJ) and the National Library of Medicine’s National Center for Biotechnology Information at NIH (NCBI) – are committed to the continued delivery of this critical element of scientific infrastructure.
The global COVID-19 crisis has brought an urgent need for the rapid open sharing of data relating to the outbreak. Most importantly, access to sequence data from the SARS-CoV-2 viral genome is essential for our understanding of the biology and spread of COVID-19. To aid in that effort, all three INSDC members have prioritized processing of SARS-CoV-2 sequence data and have streamlined the submission process.
Availability of data through INSDC databases provides:
Rapid open access – INSDC quickly makes submitted data freely available to everyone, without restrictions on reuse
Linkage of raw sequence read data to genome assemblies, providing researchers with the ability to validate the integrity of assemblies and investigate asserted mutations and changes in genome sequences
Integration of SARS-CoV-2 sequences with entirety of INSDC data, including related coronaviruses genome sequences, enabling comparison across species
Linkage of sequences to the published literature
Tools – INSDC partners provide integrated data analysis tools, such as BLAST, enhancing the discovery process
In support of the global response to the COVID-19 crisis, the INSDC calls upon the research community to:
Submit raw SARS-CoV-2 data to the databases of the INSDC
Submit consensus/assembled SARS-CoV-2 data to the databases of the INSDC
Provide information relating to the sequenced isolate or sample as part of the sequence submission; minimally the time and place of isolation/sampling and an isolate/sample identifier should be provided to maximize the value of the sequences.
In cases where scientists have already established submissions to other databases, these submissions should continue in parallel to the INSDC submission
The integration of INSDC databases with the global bioinformatics data infrastructure, including tools, secondary databases, compute capacity and curation processes, assures the rapid dissemination of data and drives its maximal impact.
In addition to these fundamental roles of INSDC member institutions in the sharing of viral sequence data, each institution has rapidly established COVID-19-specific programs and resources: the European COVID-19 Data Platform from EMBL-EBI, the DDBJ’s Research Data Resources on New Coronavirus and the NCBI SARS-CoV-2 Resources. These resources both demonstrate the connectedness of INSDC databases to broader bioinformatics initiatives and serve to add immediate value to COVID-19 research.
Guy Cochrane (EMBL-EBI), Ilene Karsch-Mizrachi (NCBI-NLM-NIH), & Masanori Arita (DDBJ) on behalf of the International Nucleotide Sequence Database Collaboration
While searching for SARS-CoV-2 sequences, have you longed for a COVID-focused SRA dataset? Great news — now there is one! We are happy to announce the addition of COVID-focused datasets (including source and normalized SRA file formats) to the AWS Public Dataset Program. These data can now be explored at the Registry of Open Data on AWS.
Researchers can now access more than 13K SRA runs that include Coronaviridae (CoV) content identified by a kmer-based approach to organismal content identification using the SRA Taxonomy Analysis Tool.
Are you trying to keep up with the rapidly growing number of biological resources associated with the SARS-CoV-2 virus and the related disease, COVID-19? There’s a new page to help you find SARS-CoV-2-related content available at NCBI (Figure1). This new site will help bench scientists, bioinformaticians, clinicians, and others connect with the information they need to study SARS-CoV-2 and end the COVID-19 pandemic.Figure 1. The new SARS-CoV-2 resources page providing access to data submissions, literature, molecular information, and clinical resources.
Figure 1. The SARS-CoV-2 submission landing page, where you can submit to GenBank or SRA. You can also view other resources related to SARS-CoV-2.
Quickly and easily add your SARS-CoV-2 sequence data to the growing public archive with new, special features and support from NCBI. Our new SARS-CoV-2 sequence submission landing page will help you get started. GenBank submissions are accessioned and released in approximately 1-2 working days, and Sequence Read Archive (SRA) submissions typically processed and released within hours. Submission is simple!
We recently announced that we made all of the Sequence Read Archive (SRA) publicly available on two cloud platforms. This archive of genetic sequences is a treasure trove of information and the cloud environments provide high-performance computing capabilities via a GCP or AWS account – right from your own device. High-throughput sequencing has made generating data extremely fast and inexpensive, which has fueled the rapid growth of SRA. Putting it on the cloud makes it possible to analyze “the high-throughput, unassembled sequence data, across all such sequences”.
So, what are some of the potential discoveries that await? To investigate some of the possibilities, we have held a series of codeathons to see if known and unknown viruses could be found lurking within SRA cloud datasets. Spoiler alert – they are! And just recently, a team from Stanford reported that they were able to identify a 2019-nCoV-like Coronavirus in pangolins by examining data sets identified via a meta-metagenomic search of SRA and downloaded using the SRA Toolkit. One challenge this team faced was downloading the datasets: 2.5TB corresponding to approximately 1013 bases took over 48 hours to gather. How might cloud-based SRA tools have made this task easier/faster? Here’s how:
BigQuery: allows native cloud programmatic access to and search based on SRA metadata in the cloud. SRA Toolkit enables retrieval and reading of sequencing files from the SRA datasets in the cloud and writing files into the same format, respectively.
Coming soon to the cloud are tools for large scale BLAST processing for a Read Alignment and Annotation Pipeline Tool (RAPT). These tools allow the data to be analyzed directly in the cloud, eliminating the need for download to local storage for analysis.
Also in the works is a mechanism to provide better access to taxonomic content of SRA runs as calculated by NCBI tools.
We are continually adding new functionality to better support your cloud workflows and are happy to help! Contact us at email@example.com if you have questions or need help getting started. If you need assistance setting up GCP or AWS, please follow the steps in our how-to videos on YouTube.
Are you interested in mining literature about COVID-19 and the novel SARS-Cov-2 virus? You may want to check out the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a collection of more than 13,000 full text articles that focus on COVID-19 and coronaviruses and that were assembled from PMC, the WHO, bioRxiv, and medRxiv. To produce this dataset, the National Library of Medicine partnered with colleagues from the Allen Institute for AI, the Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security and Emerging Technology (CSET), Kaggle, Microsoft, and the White House Office of Science and Technology Policy (OSTP).
CORD-19 is available from the Allen Institute and will be updated weekly as new articles become available. The article data are formatted in JSON, making the collection ideal for computational methods such as data mining, machine learning, and natural language processing. We hope this collection serves as a call to action for the community to improve our understanding of coronaviruses and the human diseases they cause. Have a look and let us know what you think!