CORD-19: A New Machine Readable COVID-19 Literature Dataset

Are you interested in mining literature about COVID-19 and the novel SARS-Cov-2 virus? You may want to check out the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a collection of more than 13,000 full text articles that focus on COVID-19 and coronaviruses and that were assembled from PMC, the WHO, bioRxiv, and medRxiv. To produce this dataset, the National Library of Medicine partnered with colleagues from the Allen Institute for AI, the Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security and Emerging Technology (CSET), Kaggle, Microsoft, and the White House Office of Science and Technology Policy (OSTP).

CORD-19 is available from the Allen Institute and will be updated weekly as new articles become available. The article data are formatted in JSON, making the collection ideal for computational methods such as data mining, machine learning, and natural language processing. We hope this collection serves as a call to action for the community to improve our understanding of coronaviruses and the human diseases they cause. Have a look and let us know what you think!

Rapid access to SARS-CoV-2 (Wuhan coronavirus) data from the current public health emergency

Featured

As the global health emergency around the Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2, formerly 2019-nCoV, Wuhan coronavirus) continues, we continue to play a key role in providing the biomedical community free and easy access to genome sequences from the coronavirus. You can quickly access these data through the NCBI search (Figure 1).

nCov_KISFigure 1.  NCBI search results for the term “wuhan coronavirus” showing the buttons that link to the data in the NCBI Virus resource, GenBank®/SRA , and a specialized BLAST page that searches Betacoronavirus sequences.

Continue reading