To enhance machine access to biomedical literature and drive impactful analyses and reuse, the National Library of Medicine (NLM) is pleased to announce the availability of the PubMed Central (PMC) Article Datasets on Amazon Web Services (AWS) Registry of Open Data as part of AWS’s Open Data Sponsorship Program (ODP). These datasets collectively span 4 million of PMC’s 7 million (total) full-text scientific articles.
Making these articles available through this distribution channel is part of the National Institutes of Health’s (NIH’s) commitment to host large datasets and bring together computational tools and cloud technologies in ways that support open access, interoperability, and collaborative analyses. With support from the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative, NLM’s National Center for Biotechnology Information (NCBI) also hosts the public Sequence Read Archive, the COVID-19 Genome Sequence Dataset and BLAST databases on AWS ODP. Adding the PMC article datasets allows for literature, sequence data and alignments tools to be co-located for more comprehensive genomic analyses in the cloud to accelerate discovery and insights.
NLM has supported retrieval and download of machine-readable open access journal articles in PMC through the PMC Open Archives Initiative (PMC-OAI) and FTP (file transfer protocol) services for nearly two decades. Over time, such access was expanded to include peer-reviewed, accepted author manuscripts supported by NIH and other research funding organizations; coronavirus-related literature; and open access preprints with NIH support. This expansion to the cloud makes it easier to identify and analyze the most current biomedical and life science articles available for text mining and reuse and means researchers can access and retrieve millions of full-text articles at no cost, with faster retrieval times to analyze on the cloud or locally. The cloud access option is available in addition to the existing PMC-OAI service and FTP service.
The PMC Article Datasets are available in XML and plain text formats and span centuries of scientific communications, including publicly funded research results. Below is an overview of the datasets housed in the AWS S3 bucket located at arn:aws:s3:::pmc-oa-opendata:
- The PMC Open Access Subset includes more than 3.4 million articles and preprints that are made available under Creative Commons license terms that allow reuse.
- The PMC Author Manuscript Dataset consists of more than 700,000 accepted author manuscripts that are available for text mining.
In addition to full-text articles, these datasets contain corrections, retractions, and expressions of concern (to identify, see article-type values in the available files; retracted articles can also be identified in the file lists).