NCBI’s Open Data – A Source of Experimental Data for Important Discoveries

On a typical day, researchers download about 30 terabytes of data from NCBI in an effort to make discoveries. NCBI began providing online access to data in the early 1990s, starting with the GenBank database of DNA sequences. Over the years we’ve greatly expanded the types and quantity of data available. You can now find on our site descriptions and data from experimental studies such as next-generation sequencing projects, bioactivity assays for small molecules, microarray datasets and genome-wide association studies.

The White House recently recognized these efforts by awarding NCBI Director David J. Lipman with the “Open Science” Champion of Change Award [1]. The scientific community has recognized the benefits of open data. Access to this information serves as  a source of both original and supplemental data for exploration and validation [2-4], which improves the power of experimental data [5] while increasing the speed and decreasing the cost of discovery [6].

In this post, we summarize three recent cases where researchers used data from an NCBI resource/database to make significant discoveries.

GEO CASE STUDY: Identifying Common Genes and Networks in Multi-Organ Fibrosis

GEO_ReferenceFibrotic diseases are responsible for 45% of deaths in the developed world and occur in many different organ systems including lung, heart, liver and kidneys. Despite the diverse tissues that manifest the disease, Wenzke, et al. set out to discover common gene expression profiles that might characterize a common fibrotic etiology. In the process they hoped to identify a set of therapeutic targets that might be used to combat all forms of human fibrotic disease.

Wenzke, et al. downloaded data from nine experimental studies of seven different fibrotic diseases from the GEO database. In performing the meta-analysis, they focused on the top 10 genes that were significantly either up- or down-regulated in at least five of the datasets and were able to identify a specific role for each in pathology of the disease. By performing a meta-analysis on such varied data, this group was able to discover and validate potential new “therapeutic targets for slowing or even revers[ing] fibrotic activity.”

PubChem CASE STUDY: Predicting Adverse Drug Reactions Using Publicly Available PubChem BioAssay Data

PubChem_ReferenceAccording to Pouliot and colleagues, serious adverse drug reactions are “estimated to account for more than 2 million incidents requiring hospitalization annually and more than 100,000 deaths [each year] in the United States.” The successful development of a method to predict adverse drug reactions could promote patient safety by enabling closer monitoring of patients enrolled in clinical trials.

Pouliot, et al. combined the power of PubChem’s BioAssay study data and PubChem Compound’s annotations with pharmacovigilance data from the Canadian Adverse Drug Reaction database and the System Organ Class (SOC) annotations identified by the Medical Dictionary for Regulatory Activities. Several PubChem BioAssay studies were characterized as pertinent studies representing each of 19 organ systems. Of the 508 BioAssays searched, 37 of them mapped strongly to nine specific SOC’s, including gastrointestinal disorders, nervous system disorders, cardiac disorders, immune system disorders and blood and lymphatic system disorders. Data for eight current chemotherapeutics were used as the basis for an examination of the assay data with similar compounds that appeared to affect these SOC models. Five of the drugs (75%) were correctly predicted to have an impact on the function of specific organ systems, based on comparisons with documented adverse drug reactions. In addition, three preclinical drugs were assessed and predictions were made to promote awareness for future studies. The authors conclude that these predictions should serve as an indicator of specific symptoms to look out for during clinical trials and pharmacovigilance studies, and should be used to promote awareness of potentially severe adverse reactions that might need intervention to enhance patient safety.

dbGaP CASE STUDY: Prediction of Susceptibility to Major Depression by a Model of Interactions of Multiple Functional Genetic Variants and Environmental Factors


Major Depressive Disorder (MDD) is the most common psychiatric disorder, the third and fourth leading cause of death in the age groups of 15-24 and 25-44 years, respectively, and is estimated to cause $100 billion in economic burden annually. While MDD has been suggested to “run in families,” few studies have been performed with enough power to convincingly identify genetic variations that might predispose someone to MDD. Wong, et al. had performed a genome-wide association study looking at pharmacogenomic markers and depression and were able to reuse data from many of the participants in an effort to discover functional genetic variants that might enable the prediction of susceptibility to MDD. They also were hoping to identify specific genes and biological pathways that could serve as novel therapeutic targets.

Wong, et al. had gathered phenotype and microarray data for 278 MDD patients and 321 controls of Mexican-American descent, but they needed to dramatically increase the power of their analysis by gathering additional data. Using data from NCBI’s dbGaP database, they augmented their study with data on another 1,862 MDD patients and 1,857 unaffected controls. This increase in participant numbers enabled them to look for correlations among the MDD patients with exposure to smoking, alcohol abuse/dependence and for effects related to age, gender and marital status. Using this population, they were able to narrow their search to 15 pathogenic non-synonymous genetic variations predicted to affect specific, relevant metabolic pathways. Amongst the population, 11 genetic variations appeared to have a dominant inheritance pattern, with 6 inherited as a recessive trait. Several pathways and genes known to be important in MDD were validated in this study and several others not previously known were identified. The information developed from analysis of this augmented clinical data set, and future planned studies, may serve as a starting point for the development of new classes of MDD diagnostics and therapeutics, the authors conclude.


  1. “Dr. David Lipman Receives White House “Open Science” Champions of Change Award on Behalf of NCBI.” (NCBI News Story – June 20, 2013)
  2. “Discovering biological connections between experimental conditions based on common patterns of differential gene expression.” (PMCID 3203354)
  3. “Mind the dbGAP: The Application of Data Mining to Identify Biological Mechanisms.” (PMCID 3086918)
  4. “Exploiting PubChem for Virtual Screening.” (PMCID 3117665)
  5. “Including Additional Controls from Public Databases Improves the Power of a Genome-Wide Association Study.” (PMCID 3171281)
  6. “Data archiving is a good investment.” (PMID 21593852)

See these NCBI Resources for more more information:

2 thoughts on “NCBI’s Open Data – A Source of Experimental Data for Important Discoveries

  1. Dear friends, the End of “Clinical” Medicine is the END of Medicine, so that I hope to read in PubMed Commons, among comments of molecular genetics, also others clinical in nature. Interestingly, we must remeber that whatever gene mutation, to be significant, has to bring about alteration of well-defined activity of biological system. Stagnaro Sergio. Biological System Functional Modification parallels Gene Mutation., March 13, 2008,
    Finally, I applaud the initiative PubMed Commons, because it indirectly highlights the importance for the Science of comments that can be worthy of spreading among physicians. For instance, the Manuel’s Story, Pre-Primary Prevention of cancer is based on,

Leave a Reply