dbGaP: Data and analyses from millions of study participants, samples, and trillions of genotypes!

dbGaP: Data and analyses from millions of study participants, samples, and trillions of genotypes!

Are you familiar with the well-known Framingham Heart Study, a multi-generation study of residents of Framingham, Massachusetts begun in 1948? Much of what is now known about the impact of genetics, lifestyle, and diet on cardiovascular health and disease has come from this research study. (See PMC4159698  for a historical perspective.) Did you know that data from this study and over 2,000 other studies that demonstrate the relationship between genetic and medical outcomes and other phenotypes are available from NCBI’s Database of Genotypes and Phenotypes (dbGaP)?

dbGaP was established in 2007 as a repository of human data from large scale studies. You can access data from more than 2.8 million study participants who have provided over 3.3 million molecular samples. You can retrieve patient-level phenotypic (e.g., demographic, clinical, exposure) data and molecular (e.g., called genotypes omics, sequence) data, and the results of association analyses from genome-scale case-control and longitudinal studies of heritable diseases.

What types of studies and data are available in dbGaP?

dbGaP contains a wide range of studies and types of data, all relating to human genetic and phenotypic measurements. Most dbGaP data are from NIH-funded research, but recently we have expanded to include non-NIH funded studies. An easy way to find dbGaP Studies, Phenotype and Molecular Datasets, Variables, Analyses and Documents is through the dbGaP Advanced Search (Figure 1). The interface allows you to filter results by different characteristics depending on the tab you choose.

Figure 1. The dbGaP Advanced Search interface. Tabs that appear at the top of the web interface allow you to select the studies, datasets, analyses, etc. of interest. Filters (facets) appear on the left (see inset). Click on filters to select values to find Links on the study summary pages provide direct access to data. Top panel:  Studies tab and the corresponding filter categories.  Bottom panel: Molecular data tab results with Study (Framingham SHARe), Markerset Source (Affymetrix) filters applied. 

For example, selecting the Study Disease/Focus filter in the dbGaP Advanced search allows you to filter studies by human diseases or conditions. Table 1 shows a sample of the dbGaP studies focused on human diseases.

Study Focus Number
Cardiovascular Diseases 66
Breast Neoplasms 63
Type II Diabetes 37
Asthma 31
Autism Spectrum Disorder 19
Parkinson Disease 19
Schizophrenia 15
Hypertension 13
COVID-19 9

Table 1. Selected disease conditions and their counts that link to dbGaP study summaries from the Advanced Search displays .

Study designs include longitudinal cohort, case, case-control, family/twin/trios, tumor vs. normal, clinical trial and many others. Molecular data (summary of types) include whole genome, whole exome and transcriptome sequencing, gene expression analyses, SNP array genotyping, and many more. Phenotypic data may range from habits and lifestyle choices to presence/absence of a disease diagnosis, to more specific clinical measurements such as blood pressure and blood glucose levels. These links list the molecular and phenotype datasets from the Framingham study to show some of the kinds of data available.

How do I access data in dbGaP?

There are two access tiers in dbGaP 1) Public Access and 2) Controlled Access. Anyone may access public data without making a request. Public data includes study and variable metadata, summaries, and even some preliminary association analyses for studies. For example, you may access the variable summary level data and data dictionary, and study documents, such as protocols and questionnaires, from the public FTP site, or browse these data from the public pages and advanced search (Figure 1). Access to association analysis data is available via the analysis pages or through the Phenotype Genotype Integrator (PheGenI) a browser for variations, genes and the associated trait (phenotype) with P-values indicating the strength of the genotype to phenotype association. For many studies you can see genetic population ancestry calculated from participant genotypes using GRAF-pop. Figure 2 shows the GRAF-pop analysis of genotypes for participants in the Women’s Health Initiative study.

Figure 2. GRAF-pop analysis of Women’s Health Initiative genotypes. The colors are the participants’ reported ancestry. Note that many of the reported genotypes fall well outside the 95% range (yellow ellipses) for that ancestry in dbGaP. 

The controlled-access portion of dbGaP provides de-identified individual-level genotype, sequence, omics and phenotype data. Once you find a study of interest you can follow the Authorized Access link on records to apply for access. Requirements for gaining access to these individual level data vary by study. See Tips for Preparing a Successful Data Access Request for more details about applying for controlled access. Your request will be reviewed by an NIH Data Access Committee for compliance with any data use limitations. Once you have been granted access, you can retrieve controlled access data from the study including phenotype, omics, images and supporting next-gen sequencing data from the Sequence Read Archive (SRA).

How can I submit my own study data to dbGaP?

See the Submission Flowchart for information on submitting your study data to dbGaP.

Please write to the dbGaP staff for more information on how to get started with access or to the submission staff for help with submitting data.

Stay tuned to learn more! We will provide more information about dbGaP in the coming months!

2 thoughts on “dbGaP: Data and analyses from millions of study participants, samples, and trillions of genotypes!

  1. Asthma is a complex disease that has genetic and environmental causes. The genetic factors associated with susceptibility to asthma remain largely unknown.

Leave a Reply