The ALFA dataset: New aggregated allele frequency from dbGaP and dbSNP now available

NIH’s data sharing policy now allows unrestricted access to genomic summary results for data from NCBI’s Database of Genotypes and Phenotypes (dbGaP).  Pooled allele frequency data from dbSNP and the dbGaP summary results are available as the new Allele Frequency Aggregator (ALFA) dataset. The ALFA dataset includes aggregated and harmonized array chip genotyping, exome, and genome sequencing data. The ALFA data are open access and freely available for you to incorporate into your workflows and applications from the dbSNP web pages (Figure 1), through FTP,and the Variation Services API. dbGaP currently has data for more than 2 million study subjects, approximately 1 million of whom have genotype data that is suitable for input into the ALFA dataset. The first release of ALFA contains data on about 100,000 subjects, and we hope to complete processing of data on the other 925,000 subjects within the next year. This volume and variety of data promises unprecedented opportunities to identify genetic factors that influence health and disease.  Register to attend our April 22 webinar and read on to learn more.

ALFAFigure 1.  ALFA allele frequencies for a variant (rs4988235) in the promotor of the lactase gene showing frequency differences across populations.

dbGaP contains the results of over 1,200 studies that have investigated the interaction of genotype and phenotype.  The database has over two million subjects and hundreds of millions of variants along with thousands of phenotypes and molecular assay data.  The harmonized ALFA data will allow the wider scientific community to access allele frequency for millions of variants in dbGaP.  Only dbGaP studies that have been approved by the submitting institutions for sharing of summary statistics are included in ALFA dataset for open-access. Genotype and associated individual-level data are accessible through dbGaP authorized access.

The first release of the ALFA data (March 2020) includes Minor Allele Frequencies (MAF) for more than 447 million sites with data in dbSNP and more than 4 million novel sites from 99 thousand subjects across 42 dbGaP studies. We inferred ancestry using GRAF-pop and computed allele frequencies for 12 major populations including African, African American, East Asian, South Asian, Hispanic, European, and other origins.  We conducted extensive quality checks to ensure high quality data is used as input. Analysis showed overall ALFA data is consistent with MAF data previously reported in GnomAD and 1000Genomes for the same variants. In addition to providing previously reported frequency sources, ALFA has additional frequency data for novel and existing variants in dbSNP and ClinVar that have not been reported in 1000Genomes, GnomAD, ExAC, or TopMed. Visit the ALFA project page to learn more about the release summary and how we generate the data.

We anticipate the volume of data will grow and could reach over a billion variants and trillions of genotypes from millions of subjects combined across all dbGaP studies.  We will add new studies to future dbSNP quarterly build releases and will recompute the allele frequency  for all studies with the new data.

Please contact us at suggest@ncbi.nlm.nih.gov with any feedback or questions about this new dataset.

3 thoughts on “The ALFA dataset: New aggregated allele frequency from dbGaP and dbSNP now available

  1. some samples are duplicated in multiple and/or same dataset of dbGaP, I’m wondering how NCBI handled this duplicates.

Leave a Reply