ClinVar, NCBI’s archive of submitted associations between alleles in the human genome and diseases or phenotypes, is now producing XML files that aggregate all submitted disease/phenotype information by variant (or set of variants) for public release via FTP bulk download. The new product, called ClinVarVariationRelease, is currently in beta release and will move to full release in early September 2017.
ClinVar represents interpretations by many contributors about the relationship between an allele in the human genome and a particular disease or phenotype. Not unexpectedly, there are many complexities and nuances in these relationships. For example, an allele and one disease may be strongly supported by multiple lines of evidence, but the allele may also have less clear associations with other diseases, each with its own caveats or levels of evidence. To clearly and accurately provide ClinVar users with that full set of information, the bulk download ClinVar data products had listed each allele–disease relationship pair, and its supporting evidence, separately.
Some ClinVar users, however, found that the bulk data product was not ideal for their specific purpose. For example, a clinician may start with an allele and need to quickly see a summary of all that is known about it in a single record. The ClinVar team thus developed a summary format to meet that need.
The task of automatically computing a summary that correctly preserves the strength and uncertainty in the allele x phenotype relationship is difficult. We therefore developed such a view first on the ClinVar interactive website, and we have used that view to collect feedback from users and our many collaborators. This allowed us to assess the tradeoffs in synthesizing the data and to refine our model summary. The result is now available as a data product on the ClinVar FTP site.
Data in ClinVarVariationRelease is aggregated by the Variation ID, which represents the variant or set of variants that were interpreted for clinical or functional significance. This aggregation of data will be assigned an accession number, with the prefix VCV (Variation in ClinVar) followed by nine digits. The digits will comprise the Variation ID padded with preceding zeros to make nine digits. This file will make it easier for users who want to access all data for the variant or set of variants, across all diseases reported for the variant. Also available are the XSD for ClinVarVariationRelease and the ClinVar Data Dictionary.
In the beta release, all VCVs will have version 1; the version numbers will not increase until the production release.
We will retain ClinVarFullRelease, the archive of ClinVar data aggregated by RCV, the accession number assigned to a variant-disease pair. Updates to ClinVarVariationRelease in the beta phase will use the same snapshot of data as the weekly update for ClinVarFullRelease.
New features in ClinVarVariationRelease include:
- explicit elements to distinguish records for simple alleles vs haplotypes vs genotypes
- explicit elements to distinguish between variants that were directly interpreted vs variants that were interpreted only as part of a haplotype or genotype (i.e. “included” variants). The clinical significance for included variants is indicated as “no interpretation for the single variant”.
Some features are not yet included in ClinVarVariationRelease but will be added before the production release:
- a history indicating accessions that were merged into the current accession (Replaces element)
- a section to map the submitted name or identifier for the interpreted condition to the corresponding name used in ClinVar and MedGen CUI
- a complementary file of deleted VCV accessions
- certain types of variant sets are not yet included in the release: diplotypes, phase unknown, different chromosomes.
To help us improve the product, we would appreciate your feedback during the six-week beta release. Please send feedback and error reports to email@example.com.