Would you like to compare and analyze your data with known structural variants (SV) in NCBI’s database of genomic structural variation (dbVar)? Now there are easy-to-use files containing non-redundant (NR) deletions, duplications, and insertions aggregated from across studies in dbVar. The files are available for human assembly versions GRCh37 and GRCh38. Descriptions of the NR data are available on GitHub.
The NR files are available for FTP download in BED, BEDPE, and custom tab-separated formats, designed to be compatible with many popular tools and browsers. To help users get started, we have developed tutorials for UCSC Genomic Browser, Galaxy web-based analysis platform, NCBI Sequence Viewer, and command-line BEDtools.
An upcoming release will include annotations including genes, regulatory regions, and more. Have a favorite annotation you’d like to see? Send us your suggestions by contacting dbVar directly or open a GitHub issue. We also welcome comments and other improvement suggestions.
dbVar non-redundant SV (NR SV) datasets include more than 2.2 million deletions, 1.1 million insertions, and 300,000 duplications. These data are aggregated from over 150 studies including 1000 Genomes Phase 3, Simons Genome Diversity Project, ClinGen, ExAC, and others. You can use NR SV data files to filter and annotate variants in a broad range of applications:
- Clinicians can easily filter patients’ genome data to find SV that overlap with variants previously reported as clinically significant.
- Researchers can compare the results of their own genome-wide SV surveys with dbVar NR data to identify variants that are novel or rare, those which may be pathogenic, and in some cases obtain allele frequencies for matching variants. Users can also annotate SV data with NR SV and other genomic annotations to prioritize those variants most likely to impact biological function.
- Developers of variant analysis pipelines can use dbVar NR data to help identify novel variants, calibrate their algorithms, or simply integrate the data into downstream analysis tools and workflows.
dbVar’s NR SV reference data are updated monthly. These updates include new database submissions. We welcome your feedback on the content and usability of these files so that we can improve them.
For more information, please see our GitHub site, which includes brief tutorials and access to NR SV datasets by >FTP.
As of July 2018, a new set of standalone variation services replaces the variant matching functions of Variation Reporter. Variation Reporter was a tool designed to search human sequence variation data by location and to report matching variants found in dbSNP, dbVar, and ClinVar.
The new services are faster, better at handling variants in repeat regions, and scalable to accommodate the continued explosive growth of variation volume. You can find more information about the services in the initial blog post and online SPDI document.
If you would like to report any issues related to these new services and/or would like to provide comments, please write to email@example.com.
If you have any specific questions about the NCBI site in general, contact us at firstname.lastname@example.org.
We appreciate your continued support and interaction with the NCBI tools.
dbVar has generated known structural variants (SV) datasets for use in comparisons with user data to aid variant calling, analysis and interpretation.
Files containing Non-Redundant (NR) deletions, insertions, and duplications are now available on GitHub. Additional separate files include preliminary annotations of overlap with ACMG59 genes. All files are in tab-delimited text format.
We encourage you to test these files and provide feedback, either on GitHub or by email.
NCBI’s database of structural variation, dbVar has a restructured FTP directory. The old directories can be found in archive.
- added aggregated vcf files by assembly
- named files based on major assembly and region or call
- replaced study-specific directories with file-type directories
- renamed “.tab” files to “.tsv”
- moved old human and all non-human files to archive
Refer to README.ftp for full details of the new GVF, VCF, TSV, and XML files.
RefSeq release 85 is now accessible online, via FTP and through NCBI’s programming utilities. This full release incorporates genomic, transcript, and protein data available, as of November 6, 2017, and contains 146,710,309 records, including 100,043,962 proteins, 20,905,608 RNAs, and sequences from 73,996 organisms. The release is provided in several directories as a complete dataset and as divided by logical groupings. See the RefSeq release notes for more information.
Starting in March 2018, SNP variation features will no longer be in RefSeq genome assembly records – chromosome and contig records with NC_, NT_, NW_ and AC_ accession prefixes. This change affects both the ASN.1 and flatfile records. Because the number of variants is already enormous and still growing, removing SNP features from these large genomic records will significantly reduce the size of RefSeq FTP files and make downloading and processing easier. We will continue to include SNPs on NG_-prefixed genomic records, and transcript (NM_, NR_, XM_, XR_) and protein (NP_, XP_, YP_) sequences.
Reminder: As of September 2017, NCBI has stopped accepting submissions for non-human SNPs in dbSNP and dbVar. RefSeq flatfiles will stop presenting non-human variant data in November 2017.
Subscribe to the refseq-announce listserv for regular updates on RefSeq.
Copy number variants (CNVs) from ExAC’s publication are now available at dbVar as nstd151. The data include approximately 50,000 CNV regions identified from 60,000 human exomes, providing a deep survey of common and rare copy number variation affecting protein-coding sequences in the human genome.
dbVar provides FTP files in VCF, GVF, and CSV formats, and include placements on GRCh37 as well as remapped placements on GRCh38. Tutorials for working with different formats are also available.
Follow the dbVar RSS feed for information on monthly releases.
RefSeq release 84 is now accessible online, via FTP and through NCBI’s programming utilities.
This full release incorporates genomic, transcript, and protein data available, as of September 11, 2017, and contains 140,627,690 records, including 95,563,598 proteins, 20,356,598 RNAs, and sequences from 72,965 organisms.
The release is provided in several directories as a complete dataset and as divided by logical groupings. See the RefSeq release notes for more information.
Phasing out support for non-human organisms
As of September 1, 2017, the dbSNP and dbVar databases have stopped accepting submissions for non-human organisms. Submissions for non-human variation will now be accepted by the European Variation Archive, one of our partners in the International Nucleotide Sequence Database (INSDC).
RefSeq release 83 is now accessible online, via FTP and through NCBI’s programming utilities. This full release incorporates genomic, transcript, and protein data available as of July 17, 2017, and contains 132,052,465 records, including 88,385,530 proteins, 19,634,664 RNAs, and sequences from 71,356 organisms. The release is provided in several directories as a complete dataset and as divided by logical groupings. More information about RefSeq release 83 is available in the release notes.
NCBI will phase out support for non-human organisms in the dbSNP and dbVar databases. These databases will stop accepting submissions for non-human SNPs in September 2017. The interactive websites for these databases and related NCBI services, including RefSeq flatfiles, will stop presenting non-human variant data in November 2017.