In little over a year, dbSNP human data have doubled in size from 150 million Reference SNP (rs) records to 325 million in Build 150, and again to more than 650 million rs records in Build 151. 580 million of these rs records have frequency data in Build 151.This explosive growth makes dbSNP the world’s largest public human variation database. Current trends suggest that large-scale WGS and WES projects will discover millions of new variations in the next few years.
Build 151 was released in March 2018. The data are available for web search and FTP download.
NCBI’s dbSNP houses variation and frequency data from large-scale projects including 1000Genomes, GO-ESP, ExAC, GnomAD, TOPMED and HLI, as well as focused studies like locus-specific databases (LSDB) and clinical sources. The rs records are annotated on RefSeq genomes, mRNA and protein sequences and integrated with other NCBI resources (e.g., Assembly, Gene, RefSeq, PubMed, and BioProject). The database is used worldwide in personal genomics, medical genetics, and for managing, annotating and analysis of variation data.
dbSNP is moving to the new design with new products ready for testing including new JSON data files, the RefSNP page, and an API.
New JSON data files
Human Build 151 release is the last build that will provide relational database table dumps on the FTP site. Instead, dbSNP data will be available as a cumulative file of RefSNP objects in the JSON format in future build releases. These JSON files are available now for users to begin migration and testing. Tutorials for parsing JSON are on GitHub.
A study (PMID: 28158543) published in the July 2017 issue of Bioinformatics collects, classifies and analyzes single nucleotide variants (SNVs) that may affect response to currently approved drugs. They identified 2,640 SNVs of interest, most of which occur rarely in populations (minor allele frequency <0.01).
The researchers used protein sequence alignment tools and mined open data from multiple information resources accessed through E-utilities including PubChem Compound (Kim et al., 2016 PMID: 26400175), NCBI Gene (Maglott D, et al., 2014. PMID: 25355515), NCBI Protein (Sayers, 2013), MMDB (Madej et al., 2012 PMID: 22135289), PDB (Berman et al., 2000 PMID: 10592235), dbSNP (Sherry et al., 2001 PMID: 11125122), and ClinVar (Landrum et al., 2016 PMID: 26582918).
Questions, comments, and other feedback may be sent to Yanli Wang.
RefSeq release 85 is now accessible online, via FTP and through NCBI’s programming utilities. This full release incorporates genomic, transcript, and protein data available, as of November 6, 2017, and contains 146,710,309 records, including 100,043,962 proteins, 20,905,608 RNAs, and sequences from 73,996 organisms. The release is provided in several directories as a complete dataset and as divided by logical groupings. See the RefSeq release notes for more information.
Starting in March 2018, SNP variation features will no longer be in RefSeq genome assembly records – chromosome and contig records with NC_, NT_, NW_ and AC_ accession prefixes. This change affects both the ASN.1 and flatfile records. Because the number of variants is already enormous and still growing, removing SNP features from these large genomic records will significantly reduce the size of RefSeq FTP files and make downloading and processing easier. We will continue to include SNPs on NG_-prefixed genomic records, and transcript (NM_, NR_, XM_, XR_) and protein (NP_, XP_, YP_) sequences.
Reminder: As of September 2017, NCBI has stopped accepting submissions for non-human SNPs in dbSNP and dbVar. RefSeq flatfiles will stop presenting non-human variant data in November 2017.
Subscribe to the refseq-announce listserv for regular updates on RefSeq.
RefSeq release 84 is now accessible online, via FTP and through NCBI’s programming utilities.
This full release incorporates genomic, transcript, and protein data available, as of September 11, 2017, and contains 140,627,690 records, including 95,563,598 proteins, 20,356,598 RNAs, and sequences from 72,965 organisms.
The release is provided in several directories as a complete dataset and as divided by logical groupings. See the RefSeq release notes for more information.
Phasing out support for non-human organisms
As of September 1, 2017, the dbSNP and dbVar databases have stopped accepting submissions for non-human organisms. Submissions for non-human variation will now be accepted by the European Variation Archive, one of our partners in the International Nucleotide Sequence Database (INSDC).
NCBI dbSNP is pleased to announce a newly designed Reference SNP (RefSNP, rs) Report webpage to provide enhanced performance and presentation for access to individual RefSNP records. This Alpha version of the report enables browsing of submitted and computed RefSNP variant data from the redesigned dbSNP build system.
Figure 1. The dbSNP RefSNP Report Alpha for rs268.
RefSeq release 83 is now accessible online, via FTP and through NCBI’s programming utilities. This full release incorporates genomic, transcript, and protein data available as of July 17, 2017, and contains 132,052,465 records, including 88,385,530 proteins, 19,634,664 RNAs, and sequences from 71,356 organisms. The release is provided in several directories as a complete dataset and as divided by logical groupings. More information about RefSeq release 83 is available in the release notes.
NCBI will phase out support for non-human organisms in the dbSNP and dbVar databases. These databases will stop accepting submissions for non-human SNPs in September 2017. The interactive websites for these databases and related NCBI services, including RefSeq flatfiles, will stop presenting non-human variant data in November 2017.
To continue providing efficient and timely processing, annotation, and dissemination of data, dbSNP’s architecture and process flow have been redesigned. The technical redesign prepares the database for increasing data volumes and providing timely, effective and trustworthy reference SNP results as submission rates continue to increase.
Highlights of the new system include:
- Use of data objects instead of a relational database
- Improved algorithms for clustering data into unique Reference SNPs
- Automation of the entire process to provide timely releases
- Guaranteed data consistency across dbSNP data accessed using web-based products or downloaded content, such as VCF and FTP files
This blog post is directed toward people who use dbSNP and dbVar, particularly those who submit non-human data to the two databases.
dbSNP and dbVar archive, process, display and report information related to germline and somatic variations from multiple species. These two databases have grown rapidly as sequencing and other discovery technologies have evolved, and now contain nearly two billion variants from over 360 species.
Based on projected growth and the resources required to archive and distribute the data, continued support for all organisms will become unsustainable for NCBI in the near future. Therefore, NCBI will phase out support for all non-human organisms in dbSNP and dbVar, and will support only human variation.