dbVar now provides easy-to-use human non-redundant SV reference datasets to aid the interpretation of structural variants


dbVar non-redundant SV (NR SV) datasets include more than 2.2 million deletions, 1.1 million insertions, and 300,000 duplications. These data are aggregated from over 150 studies including 1000 Genomes Phase 3, Simons Genome Diversity Project, ClinGen, ExAC, and others. You can use NR SV data files to filter and annotate variants in a broad range of applications:

  1. Clinicians can easily filter patients’ genome data to find SV that overlap with variants previously reported as clinically significant.
  2. Researchers can compare the results of their own genome-wide SV surveys with dbVar NR data to identify variants that are novel or rare, those which may be pathogenic, and in some cases obtain allele frequencies for matching variants. Users can also annotate SV data with NR SV and other genomic annotations to prioritize those variants most likely to impact biological function.
  3. Developers of variant analysis pipelines can use dbVar NR data to help identify novel variants, calibrate their algorithms, or simply integrate the data into downstream analysis tools and workflows.

dbVar’s NR SV reference data are updated monthly. These updates include new database submissions. We welcome your feedback on the content and usability of these files so that we can improve them.

For more information, please see our GitHub site, which includes brief tutorials and access to NR SV datasets by >FTP.

Going to ASHG? Here’s a sneak peek at our ClinVar poster


This October, some NCBI staff will head out to present at the American Society of Human Genetics (ASHG) conference in sunny San Diego. Below, we give you an inside scoop on the ClinVar poster that we’ll present at ASHG.

Want to learn more about how you can submit phenotype and functional data? Or access the data?

Have we hooked you yet?

Head to Poster 1492T “Increasing phenotypic and functional evidence in ClinVar” on Thursday, Oct. 18 from 3 PM to 4 PM. (Exhibit Hall, Ground Floor)

Continue reading

RefSeq release 90 is public


RefSeq release 90 is accessible online, via FTP and through NCBI’s programming utilities.

This full release incorporates genomic, transcript, and protein data available as of September 10, 2018. It contains 173,956,003 records, including 121,138,769 proteins, 23,838,676 836, and sequences from 84,276 organisms.

The release is provided in several directories as a complete dataset and as divided by logical groupings.

GenBank release 227 available through FTP, BLAST & Entrez


GenBank release 227.0 (8/13/2018) has 208,831,050 traditional records including non-bulk-oriented TSA) containing 260,806,936,411 base pairs of sequence data. There are also 665,309,765 WGS records containing 3,204,855,013,281 base pairs of sequence data, 249,295,386 bulk-oriented TSA records containing 225,520,004,678 base pairs of sequence data, and 15,822,538 bulk-oriented TLS records containing 6,077,824,493 base pairs of sequence data.

Continue reading

GenBank will start using expanded accession formats by December 2018


By the end of 2018, GenBank and other INSDC members will expand the accession formats used for sequencing projects. We have assigned almost all the possible accession numbers using the current, shorter formats. Using these longer formats will allow us to expand accession ranges and give us greater capacity.

The expanded format for Whole Genome Shotgun (WGS), Transcriptome Shotgun Assembly (TSA), and Targeted Locus Study (TLS) sequencing projects will use a six-letter Project Code prefix and a two-digit Assembly-Version number followed by 7, 8, or 9 digits (for example, AAAAAA020000001).

Non-WGS/TLS/TSA nucleotide sequences currently use a “2+6” format, two-letter prefix followed by six digits. This format will be expanded to eight digits.

Protein sequences currently use a “3+5” accession format. By the end of 2018, this format will use seven digits.

You will need to adjust any processing methods to accommodate these new identifier formats.  Please write to the helpdesk with any questions about the new formats.

Improved Search Now Available Across NCBI Databases


Earlier this year, we announced the release of a new and improved search feature that interprets plain language to give better results for common searches. This feature, originally developed in NCBI Labs and later released on the NCBI All Databases search, is now available across several NCBI resources: Nucleotide, Protein, Gene, Genome, and Assembly. Whether you are searching for a specific gene or for a whole genome, you will now retrieve NCBI’s best results regardless of the database  you search.

The image below shows the results for a search for human INS in the Nucleotide database. Even though this is a Nucleotide search, the results include relevant information from Gene, Protein, Taxonomy,  plus links to the NCBI reference sequences (RefSeq) as well as access to BLAST and the insulin gene region in NCBI’s genome browser, the Genome Data Viewer.KIS_nuccore_smallFigure 1.  The new natural language search result in the Nucleotide database from a search for human INS.

Try out this new search capability and let us know what you think. And keep visiting the NCBI Labs search page to try our latest experiments, which we’ll also announce here on NCBI Insights.

 

September 12 NCBI Minute: Release Plan for NCBI API Keys


Update: Webinar is now on September 12!

If you already registered for the September 5 date, you are automatically registered for September 12. You do not need to re-register. We welcome anyone else who would like to register.

As previously announced, NCBI has introduced API keys for the E-utilities. You will soon want to start using API Keys in your E-Utilities API calls as these will allow the fastest access to NCBI databases. In this webinar, we will review how API Keys work and will provide you with a schedule of brief testing periods and the timing of the full release of API key functionality.

Date and time: Wed, Sep 12, 2018 12:00 PM – 12:30 PM EDT

Register here: https://bit.ly/2v0wFMl

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

(Webinar re-scheduled to September 12 because the presenter was called away unexpectedly.)

Improved search for prokaryotic assemblies and genes


We now have many improvements to our search functionality on NCBI’s global search page that will benefit users trying to find prokaryotic assemblies and genes. These improvements aim to highlight the best results and provide links to related NCBI content, so you don’t have to sift through pages of results and navigate between different NCBI resources.

new search genome assembly

Continue reading

Standalone variation services replace Variation Reporter


As of July 2018, a new set of standalone variation services replaces the variant matching functions of Variation Reporter. Variation Reporter was a tool designed to search human sequence variation data by location and to report matching variants found in dbSNP, dbVar, and ClinVar.

The new services are faster, better at handling variants in repeat regions, and scalable to accommodate the continued explosive growth of variation volume. You can find more information about the services in the initial blog post and online SPDI document.

If you would like to report any issues related to these new services and/or would like to provide comments, please write to snp-admin@ncbi.nlm.nih.gov.

If you have any specific questions about the NCBI site in general, contact us at info@ncbi.nlm.nih.gov.

We appreciate your continued support and interaction with the NCBI tools.