NCBI is pleased to announce a Biomedical Data Science Codeathon in collaboration with Carnegie Mellon in Pittsburgh, PA on January 8-10, 2020.
We’re specifically seeking people with experience working with complex diseases, precision medicine, and genomic analyses. If this describes you, please apply! This event is for researchers, including students and postdocs, who are already engaged in the use of bioinformatics data or in the development of pipelines for large scale genomic analyses from high-throughput experiments. The event is open to anyone selected for the codeathon and willing to travel to Pittsburgh.
GenBank release 234.0 (10/14/2019) is now available on the NCBI FTP site. This release has 6.69 trillion bases and 1.68 billion records.
The release has 216,763,706 traditional records containing 386,197,018,538 base pairs of sequence data. There are also 1,097,629,174 WGS records containing 5,985,250,251,028 base pairs of sequence data, 342,811,151 bulk-oriented TSA records containing 305,371,891,408 base pairs of sequence data, and 27,460,978 bulk-oriented TLS records containing 10,848,455,369 base pairs of sequence data.
You now have access to bulk settings options for track hubs in the Genome Data Viewer (GDV) and Sequence Viewer. These settings allow you to pick the default tracks that load into the viewer from your chosen track hub. You can access the bulk options menu for by clicking on the collapsed menu or “hamburger” icon (stack of horizontal bars) at the right end of the track grouping in the Configure Track Hubs dialog (Figure 1).Figure 1. The Configure Track Hubs dialog in GDV. You can activate the bulk settings menu for a connected track hub by clicking on hamburger icon at the right of the track grouping. Clicking Select Default tracks checks on all of the tracks in that grouping, Smoothed PhyloCSF in this case. Continue reading →
The action menu (Figure 1) now contains Collections and My Bibliography, allowing you to manage and share groups of citations. After running a search, you will also find a “Create alert” link under the search box that lets you set up automatic My NCBI email updates for your search.
Figure 1. New PubMed search result page showing the new “Create alert” link and updated action menu.
Going forward, we will continue to develop new features leading up to the time when this new version of PubMed will replace the legacy PubMed. As this progresses, we would love to hear what you think about these new additions! Please use the “Feedback” button (available on every page of the new PubMed) to submit your comments, questions, or concerns.
The latest dbVar data release includes the Genome in a Bottle benchmark structural variant (SV) callset (pre-print Zook et al. 2019) – a highly scrutinized, carefully curated set of 12,745 sequence-resolved deletions, insertions, and delins variants from Personal Genome Project Ashkenazi trio son HG002. The data serve as a robust benchmark standard with which to measure the performance of sequencing and variant-calling pipelines. It “reliably identifies both false negatives and false positives in high-quality SV callsets” (pre-print Zook et al. 2019) that are based on short-, linked-, and long-read sequencing as well as optical mapping.
Genome Workbench version 3 is a major upgrade, including the addition of the Genome Submission Wizard. This video guides you through the wizard, from uploading your genome data file to completion of the submitter report, which is ready to submit to GenBank using tools such as Submission Portal or BankIt. Note: An on-line tutorial is under “Manuals” on the Genome Workbench home page.
If you download data from the SRA (Sequence Read Archive) FTP site, we would encourage you to try the SRA Toolkit. This is particularly true if you use the SRA Fuse/FTP site at ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant, which the SRA team will decommission on December 1, 2019.
The SRA Toolkit offers several advantages for downloading SRA data, including greater flexibility in specifying the data you need as well as access to public SRA data in the cloud. If you’re new to the Toolkit, you may want to start with these instructions.
If you have any questions or concerns about downloading SRA data, please contact email@example.com. We’d love to hear from you!
You can now download images in both PDF and Scaled Vector Graphics (SVG) formats from our Sequence Viewer and genome browsers such as the Genome Data Viewer! SVG files are ideal for editing in image editors and provide high quality graphics for publications, posters, and presentations. Both the PDF and SVG files that you download contain vector graphics for high fidelity images.
You can download image files by choosing the “Printer-Friendly PDF/SVG” option under the Tools menu from any Graphical Sequence Viewer application (Figure 1).
Figure 1. Printer friendly download options from the graphical view in the Genome Data Viewer. You can download either PDF or SVG formats, which are easily edited in standard graphics applications.
The new design for ClinVar pages is now our default view! Thank you for the feedback on the new design while it was under development. The redesigned pages have several new features described in a previous post. The current post highlight some of these improvements in the new ClinVar including the separate variant and condition views, retrieving specific versions of records, and support for ClinVar variant accessions and XML in the E-Utilities .
Using the New ClinVar Pages variant (VCV) and condition views (RCV)
One important improvement in ClinVar is the separate variant-centric and condition-centric views represented by (VCV) accession number and the (RCV) accessions respectively. The VCV record shows ClinVar data aggregated by a variant or set of variants (haplotype). The RCV aggregates conditions reported for a particular variant or set of variants. These two pages are especially useful in cases where there are different interpretations for a variant as the examples below show.
BRCA2 variant: hereditary breast and ovarian cancer
Variants in the BRCA2 gene may cause hereditary breast and ovarian cancer. However, there are many different terms that represent “hereditary breast and ovarian cancer” or related conditions. If you look at an RCV record for only one term, such as “Breast ovarian cancer, familial, 2”, you may only see that the variant has been interpreted as Likely pathogenic. Using the VCV record, you can view all of the interpretations for this variant, so that you see that the variant has been interpreted as both Likely pathogenic for “Breast ovarian cancer, familial, 2” and Uncertain significance for “Hereditary breast and ovarian cancer syndrome” (Figure 1). Aggregating conditions on the VCV record makes it clear that the variant is pathogenic for some forms of hereditary breast cancer
Figure 1. Aggregating by condition on the VCV record for NM_000059.3:c.67G>A makes clear that the variant is likely pathogenic for some forms of hereditary breast cancer even though the interpretation is uncertain for a one breast cancer syndrome.
SCN5A variant: Brugada syndrome and Long QT syndrome 3
Variants in the SCN5A gene may cause two different arrhythmogenic disorders: Brugada syndrome and Long QT syndrome 3. For the coding region variant VCV000067672.1, you can see that there seem to be conflicting interpretations of pathogenicity (Figure 2). But when you look at the interpretations for each disorder using the Conditions tab, you’ll see that the these apparently conflicting interpretations are for different disorders (conditions). The variant has been interpreted as Pathogenic for Long QT syndrome 3 (RCV000677695.1) but as Uncertain significance for Brugada syndrome (RCV000638649.1). The RCV records allow you to distinguish different interpretations for different disorders.
Figure 2. The conditions interpreted for the variant NM_000335.4:c.1604G>A. The variant has a different interpretation for the two arrythmogenic disorders.
Likewise starting from the point of view of a condition such as Brugada syndrome you could quickly find out that the same variant has been interpreted in different ways for other conditions by linking to the variant report.
Retrieving specific version of ClinVar (accession.version)
ClinVar records have versioned accessions (accession. version) that allow you to retrieve a specific version of a record. These work in a similar way to version records in other NCBI molecular resources. For example you can retrieve the most recent version of a record by searching with the accession without the version, VCV000007105 or retrieve a previous version by searching with the full accession.version, VCV000007105.3. (Note: Version specific searching for ClinVar records works only on the ClinVar resource. An All Databases search only retrieves the most recent version.)
Changes to E-utilities (esearch, efetch, esummary)
The new web pages use ClinVar’s new variation-centric XML as the source of data and new accession numbers, beginning with VCV. E-utilities for ClinVar also now support VCV accessions and return the new XML format. You can now use E-Fetch to retrieve the latest VCV record using VCV accession number, an accession.version or a variation ID.
We are continually working to improve the display and usability of the website. Please use the feedback button on each Variation page, send us your comments, and let us know how ClinVar has helped you at firstname.lastname@example.org.
Read the recent publication (PMID: 31427293) on the AMRFinder, a tool that identifies antimicrobial resistance (AMR) genes in bacterial genome sequences using a high-quality curated AMR gene reference database. We use the AMRFinder to identify AMR genes in the hundreds of bacterial genomes that NCBI receives every day, and the results of AMRFinder are used in NCBI’s Isolates Browser to provide accurate assessments of AMR gene content. You can install AMRFinder locally and run it yourself. Follow the instructions on our GitHub site.
Since the publication we have upgraded AMRFinder to AMRFinderPlus. The enhanced tool now
supports searches based on protein annotations, nucleotide sequences, or both for best results
identifies point mutations in Campylobacter, E.coli, Shigella, and Salmonella
optionally identifies many genes involved in biocide, heat, metal, and stress resistance, as well as many antigenicity and virulence genes
provides information about gene function, including resistance to individual antibiotics and other phenotypes