New release of the Prokaryotic Genome Annotation Pipeline now available


We have released a new version of the Prokaryotic Genome Annotation Pipeline (PGAP), available on GitHub. The new release includes the ability to ignore pre-annotation validation errors (–ignore-all-errors). This new feature allows you to produce a preliminary annotation for a draft version of the genome, even one that contains vector and adapter sequences or that is outside of the size range for the species. This draft annotation should be helpful with your ongoing work on the genome assembly. Please keep in mind that these pre-annotations and assemblies with contaminants or other errors are not suitable for submission to GenBank.

Another new feature allows you to provide the name of the consortium that generated the assembly and annotation so that this information appears in the final GenBank records. For more details, consult our guidelines on input files.

See our previous post and our documentation for details on how to obtain and run PGAP yourself.

Next on our to-do list is a module for calculating Average Nucleotide Identity (ANI) to confirm the assembly’s taxonomic assignment. Stay tuned!

Structural Variant Hackathon


NCBI is pleased to announce a Structural Variant Hackathon at the Baylor College of Medicine, Houston Texas, immediately before ASHG on October 11-13, 2019.

We’re specifically looking for folks who have experience in working with structural variants, complex disease, precision medicine, and similar genomic analysis.  If this describes you, please apply! This event is for researchers, including students and postdocs, who are already engaged in the use of bioinformatics data or in the development of pipelines for large scale genomic analyses from high-throughput experiments (please note that the event itself will focus on open access public human data).

Potential topics include:

  • Mapping structural variants to public databases
  • Calculating the heritability of different types of structural variants
  • CNV effect on isoform expression
  • Assembly accuracy for metagenomics
  • Quality assessment in large cohorts

The hackathon runs from 9 am – 6 pm each day, with the potential to extend into the evening hours each day. There will also be optional social events at the end of each day. Working groups of five to six individuals, with various backgrounds and expertise, will be formed into five to eight teams with an experienced leader. These teams will build pipelines and tools to analyze large datasets within a cloud infrastructure. Each day, we will come together to discuss progress on each of the topics, bioinformatics best practices, coding styles, etc.

There will be no registration fee associated with attending this event.

Note: Participants will need to bring their own laptop to this program. No financial support for travel, lodging, or meals is available for this event.

Continue reading

ClinVar’s new XML aggregated by Variation ID


Now it’s easier than ever to access all data in ClinVar for a variant or set of variants across all reported diseases.  ClinVar’s new XML is organized by variant only (Variation ID), instead of the variant-disease pair. This reduces redundancy, for example in cases where a variant is related to several disease concepts, and makes the XML consistent with the ClinVar web pages. You can get ClinVarVariationRelease XML from the /xml/clinvar_variation/ directory on the ClinVar FTP site.  New features in ClinVarVariationRelease XML shown in Figure 1 include:

  • Explicit elements to distinguish between variants that were directly interpreted and “included” variants, those that were interpreted only as part of a Haplotype or Genotype. The clinical significance for included variants is indicated as “no interpretation for the single variant”.
  • Explicit elements to distinguish records for simple allele,  haplotypes, and genotypes
  • The Replaces element that provides a history and indicates accessions that were merged into the current accession.
  • A section that  maps the submitted name or identifier for the interpreted condition to the corresponding name used in ClinVar and the MedGen Concept Identifier (CUI)

ClinVarXML_markupFigure 1.  ClinVar variant-centric XML showing a variant record for a haplotype (VCV000236230) that comprises two included variations (SimpleAlleles) that are marked as “no interpretation for the single variant”.  The record includes all the condition records (RCVList) with names and identifiers from MedGen, OMIM and other sources.

To learn more about how to use this data, read our documentation.

Tell us how ClinVar has helped you by writing to us at clinvar@ncbi.nlm.nih.gov.

September 11 Webinar: A beginner’s guide to genes and sequences at NCBI


On Wednesday, September 11, 2019 at 12 PM, NCBI staff will present a webinar for people with limited experience working with gene and sequence information. You will learn about the kinds of data available for genes and sequences, how to select the most informative records, and how to find related genes and sequences using pre-computed information and the BLAST sequence search service.

  • Date and time: Wed, Sep 11, 2019 12:00 PM – 12:30 PM EDT
  • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

GenBank release 233


GenBank release 233.0 (8/21/2019) is now available on the NCBI FTP site. This release has 6.26 terabases and 1.65 billion records.

The release has 213,865,349 traditional records containing 366.7 billion base pairs of sequence data. There are also 1.07 billion WGS records containing 5.6 trillion base pairs of sequence data, 331.3 million bulk-oriented TSA records containing 294.7 trillion base pairs of sequence data, and 26 million bulk-oriented TLS records containing 10.5 billion base pairs of sequence data.

Continue reading

GRAF, a tool for finding duplicates and closely related samples in large genomic datasets


NCBI’s Genetic Relationship and Fingerprinting (GRAF) tool is a quality assurance tool that can quickly find duplicates and closely related subjects in your data using SNP genotypes.

The population tool GRAF-pop included in GRAF computes subject ancestries using genotypes and normalizes ancestry prediction in large datasets collected across different genotyping platforms, making it possible to generate population frequency based on more than a million dbGaP samples.

Who can use this?

GRAF is a tool for researchers; it is not designed to assess an individual’s ancestry or to find relatives.

You can use this tool against your own large datasets with results generated within hours or minutes, even when there is a very high genotype missing rate to the order of 99%. This tool can check genotype datasets obtained using different chips or platforms, plotting them in the same picture for comparison purposes.

Continue reading

Magic-BLAST version 1.5.0 is here!


We’ve just released a new version of Magic-BLAST with several new, user-driven enhancements like:

  • Nanopore sequence alignment
  • Improved multithreading performance
  • Support for the new BLAST database version, BLASTDBv5, that allows you to limit your search by taxonomy
  • More reliable placements of reads

The new executables are available on the NCBI FTP site.

graphic.png

A new paper (PMID: 31345161), published in July 2019 by BMC Bioinformatics, presents the usage accuracy of Magic-BLAST.

Magic-BLAST aligns next generation DNA- and RNA-Seq sequencing reads. Read more about the latest version of Magic-BLAST in the release notes.

How well do you know GeneReviews®?


You may know . . .

  • We offer expert-authored, peer-reviewed chapters on more than 750 genetic disorders.
  • Our standardized format enables busy clinicians to readily find the information they need.
  • Molecular genetic testing strategies are presented in the context of clinical care and genetic counseling implications.
  • Tables link specific molecular genetic information to entries in OMIM (Online Mendelian Inheritance in Man), ClinVar, and genomic databases.
  • Resource lists connect families to information and support.
  • Links to actionable information for clinicians to find available Clinical Trials and genetic tests in the NIH’s Genetic Testing Registry (GTR).
  • Chapters are continually updated to reflect changes in clinically relevant information, such as test availability and treatment protocols.

But do you also know . . .

  • You can volunteer to create a GeneReviews® chapter in your area of expertise. Start by reading the information for prospective authors.
  • Our Educational Materials, designed for health care professionals of varying experience with clinical genetics, augment our glossary to clarify genetics concepts.
  • For genetics professionals, we summarize the latest information on:
    • Imprinting errors and uniparental disomy (UPD) not detectable by sequence analysis
    • Disorders caused by nucleotide repeat expansions/contractions
    • Disorders with highly homologous gene family members or pseudogenes
  • Founder variant tables compile, for the first time in one place, data to inform testing recommendations and clinical decision making for disorders more common in Finnish, Ashkenazi Jewish, Inuit, Yup’ik, Cree/Ojibway, and Navajo
  • A succinct, one-stop information page on direct-to-consumer genetic testing gives medical professionals information they need in order to advise patients who have pursued testing on their own.

Check our What’s New page for weekly new and updated postings.