NIH Data Science Collaborative Hackathon April 16 – 18, 2018


The NCBI will assist with a data science hackathon to take place on the NIH Campus in Bethesda, Maryland, from April 16-18, 2018.

The hackathon will focus on tools for advanced analysis of biomedical datasets including text, images, next generation sequencing data, proteomics, and metadata. Many individuals who attend these events have already engaged in the use of large datasets or in the development of informatics tools, code, or pipelines; however, researchers who are in the earlier stages of their data science journey, including students and postdocs are also encouraged to apply. Some projects are available to other non-scientific developers, mathematicians, or librarians.

The event is open to anyone selected for the hackathon and willing to travel to Bethesda, Maryland.

Working groups of five to six individuals, with various backgrounds and expertise, will be formed into five to eight teams with an experienced leader. These teams will build pipelines and tools to analyze large datasets within a cloud infrastructure. The hackathon runs from 9 am – 6 pm each day, with an optional social event on the evening of the second day.

Potential subjects for this iteration include:
* Implementing CWL-based genome annotation pipelines
* Prototyping federated cloud-search for biomedical data
* Machine-learning based metadata harmonization
* Visualization of Single Cell RNA-Seq Data
* Sentiment analysis from a variety of text corpora
* Metadata standardization for EMR analysis
* Building an educational experience for RNA-Seq and epigenomics analysis
* Expanding a versatile antimicrobial resistance pipeline
* Searching for novel virus families

Please see the application for more details and additional projects. Applications are due Monday March 22nd, 2018 by 3 pm ET.

Continue reading

RefSeq release 87 available

RefSeq release 87 is now accessible online, via FTP and through NCBI’s programming utilities. This full release incorporates genomic, transcript and protein data available as of March 5, 2018 and contains 155,118,991 records, including 106,245,682 proteins, 21,923,574 RNAs, and sequences from 77,225 organisms. The release is provided in several directories as a complete dataset and as divided by logical groupings.

Starting in July 2018, SNP variation features will no longer be in RefSeq genome assembly records – chromosome and contig records with NC_, NT_, NW_ and AC_ accession prefixes.  The RefSeq release notes have more information about this change.

See your data in context with NCBI’s updated Genome Data Viewer

We know it’s important to you to be able to browse and visually inspect variants and alignments from your next-gen sequencing experiments, so we’ve added remote streaming of BAM files to the Genome Data Viewer (GDV). All you need are your BAM files and the index files (.bai extension) in a location that allows HTTP access and you can stream BAM files as custom tracks into the GDV.

GDV add data widget

Figure 1. The GDV add data widget showing the dialog for adding the remote file. To add your data as tracks, use the “Your Data” widget located on the left-side GDV console. Select “Add Remote File” from the supported files menu, click the plus sign  and enter the URL to the file. You can also connect to remote BAM files by using the “Configure tracks” interface available through the “Tracks” button at the upper-right of the sequence viewer display.

After you enter a URL, a progress bar tracks the status of the connection and the validation processes. By default, the file name will be your track’s display name, but you can also enter a custom name for the track. You can easily connect to multiple remote BAM files in this way.

Remote BAM file loaded as track in GDV

Figure 2. GDV showing the remote BAM file loaded as a track. Your files appear as tracks in the graphical display and are listed in the select tracks drop-down menu of the “Your Data” widget, designated by “(R)”. In this case, the track for your BAM data appears at the bottom of the sequence viewer (graphical) panel of the GDV. You can easily re-order the tracks by dragging and dropping individual tracks, or through the “Tracks” button.

Remote data streaming is not supported for other file types or BAM files transferred through FTP or HTTPS.

Try remote streaming today and let us know what you think! If the Genome Data Viewer currently doesn’t support file types you want, use the Support Center to tell us – or send us an email. For more information on how to use this new feature, please see the GDV Help documentation.

GenBank exceeds 3 Terabases in release 224

GenBank release 224.0 (2/13/2018) has 207,040,555 traditional records (including non-bulk-oriented TSA) containing 253,630,708,098 base pairs of sequence data.

In addition, there are 564,286,852 WGS records containing 2,608,532,210,351 base pairs of sequence data, 214,324,264 TSA records containing 193,940,551,226 base pairs of sequence data, and 12,819,978 TLS records containing 4,531,966,831 base pairs of sequence data.

Continue reading

January and February annotations in RefSeq: orangutan, horse & more

In January and February, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

  • Anoplophora glabripennis (Asian longhorned beetle)
  • Bicyclus anynana (squinting bush brown)
  • Capsella rubella (eudicot)
  • Cavia porcellus (domestic guinea pig)
  • Chrysemys picta (painted turtle)
  • Citrus clementina (clementine)
  • Cryptotermes secundus (termite)
  • Cucurbita pepo pepo (vegetable marrow)
  • Cyanistes caeruleus (blue tit)
  • Dasypus novemcinctus (nine-banded armadillo)
  • Equus caballus (horse, on the EquCab3.0 assembly)
  • Eurytemora affinis (crustacean)
  • Eutrema salsugineum (saltwater cress)
  • Lactuca sativa (eudicot)
  • Loxodonta africana (African savanna elephant)
  • Lucilia cuprina (Australian sheep blowfly)
  • Morus notabilis (eudicot)
  • Myotis lucifugus (little brown bat)
  • Octodon degus (degu)
  • Orussus abietinus (hymenopteran)
  • Oryzias latipes (Japanese medaka)
  • Otolemur garnettii (small-eared galago)
  • Paramormyrops kingsleyae (bony fish)
  • Physeter catodon (sperm whale)
  • Pongo abelii (Sumatran orangutan, on the Susie_PABv2 assembly)
  • Pteropus vampyrus (large flying fox)
  • Quercus suber (cork oak)
  • Salvelinus alpinus (Arctic char)
  • Sarcophilus harrisii (Tasmanian devil)
  • Trichechus manatus latirostris (Florida manatee)
  • Trichogramma pretiosum (wasp)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

March 21 webinar – Introducing the NCBI Pathogen Detection Isolates Browser

In this next NCBI webinar, you will learn how to use the Pathogen Detection Isolate Browser to search for pathogen isolates, identify closely related isolates of interest, and find pathogens encoding particular antimicrobial resistance genes.

Date and time: Wed, Mar 21, 2018 12:00 PM – 12:30 PM EDT

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

The Pathogen Detection Isolate Browser is a web-based portal that integrates the genomic sequences, metadata, antibiotic susceptibility and resistance gene information, and SNP cluster information.

Each year in the U.S. approximately 48 million Americans (approximately 1 in 6) are affected by foodborne illnesses, 128,000 are hospitalized and 3,000 die, as estimated by the CDC. The NCBI Pathogen Detection Project was created in collaboration with FDA, CDC, USDA and others to use whole genome sequencing data for foodborne disease surveillance. Pathogens isolated from patients, food and environmental samples, from state, federal, and other labs, are sequenced and the data submitted in real time to NCBI. The Pathogen Detection analysis pipeline assembles the sequences and compares them to other isolates in its database to identify closely related sequences, thereby facilitating identification of cases involved in an outbreak and potential sources of contamination.

Bioinformatics paper uses NCBI open data to analyze drug response

study (PMID: 28158543) published in the July 2017 issue of Bioinformatics collects, classifies and analyzes single nucleotide variants (SNVs) that may affect response to currently approved drugs. They identified 2,640 SNVs of interest, most of which occur rarely in populations (minor allele frequency <0.01).

The researchers used protein sequence alignment tools and mined open data from multiple information resources accessed through E-utilities including PubChem Compound (Kim et al., 2016 PMID: 26400175), NCBI Gene (Maglott D, et al., 2014. PMID: 25355515), NCBI Protein (Sayers, 2013), MMDB (Madej et al., 2012 PMID: 22135289), PDB (Berman et al., 2000 PMID: 10592235), dbSNP (Sherry et al., 2001 PMID: 11125122), and ClinVar (Landrum et al., 2016 PMID: 26582918).

Questions, comments, and other feedback may be sent to Yanli Wang.

Genome Workbench 2.12.8 now available

The Genome Workbench team is proud to present version 2.12.8, with the latest usability improvements and bug fixes.  See the full list of changes in the Genome Workbench release notes.

Some of the improvements include:

  • Improved FASTA format view (context menu) and the addition of a “Expand All” option
  • Improved rendering of internal unaligned regions
  • Automatically open the target folder to export files quickly
  • Installation of automatic PROXY detection
  • Fixed bug in OS version

Genome Workbench is an integrated application for viewing and analyzing sequences. The Genome Workbench can be used to browse data in GenBank and combine data with your own private data.

Expression teasers and indexing added to Gene

Last February, we added gene expression data to Gene. Now, you can access these data in a few new ways.

gene record expression teaser

Figure 1. The expression teaser text from the human CYP2C19 gene record. CYP2C19 is a phase-one drug-metabolism gene expressed in liver and other organs/tissues involved in metabolizing drugs and other xenobiotics.

Expression pattern “teasers” in Summary

We’ve added a brief sentence describing the expression pattern to the Summary section. This teaser sentence describes tissue-specific expression of the gene, with a link to the complete description that appears in the Expression section.

Continue reading

NCBI-UCSC Genomics Hackathon April 2-4, 2018

From April 2 -4, 2018, the NCBI will help with a bioinformatics hackathon in Northern California hosted by the University of California, Santa Cruz (UCSC)!  The hackathon will focus on advanced bioinformatics analysis of next generation sequencing data, proteomics, and metadata.

This event is for researchers, including students and postdocs, who have already engaged in the use of bioinformatics data or in the development of pipelines for bioinformatics analyses from high-throughput experiments. Some projects are available to other non-scientific developers, mathematicians, or librarians.

The event is open to anyone selected for the hackathon and willing to travel to UCSC.

Working groups of five to six individuals will be formed into five to eight teams.  These teams will build pipelines and tools to analyze large datasets within a cloud infrastructure.  Potential subjects for this iteration include:

  • Developing a framework for nesting containerized bioinformatics workflows in cloud infrastructure.
  • Extending the GA4GH API to map fastq files
  • Machine learning pipelines for germline rare variants linked to phenotypes
  • A simple, open-source mapper for nanopore data
  • An automated pipeline for named entity recognition from biomedical literature

Please see the application form for more details and additional projects. Continue reading