Proposed changes to AGP files for genome assemblies


If you are a consumer or producer of AGP (A Golden Path) files for genome assemblies, please read on.  We’d like your feedback on the proposed changes described here.

As you know, AGP files are used to describe the structure of certain genome assemblies. The AGP file format has not kept up with changes in sequencing technology or International Sequence Database Collaboration (INSDC) feature usage. NCBI is therefore proposing to extend the current AGP v2.0 specification to add new linkage evidence types and a gap type of “contamination” as detailed below and described in the AGP v2.1 proposed specification.

Proposed changes from AGP v2.0 to AGP v2.1:

  • Add ‘proximity-ligation’ and ‘pcr’ to the set of accepted linkage evidence values
  • Drop ‘strobe’ from the set of accepted linkage evidence values
  • Expand the definition of ‘paired-end’ linkage evidence to include ‘mate-pairs’ and molecular-barcode techniques
  • Add a gap-type of ‘contamination’
    • definition: a gap inserted in place of foreign sequence to maintain the coordinates
    • usage: treated as linked to preserve the original scaffold but with linkage evidence ‘unspecified’

Timeline

April 16 – May 7: Comment period
May 8 – May 10: AGP v2.1 proposal finalized
May 12 – May 16: AGP v2.1 approved at the annual INSDC meeting
Summer 2019: NCBI begins accepting the new linkage-evidence types, and using the contamination gap type

Note: NCBI would continue to accept genome submissions in AGP v2.0 format.

We are seeking your input on these proposed changes. Please comment on this post or write to suggest@ncbi.nlm.nih.gov if you have any comments or suggestions.

Recent enhancements to BLAST+ (2.9.0): built-in taxonomy and access to proteins from the Pathogen Detection Project


We have made some recent improvements to the BLAST+ applications that take full advantage of the version 5 BLAST databases (BLASTDBv5), which include built in taxonomic information for sequences and no longer rely on the integer sequence identifiers (gi numbers).

With the latest version of BLAST, you can now:

  • Limit your searches by taxonomy using information built into the BLAST databases
  • Limit searches more efficiently when using a list of sequence accessions
  • Retrieve sequences by taxonomy from the BLAST database with blastdbcmd
  • Search PDB proteins with identifiers up to four-characters long.  You can read more about about PDB changes on our Structure database documentation.

Only BLASTDBv5 supports these new features. These new BLAST databases also contain accession-based (gi-less) proteins from important high-throughput genome sequencing projects that are not available in the previous version of BLAST databases. These include proteins from annotation of assemblies from large-scale pathogen surveillance efforts that are part of the NCBI Pathogen Project as well as those coming from large-scale metagenomics surveillance. With the v5 databases, you can perform BLAST searches of all proteins from these assemblies to find the proteins of interest.

For more information on new database version, BLASTDBv5 (download), see the previous NCBI Insights article and the recording of our webinar. We will continue to update the BLAST databases in their current version (BLASTDBv4) until September 2019.

Conserved Domain Database (CDD) 3.17 is now available


The latest version of the Conserved Domain Database contains 3,272 new or updated NCBI-curated domains and now mirrors Pfam version 31 as well as models from NCBIfams, a collection of protein family hidden Markov models (HMMs) for improving bacterial genome annotation. A fine-grained classification of the major facilitator superfamily has also been added. You can find this updated content on the CDD FTP site.

Continue reading

Genome Workbench 2.13.0 now available


The Genome Workbench team is proud to present version 2.13.0, with the latest usability improvements and bug fixes.  See the full list of changes in the Genome Workbench release notes.

Some of the improvements include:

  • New SNP tracks using the most recent dbSNP release
  • Improved alignment statistics table to correctly account for introns
  • Alignment tooltips report introns separately from gaps
  • Fixes for several interface issues to make MAFFT and BLAST alignments easier to use.

Genome Workbench is an integrated application for viewing and analyzing sequences. Genome Workbench can be used to browse and import data from NCBI and combine it with your own private data.

New BLAST results page in NCBI LABS


NCBI Labs is showcasing an experiment to improve the BLAST results page. The goal is to provide a more useful BLAST output that better meets your needs and integrates with your workflows. The new results incorporate feedback from surveys and interviews with BLAST users. We think you’ll find the new results are more compact, easier to navigate, and expose useful formatting and other features that you may not have known about.

The results page has organism, percent identity, and E value filters in plain view and easily accessible. The Descriptions and Graphic Summary are on separate tabs, and the popular taxonomy view is on a fourth tab rather than on a separate web page. These changes along with other enhancements make the display more concise and easier to navigate. The figure below shows the new output format.

Blast_resultsFigure 1. The New BLAST Results with filters directly on the page and a more concise tabbed output that includes the taxonomy report. The Back to Traditional Results Page link re-loads the results in the standard format.

Continue reading

NCBI at Experimental Biology next week (Apr 6-9) in Orlando


We’ll be exhibiting next week at the 2019 Experimental Biology conference in Orlando. Stop by the NCBI booth (#446) (April 7-9, 9 AM – 4 PM) to meet NCBI staff,  to see live demonstrations of NCBI molecular and literature databases and tools, ask questions and provide feedback. We’ll also be showcasing important updates to BLAST, PubChem, and PubMed!

Bring your own data to the computational virology workshop at LSU New Orleans


Attention all aspirational computational virologists and cloud-curious bioinformaticians! NCBI is hosting a free workshop in New Orleans, Louisiana April 23 and 24.

Choose your own adventure: participants may bring their own data and/or work with public data housed at NCBI.

Day 1 will consist of a short cloud-onboarding session, introduction to Jupyter notebooks, SRA and BLAST intros, and more! On day 2, we’ll roll our sleeves up in a working session around phylogenetic clustering of sequences where we’ll look for unknown viruses.

BYOD (bring your own data) and apply today! Please fill out the application form in its entirety by Tuesday, April 9th.

Attendee insights will be made publicly available on GitHub.

BLAST+ 2.9.0 now available with enhanced support for new database format and improved performance


The BLAST+ 2.9.0 release is now available from our FTP site.  This latest release has enhanced support for the new BLAST database version (BLASTDBv5).

  • The 2.9.0 programs handle the new four character identifiers for chains of 3D structure records from RCSB Protein Data Bank (PDB).  The previous version of the BLAST databases and programs do not support these identifiers. See the MMDB News for additional details about the PDB change and the impact on NCBI Structure resources.
  • Another important improvement in  2.9.0 is the ability to configure the output separator for tabular and CSV output formats. See the BLAST Manual for details

More  improvements and a few bug fixes with this release are detailed in the release notes.

For more information on new database version, BLASTDBv5 (download), see the previous NCBI Insights article and the recording of our webinar. We will continue to update the BLAST databases in their current version (BLASTDBv4) until September 2019.

Women-led Biodata Science Hackathon May 8-10, 2019


NCBI is excited to host our first Women-led Hackathon, a collaborative biodata science event organized by women on the NIH Campus in Bethesda, Maryland!

NIH has a strong interest in enhancing the diversity of the scientific workforce, and women in particular are underrepresented in data science.  This women-led NCBI initiative strongly encourages researchers, especially women, at any stage of their data science journey to apply for this inaugural event. Past hackathon participants have ranged from students and postdocs with a working knowledge of scripting (e.g. Shell, Python, R) to those already engaged in the use of large datasets or in the development of informatics tools, code, or pipelines.

Potential topics include:

  • An open store for variant and gene prioritization tools
  • Variable Tracking and Schema Capturing to make Biomedical Research Data ‘FAIR’
  • Molecular language: discovery of cell-to-cell communication molecules from RNA-Seq data
  • dsVirus variant discovery and annotation pipeline
  • Design of ICD-9 to 10 conversion function for the R package ‘icd’
  • Hiding in plain sight — unannotated structural variants in public genomic data sets

Continue reading

Important improvements on the genome Assembly pages


We’ve been making improvements to the NCBI genome Assembly resource. Highlights include:

  • Links added between members of a pair of genome assemblies derived from the same diploid individual
  • Additional filters now shown on the left-hand side bar
    • Annotation status
    • Assembly type, including the new types “Unresolved diploid” and “Alternate pseudohaplotype”
  • vhost filters on the Advanced page Search Builder that allow selection of virus assemblies with a particular host (e.g. “vhost human”)
  • Searching by assembly names with the version unspecified
  • Total ungapped length reported in the “Global statistics” table, replacing the less useful total gap length
  • Improved N50 & L50 statistics presentation for complex genome assemblies

Continue reading