dbVar clinical and common structural variants track hub now available

dbVar, NCBI’s database of large-scale genetic variants, has a new track hub for viewing and downloading structural variation (SV) data in popular genome browsers. Initial tracks include Clinical and Common SV datasets. dbVar’s new track hub can be viewed using NCBI’s Genome Data Viewer through the “User Data and Track Hubs” feature (Figure 1) and other genome browsers by selecting “dbVar Hub” from the list of public tracks or by specifying the following URL.

https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/dbvarhub/hub.txt

Main_Track_Hub_Dial

Figure 1. Loading the dbVar track hub in the Genome Data Viewer. The Track Hubs feature on the left-hand column of the browser allow you to add the track by searching for it or by entering the direct URL. You can select the specific tracks —  for example, “NCBI curated common SVs: All populations” — to load from the Configure Track Hubs dialog. Continue reading

The BLAST Docker and databases are now ready to use on Google and Amazon clouds

As announced in a previous post, we offer a Docker version of NCBI BLAST that you can use locally or on the Google Cloud where we have pre-loaded BLAST databases.  We are happy to announce that the same functionality is also available on the Amazon Cloud.  In addition, we now offer 23 different BLAST databases at each cloud platform.

Continue reading

We want to hear from you about changes to NIH’s Sequence Read Archive data format and storage

RFI_SRA_largeNIH’s Sequence Read Archive (SRA) is the largest, most diverse collection of next generation sequencing data from human, non-human and microbial sources. Hosted by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), SRA data is also available on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.

SRA currently contains more than 36 petabytes (PB) of data and is projected to grow to 43 PB by 2023. Though the value of this resource grows with each new sample, the exponential growth experienced over the last decade (Figure 1) threatens SRA sustainability. The storage footprint is growing more costly to maintain and the data more difficult to use at scale. The situation has reached a tipping point. SRA must be refactored to support FAIR data principles into the future.

Sra_growthFigure 1. SRA data has grown exponentially over the last decade.

NIH remains committed to the SRA and hopes to establish a long-range plan for sustained resource growth. Considerations include a model wherein normalized working files without Base Quality Scores (BQS) are readily available through cloud platforms and NCBI FTP sites, and larger source files and normalized files with base quality scores will be distributed on cloud platforms based on prevalent use cases and usage demands. Further details regarding data formats are available here.

It is critical that as an SRA user, you  participate in the review and testing of proposed data formats and infrastructure by commenting on how these developments impact your data usage. NIH has prepared a Request for Information (RFI) that details planned developments and would greatly appreciate feedback from the scientific community.

Continue reading

Improved access to SARS-CoV-2 data

NCBI Datasets has a simple, new way to get Coronoviridae data, including from SARS-CoV-2 (Figure 1). The data package includes genomic, protein and CDS sequences, annotation and a comprehensive data report for all complete genomes. You can also target your search to major taxonomic ranks within Coronaviridae.

comm-1318_fig1

Figure 1 – SARS-CoV-2 page within NCBI Datasets showing statistics as of June 16, 2020.

Interested in a specific protein? The SARS-CoV-2 protein page allows you to choose a protein and download the corresponding sequences, annotation and representative structures from all annotated genomes (Figure 2).

comm-1318_fig2

Figure 2 – SARS-CoV-2 protein page within NCBI Datasets showing annotations on the SARS-CoV-2 reference genome.

Looking for programmatic access? NCBI Datasets offers the same Coronoviridae genomic data and SARS-CoV-2 protein data through a command-line tool and a RESTful API. These tools support additional filtering including the ability to download only those genomes released after a date you specify.

We appreciate your feedback. Try NCBI Datasets and let us know what you think!

New GenBank submission options for SARS-CoV-2 submitters

NCBI is pleased to announce ongoing enhancements to submission of SARS-CoV-2 assembled genomes to GenBank, including a streamlined workflow on the web and a new API option. Both new options mean that you can receive accessions for SARS-CoV-2 data submissions more quickly!

A streamlined workflow with improved interface and enhanced validation on both web and API saves you time and effort and, most importantly, makes it possible to get SARS-CoV-2 accession numbers and public release of data within hours. In addition, we automatically annotate all SARS-CoV-2 genomes to produce standardized, consistent annotation which saves you time and benefits researchers who find your data valuable. Continue reading

New viral protein domain models for annotation of coronaviruses

NLM’s Conserved Domain Database (CDD) has expanded its scope to now include 153 new viral protein domain family models for the annotation of coronaviruses, including models such as for the S1 subunit of coronavirus Spike proteins (cd21527), the nucleocapsid (N) protein of coronavirus (cd21595), and the coronavirus RNA-dependent RNA polymerase (cd21530).

Each curated domain model consists of a multiple sequence alignment containing conserved sequence features that may have been confirmed experimentally, plus links to relevant publications. When available, the domain models include 3D structures with links to interactive 3D views and interacting partners.

Check out this tabular summary of SARS-CoV-2 gene products for links to matching conserved domain models and representative 3D protein structures.

Want to view these alignments in 3D space? We’ve updated iCn3D, a web-based 3D structure viewer, with new rendering, annotation, and alignment features.  Read more about how you can use iCn3D to view and analyze SARS-CoV-2-related structures.

Don’t forget to review our SARS-CoV-2 resources page to keep up to date on other coronavirus data at NCBI!

The New and Improved PubMed® — We Are Listening

Today marks 5 weeks since the new PubMed was made the default version. Throughout this process, we promised to listen, and we heard from you!

This was a huge change

We know change isn’t always easy, especially with major changes to a familiar service or product. We are staunch believers in making incremental changes whenever possible: releasing small improvements, observing the effects, gathering user feedback, and then using that data to make further modifications. This time, an incremental approach to improving PubMed wasn’t feasible. We needed to make major changes under the hood (new databases, cloud delivery, new web architecture, etc.) for PubMed to be sustainable going forward.

User feedback is invaluable: it has played an enormous role in updates over the 24 years PubMed has been in existence, and it continues to do so. To prepare for new PubMed, we launched the beta version in 2017, then called PubMed Labs, as a way to set up the new framework and solicit feedback from our users. During development and since, we reached out to our stakeholders with presentations, webinars, handouts, FAQstoolkits, and tutorials, including a series of four 90- minute online classes, How PubMed® Works, many of which continue to be available.

We understand that not everyone had a chance to put the new PubMed through its paces, and we’re grateful to those of you who provided feedback along the way, whether it was by sending questions or comments using the feedback button, by discussing with us how you accomplish your work with PubMed, or by filling out a survey.

For some, when the new version of PubMed became the default last month, it was a huge shift. The ways in which you were accustomed to working with the system changed. We heard from some of you that you were used to a particular feature being available on PubMed and now you don’t know where to find it.

Continue reading

New BLAST default parameters and search limits coming in September

To provide a more efficient BLAST experience for everyone, we’re changing some parameters and limits on the web BLAST service on September 8, 2020. The new settings, listed below, will improve overall performance and make search times more consistent.

  1. The Expect Value Threshold default setting will be reduced to 0.05.
  2. The maximum number of target sequences (Max target sequences) limit will be no more than 5,000.
  3. The maximum allowed query length for nucleotide queries (blastn, blastx, and tblastx) will be 1,000,000 and 100,000 for protein queries (blastp and tblastn).

These changes will help keep the BLAST service running smoothly as the already very large databases continue to grow rapidly. If you have any questions or concerns, please email us at blast-help@ncbi.nlm.nih.gov

dbSNP human build 154 release + ALFA data

dbSNP human build 154, now available, includes new ALFA (Allele Frequency Aggregator) variants and allele frequency. This build contains over two billion Submitted SNP (ss) records and 730 million Reference SNP (rs) records.

New features include:

See the release notes for more information about what’s new in build 154.

New annotations in RefSeq: budgerigar, bony fish, fly and more

close-up-photo-of-white-and-blue-bird

In May, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms:

  • Acipenser ruthenus (sterlet)
  • Arvicanthis niloticus (African grass rat)
  • Cannabis sativa (eudicot)
  • Crassostrea gigas (Pacific oyster)
  • Cyclopterus lumpus (lumpfish)
  • Drosophila albomicans (fly)
  • Drosophila guanche (fly)
  • Drosophila innubila (fly)
  • Esox lucius (northern pike)
  • Gymnodraco acuticeps (bony fish)
  • Hippoglossus hippoglossus (Atlantic halibut)
  • Marmota flaviventris (yellow-bellied marmot)
  • Melopsittacus undulatus (budgerigar)
  • Osmia lignaria (orchard mason bee)
  • Pangasianodon hypophthalmus (striped catfish)
  • Pantherophis guttatus (snake)
  • Periophthalmus magnuspinnatus (bony fish)
  • Prunus dulcis (almond)
  • Pseudochaenichthys georgianus (South Georgia icefish)
  • Setaria viridis (monocot)
  • Thalassophryne amazonica (bony fish)
  • Thrips palmi (thrip)
  • Trematomus bernacchii (emerald rockcod)
  • Zea mays (maize)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.