Variation feature changes in NCBI Reference Sequences coming in 2018

Starting in March 2018, SNP variation features will no longer be in RefSeq genome assembly records – chromosome and contig records with NC_, NT_, NW_ and AC_ accession prefixes. This change affects both the ASN.1 and flatfile records. Because the number of variants is already enormous and still growing, removing SNP features from these large genomic records will significantly reduce the size of RefSeq FTP files and make downloading and processing easier. We will continue to include SNPs on NG_-prefixed genomic records, and transcript (NM_, NR_, XM_, XR_) and protein (NP_, XP_, YP_) sequences.

Reminder: As of September 2017, NCBI has stopped accepting submissions for non-human SNPs in dbSNP and dbVar. RefSeq flatfiles will stop presenting non-human variant data in November 2017.

Subscribe to the refseq-announce listserv for regular updates on RefSeq.

BLAST+ 2.7.1 now available

In the new version (2.7.1) of the BLAST+ executables, blastdbcmd can look up taxonomic names (e.g., scientific or common name) faster. We have also made some low-level improvement that allow BLAST to multithread more efficiently, especially when available memory is not sufficient for the database.

Note: Some LINUX and MacOSX users may find that they need to increase the number of open file descriptors allowed for a process. The number of allowed open file descriptors can be easily changed with “ulimit -n” (under bash). We suggest setting the limit to at least 1024.

See the BLAST+ release notes for more information.

IgBLAST 1.8.0 release

A new version of IgBLAST is now available on FTP, along with a new manual on GitHub. This release has the following improvements:

  1. The igblastn executable can now multi-thread much more efficiently for large sets of queries. The default number of threads is now four, but can be changed with the -num_threads option.
  2. The igblastn executable can now take an SRA accession as the query input. The search runs on the local machine, but the queries are retrieved from the SRA repository at the NCBI. Use the -sra rather than the -query option to enable.
  3. A lower default nucleotide mismatch penalty values for finding D and J genes (from -4 to -2 and from -3 to -2, respectively). This improves accuracy in finding the best D and J gene hits for moderately mutated sequences.

Our web IgBLAST page also uses the new default nucleotide mismatch penalty values (i.e., -2 for finding both D and J genes).

IgBLAST facilitates the analysis of immunoglobulin and T cell receptor variable domain sequences.

New Influenza Virus Submission Wizard Makes Flu Sequence Submissions Easier

NCBI now offers a flu sequence submission wizard that makes submissions easier and will provide you with accession numbers sooner. To get started, sign in to NCBI, go to the Submission Portal and choose the link for “Ribosomal RNA (rRNA), rRNA-ITS or Influenza sequences” from the GenBank section.

submission portal page with genbank link

Continue reading

November 1 webinar: Introducing the Genome Data Viewer (GDV)

On Wednesday, November 1, 2017, we will present a webinar on GDV, NCBI’s full-featured genome browser. In this webinar, you’ll learn how to explore and analyze sequences and annotations for eukaryotic RefSeq genome assemblies. We’ll show you how to:

  • Search across the entire assembly for genes, products and other markers or jump to a specific position or range
  • Display any of seven preselected track sets highlighting various aspects of the assembly or create and load your own custom track sets from your NCBI account.
  • Load and display submitted alignment data from NCBI’s GEO or SRA.
  • Upload your own annotation and variant data
  • Display BLAST or Primer-BLAST results on the assembly in the browser.

Date and time: Wednesday, November 1, 2017 12:00-12:30PM EDT

After registering, you will receive a confirmation email with information about attending the webinar. After the live presentation, the webinar will be uploaded to the NCBI YouTube channel. You can learn about future webinars on the Webinars and Courses page.

NCBI’s Genome Data Viewer (GDV) to replace Map Viewer

The Genome Data Viewer (GDV) is now the main genome browser at NCBI replacing the Map Viewer, our original genome browser. GDV is a modern genome browser with essential improvements over Map Viewer. These include sequence-level details and an automated update process that keeps up with the rapid pace of genome sequencing, assembly and annotation.


The Genome Data Viewer homepage (top panel) and browser view (bottom panel)

Continue reading

GenBank release 222.0 is available via FTP, Entrez and BLAST

GenBank release 222.0 (10/14/2017) has 203,953,682 traditional records (including non-bulk-oriented TSA) containing 244,914,705,468 base pairs of sequence data. In addition, there are 508,825,331 WGS records containing 2,318,156,361,999 base pairs of sequence data, 192,754,804 TSA records containing 172,909,268,535 base pairs of sequence data, and 9,479,460 TLS records containing 2,993,818,315 base pairs of sequence data.

Continue reading

Sequence Viewer 3.23 now available

Sequence Viewer 3.23 has several new features, improvements and bug fixes, including performance optimization for alignment renderings and improved tooltips in uploaded VCF files. For a full list of changes, see the Sequence Viewer release notes.

Sequence Viewer is a graphical view of sequences and color-coded annotations on regions of sequences stored in the Nucleotide and Protein databases.

CNVs from Exome Aggregation Consortium (ExAC) added to dbVar in September 2017 data release

Copy number variants (CNVs) from ExAC’s publication are now available at dbVar as nstd151. The data include approximately 50,000 CNV regions identified from 60,000 human exomes, providing a deep survey of common and rare copy number variation affecting protein-coding sequences in the human genome.

dbVar provides FTP files in VCF, GVF, and CSV formats, and include placements on GRCh37 as well as remapped placements on GRCh38. Tutorials for working with different formats are also available.

Follow the dbVar RSS feed for information on monthly releases.

Updated HIV-1 interaction datasets in Gene

We recently updated the HIV-1 interaction datasets in Gene with data provided by the Southern Research Institute (SRI).

The protein interactions dataset now has:

  • 8,005 interactions,
  • 16,215 interaction descriptions,
  • 3,859 proteins encoded by 3,757 human genes,
  • and 6,822 publications.

The replication interactions dataset now has:

  • 1,595 interactions,
  • 1,854 interaction descriptions,
  • 1,583 proteins encoded by 1,583 human genes,
  • and 229 publications.

Data are also available at the RefSeq HIV-1 website and the GeneRIF FTP site.