September 2017: NCBI to present EDirect workshop at NLM

On September 18, 2017, NCBI staff will offer a workshop on EDirect, NCBI’s suite of programs for easy command line access to literature and biomolecular records. To join the workshop, please register.

NOTE: This is an in-person workshop at the National Library of Medicine on the NIH campus in Bethesda, MD, USA. The course is limited to 22 participants.

Continue reading

Sequence Viewer 3.22 now available

Sequence Viewer 3.22 has several new features, improvements and bug fixes, including improved rendering on BAM and cSRA tracks. For a full list of changes, see the Sequence Viewer release notes.

Sequence Viewer is a graphical view of sequences and color-coded annotations on regions of sequences stored in the Nucleotide and Protein databases.

Identical Protein Groups: Non-redundant access to protein records

Have you ever searched the NCBI Protein database and been overwhelmed with the number of sequences returned? Have you tried searching with a protein name, thinking that would greatly limit the results, only to still be presented with many sequences (all with the same name)? It’s a common problem in this time of greatly expanding sequence databases powered by large-scale genomic sequencing of similar organisms. Redundancy in the sequence databases is high and only getting worse.

To address this, in 2013 NCBI released the WP records, which collect identical protein sequences annotated on bacterial genomes. In 2014, NCBI released the Identical Protein Reports on Protein records, which displays information about all other proteins identical to that protein. Now, we are releasing a new resource: Identical Protein Groups (IPG).  IPG offers several features:

Continue reading

New releases from NCBI: IgBLAST 1.7.0 and Sequence Viewer 3.21

IgBLAST 1.7.0 release

A new version of IgBLAST is now available on FTP, with the following new features:

  1. Specify whether overlapping nucleotides at VDJ junctions are allowed in matching V, D, and J genes.
  2. Set a custom J gene mismatch penalty
  3. Report the CDR3 start and stop positions in the sub-region table
  4. Use alignment length instead of percent identity as the tie-breaker for hits with identical blast scores, improving accuracy in the V, D, J gene assignment.

IgBLAST was developed at the NCBI to facilitate the analysis of immunoglobulin and T cell receptor variable domain sequences.

Continue reading

New Pandoravirus Sequences are Accessible in GenBank

In the July 19, 2013 issue of the journal Science, an interesting article describes the discovery and characterization of two “giant” viruses that are proposed to comprise the first members of the “Pandoravirus” genus.

Nadege Philippe and co-workers obtained the viruses from sediment samples in Chile and Australia and found that they have no morphological resemblance to any previously defined virus families. The investigators isolated the genomes of these viruses and sequenced them using a variety of NextGen methodologies. They then assembled the reads into contigs and characterized them using various sequence similarity algorithms (including NCBI’s BLAST and CD-Search). Interestingly, while related to each other, the genomes were not similar to those of any other organism or virus. Additionally, 93% of protein-coding sequences had no recognizable homologs.

Continue reading

Using Conserved Domains to Find Protein Homologs

If you’re a protein researcher, one thing you may want to do is to find homologs for a protein of interest on the basis of its sequence. This can provide insights into what the protein does and how it does it, and may identify proteins with known three-dimensional structures that can serve as models for the protein of interest. The Conserved Domains Database (CDD) groups proteins that have strong sequence similarity to protein domain fingerprints and allows you to search these groups with any protein sequence. Such searches are often more sensitive than standard BLAST searches since the scoring matrices used are tuned to locate important functional sites and sequence motifs that are highly conserved within the domain. You can then use the results to explore the evolutionary relationships of these proteins or identify these important sequence and structural features.

Here is a method to find protein sequences from many organisms that contain a particular conserved domain:

Continue reading