Improvements made to genomes FTP site


We’ve been making improvements to the contents of NCBI’s genomes FTP site. Highlights include:

  • addition of new file types, including a feature_count.txt file with counts of gene, RNA, and CDS features of specific types and a translated_cds.faa file with conceptual translations of each CDS feature on the genome
  • improvements to the Sequence Ontology feature types used in GFF3, including identification of pseudogene gene features as “pseudogene” instead of “gene” in column 3
  • improvements to the gene_biotype calculation to categorize transcribed pseudogenes as transcribed_pseudogene instead of misc_RNA
  • addition of the #!annotation-source unofficial pragma to GFF3 files with the annotation name, for assemblies where that information is available
  • addition of an FTP directory for GenBank viral genomes that includes International Committee on Taxonomy of Viruses (ICTV) species exemplar virus genomes and a growing number of NCBI viral neighbor genomes
  • expanded the UCSC sequence name mapping provided in the assembly report files to provide mappings between GenBank or RefSeq sequence accessions, chromosome or scaffold names, and the UCSC sequence name for most of the recent assemblies in the UCSC Genome Browser

Continue reading

NCBI researchers and collaborators discover novel group of giant viruses


Nearly complete set of translation-related genes lends support to hypothesis that giant viruses evolved from smaller viruses

An international team of researchers, including NCBI’s Eugene Koonin and Natalya Yutin, has discovered a novel group of giant viruses (dubbed “Klosneuviruses”) with a more complete set of translation machinery genes than any virus that has been described to date. “This discovery significantly expands our understanding of viral evolution,” said Koonin. “These are the most ‘cell-like’ viruses ever identified. However, the computational analysis of the virus genomes shows that these viruses have not evolved from cells by reductive evolution but rather have evolved from smaller viruses, gradually acquiring genes from their hosts at different stages of their evolution.”

Continue reading

Sequence updates in human assembly GRCh38: improving gene annotation


In an earlier blog post, we discussed how sequence updates in GRCh38, the most recent version of the human reference genome, filled in a gap in human chromosome 17 near position 21,300K and expanded the region by 500K (500,000 base pairs). In this post, we will again consider this same region, but with an emphasis now on how GRCh38 also improved the gene annotation.

"Figure

Figure 1. Annotation of a region of chromosome 17 near the KCNJ12 and KCNJ18 genes. Top panel: Annotation release 105 on GRCh37.p13 represented by a configured graphic display of sequence record NC_000017.10. Bottom panel: Annotation release 106 on assembly GRCh38 represented by a configured graphic display of sequence record NC_000017.11. New gene models are circled. 

Continue reading

Sequence updates in human genome assembly GRCh38: filling in the gaps


In a previous blog post, we explained several important concepts about the human reference genome.  We presented a region of human chromosome 17 as an example of a location where the genome sequence was not fully assembled.  In this post, we are going to revisit the same gapped region to see how the Genome Reference Consortium (GRC) changed this part of the genome in GRCh38, the updated human reference assembly released in December 2013.  This region represents just one of the more than 1,000 changes and improvements that the GRC introduced in GRCh38.

Continue reading

New Pandoravirus Sequences are Accessible in GenBank


In the July 19, 2013 issue of the journal Science, an interesting article describes the discovery and characterization of two “giant” viruses that are proposed to comprise the first members of the “Pandoravirus” genus.

Nadege Philippe and co-workers obtained the viruses from sediment samples in Chile and Australia and found that they have no morphological resemblance to any previously defined virus families. The investigators isolated the genomes of these viruses and sequenced them using a variety of NextGen methodologies. They then assembled the reads into contigs and characterized them using various sequence similarity algorithms (including NCBI’s BLAST and CD-Search). Interestingly, while related to each other, the genomes were not similar to those of any other organism or virus. Additionally, 93% of protein-coding sequences had no recognizable homologs.

Continue reading

The Human Reference Genome – Understanding the New Genome Assemblies


What is a genome assembly?

The haploid human genome consists of 22 autosomal chromosomes and the Y and the X chromosomes. Each of the chromosomes represents a single DNA molecule, a sequence of millions of nucleotide bases.  These molecules are linear, so one might expect that we should represent each chromosome by a single, continuous sequence.

Unfortunately, this is not the case for two main reasons: 1) because of the nature of genomic DNA and the limitations of our sequencing methods, some parts of the genome remain unsequenced, and 2) emerging evidence suggests that some regions of the genome vary so much between individual people that they cannot be represented as a single sequence.

In response to this, modern genomic data sets present a model of the genome known as a genome assembly. This post will introduce the basic concepts of how we produce such assemblies as well as some basic vocabulary.

Continue reading

How to Download Bacterial Genomes Using the Entrez API


Given the size of modern sequence databases, finding the complete genome sequence for a bacterium among the many other partial sequences can be a challenge. In addition, if you want to download sequences for many bacterial species, an automated solution might be preferable.

In this post we’ll discuss how to download bacterial genomes programmatically for a list of species using the E-utilities, the application programming interface (API) to NCBI’s Entrez system of databases.  We’ll also take advantage of NCBI’s redesigned Genome database, which links all genome sequences for a given species to one record, making it easy to obtain the desired sequences once you find the right Genome record. In principle you can apply the procedure below to other simple genomes that are represented by a single sequence. Future posts will address additional considerations that apply to complex, eukaryotic genomes.

Continue reading