RefSeq release 82 now public


RefSeq release 82 is accessible online, via FTP and through NCBI’s programming utilities. This full release incorporates genomic, transcript, and protein data available as of May 8, 2017 and contains 127,098,289 records, including 84,756,971 proteins, 18,901,573 RNAs, and sequences from 69,035 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Continue reading

Genome data download made easy!


This blog post is directed toward Assembly users.

A new “Download assemblies” button is now available in the Assembly database. This makes it easy to download data for multiple genomes without having to write scripts.

For example, you can run a search in Assembly and use check boxes (see left side of screenshot below) to refine the set of genome assemblies of interest. Then, just open the “Download assemblies” menu, choose the source database (GenBank or RefSeq), choose the file type, and start the download. An archive file will be saved to your computer that can be expanded into a folder containing your selected genome data files.

Download_Button

Figure 1. The “Download Assemblies” button is at the top right of the Assembly page. When you click on it, you will see options for source database and file type, and a download button. There are several options for file type, including Genomic GFF.

Continue reading

Eleven eukaryotic annotations added to RefSeq in April 2017


Central Bearded Dragon (Pogona vitticeps)
(Credit: Mark Sum, USGS. Public domain.)

In April, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following eleven organisms:

  • Bombus terrestris (buff-tailed bumblebee)
  • Ceratitis capitata (Mediterranean fruit fly)
  • Athalia rosae (coleseed sawfly)
  • Dendrobium catenatum (a monocot)
  • Phalaenopsis equestris (a monocot)
  • Orbicella faveolata (stony coral)
  • Pogona vitticeps (central bearded dragon)
  • Oryzias latipes (Japanese medaka)
  • Sesamum indicum (sesame)
  • Jatropha curcas (a eudicot)
  • Amborella trichopoda (a flowering plant)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

Eight new eukaryotic genome annotations added to RefSeq


In the past month, the NCBI Eukaryotic Genome Annotation Pipeline has released new annotations in RefSeq for the following organisms:

  • Zea mays (maize)
  • Labrus bergylta (ballan wrasse)
  • Monopterus albus (swamp eel)
  • Corvus cornix cornix (hooded crow)
  • Prunus persica (peach)
  • Rhincodon typus (whale shark)
  • Oncorhynchus kisutch (coho salmon)
  • Pseudomyrmex gracilis (ant)

See more details on the Eukaryotic RefSeq Genome Annotation Status page.

Maize diversity

New Genome Data Viewer access page


NCBI is pleased to offer a direct entry point to the NCBI Genome Data Viewer (GDV) that supports the exploration, visualization and analysis of eukaryotic RefSeq genome assemblies.

GDV_homepage

The new GDV homepage includes an interactive interface for a quick overview of supported organisms, specific genome searches plus inter-connectivity to Assembly and RefSeq annotation resources. About 100 genome assemblies are now ready for GDV exploration with more on the way. Stay tuned!

Complete RefSeq genome annotation results represented in UCSC genome browser


NCBI’s RefSeq project provides comprehensive annotation of the human and other eukaryotic genomes through a combination of curation and an evidence-based eukaryotic genome annotation pipeline. Our curated records, ‘Known RefSeqs’, can be identified by the accession prefix (NM_, NR_, NG_, NP_). Model RefSeq records (XM_, XR_, and XP_ accession prefixes) are predicted based on transcript evidence (RNA-Seq and more) and protein support from Known RefSeqs, Swiss-Prot, and select INSDC records.

We recognize that many scientists access genome annotation data from one of three sources – NCBI, Ensembl, or UCSC. NCBI provides access to the human (and other) genome annotation results in the Genome Data Viewer, by BLAST and FTP, and per gene in NCBI’s Gene resource. Ensembl provides RefSeq annotation information based directly on the FTP content that NCBI releases.  In the past, UCSC has provided a partial dataset of RefSeq human genome annotation content by aligning Known RefSeq transcripts to the genome using BLAT. Using this approach, additional model RefSeq transcript variants, non-transcribed pseudogenes, and immunoglobulin and T-cell receptor regions, were not available through UCSC services. In rare cases the independent alignment method resulted in small differences in the exon structure compared to NCBI’s placement details as well as some ambiguous placements for transcripts originating from very similar paralogs that are uniquely placed within the NCBI dataset.

Continue reading

Bottlenose dolphin annotation release 101


Annotation Release 101 for the bottlenose dolphin (Tursiops truncatus) is out in RefSeq! This annotation was based on the NIST Tur_tru v1 assembly, which has a four-fold increase in contiguity from the assembly used in the previous annotation. Over four billion RNA-Seq reads from skin and blood tissue were used for gene prediction. As a result of these improvements, the percent of partially-represented protein-coding genes went down from 24% to 4%. Over 2500 genes that were fragmented in the previous assembly were merged into complete genes. A total of 24,026 genes were annotated, and 17,096 of them were protein-coding. A full report on the annotation can be found here.

Continue reading

NCBI RefSeq’s Antimicrobial Peptide Indexed Field: Facilitating Novel Antibiotic Discovery


This blog post is aimed toward biomedical researchers.

Antibiotic-resistant bacterial infections account for the deaths of tens of thousands of Americans every year. Over the past twenty years, these difficult to treat infections have become more common. Since traditional antibiotics are ineffective in these cases, biomedical researchers are looking for alternatives. NCBI’s RefSeq project has created a new indexed field, “Protein has antimicrobial activity [prop]“, to assist in this search by retrieving useful sequence annotation showing naturally occurring antimicrobial peptides, or AMPs.

Antimicrobial peptides are naturally occurring peptides from a diverse array of species that are a part of an organism’s innate immune system. The RefSeq team recently gathered a list of over 130 human genes encoding one or more experimentally proven AMPs. These peptides are typically less than 100 amino acids and can display bactericidal, antiviral, antifungal, and even antitumor activities, with a specific AMP usually having a subset of these activities. AMPs may be a suitable alternative to traditional antibiotics because they work quickly, efficiently, and tend to have broad spectrum activity. Moreover, since they are naturally-occurring, AMPs are less likely than other compounds to be toxic to host cells or to give rise to AMP-resistant bacterial strains. Continue reading

Accessing the Hidden Kingdom: Fungal ITS Reference Sequences


This post is geared toward fungi researchers as well as RefSeq and BLAST users.

Fungi have unique characteristics that can make it difficult to identify and classify species based on morphology. To address these issues, Conrad Schoch, NCBI’s fungi taxonomist, and Barbara Robbertse, NCBI’s fungi RefSeq curator, in collaboration with outside mycology experts, are curating a set of fungal sequences from internal transcribed spacer (ITS) regions of the nuclear ribosomal RNA genes. This set of standard DNA sequences for fungal taxa not only addresses these difficulties in identifying and classifying fungal species by morphology, but is also essential for analyzing environmental (metagenomics) sequencing studies. The curated ITS sequences, described in a recent article in Database (PMC Free Article), all have associated specimen data and, when possible, are taken from sequences from type materials, ensuring correct species identification and tracking of name changes. This article will show you how to access these ITS sequences and search them using the specialized Targeted Loci BLAST service.

The fungal ITS sequences are a RefSeq Targeted Loci BioProject (PRJNA177353). As you may know, a BioProject is a collection of biological data related to a single initiative; in this case, the goal is to collect and curate fungal sequences from targeted loci – specific molecular markers such as protein coding or ribosomal RNA genes used for phylogenetic analysis.

Continue reading

Designing exon-specific primers for the human genome


A common task facing geneticists is to assay for sequence changes at particular locations in genes. These assays are often looking for changes in the coding exon of genes, and the target sequences are typically amplified using PCR from genomic DNA using a pair of specific primers. In this article, we will show you how to use NCBI Reference Sequences and Primer-BLAST, NCBI’s primer designer and specificity checker, to design a pair of primers that will amplify a single exon (exon 15) of the human breast cancer 1 (BRCA1) gene.

Here are the steps to follow to design primers to amplify exon 15 from human BRCA1:

Continue reading