The Human Reference Genome – Understanding the New Genome Assemblies

What is a genome assembly?

The haploid human genome consists of 22 autosomal chromosomes and the Y and the X chromosomes. Each of the chromosomes represents a single DNA molecule, a sequence of millions of nucleotide bases.  These molecules are linear, so one might expect that we should represent each chromosome by a single, continuous sequence. Unfortunately, this is not the case for two main reasons: 1) because of the nature of genomic DNA and the limitations of our sequencing methods, some parts of the genome remain unsequenced, and 2) emerging evidence suggests that some regions of the genome vary so much between individual people that they cannot be represented as a single sequence. In response to this, modern genomic data sets present a model of the genome known as a genome assembly. This post will introduce the basic concepts of how we produce such assemblies as well as some basic vocabulary.

One of the major challenges in sequencing eukaryotic chromosomes is their sheer size, making it impossible with current technology to obtain their sequences by progressing from one end to the other. Instead, researchers commonly isolate genomic DNA from a biological sample and fragment the DNA into small pieces that can be sequenced individually.  These individual, contiguous sequences are called reads, and generally number from 100 to 1000 nucleotide bases depending on the technology.  Researchers then assemble the reads, like pieces of a giant puzzle, into progressively larger contiguous pieces. Finally, they assemble these larger pieces into full chromosome sequences.

In the Human Genome Project (HGP), researchers used a variation of the general method above called a clone-based (or hierarchical) approach (see below for links to additional information). This method used a library of fairly large, overlapping segments of genomic DNA. These segments are called clones and generally contained 40,000 – 200,000 nucleotide bases. Researchers then inserted these clones into vectors, such as bacterial artificial chromosomes (BACs), and propagated them in bacteria to generate large amounts of DNA for sequencing. They then sequenced the individual clones by fragmenting them into reads, sequencing the reads and then assembling the reads to produce the full clone sequences, which are called components. They used additional mapping methods to determine which of the components overlapped, and then used these data to organize the components into a linear arrangement known as a tiling path. HGP researchers continued assembling the sequence into larger units as long as there were overlaps between individual components. The resulting “top-level” sequences that have no gaps and represent a continuous region of a chromosome are called contigs or scaffolds. To generate chromosome models, HGP researchers ordered and joined the scaffolds that belonged to individual chromosomes.

An example: human chromosome 17

Let’s take a look at an example. Figure 1 shows a portion of the tiling path of human chromosome 17 (GenBank record CM000679.1). This region of the chromosome was based on six components, each of which has its own accession number at NCBI (e.g. AC006236.2), and these records contain the complete sequence data for each component. In this region, the tiling path is an unbroken series of overlapping components.

Figure 1: Chromosome 17 tiling path

Figure 1: Graphic representation of small portion of the tiling path of human chromosome 17. The assembled chromosome is shown in grey, the individual components in blue, and the areas of overlap in yellow.

Let’s look at another portion of chromosome 17 (Figure 2). This region is near the centromere of the chromosome, and you’ll see that the tiling path is broken near position 21,570 K. With the gaps present, you might wonder why the chromosome sequence (the grey bar) is unbroken. In fact, if you look in the actual chromosome record (CM000679.1), you’ll only find a series of “N”s in that gapped segment instead of the normal bases (A, C, T or G). Representing gaps with “N”s allows us to generate a single, continuous record for an incompletely assembled chromosome while also clearly showing where the gaps in the data lie.

Figure 2: Chromosome 17 tiling path with gap

Figure 2: Graphic representation of the tiling path near the centromere of human chromosome 17. The assembled chromosome is shown in grey, the individual components in blue, and the areas of overlap in yellow.

Why are there still gaps in the assembly?

Since the initial release of the human reference genome in 2001, researchers have made great strides in improving the quality of the assembly model, but significant challenges remain. One of these is the simple fact that certain regions of genomic DNA are much more difficult to sequence than others. For example, the telomeres and centromeres of chromosomes contain tightly packed DNA known as heterochromatin, and these regions are difficult to sequence because of the high frequency of G and C bases. Genomic DNA also contains many repetitive sequences, and this makes assembling sequencing reads a daunting task. Imagine taking the text of the beloved children’s book “Green Eggs and Ham” (with many repetitive words), shredding it and then trying to assemble the sentences back in the right order. There are many possible solutions that will generate grammatically correct English sentences, but none of these (save one) will be the correct Dr. Seuss original. Another complicating factor is that the HGP DNA samples came from multiple people, so that the resulting “genome” is really a randomly mixed conglomerate that, in some cases, may be impossible to represent correctly as a single sequence. We are now much more aware that some regions of the genome can vary quite dramatically from individual to individual, and this new awareness is helping to guide new genome assemblies. Since 2007, these improved assemblies are the responsibility of the Genome Reference Consortium (GRC). In future posts, we will have a look at the GRC’s effort to improve and modernize the assembly, and discuss the relevance of a high-quality reference assembly for the NGS sequencing projects.

Citations and literature guide

History of the Human Genome Project: key news releases and publications

Glossary terms

The links below lead to alphabetically indexed pages within the Gene Reviews Illustrated Glossary or the Genetics for Surgeons Glossary.  Please scroll to the term on the page.

A Librarian’s Guide to NCBI: Course Follow-up

NCBI, in collaboration with NLM and the National Network of Libraries of Medicine NLM Training Center (NTC) at the University of Utah, recently presented A Librarian’s Guide to NCBI. This new course was designed to prepare health science librarians for supporting and training patrons about NCBI molecular databases and tools at their own institutions.

Participants, instructors, and organizers in the first offering of “A Librarian’s Guide to NCBI” outside the National Library of Medicine.

Participants, instructors, and organizers in the first offering of “A Librarian’s Guide to NCBI” outside the National Library of Medicine.

We have made all of the course materials available. Feel free to learn from these, adapt them for your own teaching, or share them with others. You can use the links below access the course materials, which include the slide sets with quizzes, demonstrations and practice problems, or visit the FTP site to get all the materials.

Sample slides from the eight modules of A Librarian’s Guide to NCBI.  Complete PowerPoint files are available from the FTP site.

Sample slides from the eight modules of A Librarian’s Guide to NCBI. Complete PowerPoint files are available from the FTP site.

  • For a review of molecular biology concepts focusing on biological information flow and the gene as a central theme and Gene as central NCBI see the introductory Molecular Biology Basics materials.
  • Get the Advanced Entrez Searching module to learn how to use the Entrez integrated database and search system to find relevant data using basic and advanced interfaces, fielded searches and pre-compiled and pre-computed relationships.
  • You can gain practical experience and a theoretical understanding of NCBI’s sequence similarity search tool BLAST through the Guide to NCBI BLAST, which covers the basics of sequence alignment algorithms, scoring matrices, and local alignment statistics. It also uses practical protein and nucleotide search examples that highlight features of the BLAST web service designed to give the most relevant results.
  • Learn about the essential role of nucleotide and protein sequence data in modern biological research and about NCBI sequence databases through the Sequences & Genomes materials. You can also find out about the scope, purpose and content of the Assembly, BioProject, and Genome databases, how the NCBI manages and processes sequence and other data associated with genomes and their annotation, and how to find the most up-to-date and well-annotated sequences at the NCBI.
  • Survey the many databases and tools at NCBI that provide access to variation data through the Sequence Variation and its Consequences materials. These materials cover the Gene, dbSNP, dbGaP, dbVar, and PheGenI resources with an emphasis on the association between variation and disease risk. You can learn about the different types of genetic variation and the major project types that produce these data, as well as how to navigate the NCBI variation resources to find specific data and important attributes, such as geographic population, allele frequency, and disease association.
  • Get the Gene Expression & Biological Pathways course materials to explore NCBI databases and tools relevant to the study of gene expression, including Gene Expression Omnibus resources (Datasets, Profiles and the GEO2R comparison tool), UniGene, and biological pathways in BioSystems. You can learn the about the importance of gene expression in various biological phenomena, large-scale techniques (microarray, RNAseq) for measuring expression as well as how to find and compare expression patterns of genes in microarray datasets in GEO and in UniGene and how to map selected genes onto metabolic pathways in BioSystems.
  • Explore the NCBI protein structure databases and tools including the Entrez Structure and Conserved Domains databases and the structure viewer, Cn3D with the Protein Structures module. You can navigate across these resources, learn basic concepts of structural biology and the importance of 3D structural information in understanding the normal functions of proteins and abnormal functions that result in disease, all using DNA topoisomerase as an example. You can also learn how to find available 3D structural data for a given protein sequence, how to detect functional domains within the sequence, how to view the 3D structural data in Cn3D and to compare a protein query protein sequence to the structural data.
  • Discover NCBI’s Chemical and Bioactivity Databases, which are the part of the PubChem resource, through the Drugs & Other Small Molecules materials. Find out about PubChem databases (Compound, Substance and BioAssay), the types of data that are accessible from these resources, and understand how to find and use this information to answer important scientific questions.

The Librarian’s Guide Exercises page lists the practice problems that are part of the modules and links to an interface that serves as a stepwise guide to the practice problems.

We plan to expand the course materials to include a set of videos of the lectures and demonstrations to be produced for the NCBI YouTube channel as well as a set of worked exercises suitable for classroom teaching. Expanded materials will be available on the NCBI Education page in the near future.

We will offer the Librarian’s Guide at least once a year. Check back on NCBI’s Education page for future offerings of this and other NCBI courses.

For more information:

NCBI Education

A Librarian’s Guide to NCBI – A New Education Initiative

Next week NCBI will premiere A Librarian’s Guide to NCBIa new course aimed at teaching health science librarians about NCBI resources. The course is sponsored by the NCBI, the National Library of Medicine (NLM), and the National Network of Libraries of Medicine’s NLM Training Center (NNLM/NTC) at the University of Utah. The initial offering of this course will be held from April 15-19, 2013.

LibrarianCourseIcon

Our goal is to “train the trainers” who will implement new and enhance existing bioinformatics education and support services at their home institutions – 21 universities, medical centers and research institutes across the United States.

The course’s curriculum was designed to provide background knowledge and technical skills for librarians interested in helping patrons use online molecular resources from the NCBI.

Topics covered include:

  • Molecular Biology Basics
  • Advanced Entrez Searching
  • NCBI BLAST
  • Sequences & Genomes
  • Sequence Variation & its Consequences
  • Gene Expression & Biological Pathways
  • Protein Structures
  • Drugs & Other Small Molecules

Following the course, we will make the complete set of course materials freely available for anyone to download to use for the development of training programs or incorporation into existing courses. We’ll post here again after the course with full details on how to access the materials.

In addition to offering this new course, we plan to use the content as the basis for a series of webinars and a set of instructional materials suitable for classroom settings.

We’d love to hear about how you might use these materials and if there are other educational resources or materials that you’d like to see from us!

For more information:

Blastdbinfo: API access to a database of BLAST databases

NCBI offers extensive collections of sequences through its BLAST services (http://blast.ncbi.nlm.nih.gov) for comparing and identifying DNA, RNA and protein sequences. NCBI now deposits descriptions of these sequence collections, known as BLAST databases, in a special database called blastdbinfo that you can access through the Entrez Programming Utilities (E-Utilities). Using blastdbinfo, you can enable a program to find an appropriate database and then send BLAST searches to that database using either the BLAST URL API or standalone BLAST (installed locally).

If you’re unfamiliar with the E-Utilities, please see the E-Utilities documentation for a full description of these tools.

Procedure

1. Use esearch.fcgi to find desired BLAST databases (see Table 1 below for a listing of several useful query fields).

 esearch.fcgi?db=blastdbinfo&term=<database query>

[Parse out database ID from XML output]

2. Use esummary.fcgi to retrieve metadata about the matching databases.

esummary.fcgi?db=blastdbinfo&term=<database ID>

[Parse out database path from XML output]

3. Run a BLAST search with the desired database.

Blast.cgi?CMD=Put&DATABASE=<database path>&PROGRAM=<program>&query=<query>

Example

For this example, we will look for human BLAST databases containing sequences from the NCBI Reference Sequence (RefSeq) Project. Click on the links to view the results of each step.

1. Use esearch with the following query (see Table 1):

refseq[blast database source] AND human[title]

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=blastdbinfo&term=refseq%5bblast+database+source%5d+AND+human%5btitle%5d

The first few lines of the returned XML result appear below.

<eSearchResult>
<Count>13</Count>
<RetMax>13</RetMax>
<RetStart>0</RetStart>
<IdList>
<Id>1023214</Id>
<Id>1001294</Id>
<Id>998664</Id>
…

2. Use summary to retrieve the names and paths of the databases. In this case, we will use ID 1023214.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=blastdbinfo&id=1023214

The first few lines of the esummary XML appear below.

<eSummaryResult>
<DocumentSummarySet status="OK"><DocumentSummary uid="1023214">
<Name>Human build 37 RNA, reference, and alternate assemblies</Name>
<Path>DBINDEX/9606/allcontig_and_rna</Path>
<Title>human build 37 RNA, alternate and reference assemblies.</Title>
<LastUpdated>2010/11/01 00:00</LastUpdated>
<Description/>
<TotalLength>5886906670</TotalLength>
<MaxLength>115591998</MaxLength>
<NumSequences>50354</NumSequences>
…

The BLAST database name and its path prefix are in the <Path> field. We can use the complete string in this field to compose a search request using the BLAST URL API or standalone blast+.

3.  Use the BLAST URL API to invoke the database (in red):

http://blast.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=put&DATABASE=DBINDEX/9606/allcontig_and_rna&PROGRAM=blastn&QUERY=NM_001126

For standalone BLAST, you can invoke the database on the command line:

blastn -db DBINDEX/9606/allcontig_and_rna  -remote -query <query_file> …

Table 1 – Some useful query fields in blastdbinfo

Query Field Sample Values Example Function
[blast sequence strategy]

(nucleotide databases only)

est
gss
htgs012
htgs0123
wgs
wgs[blast sequence strategy] Retrieves all databases containing wgs sequences
[blast database source] genbank
gnomon
pdb
refseq
sra
swissprot
refseq[blast database source] Retrieves all databases containing RefSeq sequences
[blast sequence type] cdna
genomic
otherdna
protein
Protein[blast sequence type] Retrieves all databases containing protein sequences
[title] Text words within the database title Non-redundant[title] Retrieves databases with “non-redundant” in their title

For more information

For a complete list of all available field limits for the blastdbinfo database, visit this link:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=blastdbinfo

For technical assistance on BLAST, write to blast-help@ncbi.nlm.nih.gov.

How To Format Sequence Data For GenBank Submissions

Submitting sequences to GenBank can seem complicated at first, but starting with a solid foundation in the form of a properly formatted file will make the process go smoothly.

Before submitting sequence data to GenBank, the data must be formatted correctly, the most common file format being FASTA. This post will show you how to create a FASTA file for submitting single- and multiple-nucleotide sequences.

Submitters can upload FASTA-formatted sequence files using NCBI’s stand-alone software Sequin, command line tbl2asn or our web-based submission tool BankIt.

The image below depicts a single sequence in FASTA format. For multiple sequences, such as those of population or phylogenetic studies, environmental samples, and batch sequences of the same gene, create the file using the steps below and put the set of sequences together in a single FASTA file.

Image

Here is how to create the FASTA file:

1) We strongly recommend that you use a text editor. If you use a word processing program, you must save the file as plain ASCII text in order to retain the FASTA format.

2) Create a short, unique sequence ID (SeqID) that you can use for each sequence. This functions as a placeholder until GenBank assigns accession numbers to replace them.

The following is an example of a good SeqID: 1234_abc

  • You can also use a unique isolate number, unique clone number, or other simple unique IDs.
  • Please limit the SeqID to 25 characters or less. Use of brackets (“[]“) in the SeqID is also prohibited.

3) Type the greater than caret   >  and then the SeqID. Then press the SPACE key on your keyboard. To ensure the FASTA file will be read by Sequin or BankIt, a single space is required before entering the [organism=genus species] information.

Example:

>Seq_123 [organism=Homo sapiens]  [isolate=456]

4) Use square brackets around the formatted organism data like this: [organism=Genus species]

Add other source information like clone, isolate, breed, and cultivar in brackets.  A list of additional source modifiers is found here: http://www.ncbi.nlm.nih.gov/Sequin/modifiers.html

5) Add a brief description of the sequence and then press the return or enter key on your keyboard to create a hard return to the next line.

6) Enter the nucleotide sequence and press the return or enter key on your keyboard to create  a hard return to the next line..

7) For multiple sequences, repeat steps 2-6 until all sequences for the set are in the file.

8) Save as .txt (plain ASCII text)

Look for a future Quick Tips blog post on creating a source modifier file for multiple sequences or sequences that have many source modifiers.

For more information:

The Tasmanian Devil and Cancer as an Infectious Disease: Analysis of transcriptome data

The Tasmanian devil (Sarcophilus harrisii), the last remaining large marsupial carnivore, now faces extinction because of a strange and deadly infection: a transmissible cancer known as Devil Facial Tumor Disease.  These tumor infections are apparently passed to other devils through bites during mating or during squabbles over carrion when devils gather to feed. In this unusual situation, the cancer cells themselves are the infectious agent.

The failure of devil immune systems to recognize and destroy the foreign tumor cells may be related to a decline in genetic diversity and may serve as a warning about the vulnerability of species with reduced gene pools.  The advent of next-generation sequencing has provided an unprecedented opportunity to track the spread and identify the origin of this unusual zoonosis, as well as to examine the population structure of an endangered mammal and generate a complete genome sequence for this unique marsupial.

One way for you to access Tasmanian devil data at NCBI is through the BioProject database, which consolidates links to all of the data related to a study in a single place.  If you search this database with the term “Tasmanian devil”, you will retrieve five BioProject records: three are genome sequencing projects (PRJNA65325, PRJNA51853, and PRJNA167725) that will be the subjects of a future post on the devil, and two are next-generation transcriptome sequencing projects focusing on mRNA (PRJNA79479) and miRNA (PRJNA118101).  We will take a look at these RNA data in the present post.

Elizabeth Murchison and colleagues report on the mRNA and miRNA transcriptomes in a study that shows the remarkable potential of next-generation sequencing data to provide rapid insights into tissue-specific gene expression (PMCID: 2982769).

Let’s first look at the sequence data generated by the mRNA experiment, reported in PRJNA79479. The data are next-generation mRNA and microRNA (miRNA) expression profiles of tumor and normal testicular (testis) tissue and are available in the NCBI Sequence Read Archive (SRA).  The testis transcriptome data are in experiment SRX010967. The facial tumor data are in SRX010966. Each of these experiments comprises about a million reads from three sequencing runs each.  These represent a gene expression snapshot from the two tissues. Murchison and colleagues report that the tumor sample is enriched in transcripts typical of nerve tissue and is consistent with a Schwann cell origin. Nerve-specific transcripts present at high levels in the tumor include myelin protein zero (MPZ), myelin basic protein (MBP), and nerve growth factor receptor (NGFR).  Expression of the pro-opiomelanocortin (POMC) gene, a gene normally expressed in the pituitary gland, also shows elevated expression in the tumor.

Despite the large size of these datasets you can perform some analysis on them using tools on the NCBI website. The SRA transcriptomes have been processed and added to NCBI’s SRA BLAST service. The testis and tumor samples are available as separate databases listed under Sarcophilus harrisii.  You can easily compare the relative level of expression for any of the genes listed above by searching these two databases.

For example, searching each of these databases with the Tasmanian devil POMC-like transcript  (XM_003757795) shows that reads matching this gene product are much more abundant in the facial tumor than in the testis database as shown in the BLAST graphical overview immediately below.SRA_BLAST Of course to make this a useful comparison, you must consider the sizes of the two databases.  In this case, the tumor transcriptome is smaller (888,453 sequences; 152,473,966 bases) than the testis transcriptome (1,357,698 sequences; 237,435,784 bases) confirming the high level of this transcript in the tumor.

You can run these two BLAST searches yourself by following these links:

  1. BLAST facial tumor transcriptome.
  2. BLAST testis transcriptome.

Now, let’s look at the other transcriptome study in the BioProjects database (PRJNA118101) that links to an SRA submission (SRA010797) of lllumina-generated sequence reads of miRNAs from five tumors and ten normal tissue samples. Here, we will look for evidence of the brain-associated microRNA 338 (MIR338) that is highly represented in the tumor samples as compared with the normal tissue samples.

Although the miRNA reads in these samples are too short to search effectively using BLAST, the SRA Run Browser allows you to quickly count the number of reads for a particular sequence using the Filter search. If you retrieve the data for the facial tumor (GSM458090 and load SRR034113), select the “Reads” tab and filter the reads with the sequence of the 22 base 3’ stem portion of MIR338 (TCCAGCATCAGTGATTTTGTTG), you can see that 126,963 out of 2.6 million reads contain this miRNA sequence as shown in the output below. Repeating the steps with a non-cancerous tissue such as the liver (GSM458084, SRR034107) only finds 38 out of 1.7 million reads. SRA_run_browserThe much higher level of expression in the tumor samples for the brain-associated microRNA 338 (MIR338) helps supports the assertion that the devil facial tumor has a neural and potentially a Schwann cell origin.

In a future post on this topic, we’ll look at nuclear and mitochondrial genomes for the Tasmanian devil. These data have been isolated from normal cells as well as tumor samples. This information provides a way to look at the population structure and diversity of the wild Tasmanian devil population, and also provides insight into the evolution and spread of a cancer that metastasizes to other individuals.

How to Download Bacterial Genomes Using the Entrez API

Given the size of modern sequence databases, finding the complete genome sequence for a bacterium among the many other partial sequences can be a challenge. In addition, if you want to download sequences for many bacterial species, an automated solution might be preferable.

In this post we’ll discuss how to download bacterial genomes programmatically for a list of species using the E-utilities, the application programming interface (API) to NCBI’s Entrez system of databases.  We’ll also take advantage of NCBI’s redesigned Genome database, which links all genome sequences for a given species to one record, making it easy to obtain the desired sequences once you find the right Genome record. In principle you can apply the procedure below to other simple genomes that are represented by a single sequence. Future posts will address additional considerations that apply to complex, eukaryotic genomes.

You’ll find that several types of genome sequences are linked to a Genome record. There may be complete chromosomes and/or plasmids along with whole genome shotgun (WGS) sequences. There may be NCBI Reference Sequences (RefSeqs) and original submissions to GenBank. You can limit your download to any combination of these subsets, as you’ll see below.

(Note: In this post, long e-utilities calls and lines of code are sometimes wrapped onto the next line for readability. This break is indicated by a backslash, ‘\’, at the end of the line. Of course, these wrapped lines should be on one line when you use them.)

Procedure

1. Use esearch.fcgi to find the Genome record, using the bacterial species name as the query.

esearch.fcgi?db=genome&term=<species name>

[Parse out genome ID from XML output]

2. Use elink.fcgi to find the desired Nucleotide records linked to the Genome record.

elink.fcgi?dbfrom=genome&db=nuccore&id=<genome ID>&term=<sequence type>\
&cmd=neighbor_history

[Parse out <query_key> and <WebEnv>]

3. Use efetch.fcgi to download the Nucleotide records in one of several formats.

efetch.fcgi?db=nuccore&query_key=<query_key>&WebEnv=<WebEnv>\
&rettype=<record type>&retmode=<record format>

Alternative for step 2.

If you remove the “&cmd” parameter from step 2, elink will return the nucleotide GI numbers for the linked sequences rather than a query_key and WebEnv. You will then need to parse each of the GI numbers from the XML output and pass them to efetch in step 3 using the “&id” parameter.

Now let’s look at some tricks. Here are the Entrez search terms for <sequence type> in Step 2:

Desired Sequence &term value
Completed chromosomes gene+in+chromosome[prop]
Plasmids gene+in+plasmid[prop]
RefSeqs srcdb+refseq[prop]
INSDC (DDBJ, EMBL-Bank, GenBank) srcdb+ddbj/embl/genbank[prop]
WGS wgs[prop]
Other genomic sequences gene+in+genomic[prop]

You can combine these with Boolean operators to retrieve, for example, all RefSeq genomic sequences:

&term=(gene+in+chromosome[prop]+OR+gene+in+genomic[prop])\
+AND+srcdb+refseq[prop]

Please see Table 1 in Chapter 4 of the efetch documentation for available values of &rettype and &retmode that will generate the format you want, such as FASTA, GenBank flat file, feature table or XML.

Example

For this example our goal will be to explore the genome data available for Corynebacterium efficiens.

1. esearch.fcgi?db=genome&term=corynebacterium+efficiens

This call returns the genome ID 1076.

2. elink.fcgi?dbfrom=genome&db=nuccore&id=1076

The results of the elink call reveal a total of eight sequences (at the time of writing). By using a series of the “&term” values listed in Table 1, you’ll see that both RefSeq and WGS sequences are available. In this case we are using the alternative approach to step 2 above that does not use the “&cmd” parameter in the elink request. You might decide, for instance, to download the RefSeq sequence for the chromosome in FASTA format.  As long as you have included the appropriate “&term” value in the elink call, the final step below will accomplish this.

 3. efetch.fcgi?db=nuccore&id=25026556&rettype=fasta&retmode=text

 Next steps

Now that you’ve seen the basic method, it’s a relatively straightforward extension to produce a script that can read a file of species names and make the set of calls for each one. In this way you can download data for the entire set. We’ve included below a sample Perl script that downloads RefSeq chromosome sequences in FASTA format for a list of species provided as an array in the code. You can easily modify this script to, for example, read in species names from a file.

use strict;
use LWP::Simple;
my ($name, $outname, $url, $xml, $out, $count, $query_key,\
 $webenv, $ids);
my @genomeId;
my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $limit = 'srcdb+refseq[prop]+AND+gene+in+chromosome[prop])';
my @species = ('Corynebacterium efficiens',\
 'Acidimicrobium ferrooxidans', 'Fluviicola taffensis');

foreach my $s (@species) {
undef @genomeId;
$query_key = $webenv = '';
$s =~ s/ /+/g;
# ESearch
$url = $base . "esearch.fcgi?db=genome&term=$s";
$xml = get($url);
$count = $1 if ($xml =~ /<Count>(\d+)<\/Count>/);
if ($count > 20) {
$url = $base . "esearch.fcgi?db=genome&term=$s&retmax=$count";
$xml = get($url);
}
while ($xml =~ /<Id>(\d+?)<\/Id>/gs) {
push(@genomeId, $1);
}
$ids = join(',', @genomeId);
# ELink
$url = $base . "elink.fcgi?dbfrom=genome&db=nuccore\
&cmd=neighbor_history&id=$ids&term=$limit";
$xml = get($url);
$query_key = $1 if ($xml =~ /<QueryKey>(\d+)<\/QueryKey>/);
$webenv = $1 if ($xml =~ /<WebEnv>(\S+)<\/WebEnv>/);
# EFetch
$url = $base . "efetch.fcgi?db=nuccore&query_key=$query_key\
&WebEnv=$webenv&rettype=fasta&retmode=text";
$out = get($url);
open (OUT, ">$s.fna");
print OUT $out;
close OUT;
}

For more information:

Using Conserved Domains to Find Protein Homologs

If you’re a protein researcher, one thing you may want to do is to find homologs for a protein of interest on the basis of its sequence. This can provide insights into what the protein does and how it does it, and may identify proteins with known three-dimensional structures that can serve as models for the protein of interest. The Conserved Domains Database (CDD) groups proteins that have strong sequence similarity to protein domain fingerprints and allows you to search these groups with any protein sequence. Such searches are often more sensitive than standard BLAST searches since the scoring matrices used are tuned to locate important functional sites and sequence motifs that are highly conserved within the domain. You can then use the results to explore the evolutionary relationships of these proteins or identify these important sequence and structural features.

Here is a method to find protein sequences from many organisms that have contain a particular conserved domain:

1. If you have a Protein sequence record for your gene of interest, click on  “Identify Conserved Domains” on the right-hand side of the page in the “Analyze this sequence” section.

2. This Conserved Domains Summary page shows a brief view summarizing the identity and location of regions matching the amino acid fingerprint (PSSM – Position Specific Scoring Matrix) for particular protein domains and domain families.

Please note that the definition of these domains comes from several sources (NCBI curation efforts, SMART, Pfam and TIGRFAM).  You can look at all of the conserved domains that match this region by clicking on “View full result.” Clicking on any of the bars will take you to a record that describes that particular domain as reported by the submitting organization (NCBI, SMART, Pfam, TIGR).

In either the “Brief view” or “Full result” view, the “Specific hit” shown at the top is the domain that contains the most curated information.  These are often curated by the NCBI Conserved Domain curation staff and have accessions that begin with “cd.” If you mouse-over this top-most bar, you’ll get a preview of the full Conserved Domain record.

3. Click on the top bar to go to the Conserved Domain record page, which describes what is known about the function of your domain.

On this page is a Links box which has hyperlinks to relevant records in other databases.  The link to “Specific Protein” retrieves Protein database records that have a high degree of similarity to this conserved domain. There is also a “Related Protein” link, which retrieves protein sequences with less similarity to the domain than the “specific protein” records and may contain this domain or a functionally related domain.

4. Click on either the “Specific Protein” or “Related Protein” link to retrieve the related records in the Protein database.

You can further filter these records to display only those from the Reference Sequence project, which contains curated, non-redundant sequences representing the currently best understood, most representative sequences for each biological molecule. To do this, click on the “RefSeq” link in the upper right hand corner of the page.

In addition, if you are just interested in finding a putative functional homolog for your protein in a particular organism, you can filter your search result using the “Top Organisms” portlet on the right-hand side of the page.

To download sequences for further sequence analysis, click on the “Send to” link in the upper-right hand side of the page to save the set of records, click the File radio button, and then select the record format.

If you want to perform evolutionary analysis and/or create a phylogenetic tree of the retrieved sequences, we suggest that you download the FASTA-formatted Reference Sequence records.  This file can be used as the data source for most Phylogenetic Tree Analysis Programs or an Alignment program that can display results as a Phylogenetic Tree (such as COBALT).

Bonus Tip:  You may want to explore the NCBI Curated “cd” records in CDD a bit further. In addition to full descriptions about the function of the domain, they also contain links to relevant literature in PubMed and the NCBI Bookshelf, data about the taxonomic distribution of the domain, links to molecular pathways (BioSystems) in which proteins with this domain are known to participate, and solved 3D Structure models (Structure) for the domain that often includes annotation identifying key functional and regulatory residues.

For More Information:

What does NCBI’s Internet Explorer 7 warning mean?

Over the past several months, you may have noticed a warning message if you’ve accessed the NCBI site using Microsoft’s Internet Explorer web browser:

Internet Explorer Warning

If you have been using Internet Explorer versions 7 or 8 (on “compatibility mode”) to surf the web, you may have seen this warning at the top of NCBI webpages.

This message has caused some concern among some users about exactly what changed on January 1, 2013 and whether or not they will still be able to access PubMed and other NCBI resources.  We hope that this post will address some of the more common questions.

Why are you no longer supporting the IE7 web browser?

We consider several factors when deciding to stop support for a browser, such as how many of our web requests come from the browser, how much staff time is required to support the browser and the general level of support for the browser among other internet content providers.

What does “stop supporting this browser” mean?

When we stop supporting a given browser, we will no longer test new or updated web pages on that browser, and can no longer guarantee that the features on our web pages will work well or at all in the browser. Additionally, if we become aware of problems on our web pages that only occur only on that browser, we will no longer fix them.

What happened on the NCBI website on January 1, 2013?

Actually, nothing.  We didn’t turn anything on or off. As described above, we simply will no longer test our pages or fix problems in IE7.

What does this really mean about my use of the NCBI website?

Practically this means that after January 1, IE7 users should still be able to use NCBI resources; however, as we update our pages, we expect that over a period of time (which could be weeks or months), the ability of IE7 to render our pages correctly will degrade.

What should I do?

The best course of action is to upgrade your browser to a more current version. Examples of the most current versions of browsers that we support are as follows:

For more information about Browser Support for the NCBI site, please see this page: http://www.ncbi.nlm.nih.gov/guide/browsers.

New PubReader View For Full-Text Articles

NCBI’s new PubReader display format in PubMed Central (PMC) makes full-text research papers not only more readable but also more portable.

Whether you’re using a desktop, laptop, tablet or smart phone, PubReader adapts to your device, displaying full-text articles in a user-friendly format that minimizes scrolling and maximizes intuitive navigation and portability (see Figure 1).

NCBI’s new PubReader display format in PubMed Central (PMC) makes full-text research papers not only more readable but also more portable.  Whether you’re using a desktop, laptop, tablet or smart phone, PubReader adapts to your device, displaying full-text articles in a user-friendly format that minimizes scrolling and maximizes intuitive navigation and portability (see Figure 1).

Figure 1. The PubReader format as seen in three common displays (widescreen desktop, smart phone and tablet).

NCBI developed this new presentation format to address some common obstacles in perusing research articles via the web, as well as to keep pace with the increasing prevalence of mobile devices. Any article that is available in full-text HTML in PubMed Central is viewable in the PubReader format. Furthermore, PubReader works with the latest browsers without the need to download an app or any additional software.

One of the most common issues encountered when reading literature online is that you can lose your place when referring back to an earlier section of a paper, for example to view a figure or table. As with a printed paper, PubReader breaks an article into multiple columns and pages, which improves readability and provides visual cues for navigation. In addition, PubReader makes the article’s figures and tables available as thumbnails at the bottom of the screen (see Figure 2). This allows you to view an earlier figure or table and then close it without losing your place. This feature also works with inline figures, tables and citations.

Figure 2. PubReader display of the first screen of PMC3396517 as seen on a desktop PC display. One of the figures in the image strip (C) is selected, popping up an enlarged version. Clicking the right margin (A) advances to the next screen. Clicking on the icon (B) toggles between the image strip (C) and a linear progress bar (not shown).

Figure 2. PubReader display of the first screen of PMC3396517 as seen on a desktop PC display. One of the figures in the image strip (C) is selected, popping up an enlarged version. Clicking the right margin (A) advances to the next screen. Clicking on the icon (B) toggles between the image strip (C) and a linear progress bar (not shown).

Another key aspect of the PubReader is its adaptive formatting, which allows you to flip through a paper in the same way you would a novel on an E-reader. PubReader automatically senses whether a tablet is in vertical or landscape view, and adds additional columns accordingly. You can also set your preferred font size using the typography configuration dialog in the upper right corner; page boundaries and columns will adjust accordingly.

PubReader offers a variety of common options for moving between pages. You can use the PageUp, PageDown, RightArrow, LeftArrow keys on a keyboard, tap or click in the right or left margin, use finger swipes on a touch screen device, or use the progress bar at the bottom of the screen to jump across the page range. The article navigation dialog is another useful feature that allows you to quickly jump to any given section of a paper (see Figure 3).

Figure 3. Article navigation dialog.

Figure 3. Article navigation dialog.

From a technical standpoint, the PubReader format is assembled using the XML version of an article. We use XSLT to convert it into an HTML document. CSS and JavaScript are then added to implement the formatting, paging, navigation, text reflowing and other dynamic features. Notably, this is essentially how we have created the traditional full-text article display in PMC for years. The difference now is that we are able to leverage the features of the latest web technologies (HTML5 and CSS3).

The CSS and JavaScript code used to create the PubReader display are freely available from NCBITools on the public code repository GitHub. Anyone can use or adapt this code to display journal articles or other content that is structured as an HTML5 document.

You can read more about the PubReader view on the PubReader about page. You can try it directly with an example record (PMCID: 3396517) or by clicking on the “PubReader” link for an article in a PMC search result list or in the article itself.

References: