Making Custom Databases for Web BLAST


An easy way to speed up your BLAST analysis is to search a smaller database targeted to sequences of interest. We’ll describe here a few ways to create such custom databases on the BLAST web pages.  For this Quick Tip we’ll use the pages in the Basic BLAST section of the BLAST home page.

BLAST parent databases

Generating a custom database begins with selecting the appropriate parent database. The BLAST Guide provides database descriptions to help with choosing a database.  You select the parent in the Database pull-down menu, shown in Figure 1. Selecting the database is really your first opportunity to customize.

BLAST Parent Database Pull-down Menu

Figure 1. The database selection pull-down lists: top panel, nucleotide databases; bottom panel, protein databases

Continue reading

Setting Up Automatic NCBI Searches and New Record Alerts


Do you regularly perform PubMed searches to find new articles on your topic of interest?

Would you like to know when new sequence records become available for your gene?

Is it important to be alerted when new bioactivity assays are available with inhibitor data for your enzyme?

With a free My NCBI account, you can easily set up a series of e-mail alerts to notify you of such new information. You can read more about the many other functions of My NCBI.

Here’s how to set up these alerts:

Continue reading

Joining PubMed Commons: A Step-by-step Guide


In our previous post we wrote about a new service called PubMed Commons that allows researchers to add comments to individual PubMed records. As we described in that post, PubMed Commons is currently a beta pilot release, and requires interested people to join the system before they can view or add comments. This post will describe how to join PubMed Commons.

Continue reading

Removing Duplicate Citations from My Bibliography


My Bibliography is a component of the My NCBI service and allows authors to create an online collection of their published work. While editing their bibliographies, authors can import citations for their articles directly from PubMed, and the system will automatically check for duplicates and will remove citations imported more than once.  However, authors may still end up with duplicates in certain situations, and sometimes it is not obvious how to remove these duplicates. In this post we will describe three situations where duplicates may persist and will discuss ways to remove them.

Continue reading

Verifying Article Compliance for NIH Public Access


Are you trying to find out if your article complies with the NIH Public Access policy and/or find a PubMed Central ID (PMCID) for your article? If so, this post describes a simple method for finding the PMCID for an article and thereby verifying Public Access compliance.

First, let’s start with a bit of background. To comply with the NIH Public Access Policy, you need to make sure that your peer-reviewed articles that resulted from NIH funding (full or partial) and that were accepted for publication on or after April 7, 2008 are available in the PubMed Central (PMC) database with a PMCID. Please be aware that PMC is not the same as PubMed. PMC is NCBI’s full-text digital archive, while PubMed contains only citations and abstracts. It is not enough for your citation to be available in PubMed with a PubMed ID (PMID); you must have a PMCID to satisfy NIH Public Access policy.

To check that your article has a PMCID and is compliant, proceed as follows:

Continue reading

Blastdbinfo: API access to a database of BLAST databases


NCBI offers extensive collections of sequences through its BLAST services (http://blast.ncbi.nlm.nih.gov) for comparing and identifying DNA, RNA and protein sequences. NCBI now deposits descriptions of these sequence collections, known as BLAST databases, in a special database called blastdbinfo that you can access through the Entrez Programming Utilities (E-Utilities). Using blastdbinfo, you can enable a program to find an appropriate database and then send BLAST searches to that database using either the BLAST URL API or standalone BLAST (installed locally).

Continue reading

How To Format Sequence Data For GenBank Submissions


Submitting sequences to GenBank can seem complicated at first, but starting with a solid foundation in the form of a properly formatted file will make the process go smoothly.

Before submitting sequence data to GenBank, the data must be formatted correctly, the most common file format being FASTA. This post will show you how to create a FASTA file for submitting single- and multiple-nucleotide sequences.

Continue reading

How to Download Bacterial Genomes Using the Entrez API


Given the size of modern sequence databases, finding the complete genome sequence for a bacterium among the many other partial sequences can be a challenge. In addition, if you want to download sequences for many bacterial species, an automated solution might be preferable.

In this post we’ll discuss how to download bacterial genomes programmatically for a list of species using the E-utilities, the application programming interface (API) to NCBI’s Entrez system of databases.  We’ll also take advantage of NCBI’s redesigned Genome database, which links all genome sequences for a given species to one record, making it easy to obtain the desired sequences once you find the right Genome record. In principle you can apply the procedure below to other simple genomes that are represented by a single sequence. Future posts will address additional considerations that apply to complex, eukaryotic genomes.

Continue reading

Using Conserved Domains to Find Protein Homologs


If you’re a protein researcher, one thing you may want to do is to find homologs for a protein of interest on the basis of its sequence. This can provide insights into what the protein does and how it does it, and may identify proteins with known three-dimensional structures that can serve as models for the protein of interest. The Conserved Domains Database (CDD) groups proteins that have strong sequence similarity to protein domain fingerprints and allows you to search these groups with any protein sequence. Such searches are often more sensitive than standard BLAST searches since the scoring matrices used are tuned to locate important functional sites and sequence motifs that are highly conserved within the domain. You can then use the results to explore the evolutionary relationships of these proteins or identify these important sequence and structural features.

Here is a method to find protein sequences from many organisms that contain a particular conserved domain:

Continue reading