As you may have read in previous posts, NCBI is in the process of changing the way we handle GI numbers for sequence records. In short, we are moving to a time when accession.version identifiers, rather than GI numbers, will be the primary identifiers for sequence records.
In a previous post, we outlined a method for converting GI numbers (used to identify sequence records) to accession.version identifiers. That method used the E-utility EFetch and is capable of handling cases where you have no more than a few thousand GI numbers to convert.
What if you have more?
We now have a bulk conversion resource that will allow you to handle very large jobs. The resource consists of a Python script coupled with a database file (about 40 GB uncompressed). You’ll need to download both of these files (gi2accession.py and gi2acc_lmdb.gz) to local disk, and then you can process as needed.
As you may have read in previous posts, NCBI is in the process of changing the way we handle GI numbers for sequence records.
In short, we are moving to a time when accession.version identifiers, rather than GI numbers, will be the primary identifiers for sequence records.
As part of this transition, an obvious question for any of you currently using GI numbers is how to convert a GI number to an accession.version, so that you can make appropriate updates. The good news is that it’s pretty easy if you have no more than a few thousand GIs to convert.
This blog post is intended for people who refer to chemical names/symbols and synonyms in databases like PubMed and PubChem, or in their own scientific papers. There is a similar post for gene symbols and names.
During the research and publishing process, scientists need to refer to their chemicals-of-interest. While there are standardized nomenclatures (IUPAC, SMILES, InChITM, etc.), different labs sometimes use different names for the same chemical.
The NCBI PubChem project has set up a system to identify and correlate these various names as well as ‘alias’, ‘synonym’, or ‘also known as’ terms that have been used in the literature.
This blog post is intended for people who refer to gene symbols or names in databases such as Gene, ClinVar, or PubMed. There is a similar post for chemical names and symbols.
During the research and publishing process, scientists need to refer to their genes-of-interest. However, different labs sometimes use different gene symbols to refer to the same gene. As you can imagine, this leads to confusion.
To standardize the use of terms, the HUGO Gene Nomenclature Committee (HGNC) sets official gene symbols and names. The NCBI Gene resource reports these official gene symbols and names, as well as additional symbols and names that are included on related sequence records for the same gene or from submitted GeneRIFs.
You may have heard that NCBI is changing the way we handle GI numbers for sequence records in September 2016. Well, you heard right! Here’s the announcement, in case you missed it.
There are a number of issues raised by these changes, but we’re going to answer two questions in this post:
- What pieces of your code will break in September?
- Are GI numbers gone for good?
Professors, you’re busy – really busy. You have to develop and teach your courses and laboratory sessions, coordinate your lab’s research efforts, write grants and publications, and stay current on everything related to your teaching and research topics.
NCBI has information that would help most of these efforts – but there are so many interesting records and so little time to organize them for efficient use. Sign up for a free NCBI Account and let us help you organize your important lists!
Figure 1. The My NCBI login page.
Sign up for an NCBI Account – or sign in to your account if you already have one – and:
- Store and automate your searches;
- Save and manage collections of important records for use in coursework, research projects and federal grants;
- Create public lists for students in your courses and your own Faculty Profile;
- And keep track of everything – right on your My NCBI dashboard.
Read on to find out how to do all of these things and more!
The Sequence Read Archive (SRA), NCBI’s largest growing repository of molecular data, archives raw sequencing data and alignment information from high-throughput sequencing platforms, including Roche 454 GS Systems®, Illumina’s Genome Analyzer®, and Complete Genomics® systems.
Researchers commonly use SRA data to make discoveries via comparison of data sets. Data sets can be compared through the SRA web interface, but if you want to integrate these downloads and file conversions into an already existing pipeline, or you simply prefer using a command-line interface, we recommend using the SRA Toolkit.
Run Selector is a tool available through the Sequence Read Archive (SRA) that allows you to fine-tune your web-based search results. There are over two dozen fields that can be used to filter SRA data in Run Selector. For example, if you need to look at data from a particular sequencing platform and genome assembly, you can use these fields as filters.
After running a web-based search for any keyword in the SRA database, users can dump all the results (up to a maximum of 20,000 experiments) into the Run Selector for fine-tuning. In addition, Run Selector shows you how many runs fall into each of the categories even before a filtering category is selected, allowing you to investigate the data further by noting what is contained within the database.
Figure 1. After searching with SRA, click on “Send to” to open the drop-down menu. Then click on the radio button labeled “Run Selector” to send your search results to Run Selector. Note that you can already see how many runs are in each of the categories to the left.
This article is intended for GenBank data submitters with a basic knowledge of BLAST who submit sequence data from protein-coding genes.
One of the most common problems when submitting DNA or RNA sequence data from protein-coding genes to GenBank is failing to add information about the coding region (often abbreviated as CDS) or incorrectly defining the CDS. Incomplete or incorrect CDS information will prevent you from having accession numbers assigned to your submission data set, but there is a procedure that will help you troubleshoot any problems with the CDS feature annotation: doing a BLAST analysis with your sequences before you submit your data.
Here’s how to use nucleotide BLAST (blastn) and the formatting options menu to analyze, interpret and troubleshoot your submissions:
1. To start the BLAST analysis, go to the BLAST homepage and select “nucleotide blast”.
Figure 1. Select “nucleotide blast”.
This blog post is a continuation of last week’s blog on finding biological assay data; it is intended for researchers who use PubChem.
Your research focuses on a protein (receptor or enzyme) for which you’d like to identify a chemical probe or modulator. The probe could help to identify the subcellular location of a protein. A modulator may help to determine the biological effects of a particular protein’s activity. Additionally, finding a novel chemical that binds to your protein might assist you in exploring the use of a new class of therapeutics in drug design.
At NCBI, the PubChem BioAssay database stores biological activity assay information, which makes it possible to find experimentally measured targets for millions of chemicals. This blog post shows a simple workflow to download a table (with raw and kinetic data) of chemicals that have been determined to bind to a particular gene/protein target.