This blog post is intended for people who refer to gene symbols or names in databases such as Gene, ClinVar, or PubMed. There is a similar post for chemical names and symbols.
During the research and publishing process, scientists need to refer to their genes-of-interest. However, different labs sometimes use different gene symbols to refer to the same gene. As you can imagine, this leads to confusion.
To standardize the use of terms, the HUGO Gene Nomenclature Committee (HGNC) sets official gene symbols and names. The NCBI Gene resource reports these official gene symbols and names, as well as additional symbols and names that are included on related sequence records for the same gene or from submitted GeneRIFs.
RefSeq curators also store alternate symbols and name as they review the literature. These are reported in Gene as:
- ‘synonym’, and
- ‘also known as’ terms.
There is a tab-separated file on the NCBI FTP site that contains all of this information for human genes.
All of the information regarding gene symbols and names are stored on the NCBI FTP site in files called “gene_info_”, which are updated daily. A summary file for data of all organisms in the Gene database can be downloaded, or users can obtain a file with data for a particular organism, such as human, for example: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz.
Selected Columns from the table:
|Column Number||Description of data in the column|
|2||GeneID: the unique identifier for a gene|
|3||*NCBI Symbol: the default symbol for the gene at NCBI|
|11||Official Symbol for this gene designated by the nomenclature authority (HGNC)|
|5||Symbol Synonyms: bar-delimited set of unofficial symbols for the gene|
|9||*NCBI Named Description: the default name for this gene at NCBI|
|12||Official Name for this gene designated by the nomenclature authority (HGNC)|
|14||Other Names & Designations: pipe-delimited set of some alternate descriptions that have been assigned to a GeneID. ‘-‘ indicates none is being reported|
*The NCBI default symbol and names displayed for humans are based on official HGNC designations. If there isn’t an officially-designated HGNC symbol or name, then our RefSeq curators create the NCBI designated defaults based on information in the scientific literature and metadata provided by submitters of sequence information.
An example summary file:
NCBI Symbol: BRCA1
Official Symbol: BRCA1
Symbol Synonyms: BRCAI | BRCC1 | BROVCA1 | FANCS | IRIS | PNCA4 | PPP1R53 | PSCP | RNF53
NCBI Name: BRCA1, DNA repair associated
Official Name: BRCA1, DNA repair associated
Other Names: breast cancer type 1 susceptibility protein | BRCA1/BRCA2-containing complex, subunit 1 | Fanconi anemia, complementation group S | RING finger protein 53 | breast and ovarian cancer susceptibility protein 1 | breast cancer 1, early onset | early onset breast cancer 1 | protein phosphatase 1, regulatory subunit 53 | truncated breast cancer 1
More information about the contents and structure of this file are in the GENE_INFO README file. These files are “GZipped” (.gz), which can be uncompressed by applications such as WinZip or GUNZip, etc. When uncompressed, they are tab-delimited tables. Organism-specific ones, such as the human one mentioned above, can be imported into and managed by a spreadsheet application such as Excel.
One thing to keep in mind:
Spreadsheet applications such as Excel often have autocorrect and autoformat functions that may alter the text in the cells. For example, the tumor suppressor DEC1 may be autoconverted into the “date” format DEC-1, sometimes seen as 1-Dec.
This problem, which was highlighted in a 2004 publication, has become widespread in the scientific literature, according to an August 2016 publication in Genome Biology. To assist users, HGNC has put out a video showing users how to import gene symbol data into Excel correctly: