This blog post is intended for people who refer to chemical names/symbols and synonyms in databases like PubMed and PubChem, or in their own scientific papers. There is a similar post for gene symbols and names.
During the research and publishing process, scientists need to refer to their chemicals-of-interest. While there are standardized nomenclatures (IUPAC, SMILES, InChITM, etc.), different labs sometimes use different names for the same chemical.
The NCBI PubChem project has set up a system to identify and correlate these various names as well as ‘alias’, ‘synonym’, or ‘also known as’ terms that have been used in the literature.
All of the information regarding chemical symbols and names is stored in custom prepared files on the NCBI FTP site in a directory called “Extras”. These files are updated daily.
A summary file for information provided by PubChem submitters can be downloaded, as well as a summary file of synonyms provided by the Medical Subject Header (MeSH) Project.
These summary files are formatted as two column separated by a tab:
Column Number | Description of data in the column |
1 | CID: PubChem Compound identifier |
2 | Name associated with the chemical |
Example:
…. ….
2733526 tamoxifen
2733526 10540-29-1
2733526 Crisafeno
2733526 Citofen
2733526 Oncomox
2733526 Soltamox
2733526 Tamizam
2733526 Tamoxen
2733526 Valodex
…. ….
More information about the contents and structure of this file is in the README-Extras file.
These files are “GZipped” (.gz) and can be uncompressed by applications such as WinZip or GUNZip, etc. When uncompressed, they are uncompressed, they are tab-delimited tables.
If you are interested in one particular compound’s synonyms, you can use PubChem’s Resftul API (PUG REST) to access this information with a single URL call including the PubChem Compound Identifier. For example: https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/2733526/JSON?heading=Synonyms.
Other more standardized descriptors such as IUPAC names, InChITM, InChIKey and Canonical and Isomeric SMILES are computed from the chemical structures and stored in database files on the FTP site. Similarly to the PUG REST call to access a particular compound’s synonyms, these descriptors can also accessed by PUG REST. For example: https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/2733526/JSON?heading=Computed%20Descriptors.
This is perhaps the important thesaurus in the world