NCBI is now producing a new set of taxonomy files that include the taxonomic lineage of taxa, information on type strains and material, and host information. These files are particularly helpful for people maintaining local installations of NCBI data.
You can download the new archive (new_taxdump.tar.gz) from the taxonomy directory on the FTP site (ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/). The new files are typematerial.dmp, typeoftype.dmp, rankedlineage.dmp, fullnamelineage.dmp,
taxidlineage.dmp, and host.dmp. Please see the readme file for details of the file contents.
The original taxonomy file archive without the new content will remain available under its original name, taxdump.tar.gz. The section below shows the entries for the monkey species Cercopithecus lomamiensis from the new ranked lineage and type material files.
1191211 | Cercopithecus lomamiensis | | Cercopithecus | Cercopithecidae | Primates | Mammalia | Chordata | Metazoa | Eukaryota |
1191211 | Cercopithecus lomamiensis | holotype | YPM 14080 | 1191211 | Cercopithecus lomamiensis | holotype | YPM MAM 14080 | 1191211 | Cercopithecus lomamiensis | paratype | YPM 14189 | 1191211 | Cercopithecus lomamiensis | paratype | YPM 14191 | 1191211 | Cercopithecus lomamiensis | paratype | YPM 14192 | 1191211 | Cercopithecus lomamiensis | paratype | YPM MAM 14189 | 1191211 | Cercopithecus lomamiensis | paratype | YPM MAM 14191 | 1191211 | Cercopithecus lomamiensis | paratype | YPM MAM 14192 |
Top panel: Ranked lineage for Cercopithecus lomamiensis from rankedlineage.dmp. The ranks are species, genus, family, order, class, phylum, kingdom, and superkingdom. Bottom panel: Type material information from typematerial.dmp. The columns are taxonomy id, name, type designation, collection/repository details.
10 thoughts on “New taxonomy files available with lineage, type, and host information”
Is it recommended to switch to this? I am still using the old taxonomy dump, so I am wondering if it is worth to make a switch (it would require a lot of downstream ruby code that I wrote to change, so this is why I am asking. Thanks!)
Yes recommended to switch, but we will support both options (old and new dump files) for the foreseeable future.
Dear Dr. Heiler,
The new files contain the same information as the originals, just in a different format. If the original formats work for you, there is no reason to adjust your process. All formats will continue to be updated daily.
Please let me know if you have any other questions.
Taxonomy Data Support Specialist
There seem to be some missing taxa in the rankedlineage file from the new taxdump.
For instance, the only taxa in the cyanobacteria phylum appear to be a couple gloeobacter, whereas there are quiet a few other diverse cyanobacteria with complete assemblies on NCBI. Is this an oversight? Or are they omitted for particular reason?
Dear Dr. Cooley,
Thank you for your comment on the rankedlineage file.
I just checked the file, and see that there are 23,031 entries for Cyanobacteria.
If you are not seeing this, can you let me know how you are parsing the file?
Taxonomy Data Support Specialist
is there a document that describes what the different class names are?
(synonym, scientific name, blast name, genbank common name, in-part, authority, equivalent name, includes, common name, genbank synonym, acronym, genbank acronym)
In particular, it is unclear to me what blast name, in-part and includes are.
Thank you very much!
Thank you for making it easier to check the taxonomy of species with these new taxonomy files.
Nonetheless, I find the chosen format of the rankedlineage file fairly strange.
Firstly, I would have chosen a reverse order of the ranks. Starting with superkingdom and ending with species. This would prevent creating empty space.
Secondly, different separators. The ‘|’ denotation is non-standard and I would have gone with ‘\t’, ‘;’ or even ‘,’.
I hope you will take this critisism into account when creating new data structures in the future.
Thank you for your feedback! We will pass it along to our Taxonomy Team.