We have re-annotated all RefSeq genomes for Escherichia coli, Mycobacterium tuberculosis, Bacillus subtilis, Acinetobacter pittii, and Campylobacter jejuni using the most recent release of PGAP. You will find that more genes now have gene symbols (e.g. recA). Your feedback indicated that the lack of symbols was an impediment to comparative analysis, so we hope that this improvement will help.
The number of re-annotated genomes is 25,619 for E. coli, 470 for B. subtilis, 6,828 for M. tuberculosis, 316 for A. pittii, and 1,829 for C. jejuni. On average, the increase in gene symbols is 30% in E. coli, 110% in B. subtilis, 57% in M. tuberculosis, 94% in A. pittii and 62% in C. jejuni (see Figure 1). After re-annotation, on average, 73% of PGAP-annotated E. coli genes and 79% of B. subtilis have symbols (35% for M. tuberculosis, 40% for A. pittii and 46% for C. jejuni). We assigned symbols to the annotated genes by calculating the orthologs between the genome of interest and the reference assembly for the species, and transferring the symbols from the reference genes to their orthologs in the annotated genomes.
Figure 1: Average and standard deviation of the number of genes annotated with symbols per genome, in the previous (blue) and the current annotation (orange).
We selected these five species because each has a reference genome annotated with gene symbols commonly cited in the literature. We believe propagating these symbols across these genomes will have the most benefit. Note that only PGAP annotated genes with a defined function (i.e. not hypothetical protein) have a gene symbol propagated from their orthologs in the reference genome. The reference assemblies from which the extra symbols originated are: Escherichia coli str. K-12 substr. MG1655, Bacillus subtilis subsp. subtilis str. 168, Campylobacter jejuni subsp. jejuni NCTC 11168, Mycobacterium tuberculosis H37Rv, and Acinetobacter pittii PHEA-2.