We are embarking on this adventure and starting to place terms from the Biological Process, Molecular Function and Cellular Component ontologies to genomes and proteins we annotate with the Prokaryotic Genome Annotation Pipeline (PGAP). Because of the hierarchical nature of the Gene Ontologies, these annotations will help the comparison of gene content across genomes at variable levels of specificity and eventually allow GO term enrichment analysis. GO terms are now associated with coding sequence (CDS) features on newly-submitted genomes (See Figure 1). They will progressively appear on genomes that are already in RefSeq as these get reannotated (about once a year). We expect all RefSeq genomes to have some GO terms by the spring of 2023.
Figure 1: Excerpt from the RefSeq record for NZ_CP091650.1, a Klebsiella pneumoniae genome, showing the annotation of the the katG gene, with Molecular Function GO terms GO:0004096 and GO:0004601 and Biological Process GO term GO:0006979. In total, 41% of CDS features on this genome have at least one GO term.
The GO terms are propagated from the Protein Family Models that provide the protein function (and sometimes Enzyme Commission (EC) number and gene symbols) to CDS and proteins, and are not individually assigned to each protein. The GO terms on Pfams (protein families) are derived from the pfam2go mappings, while those on TIGRFAMs were inherited from The Institute for Genomic Research (TIGR) and reviewed in the past 6 months, and those on NCBIFAMs and BlastRules were added manually by NCBI experts. We are working to add GO terms on more hidden Markov models (HMMs) and BlastRules as well as Conserved Domain Database (CDD) architectures, so that a larger proportion of the gene content of each genome can inherit GO terms.
We also added GO terms to 65 million non-redundant RefSeq proteins (with the WP_ prefix) that are named after a Protein Family Model with one or more GO terms. See the KatG protein found in many Klebsiella genomes in Figure 2.
Figure 2: Excerpt from the RefSeq record for KatG protein WP_004180028.1, with Molecular Function GO terms GO:0004096 and GO:0004601 and Biological Process GO term GO:0006979 on the protein feature. These terms were derived from HMM TIGR00198 after which the protein is named.
We hope these new GO terms are useful to our community. As always, we welcome your feedback!