Evaluation of Different Gene Prediction Tools in Coccidioides immitis

Kirkland TN, Beyhan S, Stajich JE

Gene prediction is required to obtain optimal biologically meaningful information from genomic sequences, but automated gene prediction software is imperfect. In this study, we compare the original annotation of the RS genome (the reference strain of ) to annotations using the Funannotate and Augustus genome prediction pipelines. A total of 25% of the originally predicted genes (denoted CIMG) were not found in either the Funannotate or Augustus predictions. A comparison of Funannotate and Augustus predictions also found overlapping but not identical sets of genes. The predicted genes found only in the original annotation (referred to as CIMG-unique) were less likely to have a meaningful functional annotation and a lower number of orthologs and homologs in other fungi than all CIMG genes predicted by the original annotation. The CIMG-unique genes were also more likely to be lineage-specific and poorly expressed. In addition, the CIMG-unique genes were found in clusters and tended to be more frequently associated with transposable elements than all CIMG-predicted genes. The CIMG-unique genes were more likely to have experimentally determined transcription start sites that were further away from the originally predicted transcription start sites, and experimentally determined initial transcription was less likely to result in stable CIMG-unique transcripts. A sample of CIMG-unique genes that were relatively well expressed and differentially expressed in mycelia and spherules was inspected in a genome browser, and the structure of only about half of them was found to be supported by RNA-seq data. These data suggest that some of the CIMG-unique genes are not authentic gene predictions. Genes that were predicted only by the Funannotate pipeline were also less likely to have a meaningful functional annotation, be shorter, and express less well than all the genes predicted by Funannotate. genes predicted by more than one annotation are more likely to have predicted functions, many orthologs and homologs, and be well expressed. Lineage-specific genes are relatively uncommon in this group. These data emphasize the importance and limitations of gene prediction software and suggest that improvements to the annotation of the genome should be considered.