JCVI Annotation Service
In an effort to bring the tools of modern genomic science to researchers with prokaryotic genome sequences in need of annotation, JCVI has made available the JCVI Annotation Service. Anyone with a prokaryotic genome sequence may submit it to the JCVI Annotation Service completely free-of-charge. There are 2 components to the service:
- Production of output from JCVI's automated annotation pipeline - includes search results and automatically generated annotation in a MySQL database and associated files
- The manual annotation tool Manatee - an open source web based interface for interacting with and editing annotation data.
As sequencing costs drop and companies begin offering on-demand sequencing of prokaryotic genomes, it has become increasingly easy for researchers to obtain the genome sequence of their organism of interest. Many of these researchers will wish to manually annotate their genomes. However, the manual annotation process is a large and challenging task, which requires significant infrastructure and tools. Researchers who may have found it easy to acquire a genome sequence now find themselves in need of annotation infrastructure and expertise. The JCVI Annotation Service began in 2002 and was initially funded by the Department of Energy with a 3-year grant. It provides researchers access to the infrastructure of JCVI's annotation system and saves them from the time and expense of reproducing similar systems at their sites. This is particularly important for groups with limited resources or who have modest (1 or 2 genomes) annotation goals.
An additional resource offered by JCVI is the Prokaryotic Annotation and Analysis course. Although not required, we highly recommend that researchers who submit genomes to the JCVI Annotation Service attend this course to acquaint themselves with the process of prokaryotic annotation as done at JCVI. This course is offered 4 times a year and gives detailed instruction on JCVI's annotation pipeline, JCVI's manual annotation tool Manatee, and the use of JCVI's Comprehensive Microbial Resource (CMR).
In order to have JCVI Annotation Service data ready in time for attendance at the Prokaryotic Annotation and Analysis course, you must submit the genome to JCVI at least one month prior to the start of the course. Submitting an JCVI Annotation Service genome is not required for the class. If a genome is submitted too late or if no genome is submitted then you will use a test database during the class.
How to submit your genome to The JCVI Annotation Service
How to reference the JCVI Annotation Service
Anyone who uses output from the JCVI Annotation Service and/or the Manatee annotation tool for a publication, should state so in the materials and methods section of their paper and also acknowledge JCVI for whatever tools/services were used.
A sample paragraph for materials and methods
The DNA sequence was submitted to the JCVI Annotation Service, where it was run through JCVI's prokaryotic annotation pipeline. Included in the pipeline is gene finding with Glimmer, Blast-extend-repraze (BER) searches, HMM searches, TMHMM searches, SignalP predictions, and automatic annotations from AutoAnnotate. All of this information is stored in a MySQL database and associated files which was downloaded to our site. The manual annotation tool Manatee was downloaded from SourceForge (manatee.sourceforge.net) and used to manually review the output from the prokaryotic pipeline of the JCVI Annotation Service.
A sample acknowledgment
We'd like to thank JCVI for providing the JCVI Annotation Service which provided us with automatic annotation data and the manual annotation tool Manatee.
Description of the elements of the JCVI Annotation Service
The first major analysis step after a genome is sequenced is to identify the genes. The Glimmer system (Salzberg et al., 1998; Delcher et al., 1999) is used to find genes in bacterial, archaeal, or viral genomes. Glimmer relies on nothing other than the DNA sequence itself since it can be trained from raw sequence alone. In tests on numerous completely sequenced bacterial genomes, the system consistently finds over 99% of the genes in a fully automated fashion. The Glimmer system is freely available to nonprofit research institutions, and has been distributed to hundreds of sites worldwide (http://www.tigr.org/software/genefinding.shtml).
Once the ORFs that are candidate genes have been chosen by Glimmer, several types of searches are performed on the set of predicted proteins they encode. Each protein is searched against an internal non-identical amino acid database (niaa) made up of all proteins available from GenBank (http://www.ncbi.nlm.nih.gov), PIR (http://pir.georgetown.edu), SWISS-PROT (http://www.expasy.ch/sprot) and JCVI's CMR database, the Omniome (http://www.jcvi.org/cms/research/projects/cmr). The search algorithm employed for these searches is BLAST-Extend-Repraze (BER). This program first does a BLAST search (Altschul, et al., 1990) (http://blast.wustl.edu) of each protein against niaa and stores all significant matches in a mini-database. Then a modified Smith-Waterman alignment (Smith and Waterman, 1981) is performed on the protein against the mini-database of BLAST hits. In order to identify potential frameshifts or point mutations in the sequence, the gene is extended 300 nucleotides upstream and downstream of the predicted coding region. If significant homology to a match protein exists and extends into a different frame from that predicted, or extends through a stop codon, the program will continue the alignment past the boundaries of the predicted coding region. The results can be viewed both as pairwise and as multiple alignments of the top scoring matches.
All of the proteins from the genome sequence are also searched against hidden Markov models (HMMs) using the HMMER package (Eddy, 1999) Two sets of HMMs are used: the Pfam HMMs (Bateman, et al., 2000), and TIGRFAMs (Haft, et al., 2001). HMMs are built from highly curated multiple alignments of proteins thought to share the same function or to be members of the same family. They are useful for annotation since they are generally more sensitive and accurate than pairwise alignments. HMM searches result in a score measuring the probability that the query protein belongs to the group of proteins used to build the model. Each HMM has an associated cutoff score above which matches are known to be significant.
Several additional sequence based searches are also performed, including PROSITE (Falquet, et al., 2002), TMHMM (Krogh, et al., 2001), SignalP (Bendtsen, et al., 2004), COGs (Tatusov , et al., 2003), and paralogous families.
We have developed a computer program, AutoAnnotate, that analyzes the BER and HMM search results and assigns common name, gene symbol, Enzyme Commission (EC) number (http://www.expasy.ch/enzyme), TIGR role, and Gene Ontology (GO) terms automatically when possible. The program makes decisions based on a ranked list of evidence types, the best being equivalog level HMMs (HMMs which are built for families of orthologs which have conserved function) and high quality (in general, at least 35% identity over at least 80% of the protein) BER matches to experimentally characterized proteins from another species. AutoAnnotate will annotate from the best piece of evidence available to it. The parameters in AutoAnnotate are set with the assumption that manual annotation will follow, and are therefore not meant as an endpoint automatic annotation result.
Data Supplied to the JCVI Annotation Service User
The resulting automated annotation from the sequence sent to JCVI will be returned to the JCVI Annotation Service User. This includes: coordinates of ORFs and RNAs; common name, gene symbol, EC numbers, TIGR roles, and GO terms for proteins; underlying search results including Blast-Extend-Repraze, HMM, signalP, TMHMM, COGs, and paralogous families. This data is available as a MySQL database and associated files which can then be used with JCVI's manual annotation tool Manatee. In addition to the MySQL file we also provide a GenBank-style file and a tab delimited file of the JCVI Annotation Service data.
Manatee is a freely available, open source tool which can be found at http://manatee.sourceforge.net. Manatee allows viewing of search data and alteration of annotation in a user friendly, browser based format. Manatee has been in use by JCVI annotators, collaborators, and JCVI Annotation Service users for several years now. It pulls information from an underlying database and associated files, displaying search results and current annotations. Annotators can then view the available evidence supporting the annotations and modify the annotations as needed. Manatee then stores the updated annotations back in the underlying database. Manatee supports the capture and curation of several types of annotation information including gene name, gene symbol, EC number (for enzymes), comments, TIGR role categories, and Gene Ontology (GO) terms. Manatee has a built in GO ontology and annotation viewer which is based on information downloaded daily from the GO web site. Manatee has several features which promote annotation efficiency including one-click fill-in of many fields, GO term suggestions, and built in annotation documentation. In addition, Manatee has an integrated gene context viewer and gene model editing tool. This Genome Viewer allows annotators to view a gene of interest along with its neighbors in a color-coded graphical interface, allowing the easy identification of putative operons and other genomic areas of interest. Annotators can adjust the coordinates of predicted genes and curate start codons. Finally, and very importantly, Manatee allows the easy storage of the underlying evidence used to make annotations.
Altschul S., et al. Basic local alignment search tool. J. Mol. Biol., 215: 403-410 (1990).
Bateman A., et al. The Pfam protein families database. Nucleic Acids Res. 28(1): 263-266 (2000).
Bendtsen, J.D., Nielson, H., von Heijne, G., Brunak, S. Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol., 340:783-795, 2004.
Delcher A.L., et al. Improved Microbial Gene Identification with Glimmer. Nucleic Acids Res., 27(23): 4636-4641 (1999).
Eddy S. Profile hidden Markov models. Bioinformatics, 14(9):755-763 (1998).
Falquet L, Pagni M, Bucher P, Hulo N, Sigrist CJ, Hofmann K, Bairoch A. The PROSITE database, its status in 2002. Nucleic Acids Res. 2002 Jan 1;30(1):235-8.
Haft D., et al. TIGRFAMs: A protein family resource for the functional identification of proteins. Nucleic Acids Res. 29(1): 41-3 (2001).
Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 Jan 19;305(3):567-80.
Salzberg S., et al. Microbial Gene Identification using Interpolated Markov Models. Nucleic Acids Res., 26(2): 544-548 (1998).
Smith T.F. and M. Waterman. Identification of common molecular subsequences. J. Mol. Biol. 147(1): 195-197 (1981).
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003 Sep 11;4:41. Epub 2003 Sep 11.
Important Notice to JCVI Annotation Service Users
We are pleased to announce that what was one automated annotation service has now become two. This will allow us to serve twice as many users.
What was previously the TIGR/JCVI Annotation Engine service has now branched into two new services: one offered by the J.Craig Venter Institute (JCVI), called the JCVI Annotation Service, the other offered by the Institute for Genome Sciences (IGS) at the University of Maryland, School of Medicine called The IGS Annotation Engine.
The managers of both services are in active collaboration with one another. We encourage you to explore the web sites of the two services to learn more about the two great choices now available.