{ * Smarty * } Wheat Genome Annotation
 
 

Annotation Methods

Annotation of Wheat Genomic Sequences

Selected BACs proceed through sequencing and closure, and are then ready for annotation. We believe annotation of individual genes is an essential part of the genome project. This permits gene discovery in a systematic, comprehensive and consistent manner. Gene finding and repeat annotation will be done in parallel to maximize identification of true genes and minimize mis-annotation of transposable elements as genes.

Steps Involved in Annotation

The BAC sequence and results of all analyses are stored in our central relational database (Sybase).

Orientation

All sequences are oriented from the SP6 (base 1) to the T7 end of the vector if the orientiation is provided in the Genbank record.

Repetitive sequences

Wheat repetitive elements were identified and masked by RepeatMasker using several libraries, including RepBase, TREP (the Triticeae Repeat Sequence Database), and TIGR Oryza Repeat Database

Gene prediction programs

  • FGENESH (monocot)
  • Genscan (Maize)
  • Genscan+ (Arabidopsis)
  • GlimmerHMM (rice)
  • tRNAscan-SE, to predict tRNA

Loci and gene model nomenclature

The genes, which are also known as loci or transcriptional units (TU), have been annotated using the BAC name and a gene number that is oriented relative to the sequence. For example, BAC clone 27H32, the first gene located at base 10 to 1247 will be 27H32.t00001, the second gene located at base 1568 to 2700 will be 27H32.t00002, etc. Models should be named with a "m" to distinguish models from TUs/loci. To provide a stable identifier for future updates of the annotation, a reduced gene/locus/TU can be used (27H32.1, 27H32.2, etc).

Example:
Stable Identifier: 27H32.1
Locus or TU: 27H32.t00001
Gene model: 27H32.m00001

Functional assignment

Putative function for the genes has been assigned via combination of BLASTP matches to a non-redundant amino acid database and Pfam trusted cutoff scores as well as searches of transcript evidence (ESTs and full length cDNAs). A table summarizing the putative function assignment guidelines is provided below.

Putative Function

Match in Non-redundant amino acid (nraa) db

Pfam database Trusted Cutoff Score

Wheat ESTs/FL-cDNA alignment

Sample of annotation

Known

>90-100% ID, >90-100% length

May be above trusted cutoff, not essential

Optional for annotation

Aquaporin

Putative

>45% ID, >50% length

May be above trusted cutoff, not essential

Optional for annotation

chitinase, putative

XX-domain containing protein

N/A

Above trusted cutoff

Optional for annotation

WD-domain containing protein

Expressed

No similarity detected in nraa, or similarity to protein in nraa is < 45% ID and/or <50% coverage, or similarity is to 1) an expressed protein, or 2) a protein with no known

Below trusted cutoff

>95% ID, >70% length of EST

Expressed protein

Conserved Hypothetical Protein

>45% ID, >50% length to a protein annotated as hypothetical protein

Below trusted cutoff

<95% ID, <70% length of EST

Conserved hypothetical protein

Hypothetical Protein

No match to any db entry >45% ID, >50% length

Below trusted cutoff

<95% ID, <70% length of EST

Hypothetical protein

Pseudogenes

Pseudogenes were defined based on evidence of transcription yet have no clear ORF.

The sequences of the annotated genes, along with supporting evidence, can also be found on our web site.

Software Links

  • FGENESH
  • Genscan (Chris Burge, Massachusetts Institute of Technology)
  • Genscan+ (Chris Burge, Massachusetts Institute of Technology)
  • GlimmerHMM (Salzberg, Pertea, at al., The Institute for Genomic Research)
  • tRNAscan-SE (Sean Eddy, Dept. of Genetics, Washington U. School of Medicine)
  • dds/gap2, dps/nap (Xiaoqiu Huang, Dept of Computer Science, Michigan Technological University)
  • RepeatMasker2 (A.F.A. Smit & P. Green, University of Washington)