Users' Guide
Bill Majoros (bmajoros@tigr.org)
The Institute for Genomic Research





Contents

Introduction
Installation
Usage
Output
Configuration




Introduction

This manual describes how to install and use the eukaryotic gene finder, GeneZilla.  If you have not yet trained the gene finder for your organism, you can find out how to do so by reading the Training Manual.



Installation

The GeneZilla distribution consists of a single *.tar.gz file, such as genezilla.tar.gz.  This can be unzipped and untarred in linux via by moving the  file into an empty directory and typing:

gunzip genezilla.tar.gz
tar xvf genezilla.tar

Then, to compile the source code:

mkdir obj
make genezilla

If you also intend to train the gene finder (which is necessary unless you also downloaded model files for your organism), then you will have to compile the training programs:

make train-signal-sensor
make train-content-sensor
make mdd

Problems are most commonly caused by the use of an outdated compiler.  Please use GCC version 3.3.3 or greater.  Typing gcc -v should display the version number of your compiler.  If you are using 3.3.3 and you still have difficulty, please contact us.



Usage

GeneZilla can be used to obtain a set of gene predictions via the following command:

GeneZilla  <*.iso>  <*.fasta>

The *.iso file is an isochore definition file, which is described in the Training Manual.  This file contains an entry for each isochore pointing to a configuration file which contains parameters you may wish to modify to obtain better performance from the gene-finder.

The *.fasta file should contain a single sequence of manageable size.  If you provide a multi-FASTA file, GeneZilla will process only the first sequence, ignoring the rest.  The actual size limit for the sequence will depend on your system.  Several hundred kb should be fine.



Output

The output format of GeneZilla is GFF.  This file format is specified at
<http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml>.

The format consists of one exon per line, with each line containing a fixed number of tab-separated fields.  Here is an example:

617264  GeneZilla  initial-exon   1432  1567  .  +  .  transcript_id=1
617264  GeneZilla  internal-exon  1590  1812  .  +  .  transcript_id=1
617264  GeneZilla  final-exon     2276  2571  .  +  .  transcript_id=1


The fields of the GFF records are:
substrate  program  exon-type  begin  end  score  strand  phase  ID

These are explained in detail below:
In GFF the begin coodinate is always less than the end coordinate, even on the reverse strand.  Also note that the coordinates are 1-based (so that the very first base on the substrate has coordinate 1) and are inclusive (so that the length of a feature is end-begin+1).

Note that although GeneZilla does model UTRs internally, it does not predict them, so the exons listed in the output are meant to be coding exons, or CDSs.



Configuration

GeneZilla is highly reconfigurable.  There are many parameters which you can adjust in order to improve the performance of the gene finder.  These parameters are read from several files when GeneZilla executes:

file
description
*.iso
This file contains a table listing the *.cfg file for each isochore (by G+C%).
*.cfg
This is the primary configuration file.  It Contains most of the parameters that you might want to modify.  There is one *.cfg file for each isochore.
transition probabilities file
The name of this file is specified in the "transition-probabilities" section of the *.cfg file.  It specifies the probabilities of transitioning between particular states in the GHMM.
topology definition file
This file defines the model topology of the GHMM.  It specifies which states have transitions to each of the other states, which phases each signal can occur in, and several other pieces of information which are generally not very useful to modify.


The figure below illustrates the structure of the *.iso file.  After determining the G+C content of the sequence to be annotated, GeneZilla reads the appropriate configuration file (*.cfg):



Each *.cfg file in turn refers to several other files from which the gene finder loads parameters and statistical models.   Here is an example of a *.cfg file:


##############################################################
#
# basic configuration file for GeneZilla
#
#############################################################

donor-model = donors.model
donor-consensus = GT
acceptor-model = acceptors.model
acceptor-consensus = AG
start-codon-model = start-codons.model
start-codon-consensus = ATG
stop-codon-model = stop-codons.model
stop-codon-consensus = TAG|TGA|TAA
polya-model = polya.model
polya-consensus = AATAAA|ATTAAA
promoter-model = TATA.model
promoter-consensus = TATAAA|TATATA|CATAAA|CATATA
initial-exons = initial-exons.model
internal-exons = internal-exons.model
final-exons = final-exons.model
single-exons = single-exons.model
initial-exon-lengths = initial-exons.distr
internal-exon-lengths = internal-exons.distr
final-exon-lengths = final-exons.distr
single-exon-lengths = single-exons.distr
mean-intron-length = 1294
mean-intergenic-length = 10000
mean-5'-UTR-length = 769
mean-3'-UTR-length = 457
introns = introns.model
intergenic = intergenic.model
3'-UTR = three-prime-utr.model
5'-UTR = five-prime-utr.model
transition-probabilities = trans0-100.txt

Note that anything following a # is ignored, so you can comment out lines while you experiment with alternative settings for the parameters.

The *.model files contain the statistical models which are used in the states of the GHMM to score the various parts of a prospective gene model.  The best way to modify the *.model files is to re-train those models.  However, one simple modification which you can make to the model files directly is to alter the threshold value on the third line of the signal models (ie., donors, acceptors, start-/stop-codons, promoters, poly-A signals).  Lowering this value generally results in more signals of that type being found by the gene finder, but can increase the false-positive rate.

The consensus tags, such as donor-consensus or acceptor-consensus, specify the sequence tags which are scored as possible signals when they are encountered in the substrate sequence.  If your organism uses non-canonical splice sites, you might try adding these into the list (separating consensus tags with a vertical bar (|)).  However, if these non-canonical consensus sequences occur with low frequency, then adding them may increase the false-positive rate.

The *.distr files contain exon length distributions.  These files are best changed via re-training.

The mean non-coding length parameters influence the scores of prospective non-coding regions based on their lengths.  These mean values are used to induce a geometric distribution, which is then used to assess the probability of a non-coding segment.

The transition probabilities file lists the probabilites of transition between given states of the GHMM.  By modifying these parameters you can influence how many exons and/or genes are predicted by the gene finder.