GeneZilla Users' Manual

Users' Guide

Bill Majoros (bmajoros@tigr.org)
The Institute for Genomic Research

Contents

Introduction
Installation
Usage
Output
Configuration

Introduction

This manual describes how to install and use the eukaryotic gene finder, GeneZilla. If you have not yet trained the gene finder for your organism, you can find out how to do so by reading the Training Manual.

Installation

The GeneZilla distribution consists of a single *.tar.gz file, such as genezilla.tar.gz. This can be unzipped and untarred in linux via by moving the file into an empty directory and typing:

gunzip genezilla.tar.gz tar xvf genezilla.tar
Then, to compile the source code:

mkdir obj make genezilla
If you also intend to train the gene finder (which is necessary unless you also downloaded model files for your organism), then you will have to compile the training programs:

make train-signal-sensor make train-content-sensor make mddProblems are most commonly caused by the use of an outdated compiler. Please use GCC version 3.3.3 or greater. Typing gcc -v should display the version number of your compiler. If you are using 3.3.3 and you still have difficulty, please contact us.

Usage

GeneZilla can be used to obtain a set of gene predictions via the following command:

GeneZilla <*.iso> <*.fasta>

The *.iso file is an isochore definition file, which is described in the Training Manual. This file contains an entry for each isochore pointing to a configuration file which contains parameters you may wish to modify to obtain better performance from the gene-finder.

The *.fasta file should contain a single sequence of manageable size. If you provide a multi-FASTA file, GeneZilla will process only the first sequence, ignoring the rest. The actual size limit for the sequence will depend on your system. Several hundred kb should be fine.

Output

The output format of GeneZilla is GFF. This file format is specified at

<http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml>.

The format consists of one exon per line, with each line containing a fixed number of tab-separated fields. Here is an example:

617264 GeneZilla initial-exon 1432 1567 . + . transcript_id=1
617264 GeneZilla internal-exon 1590 1812 . + . transcript_id=1
617264 GeneZilla final-exon 2276 2571 . + . transcript_id=1

The fields of the GFF records are:
substrate program exon-type begin end score strand phase ID

These are explained in detail below:

substrate : contig or BAC identifier
program : GeneZilla
exon-type : initial-exon, internal-exon, final-exon, or single-exon
begin: 1-based coordinate of the leftmost base in the feature, so that a feature beginning at the first base would have a start coordinate of 1
end : 1-based coordinate of the rightmost base in the feature
score : log of ratio of coding model score versus non-coding model score (i.e., a score of zero is perfectly neutral; exons with negative scores are suspect, while those with positive scores have support)
strand : + or -
phase : phase of the most 5-prime base of the exon. Phase 0 means that the exon begins with a complete codon, whereas phase 1 means that the first two bases (5-prime) of the exon constitute the second and third bases of an interrupted codon. Phase 2 means the first base of the exon consists of the last base of an interrupted codon.
ID : transcript identifier, for grouping exons into a transcript

In GFF the begin coodinate is always less than the end coordinate, even on the reverse strand. Also note that the coordinates are 1-based (so that the very first base on the substrate has coordinate 1) and are inclusive (so that the length of a feature is end-begin+1).

Note that although GeneZilla does model UTRs internally, it does not predict them, so the exons listed in the output are meant to be coding exons, or CDSs.

Configuration

GeneZilla is highly reconfigurable. There are many parameters which you can adjust in order to improve the performance of the gene finder. These parameters are read from several files when GeneZilla executes:

file	description
*.iso	This file contains a table listing the *.cfg file for each isochore (by G+C%).
*.cfg	This is the primary configuration file. It Contains most of the parameters that you might want to modify. There is one *.cfg file for each isochore.
transition probabilities file	The name of this file is specified in the "transition-probabilities" section of the *.cfg file. It specifies the probabilities of transitioning between particular states in the GHMM.
topology definition file	This file defines the model topology of the GHMM. It specifies which states have transitions to each of the other states, which phases each signal can occur in, and several other pieces of information which are generally not very useful to modify.

The figure below illustrates the structure of the *.iso file. After determining the G+C content of the sequence to be annotated, GeneZilla reads the appropriate configuration file (*.cfg):

Each *.cfg file in turn refers to several other files from which the gene finder loads parameters and statistical models. Here is an example of a *.cfg file:

##############################################################
#
# basic configuration file for GeneZilla
#
#############################################################

donor-model 		= donors.model
donor-consensus 	= GT
acceptor-model 		= acceptors.model
acceptor-consensus	= AG
start-codon-model	= start-codons.model
start-codon-consensus	= ATG
stop-codon-model	= stop-codons.model
stop-codon-consensus	= TAG|TGA|TAA
polya-model		= polya.model
polya-consensus		= AATAAA|ATTAAA
promoter-model		= TATA.model
promoter-consensus	= TATAAA|TATATA|CATAAA|CATATA
initial-exons		= initial-exons.model
internal-exons		= internal-exons.model
final-exons		= final-exons.model
single-exons		= single-exons.model
initial-exon-lengths	= initial-exons.distr
internal-exon-lengths	= internal-exons.distr
final-exon-lengths	= final-exons.distr
single-exon-lengths	= single-exons.distr
mean-intron-length	= 1294
mean-intergenic-length	= 10000
mean-5'-UTR-length	= 769
mean-3'-UTR-length      = 457
introns			= introns.model
intergenic		= intergenic.model
3'-UTR			= three-prime-utr.model
5'-UTR			= five-prime-utr.model
transition-probabilities = trans0-100.txt

Note that anything following a # is ignored, so you can comment out lines while you experiment with alternative settings for the parameters.

The *.model files contain the statistical models which are used in the states of the GHMM to score the various parts of a prospective gene model. The best way to modify the *.model files is to re-train those models. However, one simple modification which you can make to the model files directly is to alter the threshold value on the third line of the signal models (ie., donors, acceptors, start-/stop-codons, promoters, poly-A signals). Lowering this value generally results in more signals of that type being found by the gene finder, but can increase the false-positive rate.

The consensus tags, such as donor-consensus or acceptor-consensus, specify the sequence tags which are scored as possible signals when they are encountered in the substrate sequence. If your organism uses non-canonical splice sites, you might try adding these into the list (separating consensus tags with a vertical bar (|)). However, if these non-canonical consensus sequences occur with low frequency, then adding them may increase the false-positive rate.

The *.distr files contain exon length distributions. These files are best changed via re-training.

The mean non-coding length parameters influence the scores of prospective non-coding regions based on their lengths. These mean values are used to induce a geometric distribution, which is then used to assess the probability of a non-coding segment.

The transition probabilities file lists the probabilites of transition between given states of the GHMM. By modifying these parameters you can influence how many exons and/or genes are predicted by the gene finder.