GeneZilla Software Architecture

Software Architecture

Bill Majoros (bmajoros@tigr.org)
The Institute for Genomic Research

	NOTE:

This document is currently undergoing revision. Please check back for updates, or contact us to get the most recent version.

Contents

Introduction
Overview
Details
Classes
Enums

Introduction

This manual describes the software architecture of the eukaryotic ab initio gene finder, GeneZilla. Because GeneZilla is an open source program and is written in a highly extensible C++ style, it is hoped that other projects will find this software useful for building more sophisticated genome annotation tools.

Overview

GeneZilla models DNA using a Generalized Hidden Markov Model (GHMM). Alternate parses of DNA (into zero or more gene models) are evaluated under this model. A GHMM is an extension of an HMM in which each state can emit a sequence of symbols at each time unit rather than just a single symbol. Whereas an HMM emits a symbol (stochastically) from the current state and then transitions to another state (also stochastically), a GHMM emits a nonempty string of symbols from the current state before transitioning to the next state. This allows the GHMM to explicitly model the length distributions of gene features (rather than always imposing a geometric distribution as does an HMM), and also permits other forms of dependency modeling not normally feasible with a standard HMM.

The evaluation of an input string using an HMM/GHMM is referred to as a decoding algorithm. GeneZilla implements a novel decoding algorithm for GHMMs that is both time-efficient and space-efficient, through the use of queues and propagators, which will be described below. The procedure is described in detail in the following publication:

Majoros W. et al. (2005) Efficient decoding algorithms for generalized hidden Markov model gene finders, BMC Bioinformatics 5:616.

Here is the top-level decoding algorithm for GeneZilla:

1.  load sequence and model parameters.
2.  instantiate anchor signals at left terminus.
3.  foreach base in the sequence, left-to-right:
4.    foreach signal sensor:
5.      if a signal occurs here:
6.        instantiate a new signal S.
7.        link S back to predecessors in appropriate queues.
8.        enqueue S in all appropriate queues.
9.    if a stop codon occurs here:
10.     terminate this reading frame.
11.   foreach signal queue:
12.     propagate the queue's accumulator up to this point.
13. instantiate anchor signals at right terminus.
14. select highest-scoring right-terminus anchor R.
15. trace back from R to the left terminus to get optimal parse.
16. generate GFF from the optimal parse.

We will go through this line-by-line.

1. Load Sequence And Model Parameters

The model parameters for the GHMM are loaded from a *.cfg file. Precisely which *.cfg to load is determined based on the observed GC-content of the sequence and the look-up table for *.cfg files provided in the isochore definiteion (*.iso) file.

2. Instantiate Anchor Signals at Left Terminus

Place-holder signals are instantiated at the left end of the substrate sequence, before the leftmost base. These provide an anchor to which other signals can link back. A complete parse can then be defined as a phase-consistent path extending from a Left Terminus signal to a Right Terminus signal.

3. Foreach Base in the Sequence...

We iterate left-to-right through the substrate sequence.

4. Foreach SignalSensor...

We iterate through all the signal sensors, which are the models responsible for deciding whether a given signal (like a donor site or a start codon) is likely to occur at the current position in the substrate sequence.

5. If a Signal Occurs Here...

Each SignalSensor consists of a model for evaluating a prospective signal and a cut-off threshold. If the score of the prospective signal exceeds the threshold then we go on to line 6:

6. Instantiate a New Signal

The SignalSensor believes a signal of a given type probably occurs here, so we instantiate a Signal object of that type and rooted at this position.

7. Link the Signal back to Predecessors

In order for a (non-Terminus) Signal object to participate in a parse of the substrate it must have links to both predecessor and successor Signals. Thus, a Signal having no such links can be garbage-collected. In step #7 we consider all the possible predecessor Signals for this Signal. Those potential predecessors are the Signals which currently reside in the appropriate SignalQueues. Each SignalType has a set of appropriate SignalQueues to which it can link back. For example, donor Signals can link back only to signals in the intron queue, whereas a stop-codon Signal can link back to Signals in the single-exon queue and/or the final-exon queue. For coding segments and introns the Signal can link back to at most one predecessor per phase. For all others, the Signal can link back to at most one predecessor. In either case, the predecessor which is selected will be the one which gives the current Signal the highest inductive score.

8. Enqueue the New Signal in Appropriate Queues

Before we move on to consider other Signals after this one we need to place this Signal into all appropriate SignalQueues, so that later Signals can potentially link back to this one. A Signal can be enqueued in all of the SignalQueues for those segments beginning with a signal of that type: e.g., a donor Signal can be enqueued in the intron queue, whereas an acceptor can be enqueued in both the internal-exon queue and the final-exon queue. A Signal can belong to multiple queues simultaneously, and can have multiple predecessors and successors, though at most one in each phase.

9. If a Stop Codon Occurs Here...

We look to see whether a valid stop-codon consensus sequence occurs at this position (regardless of its score).

10. Terminate this Reading Frame

If a stop codon occurs at this position, then any reading frame which places this stop codon in phase zero ends here, so we iterate through all of the exon SignalQueues, and for each Signal in those queues we obliterate (set to negative infinity) the score for that Signal in that reading frame. We do this because the scores for all Signals in the SignalQueues are constantly propagated up to the current point, to represent the score of the optimal parse passing through that Signal in the given phase. By obliterating the score in the terminted reading frame, we ensure that no other Signal after this point will be able to link back to that Signal in that phase, because the inductive score would be negative infinity.

11. Foreach SignalQueue:

We iterate through the SignalQueues.

12. Propagate the Queue's Accumulator up to this Point

As mentioned earlier, each Signal's score (in each phase, where applicable) is constantly updated to represent the inductive score at the current position in the substrate sequence. To make this update process faster we actually cache the updates to be added to the individual Signal scores in the queue's accumulator. Then before we allow any Signal to link back to another Signal in a SignalQueue we first flush the queue's accumulator by adding its updates into the Signals' individual propagators and zeroing out the queue's accumulator. Note that each signal actually has a separate propagator for each queue to which it belongs.

13. Instantiate Anchor Signals at the Right Terminus

We instantiate anchors at the Right Terminus just as did with the Left Terminus. They are located just beyond the rightmost base of the substrate sequence.

14. Select Highest-Scoring Right-Terminus Anchor

We consider all the scores of all the Right-Terminus signals (in all phases, where appropriate), and select the highest-scoring Signal.

15. Trace Back from Right Terminus to Get Optimal Parse

We trace backward from the selected Right Terminus signal (in the highest-scoring phase, where appropriate), keeping track of the current phase as we traverse the predecessor-links. The updating of the current phase during this trace-back procedure allows us to identify a unique predecessor for each Signal, resulting in a linear path between two terminus Signals.

16. Generate GFF from Optimal Parse

The path resulting from trace-back denotes a parse of the substrate, so we generate the appropriate GFF from that path.

Detailed Algorithm

A detailed description of the DSP decoding algorithm, which is utilized by GeneZilla, is given in the paper:

Majoros W. et al. (2005) Efficient decoding algorithms for generalized hidden Markov model gene finders, BMC Bioinformatics 5:616.

A preprint of this paper can be downloaded here.

The operation of the DSP (Dynamic Score Propagation) decoding algorithm is very crudely illustrated in the figure below.

The tiers in the figure, from top to bottom are: (1) the frame, which is defined to begin at 0 at the very beginning of the input contig; (2) the input sequence (omitting all but the putative signals); (3) the phase of a putative gene parse, where the phase is defined to begin at 0 at the beginning of a forward-strand gene, or 2 at the beginning of a reverse-strand gene; (4) the "model phase" -- each dot represents a single-nucleotide score evaluated by one of three Markov chains (or Interpolated Markov Models), one chain per phase; (5) the accumulator -- each signal queue has its own accumulator which simply improves the speed of the gene finder by caching updates to the propagators; (6) the propagators -- each putative signal has one or more propagators (one per signal queue of which it is a member). The propagators are used to propagate the scores of optimal partial parses passing through each putative signal. When a signal is removed from its signal queue (due to eclipsing by in-frame stop codons) its propagator is no longer updated.

For a more detailed description of the algorithm, please refer to the paper cited above.

Classes

The following class list pertains to TIGRscan, the predecessor of GeneZilla. An updated class list for GeneZilla will be substituted here very soon.

Class	Superclass	Description
Accumulator	Propagator	a Propagator which stores an update delta to be added to all the propagators in a queue
Alphabet		a set of Symbols
AminoAlphabet	Alphabet	{*ARNDCQEGHILKMFPSTWYV}
ContentSensor		evaluates variable-length features of a parse (exons, introns, etc.), one base at a time
DiscreteDistribution		maps integers to (log) probabilities
DnaAlphabet	Alphabet	{ACGNT}
EdgeFactory		manufactures Edge objects, or descendents of Edge (useful for adding your own attributes to Edges through subclassing)
Edge		an edge connecting two Signals in a ParseGraph; abstract base class
EmpiricalDistribution	DiscreteDistribution	a DiscreteDistribution represented as a table of observed frequencies
Fast3PMC	ContentSensor	currently not in use
FastMarkovChain	ContentSensor	currently not in use
GarbageCollector		a mark-and-sweep garbage collector for Signals
GarbageIgnorer	GarbageCollector	used to disable garbage collection
GeometricDistribution	DiscreteDistribution	implements a geometric distribution, P[length]=q*(1-q)^(length-1) where q=1/length
GffReader		builds a ParseGraph from the contents of a GFF file
IntronQueue	SignalQueue	a type of SignalQueue which remembers only the best signal in each phase
IntronQueueIterator	TigrIterator <Signal*>	allows iteration through the elements of an IntronQueue's main queue (but not its holding queue)
IsochoreFile		loads a *.iso file and returns the appropriate TigrConfigFile, given the actual GC content
LowercaseDnaAlphabet	Alphabet	{acgnt}
MarkovChainCompiler		currently not in use
MarkovChain	ContentSensor	evaluates variable-length features of a parse using an N^th order Markov Chain (and enforcing a min. sample size, like in IMMs)
MddTree	SignalSensor	represents the tree used in MDD
ModelBuilder		builds a signal- or content-sensor from a set of training sequences
NoncodingQueue	SignalQueue	a type of SignalQueue which remembers only the best signal (phase-agnostic)
NonPhasedEdge	Edge	an Edge representing a UTR or intergenic segment, with a single phase-agnostic score
NthOrderStringIterator		generates all strings of length N, for installing pseudocounts when training a Markov chain
ParseGraph		a set of Signals and Edges connecting those Signals into valid parses
Partition		represents a positional test at an interior node in an MDD tree
PhasedEdge	Edge	an Edge representing an exon or intron, with 3 phase-specific scores
Propagator		propagates the score of a path (in 3 phases) through a signal at a specific point in the substrate sequence
ScoreAnalyzer		generates a precision-recall curve from a set of scores for positive and negative examples
Sequence		an array of Symbols drawn from an Alphabet
Signal		represents a fixed-length feature in a parse (splice site, start-/stop-codon, promoter, poly-A signal)
SignalQueue		stores signals of a specific type that are still eligible to participate in a parse at some par7ticular point later in the sequence
SignalSensor		a model that evaluates fixed-length features of a parse (Signals)
SignalTypeProperties		consilidates knowledge about the properties of each signal type, so that additional signal types can be added to the gene-finder in the future
SinglePhaseComparator	TigrComparator <Signal*>	compares the propagated inductive scores of two Signals in a given phase, accounting for accumulated lengths
Symbol		a letter in an Alphabet
TataCapModel	SignalSensor	a SignalSensor for TataCapSignals
TataCapSignal	Signal	a Signal consisting of a TATA-box followed after a short distance by a CAP-site
ThreePeriodicMarkovChain	ContentSensor	a MarkovChain that models the three coding phases separately
TigrChi2IndepTest		implements a chi-squared test for independence
TigrConfigFile		reads and stores a set of named parameters from a file
TigrFastaReader		reads a fasta file
GeneZilla		encapsulates the entire gene-finder, so it can be embedded within other projects
TopologyLoader		loads and parses the topology definition file
TrainingSequence	Sequence	extends class Sequence with a "boost count" for implementing boosting during training
Transitions		stores Transition probabilities
TreeNode		a node in an MddTree
WAM	SignalSensor	a SignalSensor which is like a WMM but has a Markov chain at each position in the matrix
WMM	SignalSensor	a position-specific weight matrix
WWAM	SignalSensor	a WAM in which Markov chain probabilities were estimated from pooled samples during training

Enums

Type	Values
Strand	PLUS_STRAND=FORWARD_STRAND, MINUS_STRAND=REVERSE_STRAND, EITHER_STRAND=NO_STRAND
SignalType	ATG, TAG, GT, AG, PROM, POLYA, NEG_ATG, NEG_TAG, NEG_GT, NEG_AG, NEG_PROM, NEG_POLYA, NO_SIGNAL_TYPE
ModelType	WMM_MODEL, WAM_MODEL, WWAM_MODEL, THREE_PERIODIC, MARKOV_CHAIN, FAST_MC, FAST_3PMC, IMM_MODEL, CODON_BIAS, MDD, SIGNAL_MODEL, HMM_MODEL, NONSTATIONARY_MC
DistributionType	EMPIRICAL_DISTRIBUTION, GEOMETRIC_DISTRIBUTION
Direction	DIR_LEFT, DIR_RIGHT

GeneZilla is governed by the ARTISTIC LICENSE (see www.opensource.org).