Software Architecture
Bill Majoros (bmajoros@tigr.org)
The Institute for Genomic Research




NOTE:
This document is currently undergoing revision.  Please check back for updates, or contact us to get the most recent version.




Contents

Introduction
Overview
Details
Classes
Enums


Introduction

This manual describes the software architecture of the eukaryotic ab initio gene finder, GeneZilla.  Because GeneZilla is an open source program and is written in a highly extensible C++ style, it is hoped that other projects will find this software useful for building more sophisticated genome annotation tools.



Overview

GeneZilla models DNA using a Generalized Hidden Markov Model (GHMM).  Alternate parses of DNA (into zero or more gene models) are evaluated under this model.  A GHMM is an extension of an HMM in which each state can emit a sequence of symbols at each time unit rather than just a single symbol.  Whereas an HMM emits a symbol (stochastically) from the current state and then transitions to another state (also stochastically), a GHMM emits a nonempty string of symbols from the current state before transitioning to the next state.  This allows the GHMM to explicitly model the length distributions of gene features (rather than always imposing a geometric distribution as does an HMM), and also permits other forms of dependency modeling not normally feasible with a standard HMM.



The evaluation of an input string using an HMM/GHMM is referred to as a decoding algorithm.  GeneZilla implements a novel decoding algorithm for GHMMs that is both time-efficient and space-efficient, through the use of queues and propagators, which will be described below.  The procedure is described in detail in the following publication:

Majoros W. et al. (2005) Efficient decoding algorithms for generalized hidden Markov model gene finders, BMC Bioinformatics 5:616.

Here is the top-level decoding algorithm for GeneZilla:
1.  load sequence and model parameters.
2. instantiate anchor signals at left terminus.
3. foreach base in the sequence, left-to-right:
4. foreach signal sensor:
5. if a signal occurs here:
6. instantiate a new signal S.
7. link S back to predecessors in appropriate queues.
8. enqueue S in all appropriate queues.
9. if a stop codon occurs here:
10. terminate this reading frame.
11. foreach signal queue:
12. propagate the queue's accumulator up to this point.
13. instantiate anchor signals at right terminus.
14. select highest-scoring right-terminus anchor R.
15. trace back from R to the left terminus to get optimal parse.
16. generate GFF from the optimal parse.

We will go through this line-by-line.

1. Load Sequence And Model Parameters

The model parameters for the GHMM are loaded from a *.cfg file.  Precisely which *.cfg to load is determined based on the observed GC-content of the sequence and the look-up table for *.cfg files provided in the isochore definiteion (*.iso) file.

2. Instantiate Anchor Signals at Left Terminus

Place-holder signals are instantiated at the left end of the substrate sequence, before the leftmost base.  These provide an anchor to which other signals can link back.  A complete parse can then be defined as a phase-consistent path extending from a Left Terminus signal to a Right Terminus signal.

3. Foreach Base in the Sequence...

We iterate left-to-right through the substrate sequence.

4. Foreach SignalSensor...

We iterate through all the signal sensors, which are the models responsible for deciding whether a given signal (like a donor site or a start codon) is likely to occur at the current position in the substrate sequence.

5. If a Signal Occurs Here...

Each SignalSensor consists of a model for evaluating a prospective signal and a cut-off threshold.  If the score of the prospective signal exceeds the threshold then we go on to line 6:

6. Instantiate a New Signal

The SignalSensor believes a signal of a given type probably occurs here, so we instantiate a Signal object of that type and rooted at this position. 

7. Link the Signal back to Predecessors

In order for a (non-Terminus) Signal object to participate in a parse of the substrate it must have links to both predecessor and successor Signals.  Thus, a Signal having no such links can be garbage-collected.  In step #7 we consider all the possible predecessor Signals for this Signal.  Those potential predecessors are the Signals which currently reside in the appropriate SignalQueues.  Each SignalType has a set of appropriate SignalQueues to which it can link back.  For example, donor Signals can link back only to signals in the intron queue, whereas a stop-codon Signal can link back to Signals in the single-exon queue and/or the final-exon queue.  For coding segments and introns the Signal can link back to at most one predecessor per phase.  For all others, the Signal can link back to at most one predecessor.  In either case, the predecessor which is selected will be the one which gives the current Signal the highest inductive score.

8. Enqueue the New Signal in Appropriate Queues

Before we move on to consider other Signals after this one we need to place this Signal into all appropriate SignalQueues, so that later Signals can potentially link back to this one.  A Signal can be enqueued in all of the SignalQueues for those segments beginning with a signal of that type: e.g., a donor Signal can be enqueued in the intron queue, whereas an acceptor can be enqueued in both the internal-exon queue and the final-exon queue.  A Signal can belong to multiple queues simultaneously, and can have multiple predecessors and successors, though at most one in each phase.

9. If a Stop Codon Occurs Here...

We look to see whether a valid stop-codon consensus sequence occurs at this position (regardless of its score).

10. Terminate this Reading Frame

If a stop codon occurs at this position, then any reading frame which places this stop codon in phase zero ends here, so we iterate through all of the exon SignalQueues, and for each Signal in those queues we obliterate (set to negative infinity) the score for that Signal in that reading frame.  We do this because the scores for all Signals in the SignalQueues are constantly propagated up to the current point, to represent the score of the optimal parse passing through that Signal in the given phase.  By obliterating the score in the terminted reading frame, we ensure that no other Signal after this point will be able to link back to that Signal in that phase, because the inductive score would be negative infinity.

11. Foreach SignalQueue:

We iterate through the SignalQueues.

12. Propagate the Queue's Accumulator up to this Point

As mentioned earlier, each Signal's score (in each phase, where applicable) is constantly updated to represent the inductive score at the current position in the substrate sequence.  To make this update process faster we actually cache the updates to be added to the individual Signal scores in the queue's accumulator.  Then before we allow any Signal to link back to another Signal in a SignalQueue we first flush the queue's accumulator by adding its updates into the Signals' individual propagators and zeroing out the queue's accumulator.  Note that each signal actually has a separate propagator for each queue to which it belongs.

13. Instantiate Anchor Signals at the Right Terminus

We instantiate anchors at the Right Terminus just as did with the Left Terminus.  They are located just beyond the rightmost base of the substrate sequence.

14. Select Highest-Scoring Right-Terminus Anchor

We consider all the scores of all the Right-Terminus signals (in all phases, where appropriate), and select the highest-scoring Signal.

15. Trace Back from Right Terminus to Get Optimal Parse

We trace backward from the selected Right Terminus signal (in the highest-scoring phase, where appropriate), keeping track of the current phase as we traverse the predecessor-links.  The updating of the current phase during this trace-back procedure allows us to identify a unique predecessor for each Signal, resulting in a linear path between two terminus Signals.

16. Generate GFF from Optimal Parse

The path resulting from trace-back denotes a parse of the substrate, so we generate the appropriate GFF from that path.


Detailed Algorithm

A detailed description of the DSP decoding algorithm, which is utilized by GeneZilla, is given in the paper:

Majoros W. et al. (2005) Efficient decoding algorithms for generalized hidden Markov model gene finders, BMC Bioinformatics 5:616.

A preprint of this paper can be downloaded here.

The operation of the DSP (Dynamic Score Propagation) decoding algorithm is very crudely illustrated in the figure below.




The tiers in the figure, from top to bottom are: (1) the frame, which is defined to begin at 0 at the very beginning of the input contig; (2) the input sequence (omitting all but the putative signals); (3) the phase of a putative gene parse, where the phase is defined to begin at 0 at the beginning of a forward-strand gene, or 2 at the beginning of a reverse-strand gene; (4) the "model phase" -- each dot represents a single-nucleotide score evaluated by one of three Markov chains (or Interpolated Markov Models), one chain per phase; (5) the accumulator -- each signal queue has its own accumulator which simply improves the speed of the gene finder by caching updates to the propagators; (6) the propagators -- each putative signal has one or more propagators (one per signal queue of which it is a member).  The propagators are used to propagate the scores of optimal partial parses passing through each putative signal.  When a signal is removed from its signal queue (due to eclipsing by in-frame stop codons) its propagator is no longer updated.

For a more detailed description of the algorithm, please refer to the paper cited above.

Classes

The following class list pertains to TIGRscan, the predecessor of GeneZilla.  An updated class list for GeneZilla will be substituted here very soon.

Class
Superclass Description
Accumulator Propagator
a Propagator which stores an update delta to be added to all the propagators in a queue
Alphabet

a set of Symbols
AminoAlphabet
Alphabet
{*ARNDCQEGHILKMFPSTWYV}
ContentSensor
evaluates variable-length features of a parse (exons, introns, etc.), one base at a time
DiscreteDistribution
maps integers to (log) probabilities
DnaAlphabet
Alphabet
{ACGNT}
EdgeFactory
manufactures Edge objects, or descendents of Edge (useful for adding your own attributes to Edges through subclassing)
Edge
an edge connecting two Signals in a ParseGraph; abstract base class
EmpiricalDistribution DiscreteDistribution
a DiscreteDistribution represented as a table of observed frequencies
Fast3PMC ContentSensor
currently not in use
FastMarkovChain ContentSensor
currently not in use
GarbageCollector
a mark-and-sweep garbage collector for Signals
GarbageIgnorer
GarbageCollector
used to disable garbage collection
GeometricDistribution DiscreteDistribution
implements a geometric distribution, P[length]=q*(1-q)^(length-1) where q=1/length
GffReader
builds a ParseGraph from the contents of a GFF file
IntronQueue SignalQueue
a type of SignalQueue which remembers only the best signal in each phase
IntronQueueIterator
TigrIterator
<Signal*>
allows iteration through the elements of an IntronQueue's main queue (but not its holding queue)
IsochoreFile
loads a *.iso file and returns the appropriate TigrConfigFile, given the actual GC content
LowercaseDnaAlphabet
Alphabet
{acgnt}
MarkovChainCompiler
currently not in use
MarkovChain ContentSensor
evaluates variable-length features of a parse using an Nth order Markov Chain (and enforcing a min. sample size, like in IMMs)
MddTree SignalSensor
represents the tree used in MDD
ModelBuilder
builds a signal- or content-sensor from a set of training sequences
NoncodingQueue SignalQueue
a type of SignalQueue which remembers only the best signal (phase-agnostic)
NonPhasedEdge
Edge
an Edge representing a UTR or intergenic segment, with a single phase-agnostic score
NthOrderStringIterator
generates all strings of length N, for installing pseudocounts when training a Markov chain
ParseGraph
a set of Signals and Edges connecting those Signals into valid parses
Partition
represents a positional test at an interior node in an MDD tree
PhasedEdge Edge an Edge representing an exon or intron, with 3 phase-specific scores
Propagator
propagates the score of a path (in 3 phases) through a signal at a specific point in the substrate sequence
ScoreAnalyzer
generates a precision-recall curve from a set of scores for positive and negative examples
Sequence

an array of Symbols drawn from an Alphabet
Signal
represents a fixed-length feature in a parse (splice site, start-/stop-codon, promoter, poly-A signal)
SignalQueue
stores signals of a specific type that are still eligible to participate in a parse at some par7ticular point later in the sequence
SignalSensor
a model that evaluates fixed-length features of a parse (Signals)
SignalTypeProperties
consilidates knowledge about the properties of each signal type, so that additional signal types can be added to the gene-finder in the future
SinglePhaseComparator
TigrComparator
<Signal*>
compares the propagated inductive scores of two Signals in a given phase, accounting for accumulated lengths
Symbol

a letter in an Alphabet
TataCapModel SignalSensor
a SignalSensor for TataCapSignals
TataCapSignal Signal
a Signal consisting of a TATA-box followed after a short distance by a CAP-site
ThreePeriodicMarkovChain ContentSensor
a MarkovChain that models the three coding phases separately
TigrChi2IndepTest

implements a chi-squared test for independence
TigrConfigFile

reads and stores a set of named parameters from a file
TigrFastaReader

reads a fasta file
GeneZilla
encapsulates the entire gene-finder, so it can be embedded within other projects
TopologyLoader
loads and parses the topology definition file
TrainingSequence Sequence
extends class Sequence with a "boost count" for implementing boosting during training
Transitions
stores Transition probabilities
TreeNode
a node in an MddTree
WAM SignalSensor
a SignalSensor which is like a WMM but has a Markov chain at each position in the matrix
WMM SignalSensor
a position-specific weight matrix
WWAM
SignalSensor
a WAM in which Markov chain probabilities were estimated from pooled samples during training

Enums


Type Values
Strand PLUS_STRAND=FORWARD_STRAND,
MINUS_STRAND=REVERSE_STRAND,
EITHER_STRAND=NO_STRAND
SignalType ATG, TAG, GT, AG, PROM, POLYA, NEG_ATG, NEG_TAG, NEG_GT, NEG_AG, NEG_PROM, NEG_POLYA, NO_SIGNAL_TYPE
ModelType WMM_MODEL, WAM_MODEL, WWAM_MODEL, THREE_PERIODIC, MARKOV_CHAIN, FAST_MC, FAST_3PMC, IMM_MODEL, CODON_BIAS, MDD, SIGNAL_MODEL, HMM_MODEL, NONSTATIONARY_MC
DistributionType
EMPIRICAL_DISTRIBUTION, GEOMETRIC_DISTRIBUTION
Direction
DIR_LEFT, DIR_RIGHT




OSI certified (open-source)
GeneZilla is governed by the ARTISTIC LICENSE (see www.opensource.org).