SOTA: Self Organizing Tree Algorithm
Parameter Information
General SOTA Terminology
The topology of the resulting tree is a binary tree structure where each terminal
node represents a cluster.
Centroid Vector: a vector that is representative of the membership of a node.
Members: Expression Elements associated with a Node.
Node: a structure which contains a Centroid Vector and a number of associated
expression profiles (members).
Cell: a Node which is the terminal Node in a branch of the tree (a.k.a. leaf node).
The members of the cell are considered members of an expression cluster.
Growth Termination Criteria
Max Cycles
This integer value represents the maximum iterations allowed. The resulting number
of clusters produced by SOTA is (Max Cycles +1) unless other criteria are satisfied prior
the indicated maximum number of cycles.
Max epochs/cycle
This integer value indicates the maximum number of training epochs allowed per cycle.
Max. Cell Diversity
This value represents a maximum variability allowed within a cluster.
All resulting clusters will fall below this level of 'diversity'
(mean gene to cluster centroid distance) if diversity is used as the cell
division criteria. (Unless Max cycles are reached at which time some
clusters may still exceed this parameter)
Min Epoch Error Improvement
This value is used as a threshold for signaling the start of a new cycle and a
cell division. The tree diversity is monitored during a training epoch and
when the diversity fails to improve by more than this value then training has
been considered to have stabilized and a new cycle begins.
Run Maximum Number of Cycles (unrestricted growth)
The algorithm will run until Max Cycles or until all of the input
set are fully partitioned such that each cluster has one gene or several identical gene vectors.
Centroid Migration and Neighborhood Parameters
Migration Weights
These values are used to scale the movement of cluster centroids
(characteristic gene expression patterns) toward a gene vector which has been
associated with a neighborhood. When a gene is associated with a cluster
the centroid adapts to become more like the newly associated gene vector.
The parent and sister cell migration weights should be smaller than the
weight for the winning cell (Cell to which the gene vector is associated.).
Neighborhood Level
This value determines which cells are candidates to accept new expression elements.
When elements are considered for redistribution to new node during a cell division
candidate cells are determined by moving up the tree toward the root this number of levels.
From that node, all cells (terminal nodes) within this subtree are targets for possibly
accepting expression vectors. (Each vector moves into the cell to which it is most similar).
Cell Division Criteria
Use Cell Diversity
Cell diversity is the mean distance between the cell's members (expression profiles) to the cell's
centroid vector. When considering which cell to divide, the cell with the greatest diversity
is split. (providing it's diversity exceeds Max Cell Diversity (see above))
Use Cell Variability
Cell variability is the maximum element-to-element distance within a cell.
The cell having the largest internal gene-to-gene distance is selected as the next cell to divide.
In this case the stopping criteria is changed so that growth
continues until the most variable cell falls below a variability criteria generated using
the provided pValue (see below)
pValue
This value is used when using variability as the cell division criteria. A distribution of all element to element
distances is generated by resampling the data set with each expression vector having randomized ordering of vector elements.
The resulting distribution represents random gene to gene distances. The pValue supplied is applied to this
resampled distribution to generate a variability cutoff.
Clusters falling below this variability cutoff have a probability of having members
that are paired by chance at or below the supplied pValue.
Hierarchical Clustering
This check box selects whether to perform hierarchical clustering on the elements in each cluster
created.