in Walter, Martinetz, and Schulten 1991 and compared to the so-called Neural-Gas Network in Walter 1991 and Walter and Schulten 1993.. The Neural-Gas Network shows in contrast to the SOM
Trang 13.5 Strategies to Avoid Over-Fitting 35
could be extracted from a sequence of three-word sentences
(Koho-nen 1990; Ritter and Koho(Koho-nen 1989) The topology preserving
prop-erties enables cooperative learning in order to increase speed and
ro-bustness of learning, studied e.g in Walter, Martinetz, and Schulten
(1991) and compared to the so-called Neural-Gas Network in Walter
(1991) and Walter and Schulten (1993)
The Neural-Gas Network shows in contrast to the SOM not a fixed
grid topology but a “gas-like”, dynamic definition of the
neighbor-hood function, which is determined by (dynamic) ranking of
close-ness in the input space (Martinetz and Schulten 1991) This results in
advantages for applications with inhomogeneous or unknown
topol-ogy (e.g prediction of chaotic time series like the Mackey-Glass
series in Walter (1991) and later also published in Martinetz et al
(1993))
The choice of the type of approximation function introduces bias, and
restricts the variance of the of the possible solutions This is a fundamental
relation called the bias–variance problem (Geman et al 1992) As indicated
before, this bias and the corresponding variance reduction can be good or
bad, depending on the suitability of the choice The next section discusses
the problem over-using the variance of a chosen approximation ansatz,
especially in the presence of noise
3.5 Strategies to Avoid Over-Fitting
Over-fitting can occur, when the function f gets approximated in the
do-mainD, using only a too limited number of training data pointsDtrain If
the ratio of free parameter versus training points is too high, the
approxi-mation fits to the noise, as illustrated by Fig 3.4 This results in a reduced
generalization ability Beside the proper selection of the appropriate
net-work structure, several strategies can help to avoid the over-fitting effect:
Early stopping: During incremental learning the approximation error is
systematically decreased, but at some point the expected error or
lack-of-fitLOF(FD)starts to increase again The idea of early
stop-ping is to estimate theLOF on a separate test data setDtest and
de-termine the optimal time to stop learning
Trang 2X X
Figure 3.4: (Left) A meaningful fit to the given cross-marked noisy data (Right)
badly on the indicated (cross-marked) position.
More training data: Over-fitting can be avoided when sufficient training points are available, e.g by learning on-line Duplicating the avail-able training data set and adding a small amount of noise can help
to some extent
Smoothing and Regularization: Poggio and Girosi (1990) pointed out that learning from a limited set of data is an ill-posed problem and needs further assumptions to achieve meaningful generalization
capabili-ties The most usual presumption is smoothness, which can be
formal-ized by a stabilizer term in the cost function Eq 3.1 (regularization theory) The roughness penalty approximations can be written as
F(wx) =argminF(LOF (FD) + R(F)) (3.7) whereR(F)is a functional that describes the roughness of the func-tion F(wx) The parameter controls the tradeoff between the fi-delity to the data and the smoothness ofF A common choice forR
is the integrated squared Laplacian ofF
R(F) =
n
X
i=1
n
X
j=1
Z
D
@2F
@xi@xj
2
which is equivalent to the thin-plate spline (forn 3; coined by the energy of a bended thin plate of finite extent) The main difficulty is the introduction of a very influential parameter and the computa-tion burden to carry out the integral
For the topology preserving maps the smoothing is introduced by
a parameter, which determines the range of learning coupling
Trang 3be-3.6 Selecting the Right Network Size 37
tween neighboring neurons in the map This can be interpreted as a
regularization for the SOM and the “Neural-Gas” network
3.6 Selecting the Right Network Size
Beside the accuracy criterion (LOF, Eq 3.1) the simplicity of the network
is desirable, similar to the idea of Occam's Razor The formal way is to
augment the cost function by a complexity cost term, which is often written
as a function of the number of non-constant model parameters (additive
or multiplicative penalty, e.g the Generalized Cross-Validation criterion
GCV; Craven and Wahba 1979)
There are several techniques to select the right network size and
struc-ture:
Trial-and-Error is probably the most prominent method in practice A
particular network structure is constructed and evaluated, which
in-cludes training and testing The achieved lack-of-fit (LOF) is
esti-mated and minimized
Genetic Algorithms can automize this optimization method, in case of a
suitable encoding of the construction parameter, the genome can be
defined Initially, a set of individuals (network genomes), the
pop-ulation is constructed by hand During each epoch, the individuals
of this generation are evaluated (training and testing) Their fitnesses
(negative cost function) determine the probability of various ways of
replication, including mutations (stochastic genome modifications)
and cross-over (sexual replication with stochastic genome exchange)
The applicability and success of this method depends strongly on
the complexity of the problem, the effective representation, and the
computation time required to simulate evolution The computation
time is governed by the product of the (non-parallelized) population
size, the fitness evaluation time, and the number of simulated
gen-erations For an introduction see Goldberg (1989) and, e.g Miller,
Todd, and Hegde (1989) for optimizing the coding structure and for
weights determination Montana and Davis (1989)
Pruning and Weight Decay: By including a suitable non-linear
complex-ity penalty term to the iterative learning cost function, a fraction of
Trang 4the available parameters are forced to decay to small values (weight
decay) These redundant terms are afterwards removed The
disad-vantage of pruning (Hinton 1986; Hanson and Pratt 1989) or optimal
brain damage (Cun, Denker, and Solla 1990) methods is that both start
with rather large and therefore slower converging networks
Growing Network Structures (additive model) follow the opposite direc-tion Usually, the learning algorithm monitors the network
perfor-mance and decides when and how to insert further network elements
(in form of data memory, neurons, or entire sub-nets) into the ex-isting structure This can be combined with outliers removing and pruning techniques, which is particularly useful when the grow-ing step is generous (one-shot learngrow-ing and forgettgrow-ing the unimpor-tant things) Various unsupervised algorithms have been proposed: additive models building local regression models (Breimann, Fried-man, Olshen, and Stone 1984; Hastie and Tibshirani 1991), dynamic memory based models (Atkeson 1992; Schaal and Atkeson 1994),
and RBF net (Platt 1991); the tiling algorithm (for binary outputs;
Mézard and Nadal 1989) has similarities to the recursive partition-ing procedure (MARS) but allows also non-orthogonal hyper-planes
The (binary output) upstart algorithm (Frean 1990) shares similarities with the continuous valued cascade correlation algorithm (Fahlman
and Lebiere 1990; Littmann 1995) Adaptive topological models are studied in (Jockusch 1990), (Fritzke 1991) and in combination with the Neural-Gas in (Fritzke 1995)
3.7 Kohonen's Self-Organizing Map
Teuvo Kohonen formulated the (Self-Organizing Map) (SOM) algorithm as
a mathematical model of the self-organization of certain structures in the
brain, the topographic maps (e.g Kohonen 1984).
In the cortex, neurons are often organized in two-dimensional sheets with connections to other areas of the cortex or sensor or motor neurons somewhere in the body For example, the somatosensory cortex shows a
topographic map of the sensory skin of the body Topographic map means
that neighboring areas on the skin find their neural connection and rep-resentation to neighboring neurons in the cortex Another example is the
Trang 53.7 Kohonen's Self-Organizing Map 39
retinotopic map in the primary visual cortex (e.g Obermayer et al 1990)
Fig 3.5 shows the basic operation of the Kohonen feature map The
map is built by am(usually two) dimensional latticeAof formal neurons
Each neuron is labeled by an index a 2 A, and has reference vectors w
a
attached, projecting into the input spaceX (for more details, see Kohonen
1984; Kohonen 1990; Ritter et al 1992)
x
Array of
Neurons a
*
a *
Input Space X
Figure 3.5: The “Self-Organizing Map” (“SOM”) is formed by an array of
pro-cessing units, called formal neurons Here the usual case, a two-dimensional array
is illustrated at the right side Each neuron has a reference vector w
a attached, which is a point in the embedding input spaceX A presented input x will
se-lect that neuron with w
a closest to it This competitive mechanism tessellates the
input space in discrete patches - the so-called Voronoi cells.
The response of a SOM to an input vectorx is determined by the
ref-erence vector w
a
of the discrete “best-match” node a
The “winner”
neurona
is defined as the node which has its reference vector w
a closest
to the given input
a
= argmin
8a2A
kw a 0
This competition among neurons can be biologically interpreted as a result
of a lateral inhibition in the neural layer The distribution of the reference
vectors, or “weights”w
a, is iteratively developed by a sequence of training vectors After finding the best-match neuron
all reference vectors are
Trang 6a (new) := w a (old) + w
aby the following adaption rule:
w a
= h(aa
) (x ; w
Here h(aa
) is a bell shaped function (Gaussian) centered at the “win-ner”a
and decaying with increasing distanceja ; a
jin the neuron layer Thus, each node or “neuron” in the neighborhood of the “winner”a
par-ticipates in the current learning step (as indicated by the gray shading in Fig 3.5.)
The networks starts with a given node gridAand a random initializa-tion of the reference vectors During the course of learning, the width of the neighborhood bell functionh()and the learning step size parameter
is continuously decreased in order to allow more and more specialization and fine tuning of the (then increasingly) individual neurons
This particular cooperative nature of the adaptation algorithm has im-portant advantages:
it is able to generate topological order between thew
a;
as a result, the convergence of the algorithm can be sped up by
in-volving a whole group of neighboring neurons in each learning step;
this is additionally valuable for the learning of output values with a
higher degree of robustness (see Sect 3.8 below).
By means of the Kohonen learning rule Eq 3.10 anm–dimensional fea-ture map will select a (possibly locally varying) subset ofm independent features that capture as much of the variation of the stimulus distribu-tion as possible This is an important property that is also shared by the
method of principal component analysis (“PCA”, e.g Jolliffe 1986) Here a
linear sub-space is oriented along the axis of the maximum data variation, where in contrast the SOM can optimize its “best” features locally There-fore, the feature map can be viewed as the non-linear extension of the PCA method
The emerging tessellation of the input and the associated encoding in the node location code exhibits an interesting property related to the task
of data compression Assuming a noisy data transmission (or storage)
of an encoded data set (e.g image) the data reconstruction shows errors depending on the encoding and the distribution of noise included Feature
Trang 73.8 Improving the Output of the SOM Schema 41
map encoding (i.e node Location in the neural array) are advantageous
when the distribution of stochastic transmission errors is decreasing with
distance to the original data In case of an error the reconstruction will
restore neighboring features, resulting in a more “faithful” compression
Ritter showed the strict monotonic relationship between the stimulus
density in the m-dimensional input space and the density of the
match-ing weight vectors Regions with high input stimulus densityP(x)will be
represented by more specialized neurons than regions with lower
stimu-lus density For certain conditions the density of weight vectors could be
derived to be proportional to P(x), with the exponent = m=(m + 2)
(Ritter 1991)
3.8 Improving the Output of the SOM Schema
As discussed before, many learning applications desire continuous valued
outputs How can the SOM network learn smooth input–output
map-pings?
Similar to the binning in the hyper-rectangular recursive partitioning
algorithm (CART), the original output learning strategy was the
super-vised teaching of an attached constant ya (or vectory
a) for every winning neurona
F(x) =ya
The next important step to increase the output precision was the
intro-duction of a locally valid mapping around the reference vector
Cleve-land (1979) introduced the idea of locally weighted linear regression for
uni-variate approximation and later for multivariate regression
(Cleve-land and Devlin 1988) Independently, Ritter and Schulten (1986)
devel-oped the similar idea in the context of neural networks, which was later
coined the Local Linear Map (“LLM”) approach.
Within each subregion, the Voronoi cell (depicted in Fig 3.5), the output
is defined by a tangent hyper-plane described by the additional vector (or
matrix)B
F(x) =ya
+ B a
(x ; w
a
By this means, a univariate function is approximated by a set of tangents
In general, the outputF(x)is discontinuous, since the hyper-planes do not
match at the Voronoi cell borders
Trang 8The next step is to smooth the LLM-outputs of several neurons, in-stead of considering one single neuron This can be achieved by
replac-ing the “winner-takes-all” rule (Eq 3.9) with a “winner-takes-most” or
“soft-max” mechanism For example, by employing Eq 3.6 in the index space
of lattice coordinatesA Here the distance to the best-matcha
in the neu-ron index space determines the contribution of each neuneu-ron The relative width controls how strong the distribution is smeared out, similarly to the neighborhood functionh(), but using a separate bell size
This form of local linear map proved to be very successful in many ap-plications, e.g like the kinematic mapping for an industrial robot (Ritter, Martinetz, and Schulten 1989; Walter and Schulten 1993) In time-series prediction it was introduced in conjunction with the SOM (Walter, Ritter, and Schulten 1990) and later with the Neural-Gas network (Walter 1991; Martinetz et al 1993) Wan (1993) won the Santa-Fee time-series contest
(series X part) with a network built of finite impulse response (“FIR”)
ele-ments, which have strong similarities to LLMs
Considering the local mapping as an “expert” for a particular task sub-domain, the LLM-extended SOM can be regarded as the precursor to the
architectural idea of the “mixture-of-experts” networks (Jordan and Jacobs 1994) In this idea, the competitive SOM network performs the gating of the parallel operating, local experts We will return to the mixture-of-experts
architecture in Chap 9
Trang 9Chapter 4
The PSOM Algorithm
Despite the improvement by the LLMs, the discrete nature of the
stan-dard SOM can be a limitation when the construction of smooth,
higher-dimensional map manifolds is desired Here a “blending” concept is
re-quired, which is generally applicable — also to higher dimensions
Since the number of nodes grows exponentially with the number of
map dimensions, manageably sized lattices with, say, more than three
dimensions admit only very few nodes along each axis direction Any
discrete map can therefore not be sufficiently smooth for many purposes
where continuity is very important, as e.g in control tasks and in robotics
In this chapter we discuss the Parameterized Self-Organizing Map (“PSOM”)
algorithm It was originally introduced as the generalization of the SOM
algorithm (Ritter 1993) The PSOM parameterizes a set of basis functions
and constructs a smooth higher-dimensional map manifold By this means
a very small number of training points can be sufficient for learning very
rapidly and achieving good generalization capabilities
4.1 The Continuous Map
Starting from the SOM algorithm, described in the previous section, the
PSOM is also based on a lattice of formal neurons, in the followig also
called “nodes” Similarly to the SOM, each node carries a reference vector
w
a, projecting into thed-dimensional embedding spaceX I Rd
The first step is to generalize the index spaceAin the Kohonen map
to a continuous auxiliary mapping or parameter manifoldS 2 I Rm in the
Trang 10s1
s2
a
33
A∈S
w
9
w
3
w
1
w
2
Embedding
Space X
Array of
Knots a ∈A
Figure 4.1: The PSOM's starting position is very much the same as for the SOM depicted in Fig 3.5 The gray shading indicates that the index space A , which is discrete in the SOM, has been generalized to the continuous spaceSin the PSOM The spaceSis referred to as parameter spaceS.
PSOM This is indicated by the grey shaded area on the right side of Fig 4.1
The second important step is to define a continuous mappingw () : s 7!
w (s) 2M X, wheresvaries continuously overS I Rm
Fig 4.2 illustrates on the left them=2 dimensional “embedded manifold”
M in thed=3 dimensional embedding spaceX M is spanned by the nine (dot marked) reference vectorsw
1:::w
9, which are lying in a tilted plane
in this didactic example The cube is drawn for visual guidance only The dashed grid is the image under the mappingw ()of the (right) rectangular grid in the parameter manifoldS
How can the smooth manifoldw (s)be constructed? We require that the embedded manifoldMpasses through all supporting reference vectorsw
a
and writew () : S !M X:
w (s) =
X a2A
H(as ) w
This means that, we need a “basis function”H(as )for each formal node, weighting the contribution of its reference vector (= initial “training point”)
w
adepending on the locationsrelative to the node positiona, and
possi-bly, also all other nodes A(however, we drop in our notation the depen-dencyH H on the latter)
...map-pings?
Similar to the binning in the hyper-rectangular recursive partitioning
algorithm (CART), the original output learning strategy was the
super-vised teaching... class="page_container" data-page="8">
The next step is to smooth the LLM-outputs of several neurons, in- stead of considering one single neuron This can be achieved by
replac-ing the “winner-takes-all”... “winner-takes-most” or
“soft-max” mechanism For example, by employing Eq 3.6 in the index space
of lattice coordinatesA Here the distance to the best-matcha