Rapid Learning in Robotics - Jorg Walter Part 4 pdf

in Walter, Martinetz, and Schulten 1991 and compared to the so-called Neural-Gas Network in Walter 1991 and Walter and Schulten 1993.. The Neural-Gas Network shows in contrast to the SOM

Trang 1

3.5 Strategies to Avoid Over-Fitting 35

could be extracted from a sequence of three-word sentences

(Koho-nen 1990; Ritter and Koho(Koho-nen 1989) The topology preserving

prop-erties enables cooperative learning in order to increase speed and

ro-bustness of learning, studied e.g in Walter, Martinetz, and Schulten

(1991) and compared to the so-called Neural-Gas Network in Walter

(1991) and Walter and Schulten (1993)

The Neural-Gas Network shows in contrast to the SOM not a fixed

grid topology but a “gas-like”, dynamic definition of the

neighbor-hood function, which is determined by (dynamic) ranking of

close-ness in the input space (Martinetz and Schulten 1991) This results in

advantages for applications with inhomogeneous or unknown

topol-ogy (e.g prediction of chaotic time series like the Mackey-Glass

series in Walter (1991) and later also published in Martinetz et al

(1993))

The choice of the type of approximation function introduces bias, and

restricts the variance of the of the possible solutions This is a fundamental

relation called the bias–variance problem (Geman et al 1992) As indicated

before, this bias and the corresponding variance reduction can be good or

bad, depending on the suitability of the choice The next section discusses

the problem over-using the variance of a chosen approximation ansatz,

especially in the presence of noise

3.5 Strategies to Avoid Over-Fitting

Over-fitting can occur, when the function f gets approximated in the

do-mainD, using only a too limited number of training data pointsDtrain If

the ratio of free parameter versus training points is too high, the

approxi-mation fits to the noise, as illustrated by Fig 3.4 This results in a reduced

generalization ability Beside the proper selection of the appropriate

net-work structure, several strategies can help to avoid the over-fitting effect:

Early stopping: During incremental learning the approximation error is

systematically decreased, but at some point the expected error or

lack-of-fitLOF(FD)starts to increase again The idea of early

stop-ping is to estimate theLOF on a separate test data setDtest and

de-termine the optimal time to stop learning

Trang 2

X X

Figure 3.4: (Left) A meaningful fit to the given cross-marked noisy data (Right)

badly on the indicated (cross-marked) position.

More training data: Over-fitting can be avoided when sufficient training points are available, e.g by learning on-line Duplicating the avail-able training data set and adding a small amount of noise can help

to some extent

Smoothing and Regularization: Poggio and Girosi (1990) pointed out that learning from a limited set of data is an ill-posed problem and needs further assumptions to achieve meaningful generalization

capabili-ties The most usual presumption is smoothness, which can be

formal-ized by a stabilizer term in the cost function Eq 3.1 (regularization theory) The roughness penalty approximations can be written as

F(wx) =argminF(LOF (FD) + R(F)) (3.7) whereR(F)is a functional that describes the roughness of the func-tion F(wx) The parameter controls the tradeoff between the fi-delity to the data and the smoothness ofF A common choice forR

is the integrated squared Laplacian ofF

R(F) =

n

X

i=1

n

X

j=1

Z

D

@2F

@xi@xj

2

which is equivalent to the thin-plate spline (forn 3; coined by the energy of a bended thin plate of finite extent) The main difficulty is the introduction of a very influential parameter and the computa-tion burden to carry out the integral

For the topology preserving maps the smoothing is introduced by

a parameter, which determines the range of learning coupling

Trang 3

be-3.6 Selecting the Right Network Size 37

tween neighboring neurons in the map This can be interpreted as a

regularization for the SOM and the “Neural-Gas” network

3.6 Selecting the Right Network Size

Beside the accuracy criterion (LOF, Eq 3.1) the simplicity of the network

is desirable, similar to the idea of Occam's Razor The formal way is to

augment the cost function by a complexity cost term, which is often written

as a function of the number of non-constant model parameters (additive

or multiplicative penalty, e.g the Generalized Cross-Validation criterion

GCV; Craven and Wahba 1979)

There are several techniques to select the right network size and

struc-ture:

Trial-and-Error is probably the most prominent method in practice A

particular network structure is constructed and evaluated, which

in-cludes training and testing The achieved lack-of-fit (LOF) is

esti-mated and minimized

Genetic Algorithms can automize this optimization method, in case of a

suitable encoding of the construction parameter, the genome can be

defined Initially, a set of individuals (network genomes), the

pop-ulation is constructed by hand During each epoch, the individuals

of this generation are evaluated (training and testing) Their fitnesses

(negative cost function) determine the probability of various ways of

replication, including mutations (stochastic genome modifications)

and cross-over (sexual replication with stochastic genome exchange)

The applicability and success of this method depends strongly on

the complexity of the problem, the effective representation, and the

computation time required to simulate evolution The computation

time is governed by the product of the (non-parallelized) population

size, the fitness evaluation time, and the number of simulated

gen-erations For an introduction see Goldberg (1989) and, e.g Miller,

Todd, and Hegde (1989) for optimizing the coding structure and for

weights determination Montana and Davis (1989)

Pruning and Weight Decay: By including a suitable non-linear

complex-ity penalty term to the iterative learning cost function, a fraction of

Trang 4

the available parameters are forced to decay to small values (weight

decay) These redundant terms are afterwards removed The

disad-vantage of pruning (Hinton 1986; Hanson and Pratt 1989) or optimal

brain damage (Cun, Denker, and Solla 1990) methods is that both start

with rather large and therefore slower converging networks

Growing Network Structures (additive model) follow the opposite direc-tion Usually, the learning algorithm monitors the network

perfor-mance and decides when and how to insert further network elements

(in form of data memory, neurons, or entire sub-nets) into the ex-isting structure This can be combined with outliers removing and pruning techniques, which is particularly useful when the grow-ing step is generous (one-shot learngrow-ing and forgettgrow-ing the unimpor-tant things) Various unsupervised algorithms have been proposed: additive models building local regression models (Breimann, Fried-man, Olshen, and Stone 1984; Hastie and Tibshirani 1991), dynamic memory based models (Atkeson 1992; Schaal and Atkeson 1994),

and RBF net (Platt 1991); the tiling algorithm (for binary outputs;

Mézard and Nadal 1989) has similarities to the recursive partition-ing procedure (MARS) but allows also non-orthogonal hyper-planes

The (binary output) upstart algorithm (Frean 1990) shares similarities with the continuous valued cascade correlation algorithm (Fahlman

and Lebiere 1990; Littmann 1995) Adaptive topological models are studied in (Jockusch 1990), (Fritzke 1991) and in combination with the Neural-Gas in (Fritzke 1995)

3.7 Kohonen's Self-Organizing Map

Teuvo Kohonen formulated the (Self-Organizing Map) (SOM) algorithm as

a mathematical model of the self-organization of certain structures in the

brain, the topographic maps (e.g Kohonen 1984).

In the cortex, neurons are often organized in two-dimensional sheets with connections to other areas of the cortex or sensor or motor neurons somewhere in the body For example, the somatosensory cortex shows a

topographic map of the sensory skin of the body Topographic map means

that neighboring areas on the skin find their neural connection and rep-resentation to neighboring neurons in the cortex Another example is the

Trang 5

3.7 Kohonen's Self-Organizing Map 39

retinotopic map in the primary visual cortex (e.g Obermayer et al 1990)

Fig 3.5 shows the basic operation of the Kohonen feature map The

map is built by am(usually two) dimensional latticeAof formal neurons

Each neuron is labeled by an index a 2 A, and has reference vectors w

a

attached, projecting into the input spaceX (for more details, see Kohonen

1984; Kohonen 1990; Ritter et al 1992)

x

Array of

Neurons a

*

a *

Input Space X

Figure 3.5: The “Self-Organizing Map” (“SOM”) is formed by an array of

pro-cessing units, called formal neurons Here the usual case, a two-dimensional array

is illustrated at the right side Each neuron has a reference vector w

a attached, which is a point in the embedding input spaceX A presented input x will

se-lect that neuron with w

a closest to it This competitive mechanism tessellates the

input space in discrete patches - the so-called Voronoi cells.

The response of a SOM to an input vectorx is determined by the

ref-erence vector w

a

of the discrete “best-match” node a

The “winner”

neurona

is defined as the node which has its reference vector w

a closest

to the given input

a

= argmin

8a2A

kw a 0

This competition among neurons can be biologically interpreted as a result

of a lateral inhibition in the neural layer The distribution of the reference

vectors, or “weights”w

a, is iteratively developed by a sequence of training vectors After finding the best-match neuron

all reference vectors are

Trang 6

a (new) := w a (old) + w

aby the following adaption rule:

w a

= h(aa

) (x ; w

Here h(aa

) is a bell shaped function (Gaussian) centered at the “win-ner”a

and decaying with increasing distanceja ; a

jin the neuron layer Thus, each node or “neuron” in the neighborhood of the “winner”a

par-ticipates in the current learning step (as indicated by the gray shading in Fig 3.5.)

The networks starts with a given node gridAand a random initializa-tion of the reference vectors During the course of learning, the width of the neighborhood bell functionh()and the learning step size parameter

is continuously decreased in order to allow more and more specialization and fine tuning of the (then increasingly) individual neurons

This particular cooperative nature of the adaptation algorithm has im-portant advantages:

it is able to generate topological order between thew

a;

as a result, the convergence of the algorithm can be sped up by

in-volving a whole group of neighboring neurons in each learning step;

this is additionally valuable for the learning of output values with a

higher degree of robustness (see Sect 3.8 below).

By means of the Kohonen learning rule Eq 3.10 anm–dimensional fea-ture map will select a (possibly locally varying) subset ofm independent features that capture as much of the variation of the stimulus distribu-tion as possible This is an important property that is also shared by the

method of principal component analysis (“PCA”, e.g Jolliffe 1986) Here a

linear sub-space is oriented along the axis of the maximum data variation, where in contrast the SOM can optimize its “best” features locally There-fore, the feature map can be viewed as the non-linear extension of the PCA method

The emerging tessellation of the input and the associated encoding in the node location code exhibits an interesting property related to the task

of data compression Assuming a noisy data transmission (or storage)

of an encoded data set (e.g image) the data reconstruction shows errors depending on the encoding and the distribution of noise included Feature

Trang 7

3.8 Improving the Output of the SOM Schema 41

map encoding (i.e node Location in the neural array) are advantageous

when the distribution of stochastic transmission errors is decreasing with

distance to the original data In case of an error the reconstruction will

restore neighboring features, resulting in a more “faithful” compression

Ritter showed the strict monotonic relationship between the stimulus

density in the m-dimensional input space and the density of the

match-ing weight vectors Regions with high input stimulus densityP(x)will be

represented by more specialized neurons than regions with lower

stimu-lus density For certain conditions the density of weight vectors could be

derived to be proportional to P(x), with the exponent = m=(m + 2)

(Ritter 1991)

3.8 Improving the Output of the SOM Schema

As discussed before, many learning applications desire continuous valued

outputs How can the SOM network learn smooth input–output

map-pings?

Similar to the binning in the hyper-rectangular recursive partitioning

algorithm (CART), the original output learning strategy was the

super-vised teaching of an attached constant ya (or vectory

a) for every winning neurona

F(x) =ya

The next important step to increase the output precision was the

intro-duction of a locally valid mapping around the reference vector

Cleve-land (1979) introduced the idea of locally weighted linear regression for

uni-variate approximation and later for multivariate regression

(Cleve-land and Devlin 1988) Independently, Ritter and Schulten (1986)

devel-oped the similar idea in the context of neural networks, which was later

coined the Local Linear Map (“LLM”) approach.

Within each subregion, the Voronoi cell (depicted in Fig 3.5), the output

is defined by a tangent hyper-plane described by the additional vector (or

matrix)B

F(x) =ya

+ B a

(x ; w

a

By this means, a univariate function is approximated by a set of tangents

In general, the outputF(x)is discontinuous, since the hyper-planes do not

match at the Voronoi cell borders

Trang 8

The next step is to smooth the LLM-outputs of several neurons, in-stead of considering one single neuron This can be achieved by

replac-ing the “winner-takes-all” rule (Eq 3.9) with a “winner-takes-most” or

“soft-max” mechanism For example, by employing Eq 3.6 in the index space

of lattice coordinatesA Here the distance to the best-matcha

in the neu-ron index space determines the contribution of each neuneu-ron The relative width controls how strong the distribution is smeared out, similarly to the neighborhood functionh(), but using a separate bell size

This form of local linear map proved to be very successful in many ap-plications, e.g like the kinematic mapping for an industrial robot (Ritter, Martinetz, and Schulten 1989; Walter and Schulten 1993) In time-series prediction it was introduced in conjunction with the SOM (Walter, Ritter, and Schulten 1990) and later with the Neural-Gas network (Walter 1991; Martinetz et al 1993) Wan (1993) won the Santa-Fee time-series contest

(series X part) with a network built of finite impulse response (“FIR”)

ele-ments, which have strong similarities to LLMs

Considering the local mapping as an “expert” for a particular task sub-domain, the LLM-extended SOM can be regarded as the precursor to the

architectural idea of the “mixture-of-experts” networks (Jordan and Jacobs 1994) In this idea, the competitive SOM network performs the gating of the parallel operating, local experts We will return to the mixture-of-experts

architecture in Chap 9

Trang 9

Chapter 4

The PSOM Algorithm

Despite the improvement by the LLMs, the discrete nature of the

stan-dard SOM can be a limitation when the construction of smooth,

higher-dimensional map manifolds is desired Here a “blending” concept is

re-quired, which is generally applicable — also to higher dimensions

Since the number of nodes grows exponentially with the number of

map dimensions, manageably sized lattices with, say, more than three

dimensions admit only very few nodes along each axis direction Any

discrete map can therefore not be sufficiently smooth for many purposes

where continuity is very important, as e.g in control tasks and in robotics

In this chapter we discuss the Parameterized Self-Organizing Map (“PSOM”)

algorithm It was originally introduced as the generalization of the SOM

algorithm (Ritter 1993) The PSOM parameterizes a set of basis functions

and constructs a smooth higher-dimensional map manifold By this means

a very small number of training points can be sufficient for learning very

rapidly and achieving good generalization capabilities

4.1 The Continuous Map

Starting from the SOM algorithm, described in the previous section, the

PSOM is also based on a lattice of formal neurons, in the followig also

called “nodes” Similarly to the SOM, each node carries a reference vector

w

a, projecting into thed-dimensional embedding spaceX I Rd

The first step is to generalize the index spaceAin the Kohonen map

to a continuous auxiliary mapping or parameter manifoldS 2 I Rm in the

Trang 10

s1

s2

a

33

A∈S

w

9

w

3

w

1

w

2

Embedding

Space X

Array of

Knots a ∈A

Figure 4.1: The PSOM's starting position is very much the same as for the SOM depicted in Fig 3.5 The gray shading indicates that the index space A , which is discrete in the SOM, has been generalized to the continuous spaceSin the PSOM The spaceSis referred to as parameter spaceS.

PSOM This is indicated by the grey shaded area on the right side of Fig 4.1

The second important step is to define a continuous mappingw () : s 7!

w (s) 2M X, wheresvaries continuously overS I Rm

Fig 4.2 illustrates on the left them=2 dimensional “embedded manifold”

M in thed=3 dimensional embedding spaceX M is spanned by the nine (dot marked) reference vectorsw

1:::w

9, which are lying in a tilted plane

in this didactic example The cube is drawn for visual guidance only The dashed grid is the image under the mappingw ()of the (right) rectangular grid in the parameter manifoldS

How can the smooth manifoldw (s)be constructed? We require that the embedded manifoldMpasses through all supporting reference vectorsw

a

and writew () : S !M X:

w (s) =

X a2A

H(as ) w

This means that, we need a “basis function”H(as )for each formal node, weighting the contribution of its reference vector (= initial “training point”)

w

adepending on the locationsrelative to the node positiona, and

possi-bly, also all other nodes A(however, we drop in our notation the depen-dencyH H on the latter)

map-pings?

Similar to the binning in the hyper-rectangular recursive partitioning

algorithm (CART), the original output learning strategy was the

super-vised teaching... class="page_container" data-page="8">

The next step is to smooth the LLM-outputs of several neurons, in- stead of considering one single neuron This can be achieved by

replac-ing the “winner-takes-all”... “winner-takes-most” or

“soft-max” mechanism For example, by employing Eq 3.6 in the index space

of lattice coordinatesA Here the distance to the best-matcha

Định dạng
Số trang	16
Dung lượng	224,73 KB