Deﬁnition of the Kernel Unit

Figure 3.1 depicts the kernel unit used in the kernel memory concept. As in the ﬁgure, a single kernel unit is composed of 1) the kernel function, 2)

1In this book, the term kernel sometimes interchangeably represents “kernel unit”.

3.2 The Kernel Memory 33 excitation counter, 3) auxiliary memory to store the class ID (label), and 4) pointers to the other kernel units.

In the figure, the first element, i.e. the kernel function K(x) is formally defined:

K(x) =f(x) =f(x1, x2, . . . , xN) (3.1) wheref(ã) is a certain function, or, if it is used as a similarity measurement in a speciﬁc situation:

K(x) =K(x,t) =D(x,t) (3.2)

where x= [x1, x2, . . . , xN]T is the input vector to the new memory element (i.e. a kernel unit),tis the template vector of the kernel unit, with the same dimension asx(i.e.t= [t1, t2, . . . , tN]T), and the functionD(ã) gives a certain metric between the vectorxandt.

Then, a number of such kernels as deﬁned by (3.2) can be considered. The simplest of which is the form that utilises the Euclidean distance metric:

K(x,t) =x−tn2 (n >0), (3.3) or, alternatively, we could exploit a variant of the basic form (3.3) as in the following table (see e.g. Hastie et al., 2001):

Table 3.1.Some of the commonly used kernel functions Inner product:

K(x) =K(x,t) =xãt (3.4) Gaussian:

K(x) =K(x,t) = exp(−x−t2

σ2 ) (3.5)

Epanechnikov quadratic:

K(x) =K(z) =

4(1−z2) if|z|<1;

0 otherwise (3.6)

Tri-cube:

K(x) =K(z) = (1− |z|3)3 if|z|<1;

0 otherwise (3.7)

wherez=x−tn(n >0).

34 3 The Kernel Memory Concept The Gaussian Kernel

In (3.2), if a Gaussian response function is chosen for a kernel unit, the output of the kernel functionK(x) is given as2

K(x) =K(x,c) = exp

−x−c2 σ2

. (3.8)

In the above, the template vectortis replaced by the centroid vectorcwhich is speciﬁc to a Gaussian response function.

Then, the kernel function represented in terms of the Gaussian response function exhibits the following properties:

1) The distance metric between the two vectors x and c is given as the squared value of the Euclidean distance (i.e. theL2 norm).

2) Thespread of the output value (or, the width of the kernel) is determined by the factor (radius)σ.

3) The output value obtained by calculatingK(x) isstrictlybounded within the range from 0 to 1.

4) In terms of the Taylor series expansion, the exponential part within the Gaussian response function can be approximated by the polynomial

exp(−z)≈ N n=0

(−1)nzn n!

= 1−z+1 2z2− 1

3!z3+ã ã ã (3.9) where N is ﬁnite and reasonably large in practice. Exploiting this may facilitate hardware representation3. Along with this line, it is reported in (Platt, 1991) that the following approximation is empirically found to be reasonable:

exp(−z

σ2)≈ (1−(qσz2)2)2 ifz < qσ2;

0 otherwise (3.10)

whereq= 2.67.

5) The real world data can be moderately but reasonably well-represented in many situations in terms of the Gaussian response function, i.e. as a consequence of the central limit theorem in the statistical sense (see e.g.

2In some literature, the factor σ2 within the denominator of the exponential function in (3.8) is multiplied by 2, due to the derivation of the original form.

However, there is essentially no diﬀerence in practice, since we may rewrite (3.8) withσ=√

2´σ, where ´σis then regarded as the radius.

3For the realisation of the Gaussian response function (or RBF) in terms of hardware, the complimentary metal-oxide semiconductor (CMOS) inverters have been exploited (for the detail, see Anderson et al., 1993; Theogarajan and Akers, 1996, 1997; Yamasaki and Shibata, 2003).

3.2 The Kernel Memory 35 Garcia, 1994) (as described in Sect. 2.3). Nevertheless, within the kernel memory context, it is also possible to use a mixture of kernel represen- tations rather than resorting to a single representation, depending upon situations.

In 1) above, a single Gaussian kernel is already a pattern classiﬁer in the sense that calculating the Euclidean distance betweenxandcis equiva- lent to performing pattern matching and then the score indicating howsimilar the input vector x is to the stored pattern c is given as the value obtained from the exponential function (according to 3) above); if the value becomes asymptotically close to 1 (or, if the value is above a certain threshold), this indicates that the input vector x given matches the template vector c to a great extent and can be classiﬁed as the same category as that ofc. Otherwise, the patternxbelongs to another category4.

Thus, since the value obtained from the similarity measurement in (3.8) is bounded (or, in other words, normalised), due to the existence of the exponential function, the uniformity in terms of the classiﬁcation score is retained. In practice, this property is quite useful, especially when considering the utility of a multiple of Gaussian kernels, as used in the family of RBF-NNs. In this context, the Gaussian metric is advantageous in comparison with the original Euclidean metric given by (3.3).

Kernel Function Representing a General Symbolic Node

In addition, a single kernel can also be regarded as a new entity in place of the conventional memory element, as well as a symbolic node in general symbolism by simply assigning the kernel function as

K(x) =









θs ; if the activation from the other kernel unit(s) is transferred to this kernel unit via the link weight(s)

0 ; otherwise

(3.11)

whereθsis a certain constant.

This view then allows us to subsume the concept of symbolic connectionist models such as Minsky’s knowledge-line (K-Line) (Minsky, 1985). Moreover, the kernel memory can replace the ordinary symbolism in that each node (i.e.

represented by a single kernel unit) can have a generalisation capability which could, to a greater extent, mitigate the “curse-of-dimensionality”, in which, practically speaking, the exponentially growing number of data points soon exhausts the entire memory space.

4In fact, the utility of Gaussian distribution function as a similarity measurement between two vectors is one of the common techniques, e.g. the psychological model of GCM (Nosofsky, 1986), which can be viewed as one of the twins of RBF-NNs, or the application to continuous speech recognition (Lee et al., 1990; Rabiner and Juang, 1993).

36 3 The Kernel Memory Concept The Excitation Counter

Returning to Fig. 3.1, the second element of the kernel unitεis theexcitation counter. The excitation counter can be used to count how many times the kernel unit isrepeatedly excited (e.g. the value of the kernel functionK(x) is above a given threshold) in a certain period of time (if so deﬁned), i.e. when the kernel function satisﬁes the relation

K(x)≥θK (3.12)

whereθK is the given threshold.

Initially, the valueεis set to 0 and incremented whenever the kernel unit is excited, though the value may be reset to 0, where necessary.

The Auxiliary Memory

The third element in Fig. 3.1 is the auxiliary memory η to store the class ID (label) indicating that the kernel unit belongs to a particular class (or category). Unlike the conventional pattern classification context, the timing to fix the class ID η is flexibly determined, which is dependent upon the learning algorithm for the kernel memory, as described later.

The Pointers to Other Kernel Units

Finally, the fourth element in Fig. 3.1 is the pointer(s)pi (i= 1,2, . . . , Np) to the other kernel unit(s). Then, by exploiting these pointers, thelink weight, which lies between a pair of kernel units with a weighting factor to represent thestrength of the connection in between, is given.

Note that this manner of connection then allows us to re- alise a diﬀerent form of network conﬁguration from the con- ventional neural network architectures, since the output of the kernel function K(x) is not always directly transferred to the other nodes via the “weights”, e.g. those between the hidden and output layers, as in PNNs/GRNNs.

Comparison Between Commonly Used Connectionist Models

Representation of the Kernel Unit Activated