NEURAL NETWORKS

If the output is the prototype input itself, then the system is said to be an autoassociative memory.. The correlation memory matrix trained on the pattern pairs x1, y1,.. For this reaso

Trang 1

Ivan F Wilde Mathematics Department King’s College London London, WC2R 2LS, UK

ivan.wilde@kcl.ac.uk

Trang 2

1 Matrix Memory 1

2 Adaptive Linear Combiner 21

3 Artificial Neural Networks 35

4 The Perceptron 45

5 Multilayer Feedforward Networks 75

6 Radial Basis Functions 95

7 Recurrent Neural Networks 103

8 Singular Value Decomposition 115

Bibliography 121

Trang 3

an acceptable response Such a system is also called a content addressablememory.

3

stimulus mapping response

Figure 1.1: A content addressable memory

The idea is that the association should not be defined so much betweenthe individual stimulus-response pairs, but rather embodied as a whole col-lection of such input-output patterns—the system is a distributive associa-tive memory (the input-output pairs are “distributed” throughout the sys-tem memory rather than the particular input-output pairs being somehowrepresented individually in various different parts of the system)

To attempt to realize such a system, we shall suppose that the inputkey (or prototype) patterns are coded as vectors in Rn, say, and that theresponses are coded as vectors in Rm For example, the input might be adigitized photograph comprising a picture with 100 × 100 pixels, each ofwhich may assume one of eight levels of greyness (from white (= 0) to black

Trang 4

(= 7)) In this case, by mapping the screen to a vector, via raster order, say,the input is a vector in R10000 and whose components actually take values

in the set {0, , 7} The desired output might correspond to the name ofthe person in the photograph If we wish to recognize up to 50 people, say,then we could give each a binary code name of 6 digits—which allows up to

26 = 64 different names Then the output can be considered as an element

of R6

Now, for any pair of vectors x ∈ Rn, y ∈ Rm, we can effect the map

x 7→ y via the action of the m × n matrix

M(x,y) = y xTwhere x is considered as an n × 1 (column) matrix and y as an m × 1 matrix.Indeed,

M(x,y)x = y xTx

= α y,where α = xTx = kxk2, the squared Euclidean norm of x The matrix

yxT is called the outer product of x and y This suggests a model for our

pat-i=1y(i)x(i)T is just Y XT.Indeed, the jk-element of Y XT is

which is precisely the jk-element of M

When presented with the input signal x(j), the output is

(x(i)Tx(j))y(i)

Trang 5

In particular, if we agree to “normalize” the key input signals so that

x(i)Tx(i) = kx(i)k2 = 1, for all 1 ≤ i ≤ p, then the first term on the righthand side above is just y(j), the desired response signal The second term onthe right hand side is called the “cross-talk” since it involves overlaps (i.e.,inner products) of the various input signals

If the input signals are pairwise orthogonal vectors, as well as beingnormalized, then x(i)Tx(j)= 0 for all i 6= j In this case, we get

M x(j)= y(j)that is, perfect recall Note that Rncontains at most n mutually orthogonalvectors

Operationally, one can imagine the system organized as indicated in thefigure

• finally, present any input signal and observe the response

Note that additional signal-response patterns can simply be “added in”

at any time, or even removed—by adding in −y(j)x(j)T After the secondstage above, the system has “learned” the signal-response pattern pairs Thecollection of pattern pairs (x(1), y(1)), , (x(p), y(p)) is called the trainingset

Trang 6

Remark 1.1 In general, the system is a heteroassociative memory x(i) Ã

y(i), 1 ≤ i ≤ p If the output is the prototype input itself, then the system

is said to be an autoassociative memory

We wish, now, to consider a quantitative account of the robustness ofthe autoassociative memory matrix For this purpose, we shall suppose thatthe prototype patterns are bipolar vectors in Rn, i.e., the components ofthe x(i) each belong to {−1, 1} Then kx(i)k2 = Pn

j=1x(i)j 2 = n, for each

1 ≤ i ≤ p, so that (1/√n)x(i) is normalized Suppose, further, that theprototype vectors are pairwise orthogonal (—this requires that n be even).The correlation memory matrix is

M = 1n

p

X

i=1

x(i)x(i)T

and we have seen that M has perfect recall, M x(j)= x(j) for all 1 ≤ j ≤ p

We would like to know what happens if M is presented with x, a corruptedversion of one of the x(j) In order to obtain a bipolar vector as output, weprocess the output vector M x as follows:

M x Ã Φ(Mx)where Φ : Rn→ {−1, 1}n is defined by

Φ(z)k=

(

1, if zk≥ 0

−1, if zk< 0for 1 ≤ k ≤ n and z ∈ Rn Thus, the matrix output is passed through

a (bipolar) signal quantizer, Φ To proceed, we introduce the notion ofHamming distance between pairs of bipolar vectors

Let a = (a1, , an) and b = (b1, , bn) be elements of {−1, 1}n, i.e.,bipolar vectors The set {−1, 1}n consists of the 2nvertices of a hypercube

where α is the number of components of a and b which are the same, and β

is the number of differing components (aibi = 1 if and only if ai = bi, and

aibi = −1 if and only if ai 6= bi) Clearly, α + β = n and so aTb = n − 2β.Definition 1.2 The Hamming distance between the bipolar vectors a, b,denoted ρ(a, b), is defined to be

Trang 7

Evidently (thanks to the factor 12), ρ(a, b) is just the total number ofmismatches between the components of a and b, i.e., it is equal to β, above.Hence

Hence, using x(i)Tx = n − 2ρ(x(i), x), we have

M x = 1n

p

X

i=1 i6=m

We wish to find conditions which ensure that the inequality (∗) holds

By the triangle inequality, we get

|n − 2ρi(x)| (∗∗)

Trang 8

since |x(i)j x(m)j | = 1 for all 1 ≤ j ≤ n Furthermore, using the orthogonality

of x(m) and x(i), for i 6= m, we have

ρ(x(i), x(m)) ≤ ρ(x(i), x) + ρ(x, x(m)) andρ(x(i), x) ≤ ρ(x(i), x(m)) + ρ(x(m), x)

|n − 2ρi(x)|

≤ 2(p − 1)ρm(x)

It follows that whenever 2(p − 1)ρm(x) < n − 2ρm(x) then (∗) holds whichmeans that M x = x(m) The condition 2(p − 1)ρm(x) < n − 2ρm(x) is justthat 2pρm(x) < n, i.e., the condition that ρm(x) < n/2p

Now, we observe that if ρm(x) < n/2p, then, for any i 6= m,

n − 2ρi(x) ≤ 2ρm(x) by (∗∗∗), above,

< npand so n − 2ρi(x) < n/p Thus

µ

p − 1p

¶

≥ 2pn,assuming that p ≥ 2, so that p − 1 ≥ 1 In other words, if x is withinHamming distance of (n/2p) from x(m), then its Hamming distance to ev-ery other prototype input vector is greater (or equal to) (n/2p) We havethus proved the following theorem (L Personnaz, I Guyon and G Dreyfus,Phys Rev A 34, 4217–4228 (1986))

Trang 9

Theorem 1.3 Suppose that {x(1), x(2), , x(p)} is a given set of mutuallyorthogonal bipolar patterns in {−1, 1}n If x ∈ {−1, 1}n lies within Ham-ming distance (n/2p) of a particular prototype vector x(m), say, then x(m)

is the nearest prototype vector to x

Furthermore, if the autoassociative matrix memory based on the patterns{x(1), x(2), , x(p)} is augmented by subsequent bipolar quantization, thenthe input vector x invokes x(m) as the corresponding output

This means that the combined memory matrix and quantization tem can correctly recognize (slightly) corrupted input patterns The non-linearity (induced by the bipolar quantizer) has enhanced the system performance—small background “noise” has been removed Note that it could happen thatthe output response to x is still x(m) even if x is further than (n/2p) from

sys-x(m) In other words, the theorem only gives sufficient conditions for x torecall x(m)

As an example, suppose that we store 4 patterns built from a grid of

8 × 8 pixels, so that p = 4, n = 82= 64 and (n/2p) = 64/8 = 8 Each of the

4 patterns can then be correctly recalled even when presented with up to 7incorrect pixels

Remark 1.4 If x is close to −x(m), then the output from the combinedautocorrelation matrix memory and bipolar quantizer is −x(m)

A memory matrix, also known as a linear associator, can be pictured as

a network as in the figure

Trang 10

“Weights” are assigned to the connections Since yi = P

jMijxj, thissuggests that we assign the weight Mijto the connection joining input node j

to output node i; Mij = weight(j → i)

The correlation memory matrix trained on the pattern pairs (x(1), y(1)), ,(x(p), y(p)) is given by M =Pp

m=1y(m)x(m)T, which has typical term

exci-of cell j causes an increase in its efficiency to excite cell i To encapsulate

a crude version of this idea mathematically, we might hypothesise that theweight between the two nodes be proportional to the excitation values ofthe nodes Thus, for pattern label m, we would postulate that the weight,weight(input j → output i), be proportional to x(m)j yi(m)

We see that Mijis a sum, over all patterns, of such terms For this reason,the assignment of the correlation memory matrix to a content addressablememory system is sometimes referred to as generalized Hebbian learning, orone says that the memory matrix is given by the generalized Hebbian rule

Capacity of autoassociative Hebbian learning

We have seen that the correlation memory matrix has perfect recall providedthat the input patterns are pairwise orthogonal vectors Clearly, there can

be at most n of these In practice, this orthogonality requirement may not

be satisfied, so it is natural ask for some kind of guide as to the number

of patterns that can be stored and effectively recovered In other words,how many patterns can there be before the cross-talk term becomes so largethat it destroys the recovery of the key patterns? Experiment confirms that,indeed, there is a problem here To give some indication of what might bereasonable, consider the autoassociative correlation memory matrix based

on p bipolar pattern vectors x(1), , x(p) ∈ {−1, 1}n, followed by bipolarquantization, Φ On presentation of pattern x(m), the system output is

Trang 11

Consider the kthbit Then Φ(M x(m))k= x(m)k whenever x(m)k (M x(m))k> 0,that is whenever

p

X

i=1 i6=j

We see that Ck is a sum of many terms

Ck= 1n

p

X

i=1 i6=m

n

X

j=1

Xm,k,i,j

where Xm,k,i,j = x(m)k x(i)k x(i)j x(m)j We note firstly that, with j = k,

Xm,k,i,k= x(m)k x(i)k x(i)k x(m)k = 1

Next, we see that, for j 6= k, each Xm,k,i,j takes the values ±1 with equalprobability, namely, 12, and that these different Xs form an independentfamily Therefore, we may write Ck as

an approximate standard normal distribution (for large n)

Trang 12

Hence, if we denote by Z a standard normal random variable,

−n − (p − 1)p

(n − 1)(p − 1)

´

p(n − 1)(p − 1) < −

´,

where we have ignored terms in 1/n and replaced p − 1 by p Using thesymmetry of the standard normal distribution, we can rewrite this as

Prob(Ck< −1) = Prob³Z >

rnp

´

Suppose that we require that the probability of an incorrect bit be no greaterthan 0.01 (or 1%) Then, from statistical tables, we find that Prob(Z >p

n/p) ≤ 0.01 requires that pn/p ≥ 2.326 That is, we require n/p ≥(2.326)2 or p/n ≤ 0.185 Now, to say that any particular bit is incorrectlyrecalled with probability 0.01 is to say that the average number of incorrectbits (from a large sample) is 1% of the total We have therefore shown that

if we are prepared to accept up to 1% bad bits in our recalled patterns (onaverage) then we can expect to be able to store no more than p = 0.185npatterns in our autoassociative system That is, the storage capacity (with

a 1% error tolerance) is 0.185n

Generalized inverse matrix memory

We have seen that the success of the correlation memory matrix, or Hebbianlearning, is limited by the appearance of the cross-talk term We shallderive an alternative system based on the idea of minimization of the outputdistortion or error

Let us start again (and with a change of notation) We wish to construct

an associative memory system which matches input patterns a(1), , a(p)(from Rn) with output pattern vectors b(1), , b(p) (in Rm), respectively.The question is whether or not we can find a matrix M ∈ Rm×n, the set of

m × n real matrices, such that

M a(i) = b(i)

Trang 13

for all 1 ≤ i ≤ p Let A ∈ Rn×pbe the matrix whose columns are the vectors

a(1), , a(p), i.e., A = (a(1)· · · a(p)), and let B ∈ Rm×p be the matrix withcolumns given by the b(i)s, B = (b(1)· · · b(p)), thus Aij = a(j)i and Bij = b(j)i Then it is easy to see that M a(i) = b(i), for all i, is equivalent to M A = B.The problem, then, is to solve the matrix equation

M A = B,for M ∈ Rm×n, for given matrices A ∈ Rn×p and B ∈ Rm×p

First, we observe that for a solution to exist, the matrices A and Bcannot be arbitrary Indeed, if A = 0, then so is M A no matter what Mis—so the equation will not hold unless B also has all zero entries

Suppose next, slightly more subtly, that there is some non-zero vector

v ∈ Rp such that Av = 0 Then, for any M , M Av = 0 In general, it neednot be true that Bv = 0

Suppose then that there is no such non-zero v ∈ Rp such that Av = 0,i.e., we are supposing that Av = 0 implies that v = 0 What does thismean? We have

Now, the statement that Av = 0 if and only if v = 0 is equivalent to thestatement that v1a(1)+ · · · + vpa(p) = 0 if and only if v1 = v2 = · · · = vp= 0which, in turn, is equivalent to the statement that a(1), ,a(p) are linearlyindependent vectors in Rn

Thus, the statement, Av = 0 if and only if v = 0, is true if and only ifthe columns of A are linearly independent vectors in Rn

Proposition 1.5 For any A ∈ Rn×p, the p × p matrix ATA is invertible ifand only if the columns of A are linearly independent in Rn

Proof The square matrix ATA is invertible if and only if the equation

ATAv = 0 has the unique solution v = 0, v ∈ Rp (Certainly the invertibility

of ATA implies the uniqueness of the zero solution to ATAv = 0 For theconverse, first note that the uniqueness of this zero solution implies that

Trang 14

ATA is a one-one linear mapping from Rp to Rp Moreover, using linearity,one readily checks that the collection ATAu1, , ATAup is a linearly in-dependent set for any basis u1, , up of Rp This means that it is a basisand so ATA maps Rp onto itself Hence ATA has an inverse Alternatively,one can argue that since ATA is symmetric it can be diagonalized via someorthogonal transformation But a diagonal matrix is invertible if and only ifevery diagonal entry is non-zero In this case, these entries are precisely theeigenvalues of ATA So ATA is invertible if and only if none of its eigenvaluesare zero.)

Suppose that the columns of A are linearly independent and that ATAv =

0 Then it follows that vTATAv = 0 and so Av = 0, since vTATAv =

of A, and we deduce that v = 0 Hence ATA is invertible

On the other hand, if ATA is invertible, then Av = 0 implies that ATAv =

0 and so v = 0 Hence the columns of A are linearly independent

We can now derive the result of interest here

Theorem 1.6 Let A be any n × p matrix whose columns are linearly pendent Then for any m × p matrix B, there is an m × n matrix M suchthat M A = B

Can we see what M looks like in terms of the patterns a(i), b(i)? Theanswer is “yes and no” We have A = (a(1)· · · a(p)) and B = (b(1)· · · b(p)).Then

Trang 15

which gives ATA directly in terms of the a(i)s Let Q = ATA ∈ Rp×p Then

This formula for M , valid for linearly independent input patterns, expresses

M more or less in terms of the patterns The appearance of the inverse,

Q−1, somewhat lessens its appeal, however

To discuss the case where the columns of A are not necessarily linearlyindependent, we need to consider the notion of generalized inverse

Definition 1.7 For any given matrix A ∈ Rm×n, the matrix X ∈ Rn×m issaid to be a generalized inverse of A if

(i) AXA = A,

(ii) XAX = X,

(iii) (AX)T = AX,

(iv) (XA)T = XA

The terms pseudoinverse or Moore-Penrose inverse are also commonlyused for such an X

Examples 1.8

1 If A ∈ Rn×n is invertible, then A−1 is the generalized inverse of A

2 If A = α ∈ R1×1, then X = 1/α is the generalized inverse provided

α 6= 0 If α = 0, then X = 0 is the generalized inverse

3 The generalized inverse of A = 0 ∈ Rm×n is X = 0 ∈ Rn×m

4 If A = u ∈ Rm×1, u 6= 0, then one checks that X = uT/(uTu) is ageneralized inverse of u

The following result is pertinent to the theory

Theorem 1.9 Every matrix possesses a unique generalized inverse

Proof We postpone discussion of existence (which can be established viathe Singular Value Decomposition) and just show uniqueness This follows

Trang 16

by repeated use of the defining properties (i), ,(iv) Let A ∈ Rm×n begiven and suppose that X, Y ∈ Rn×m are generalized inverses of A Then

Notation For given A ∈ Rm×n, we denote its generalized inverse by A#

It is also often written as Ag, A+ or A†

Proposition 1.10 For any A ∈ Rm×n, AA#is the orthogonal projection ontoran A, the linear span in Rmof the columns of A, i.e., if P = AA#∈ Rm×m,then P = PT = P2 and P maps Rm onto ran A

Proof The defining property (iii) of the generalized inverse A# is preciselythe statement that P = AA# is symmetric Furthermore,

P2 = AA#AA#= AA#, by condition (i),

= P

so P is idempotent Thus P is an orthogonal projection

For any x ∈ Rm, we have that P x = AA#x ∈ ran A, so that P : Rm →ran A On the other hand, if x ∈ ran A, there is some z ∈ Rn such that

x = Az Hence P x = P Az = AA#Az = Az = x, where we have usedcondition (i) in the penultimate step Hence P maps R onto ran A

Proposition 1.11 Let A ∈ Rm×n

(i) If rank A = n, then A#= (ATA)−1AT

(ii) If rank A = m, then A#= AT(AAT)−1

Trang 17

Proof If rank A = n, then A has linearly independent columns and we knowthat this implies that ATA is invertible (in Rn×n) It is now a straightforwardmatter to verify that (ATA)−1AT satisfies the four defining properties of thegeneralized inverse, which completes the proof of (i).

If rank A = m, we simply consider the transpose instead Let B = AT.Then rank B = m, since A and AT have the same rank, and so, by theargument above, B# = (BTB)−1BT However, AT # = A#T, as is easilychecked (again from the defining conditions) Hence

A#= A#T T = (AT)#T

= B#T = B(BTB)−1

= AT(AAT)−1which establishes (ii)

Definition 1.12 The k · kF-norm on Rm×n is defined by

kAk2F = Tr(ATA) for A ∈ Rm×n,where Tr(B) is the trace of the square matrix B; Tr(B) = P

iBii Thisnorm is called the Frobenius norm and sometimes denoted k · k2

(sum of squares of all entries of A)

We also note, here, that clearly kAkF = kATkF

Suppose that A = u ∈ Rm×1, an m-component vector Then kAk2

F =

Pm

i=1u2i, that is, kAkF is the usual Euclidean norm in this case Thusthe notation k · k2 for this norm is consistent Note that, generally, kAkF

is just the Euclidean norm of A ∈ Rm×n when A is “taken apart” row

by row and considered as a vector in Rmn via the correspondence A ↔(A11, A12, , A1n, A21, , Amn)

The notation kAk2 is sometimes used in numerical analysis texts (and

in the computer algebra software package Maple) to mean the norm of

Trang 18

A as a linear map from Rn into Rm, that is, the value sup{kAxk : x ∈

Rn with kxk = 1} One can show that this value is equal to the square root

of the largest eigenvalue of ATA whereas kAkF is equal to the square root

of the sum of the eigenvalues of ATA

Remark 1.13 Let A ∈ Rm×n, B ∈ Rn×m, C ∈ Rm×n, and X ∈ Rp×p Then

it is easy to see that

(i) Tr(AB) = Tr(BA),

(ii) Tr(X) = Tr(XT),

(iii) Tr(ACT) = Tr(CTA) = Tr(ATC) = Tr(CAT)

The equalities in (iii) can each be verified directly, or alternatively, onenotices that (iii) is a consequence of (i) and (ii) (replacing B by CT).Lemma 1.14 For A ∈ Rm×n, A#AAT = AT

Theorem 1.15 Let A ∈ Rn×p and B ∈ Rm×p be given Then X = BA# is

an element of Rm×n which minimizes the quantity kXA − BkF

Hence

kXA − Bk2F = k(X − BA#)Ak2F + kB(A#A − 1lp)k2F

which achieves its minimum, kB(A#A − 1lp)k2F, when X = BA#

Trang 19

Note that any X satisfying XA = BA#A gives a minimum solution.

If AT has full column rank (or, equivalently, AT has no kernel) then AAT

is invertible Multiplying on the right by AT(AAT)−1 gives X = BA# Sounder this condition on AT, we see that there is a unique solution X = BA#minimizing kXA − BkF

In general, one can show that BA#is that element with minimal k · kFnorm which minimizes kXA − BkF, i.e., if Y 6= BA# and kY A − BkF =kBA#A − BkF, then kBA#kF < kY kF

-Now let us return to our problem of finding a memory matrix which storesthe input-output pattern pairs (a(i), b(i)), 1 ≤ i ≤ p, with each a(i) ∈ Rnand each b(i) ∈ Rm In general, it may not be possible to find a matrix

M ∈ Rm×n such that M a(i) = b(i), for each i Whatever our choice of

M , the system output corresponding to the input a(i) is just M a(i) So,failing equality M a(i) = b(i), we would at least like to minimize the error

b(i) − Ma(i) A measure of such an error is kb(i) − Ma(i)k2

2 the squaredEuclidean norm of the difference Taking all p patterns into account, thetotal system recall error is taken to be

kB − MAk2F

We have seen that this is minimized by the choice M = BA#, where A# isthe generalized inverse of A The memory matrix M = BA# is called theoptimal linear associative memory (OLAM) matrix

Remark 1.16 If the patterns {a(1), , a(p)} constitute an orthonormal ily, then A has independent columns and so A#= (ATA)−1AT = 1lpAT, sothat the OLAM matrix is BA# = BAT which is exactly the correlationmemory matrix

fam-In the autoassociative case, b(i) = a(i), so that B = A and the OLAMmatrix is given as

Trang 20

Any input x ∈ Rn can be written as

Pattern classification

We have discussed the distributed associative memory (DAM) matrix as

an autoassociative or as a heteroassociative memory model The first ismathematically just a special case of the second Another special case isthat of so-called classification The idea is that one simply wants an inputsignal to elicit a response “tag”, typically coded as one of a collection oforthogonal unit vectors, such as given by the standard basis vectors of Rm

• In operation, the input x induces output Mx, which is then associatedwith that tag vector corresponding to its maximum component In otherwords, if (M x)j is the maximum component of M x, then the output M x

is associated with the jth tag

Examples of various pattern classification tasks have been given by T nen, P Lehti¨o, E Oja, A Kortekangas and K M¨akisara, Demonstration ofpattern processing properties of the optimal associative mappings, Proceed-ings of the International Conference on Cybernetics and Society, Washing-ton, D C., 581–585 (1977) (See also the article “Storage and Processing ofInformation in Distributed Associative Memory Systems” by T Kohonen,

Koho-P Lehti¨o and E Oja in “Parallel Models of Associative Memory” edited

by G Hinton and J Anderson, published by Lawrence Erlbaum Associates,(updated edition) 1989.)

In one such experiment, ten people were each photographed from fivedifferent angles, ranging from 45◦ to −45◦, with 0◦ corresponding to a fullyfrontal face These were then digitized to produce pattern vectors witheight possible intensity levels for each pixel A distinct unit vector, a tag,was associated with each person, giving a total of ten tags, and fifty patterns.The OLAM matrix was constructed from this data

The memory matrix was then presented with a digitized photograph

of one of the ten people, but taken from a different angle to any of theoriginal five prototypes The output was then classified according to thetag associated with its largest component This was found to give correctidentification

The OLAM matrix was also found to perform well with autoassociation.Pattern vectors corresponding to one hundred digitized photographs were

Trang 21

used to construct the autoassociative memory via the projection rule Whenpresented with incomplete or fuzzy versions of the original patterns, theOLAM matrix satisfactorily reconstructed the correct image.

In another autoassociative recall experiment, twenty one different totype images were used to construct the OLAM matrix These were eachcomposed of three similarly placed copies of a subimage New pattern im-ages, consisting of just one part of the usual triple features, were presented

pro-to the OLAM matrix The output images consisted of slightly fuzzy versions

of the single part but triplicated so as to mimic the subimage positioninglearned from the original twenty one prototypes

An analysis comparing the performance of the correlation memory trix with that of the generalized inverse matrix memory has been offered byCherkassky, Fassett and Vassilas (IEEE Trans on Computers, 40, 1429 (1991)).Their conclusion is that the generalized inverse memory matrix performsbetter than the correlation memory matrix for autoassociation, but thatthe correlation memory matrix is better for classification This is contrary

ma-to the widespread belief that the generalized inverse memory matrix is thesuperior model

Trang 23

Adaptive Linear Combiner

We wish to consider a memory matrix for the special case of one-dimensionaloutput vectors Thus, we consider input pattern vectors x(1), , x(p)∈ Rℓ,say, with corresponding desired outputs y(1), , y(p) ∈ R and we seek amemory matrix M ∈ R1×ℓ such that

Figure 2.1: The Adaptive Linear Combiner

We have seen that we may not be able to find M which satisfies theexact input-output relationship M x(i)= y(i), for each i The idea is to lookfor an M which is in a certain sense optimal To do this, we seek m1, , mℓ

Trang 24

such that (one half) the average mean-squared error

an algorithmic approach to the construction of the appropriate memorymatrix We can write out E in terms of the mi as follows

i=1x(i)j x(i)k , bj = 1pPp

i=1y(i)x(i)j and c = 1pPp

i=1y(i)2.Note that A = (Ajk) ∈ Rℓ×ℓ is symmetric The error E is a non-negativequadratic function of the mi For a minimum, we investigate the equalities

∂E/∂mi = 0, that is,

If A is invertible, then m = A−1b is the unique solution In general, theremay be many solutions For example, if A is diagonal with A11 = 0, thennecessarily b1 = 0 (otherwise E could not be non-negative as a function ofthe mi) and so we see that m1 is arbitrary To relate this to the OLAMmatrix, write E as

E = 2p1kMX − Y k2F,

Trang 25

where X = (x(1)· · · x(p)) ∈ Rℓ×p and Y = (y(1)· · · y(p)) ∈ R1×p This, weknow, is minimized by M = Y X# ∈ R1×ℓ Therefore m = MT must be asolution to the Wiener-Hopf equations above We can write A, b and c interms of the matrices X and Y One finds that A = 1pXXT, bT = 1pY XTand c = 1pYTY The equation Am = b then becomes XXTm = XYT giving

m = (XXT)−1XYT, provided that A is invertible This gives M = mT =

Y XT(XXT)−1 = Y X#, as above

One method of attack for finding a vector m∗ minimizing E is that ofgradient-descent The idea is to think of E(m1, , mℓ) as a bowl-shapedsurface above the ℓ-dimensional m1, , mℓ-space Pick any value for m.The vector grad E, when evaluated at m, points in the direction of maximumincrease of E in the neighbourhood of m That is to say, for small α (and avector v of given length), E(m+αv)−E(m) is maximized when v points in thesame direction as grad E (as is seen by Taylor’s theorem) Now, rather thanincreasing E, we wish to minimize it So the idea is to move a small distancefrom m to m − α grad E, thus inducing maximal “downhill” movement onthe error surface By repeating this process, we hope to eventually reach avalue of m which minimizes E

The strategy, then, is to consider a sequence of vectors m(n) given rithmically by

algo-m(n + 1) = algo-m(n) − α grad E , for n = 1, 2, ,

with m(1) arbitrary and where the parameter α is called the learning rate

If we substitute for grad E, we find

m(n + 1) = m(n) + α¡

b − Am(n)¢.Now, A is symmetric and so can be diagonalized There is an orthogonalmatrix U ∈ Rℓ×ℓ such that

U AUT = D = diag(λ1, , λℓ)and we may assume that λ1≥ λ2 ≥ · · · ≥ λℓ We have

Trang 26

Since E ≥ 0, it follows that all λi ≥ 0—otherwise E would have a negativeleading term The recursion formula for m(n), namely,

m(n + 1) = m(n) + α¡

b − Am(n)¢,gives

zj(n + 1) = zj(n) + α¡

vj− λjzj(n)¢

= (1 − αλj)zj(n) + αvj.Setting µj = (1 − αλj), we have

1 − αλj < 1 Thus, convergence demands the inequalities 0 < αλj < 2 forall 1 ≤ j ≤ ℓ We therefore have shown that the algorithm

m(n + 1) = m(n) + α¡

b − Am(n)¢, n = 1, 2, ,with m(1) arbitrary, converges provided 0 < α < 2

λmax, where λmax is themaximum eigenvalue of A

Suppose that m(1) is given and that α does indeed satisfy the inequalities

0 < α < 2/λmax Let m∗ denote the limit limn→∞m(n) Then, letting

n → ∞ in the recursion formula for m(n), we see that

m∗ = m∗+ α¡

b − Am∗

¢,that is, m∗ satisfies Am∗ = b and so m(n) does, indeed, converge to a valueminimizing E Indeed, if m∗ satisfies Am∗ = b, then we can complete thesquare and write 2E as

Trang 27

The above analysis requires a detailed knowledge of the matrix A Inparticular, its eigenvalues must be determined in order for us to be able tochoose a valid value for the learning rate α We would like to avoid having

to worry too much about this detailed structure of A

We recall that A = (Ajk) is given by

Ajk = 1p

i=1y(i)x(i)j as an average, and y(i)x(i)j as

an estimate for bj Accordingly, we change our algorithm for updating thememory matrix to the following

Select an input-output pattern pair, (x(i), y(i)), say, and use the previousalgorithm but with Ajk and bj “estimated” as above Thus,

= (desired output − actual output)

is the output error for pattern pair i This is known as the delta-rule, or theWidrow-Hoff learning rule, or the least mean square (LMS) algorithm

The learning rule is then as follows

Trang 28

Widrow-Hoff (LMS) algorithm

• First choose a value for α, the learning rate (in practice, this might be 0.1

or 0.05, say)

• Start with mj(1) = 0 for all j, or perhaps with small random values

• Keep selecting input-output pattern pairs x(i), y(i) and update m(n) bythe rule

mj(n + 1) = mj(n) + αδ(i)x(i)j , 1 ≤ j ≤ ℓ,where δ(i) = y(i)−Pℓk=1mk(n)x(i)k is the output error for the patternpair (i) as determined by the memory matrix in operation at iterationstep n Ensure that every pattern pair is regularly presented and con-tinue until the output error has reached and appears to remain at anacceptably small value

• The actual question of convergence still remains to be discussed!

Remark 2.1 If α is too small, we might expect the convergence to be slow—the adjustments to m are small if α is small Of course, this is assuming thatthere is convergence Similarly, if the output error δ is small then changes

to m will also be small, thus slowing convergence This could happen if menters an error surface “valley” with an almost flat bottom

On the other hand, if α is too large, then the ms may overshoot andoscillate about an optimal solution In practice, one might start with alargish value for α but then gradually decrease it as the learning progresses.These comments apply to any kind of gradient-descent algorithm

Remark 2.2 Suppose that, instead of basing our discussion on the errorfunction E, we present the ith input vector x(i) and look at the immediateoutput “error”

Trang 29

with m(1) arbitrary This is exactly what we have already arrived at above.

It should be clear from this point of view that there is no reason a priori tosuppose that the algorithm converges Indeed, one might be more inclined

to suspect that the m-values given by this rule simply “thrash about all overthe place” rather than settling down towards a limiting value

Remark 2.3 We might wish to consider the input-output patterns x and y

as random variables taking values in Rℓ and R, respectively In this context,

it would be natural to consider the minimization of E((y − mTx)2) Theanalysis proceeds exactly as above, but now with Ajk = E(xjxk) and bj =

E(yxj) The idea of using the current, i.e., the observed, values of x and y

to construct estimates for A and b is a common part of standard statisticaltheory The algorithm is then

where x(n) and y(n) are the input-output pattern pair presented at step

n If we assume that these patterns presented at the various steps areindependent, then, from the algorithm, we see that mk(n) only depends onthe patterns presented before step n and so is independent of x(n) Takingexpectations we obtain the vector equation

a(i) ∈ Rℓ and b(i) ∈ Rm, 1 ≤ i ≤ p Taking m = 1, we recover the ALC,

as above We seek an algorithmic approach to minimizing the total system

We have seen that E(M) = 12kB − MAk2

F, where k · kF is the Frobeniusnorm, A = (a(1)· · · a(p)) ∈ Rℓ×p and B = (b(1)· · · b(p)) ∈ Rm×p and that asolution to the problem is given by M = BA#

Trang 30

Each E(i) is a function of the elements Mjk of the memory matrix M Calculating the partial derivatives gives

• The patterns (a(i), b(i)) are presented one after the other, (a(1), b(1)),(a(2), b(2)), , (a(p), b(p))—thus constituting a (pattern) cycle This cycle

in cycle n is actually the presentation of pattern 1 in cycle n + 1.Remark 2.4 The gradient of the total error function E is given by the terms

E(i) will decrease but it could happen that E actually increases The point

is that the algorithm is not a standard gradient-descent algorithm and sostandard convergence arguments are not applicable A separate proof ofconvergence must be given

Remark 2.5 When m = 1, the output vectors are just real numbers and werecover the adaptive linear combiner and the Widrow-Hoff rule as a specialcase

Remark 2.6 The algorithm is “local” in the sense that it only involves formation available at the time of each presentation, i.e., it does not need

in-to remember any of the previously seen examples

Trang 31

The following result is due to Kohonen.

Theorem 2.7 Suppose that αn= α > 0 is fixed Then, for each i = 1, , p,the sequence M(i)(n) converges to some matrix Mα(i) depending on α and i.Moreover,

lim

α↓0Mα(i) = BA#,for each i = 1, , p

Remark 2.8 In general, the limit matrices Mα(i) are different for different i

We shall investigate a simple example to illustrate the theory (followingLuo)

Example 2.9 Consider the case when there is a single input node, so thatthe memory matrix M ∈ R1×1 is just a real number, m, say

in

M ∈ R1×1

out

Figure 2.2: The ALC with one input node

We shall suppose that the system is to learn the two pattern pairs (1, c1)and (−1, c2) Then the total system error function is

E = 12(c1− m)2+ 12(c2+ m)2where M11= m, as above The LMS algorithm, in this case, becomes

Trang 32

We see that limn→∞m(1)(n + 1) = β/(1 − λ), provided that |λ| < 1 Thiscondition is equivalent to (1 − α)2 < 1, or |1 − α| < 1, which is the same as

is the value for the OLAM “matrix” If c16= c2 and α 6= 0, then m(1)α 6= m∗

and m(2)α 6= m∗ Notice that both m(1)α and m(2)α converge to the OLAMsolution m∗ as α → 0, and also the average 12(m(1)(n) + m(2)(n)) converges

Trang 33

m(1)(n + 1) = (1 − αn)£

(1 − αn)m(1)(n) + αnc1¤

− αnc2,giving

yn+1 = (1 − αn)2yn− α2n

³c1+ c22

´.Next, we impose suitable conditions on the learning rates, αn, which willensure convergence

• Suppose that 0 ≤ αn< 1, for all n, and that

that is, the series P∞

n=1αn is divergent, whilst the series P∞

n=1α2

n isconvergent

An example is provided by the assignment αn = 1/n The intuition is thatcondition (i) ensures that the learning rate is always sufficiently large topush the iteration towards the desired limiting value, whereas condition (ii)ensures that its influence is not too strong that it might force the schemeinto some kind of endless oscillatory behaviour

Claim The sequence (yn) converges to 0, as n → ∞

Trang 34

For convenience, set y1 = r0 and βj = (1 − αj)2 Then we can write yn+1 as

yn+1= r0β1β2· · · βn+ r1β2· · · βn+ r2β3· · · βn+ · · · + rn−1βn+ rn.Let ε > 0 be given We must show that there is some integer N such that

|yn+1| < ε whenever n > N The idea of the proof is to split the sum in theexpression for yn+1 into two parts, and show that each can be made smallfor sufficiently large n Thus, we write

yn+1= (r0β1β2· · · βn+ · · · + rmβm+1· · · βn)

+ (rm+1βm+2· · · βn+ · · · + rn−1βn+ rn)and seek m so that each of the two bracketed terms on the right hand side

which deals with the second bracketed term in the expression for yn+1

To estimate the first term, we rewrite it as

(r0′ + r′1+ · · · + rm′ )β1β2· · · βn,where we have set r′0= r0 and r′j = rj

β1· · · βj

, for j > 0

We claim that β1· · · βn→ 0 as n → ∞ To see this, we use the inequality

log(1 − t) ≤ −t, for 0 ≤ t < 1,which can be derived as follows We have

− log(1 − t) =

Z 1 1−t

dxx

≥

Z 1 1−t

dx, since 1

x ≥ 1 in the range of integration,

= t,which gives log(1 − t) ≤ −t, as required Using this, we may say thatlog(1 − αj) ≤ −αj, and so Pn

j=1log(1 − αj) ≤ −Pnj=1αj Thuslog(β1· · · βn) = log

j=1αj → ∞ as n → ∞, which means that log(β1· · · βn) → −∞ as

n → ∞, which, in turn, implies that β1· · · βn→ 0 as n → ∞, as claimed

Trang 35

Finally, we observe that for m as above, the numbers r′0, r1′, , rm′ donot depend on n Hence there is N , with N > m, such that

|(r0′ + r′1+ · · · + rm′ )β1β2· · · βn| < ε2whenever n > N This completes the proof that yn→ 0 as n → ∞

It follows, therefore, that m(1)(n) → m∗ = ³c1− c2

2

´, as n → ∞ Toinvestigate m(2)(n), we use the relation

to see that m(2)(n) → m∗ also

Thus, for this special simple example, we have demonstrated the vergence of the LMS algorithm The statement of the general case is asfollows

con-Theorem 2.10 (LMS Convergence con-Theorem) Suppose that the learning rate αn

in the LMS algorithm satisfies the conditions

The ALC can be used to “clean-up” a noisy signal by arranging it as

a transverse filter Using delay mechanisms, the “noisy” input signal issampled n times, that is, its values at time steps τ, 2τ, , nτ are collected.These n values form the fan-in values for the ALC The output error is

ε = |d − y|, where d is the desired output, i.e., the pure signal The network

is trained to minimize ε, so that the system output y is as close to d aspossible (via the LMS error) Once trained, the network produces a “clean”version of the signal

The ALC has applications in echo cancellation in long distance telephonecalls The ’phone handset contains a special circuit designed to distinguishbetween incoming and outgoing signals However, there is a certain amount

of signal leakage from the earpiece to the mouthpiece When the callerspeaks, the message is transmitted to the recipient’s earpiece via satellite

Trang 36

Some of this signal then leaks across to the recipient’s mouthpiece and issent back to the caller The time taken for this is about half a second, sothat the caller hears an echo of his own voice By appropriate use of theALC in the circuit, this echo effect can be reduced.

Trang 37

Artificial Neural Networks

By way of background information, we consider some basic neurophysiology

It has been estimated that the human brain contains some 1011nerve cells, orneurons, each having perhaps as many as 104 interconnections, thus forming

a densely packed web of fibres

The neuron has three major components:

• the dendrites (constituting a vastly multibranching tree-like structurewhich collects inputs from other cells),

• the cell body (the processing part, called the soma),

• the axon (which carries electrical pulses to other cells)

Each neuron has only one axon, but it may branch out and so may beable to reach perhaps thousands of other cells There are many dendrites(the word dendron is Greek for tree) The diameter of the soma is of theorder of 10 microns

The outgoing signal is in the form of a pulse down the axon On arrival

at a synapse (the junction where the axon meets a dendrite, or indeed, anyother part of another nerve cell) molecules, called neurotransmitters are re-leased These cross the synaptic gap (the axon and receiving neuron do notquite touch) and attach themselves, very selectively, to receptor sites on thereceiving neuron The membrane of the target neuron is chemically affectedand its own inclination to fire may be either enhanced or decreased Thus,the incoming signal can be correspondingly either excitatory or inhibitory.Various drugs work by exploiting this behaviour For example, curare de-posits certain chemicals at particular receptor sites which artificially inhibitmotor (muscular) stimulation by the brain cells This results in the inability

to move

Trang 38

The containing wall of the cell is the cell membrane—a phospholipidbilayer (A lipid is, by definition, a compound insoluble in water, such asoils and fatty acids.) These bilayers have a phosphoric acid head which isattracted to water and a glyceride tail which is repelled by water Thismeans that in a water solution they tend to line up in a double layer withthe heads pointing outwards.

The membrane keeps most molecules from passing either in or out ofthe cell, but there are special channels allowing the passage of certain ionssuch as Na+, K+, Cl− and Ca++ By allowing such ions to pass in andout, a potential difference between the inside and the outside of the cell ismaintained The cell membrane is selectively more favourable to the passage

of potassium than to sodium so that the K+ ions could more easily diffuseout, but negative organic ions inside tend to pull K+ions into the cell Thenet result is that the K+concentration is higher inside than outside whereasthe reverse is true of Na+ and Cl− ions This results in a resting potentialinside relative to outside of about −70mV across the cell wall

When an action potential reaches a synapse it causes a change in thepermeability of the membrane of the cell carrying the pulse (the presynapticmembrane) which results in an influx of Ca++ ions This leads to a release

of neurotransmitters into the synaptic cleft which diffuse across the gap andattach themselves at receptor sites on the membrane of the receiving cell(the postsynaptic membrane) As a consequence, the permeability of thepostsynaptic membrane is altered An influx of positive ions will tend todepolarize the receiving neuron (causing excitation) whereas an influx ofnegative ions will increase polarization and so inhibit activation

Each input pulse is of the order of 1 millivolt and these diffuse towardsthe body of the cell where they are summed at the axon hillock If there

is sufficient depolarization the membrane permeability changes and allows

a large influx of Na+ ions An action potential is generated and travelsdown the axon away from the main cell body and off to other neurons.The amplitude of this signal is of the order of tens of millivolts and itspresence prevents the axon from transmitting further pulses The shapeand amplitude of this travelling pulse is very stable and is replicated at thebranching points of the axon This would indicate that the pulse seems not

to carry any information other than to indicate its presence, i.e., the axoncan be thought of as being in an all-or-none state

Once triggered, the neuron is incapable of re-excitation for about onemillisecond, during which time it is restored to its resting potential This

is called the refractory period The existence of the refractory period limitsthe frequency of nerve-pulse transmissions to no more than about 1000 persecond In fact, this frequency can vary greatly, being mere tens of pulsesper second in some cases The impulse trains can be in the form of regularspikes, irregular spikes or in bursts

Trang 39

The big question is how this massively interconnected network ing the brain can not only control general functional behaviour but also giverise to phenomena such as personality, sleep and consciousness It is alsoamazing how the brain can recognize something it has not “seen” before.For example, a piece of badly played music, or writing roughly done, say, cannevertheless be perfectly recognized in the sense that there is no doubt inone’s mind (sic) what the tune or the letters actually “are” (Indeed, surely

constitut-no real-life experience can ever be an exact replica of a previous experience.)This type of ability seems to be very hard indeed to reproduce by computer.One should take care in discussions of this kind, since we are apparentlytalking about the functioning of the brain in self-referential terms After all,perhaps if we knew (whatever “know” means) what a tune “is”, i.e., how itrelates to the brain via our hearing it (or, indeed, seeing the musical scorewritten down) then we might be able to understand how we can recognize

it even in some new distorted form

In this connection, certain cells do seem to perform as so-called “featuredetectors” One example is provided by auditory cells located at either side

of the back of the brain near the base and serve to locate the direction ofsounds These cells have two groups of dendrites receiving inputs originating,respectively, from the left ear and the right ear For those cells in the left side

of the brain, the inputs from the left ear inhibit activation, whereas thosefrom the right are excitatory The arrangement is reversed for those in theright side of the brain This means that a sound coming from the right, say,will reach the right ear first and hence initially excite those auditory cellslocated in the left side of the brain but inhibit those in the right side Whenthe sound reaches the left ear, the reverse happens, the cells on the leftbecome inhibited and those on the right side of the brain become excited.The change from strong excitation to strong inhibition can take place within

a few hundred microseconds

Another example of feature detection is provided by certain visual cells.Imagine looking at a circular region which is divided into two by a smallerconcentric central disc and its surround Then there are cells in the visualsystem which become excited when a light appears in the centre but forwhich activation is inhibited when a light appears in the surround Theseare called “on centre–off surround” cells There are also corresponding “offcentre–on surround” cells

We would like to devise mathematical models of networks inspired by(our understanding of) the workings of the brain The study of such artificialneural networks may then help us to gain a greater understanding of theworkings of the brain In this connection, one might then strive to makethe models more biologically realistic in a continuing endeavour to modelthe brain Presumably one might imagine that sooner or later the detailedbiochemistry of the neuron will have to be taken into account Perhaps one

Trang 40

might even have to go right down to a quantum mechanical description.This seems to be a debatable issue in that there is a school of thought whichsuggests that the details are not strictly relevant and that it is the overallcooperative behaviour which is important for our understanding of the brain.This situation is analogous to that of the study of thermodynamics andstatistical mechanics The former deals essentially with gross behaviour ofphysical systems whilst the latter is concerned with a detailed atomic ormolecular description in the hope of explaining the former It turned outthat the detailed (and quantum) description was needed to explain certainphenomena such as superconductivity Perhaps this will turn out to be thecase in neuroscience too.

On the other hand, one could simply develop the networks in any rection whatsoever and just consider them for their own sake (as part of

di-a mdi-athemdi-aticdi-al structure), or di-as tools in di-artificidi-al intelligence di-and expertsystems Indeed, artificial neural networks have been applied in many areasincluding medical diagnosis, credit validation, stock market prediction, winetasting and microwave cookers

To develop the basic model, we shall think of the nervous system asmediated by the passage of electrical impulses between a vast web of inter-connected cells—neurons This network receives input from receptors, such

as the rods and cones of the eye, or the hot and cold touch receptors of theskin These inputs are then processed in some way by the neural net withinthe brain, and the result is the emission of impulses that control so-calledeffectors, such as muscles, glands etc., which result in the response Thus,

we have a three-stage system: input (via receptors), processing (via neuralnet) and output (via effectors)

To model the excitatory/inhibitory behaviour of the synapse, we shallassign suitable positive weights to excitatory synapses and negative weights

to the inhibitory ones The neuron will then “fire” if its total weightedinput exceeds some threshold value Having fired, there is a small delay,the refractory period, before it is capable of firing again To take this intoaccount, we consider a discrete time evolution by dividing the time scaleinto units equal to the refractory period Our concern is whether any givenneuron has “spiked” or not within one such period This has the effect of

“clocking” the evolution We are thus led to the caricature illustrated in thefigure

The symbols x1, , xn denote the input values, w1, , wn denote theweights associated with the connections (terminating at the synapses) u =

Pn

i=1wixi is the net (weighted) input, θ is the threshold, v = u − θ is calledthe activation potential, and ϕ(·) the activation function The output y isgiven by

y = ϕ(u − θ| {z }

v)

Định dạng
Số trang	126
Dung lượng	752,2 KB