In the previous chapter, we have shown that neural networks, be they feedforward or recurrent, cannot generate time delays of an order higher than the dimension ofthe input to the networ
Trang 1Copyright c2001 John Wiley & Sons Ltd
ISBNs: 0-471-49517-4 (Hardback); 0-470-84535-X (Electronic)
6
Neural Networks as Nonlinear Adaptive Filters
6.1 Perspective
Neural networks, in particular recurrent neural networks, are cast into the framework
of nonlinear adaptive filters In this context, the relation between recurrent neuralnetworks and polynomial filters is first established Learning strategies and algorithmsare then developed for neural adaptive system identifiers and predictors Finally, issuesconcerning the choice of a neural architecture with respect to the bias and variance
of the prediction performance are discussed
6.2 Introduction
Representation of nonlinear systems in terms of NARMA/NARMAX models has beendiscussed at length in the work of Billings and others (Billings 1980; Chen and Billings
1989; Connor 1994; Nerrand et al 1994) Some cognitive aspects of neural nonlinear
filters are provided in Maass and Sontag (2000) Pearson (1995), in his article onnonlinear input–output modelling, shows that block oriented nonlinear models are
a subset of the class of Volterra models So, for instance, the Hammerstein model,
which consists of a static nonlinearity f ( · ) applied at the output of a linear dynamical system described by its z-domain transfer function H(z), can be represented1by theVolterra series
In the previous chapter, we have shown that neural networks, be they feedforward
or recurrent, cannot generate time delays of an order higher than the dimension ofthe input to the network Another important feature is the capability to generatesubharmonics in the spectrum of the output of a nonlinear neural filter (Pearson1995) The key property for generating subharmonics in nonlinear systems is recursion,hence, recurrent neural networks are necessary for their generation Notice that, as
1 Under the condition that the function f is analytic, and that the Volterra series can be thought
of as a generalised Taylor series expansion, then the coefficients of the model (6.2) that do not vanish
are h i,j, ,z = 0 ⇔ i = j = · · · = z.
Trang 2pointed out in Pearson (1995), block-stochastic models are, generally speaking, notsuitable for this application.
In Hakim et al (1991), by using the Weierstrass polynomial expansion theorem,
the relation between neural networks and Volterra series is established, which is thenextended to a more general case and to continuous functions that cannot be expandedvia a Taylor series expansion.2Both feedforward and recurrent networks are charac-terised by means of a Volterra series and vice versa
Neural networks are often referred to as ‘adaptive neural networks’ As alreadyshown, adaptive filters and neural networks are formally equivalent, and neural net-works, employed as nonlinear adaptive filters, are generalisations of linear adaptivefilters However, in neural network applications, they have been used mostly in such
a way that the network is first trained on a particular training set and subsequentlyused This approach is not an online adaptive approach, which is in contrast withlinear adaptive filters, which undergo continual adaptation
Two groups of learning techniques are used for training recurrent neural works: a direct gradient computation technique (used in nonlinear adaptive filtering)and a recurrent backpropagation technique (commonly used in neural networks foroffline applications) The real-time recurrent learning (RTRL) algorithm (Williamsand Zipser 1989a) is a technique which uses direct gradient computation, and is used
net-if the network coefficients change slowly with time This technique is essentially anLMS learning algorithm for a nonlinear IIR filter It should be noticed that, with thesame computation time, it might be possible to unfold the recurrent neural networkinto the corresponding feedforward counterparts and hence to train it by backprop-agation The backpropagation through time (BPTT) algorithm is such a technique(Werbos 1990)
Some of the benefits involved with neural networks as nonlinear adaptive filters arethat no assumptions concerning Markov property, Gaussian distribution or additivemeasurement noise are necessary (Lo 1994) A neural filter would be a suitable choiceeven if mathematical models of the input process and measurement noise are notknown (black box modelling)
6.3 Overview
We start with the relationship between Volterra and bilinear filters and neural works Recurrent neural networks are then considered as nonlinear adaptive filters andneural architectures for this case are analysed Learning algorithms for online training
net-of recurrent neural networks are developed inductively, starting from correspondingalgorithms for linear adaptive IIR filters Some issues concerning the problem of van-ishing gradient and bias/variance dilemma are finally addressed
6.4 Neural Networks and Polynomial Filters
It has been shown in Chapter 5 that a small-scale neural network can represent order nonlinear systems, whereas a large number of terms are required for an equiv-
high-2 For instance nonsmooth functions, such as|x|.
Trang 3alent Volterra series representation For instance, as already shown, after performing
a Taylor series expansion for the output of a neural network depicted in Figure 5.3,
with input signals u(k − 1) and u(k − 2), we obtain
y(k) = c0+ c1u(k − 1) + c2u(k − 2) + c3u2(k − 1) + c4u2(k − 2)
series and complexity of kernels h( · ) increase exponentially with the order of the
delay in system (6.2) This problem restricts practical applications of Volterra series
to small-scale systems
Nonlinear system identification, on the other hand, has been traditionally basedupon the Kolmogorov approximation theorem (neural network existence theorem),which states that a neural network with a hidden layer can approximate an arbitrarynonlinear system Kolmogorov’s theorem, however, is not that relevant in the con-text of networks for learning (Girosi and Poggio 1989b) The problem is that innerfunctions in Kolmogorov’s formula (4.1), although continuous, have to be highly non-smooth Following the analysis from Chapter 5, it is straightforward that multilayeredand recurrent neural networks have the ability to approximate an arbitrary nonlinearsystem, whereas Volterra series fail even for simple saturation elements
Another convenient form of nonlinear system is the bilinear (truncated Volterra)system described by
N−1 j=1
b i,j y(k − j)x(k − i) +
N−1 i=0
a i x(k − i). (6.3)
Despite its simplicity, this is a powerful nonlinear model and a large class of nonlinearsystems (including Volterra systems) can be approximated arbitrarily well using thismodel Its functional dependence (6.3) shows that it belongs to a class of generalrecursive nonlinear models A recurrent neural network that realises a simple bilinearmodel is depicted in Figure 6.1 As seen from Figure 6.1, multiplicative input nodes(denoted by ‘×’) have to be introduced to represent the bilinear model Bias terms
are omitted and the chosen neuron is linear
Example 6.4.1 Show that the recurrent network shown in Figure 6.1 realises a
bilinear model Also show that this network can be described in terms of NARMAXmodels
Trang 4a a b b
c
y(k) x(k)
0,1
Σ
+ +
y(k-1) x(k-1)
Figure 6.1 Recurrent neural network representation of the bilinear model
Solution The functional description of the recurrent network depicted in Figure 6.1
is given by
y(k) = c1y(k −1)+b 0,1 x(k)y(k −1)+b 1,1 x(k −1)y(k −1)+a0x(k) + a1x(k −1), (6.4)
which belongs to the class of bilinear models (6.3) The functional description of thenetwork from Figure 6.1 can also be expressed as
y(k) = F (y(k − 1), x(k), x(k − 1)), (6.5)which is a NARMA representation of model (6.4)
Example 6.4.1 confirms the duality between Volterra, bilinear, NARMA/NARMAXand recurrent neural models To further establish the connection between Volterraseries and a neural network, let us express the activation potential of nodes of thenetwork as
where neti (k) is the activation potential of the ith hidden neuron, w i,j are weights
and x(k −j) are inputs to the network If the nonlinear activation functions of neurons are expressed via an Lth-order polynomial expansion3as
3 Using the Weierstrass theorem, this expansion can be arbitrarily accurate However, in practice
we resort to a moderate order of this polynomial expansion.
Trang 5then the neural model described in (6.6) and (6.7) can be related to the Volterramodel (6.2) The actual relationship is rather complicated, and Volterra kernels areexpressed as sums of products of the weights from input to hidden units, weights
associated with the output neuron, and coefficients ξ il from (6.7) Chon et al (1998)
have used this kind of relationship to compare the Volterra and neural approach whenapplied to processing of biomedical signals
Hence, to avoid the difficulty of excessive computation associated with Volterraseries, an input–output relationship of a nonlinear predictor that computes the output
in terms of past inputs and outputs may be introduced as4
ˆ
y(k) = F (y(k − 1), , y(k − N), u(k − 1), , u(k − M)), (6.8)
where F ( · ) is some nonlinear function The function F may change for different
input variables or for different regions of interest A NARMAX model may therefore
be a correct representation only in a region around some operating point Leontaritisand Billings (1985) rigorously proved that a discrete time nonlinear time invariantsystem can always be represented by model (6.8) in the vicinity of an equilibriumpoint provided that
• the response function of the system is finitely realisable, and
• it is possible to linearise the system around the chosen equilibrium point.
As already shown, some of the other frequently used models, such as the bilinearpolynomial filter, given by (6.3), are obviously cases of a simple NARMAX model
6.5 Neural Networks and Nonlinear Adaptive Filters
To perform nonlinear adaptive filtering, tracking and system identification of nonlineartime-varying systems, there is a need to introduce dynamics in neural networks Thesedynamics can be introduced via recurrent neural networks, which are the focus of thisbook
The design of linear filters is conveniently specified by a frequency response which
we would like to match In the nonlinear case, however, since a transfer function
of a nonlinear filter is not available in the frequency domain, one has to resort todifferent techniques For instance, the design of nonlinear filters may be thought of as
a nonlinear constrained optimisation problem in Fock space (deFigueiredo 1997)
In a recurrent neural network architecture, the feedback brings the delayed outputs
from hidden and output neurons back into the network input vector u(k), as shown in
Figure 5.13 Due to gradient learning algorithms, which are sequential, these delayedoutputs of neurons represent filtered data from the previous discrete time instant.Due to this ‘memory’, at each time instant, the network is presented with the raw,
4 As already shown, this model is referred to as the NARMAX model (nonlinear ARMAX), since
it resembles the linear model
Trang 6y(k)w
x(k)
Input
Figure 6.2 NARMA recurrent perceptron
possibly noisy, external input data s(k), s(k − 1), , s(k − M) from Figure 5.13 and Equation (5.31), and filtered data y1(k − 1), , y N (k − 1) from the network output.
Intuitively, this filtered input history helps to improve the processing performance ofrecurrent neural networks, as compared with feedforward networks Notice that thehistory of past outputs is never presented to the learning algorithm for feedforwardnetworks Therefore, a recurrent neural network should be able to process signalscorrupted by additive noise even in the case when the noise distribution is varyingover time
On the other hand, a nonlinear dynamical system can be described by
with an observation process
where (k) is observation noise (Haykin and Principe 1998) Takens’ embedding
theo-rem (Takens 1981) states that the geometric structure of system (6.9) can be recovered
Trang 7(b) A recurrent linear/nonlinear neural filter structure
Figure 6.3 Nonlinear IIR filter structuresfrom the sequence{y(k)} in a D-dimensional space spanned by5
provided that D 2d + 1, where d is the dimension of the state space of system (6.9).
Therefore, one advantage of NARMA models over FIR models is the parsimony ofNARMA models, since an upper bound on the order of a NARMA model is twice theorder of the state (phase) space of the system being analysed
The simplest recurrent neural network architecture is a recurrent perceptron, shown
in Figure 6.2 This is a simple, yet effective architecture The equations which describethe recurrent perceptron shown in Figure 6.2 are
y(k) = Φ(v(k)),
v(k) = uT(k)w(k),
(6.12)
where u(k) = [x(k − 1), , x(k − M), 1, y(k − 1), , y(k − N)]Tis the input vector,
transpose operator
5 Model (6.11) is in fact a NAR/NARMAX model.
Trang 8Figure 6.5 Fully connected feedforward neural filter
A recurrent perceptron is a recursive adaptive filter with an arbitrary output tion as shown in Figure 6.3 Figure 6.3(a) shows the recurrent perceptron structure
func-as a nonlinear infinite impulse response (IIR) filter Figure 6.3(b) depicts the parallellinear/nonlinear structure, which is one of the possible architectures These structuresstem directly from IIR filters and are described in McDonnell and Waagen (1994),
Connor (1994) and Nerrand et al (1994) Here, A(z), B(z), C(z) and D(z) denote the z-domain linear transfer functions The general structure of a fully connected,
multilayer neural feedforward filter is shown in Figure 6.5 and represents a isation of a simple nonlinear feedforward perceptron with dynamic synapses, shown
general-in Figure 6.4 This structure consists of an general-input layer, layer of hidden neurons and
an output layer Although the output neuron shown in Figure 6.5 is linear, it could
be nonlinear In that case, attention should be paid that the dynamic ranges of theinput signal and output neuron match
Another generalisation of a fully connected recurrent neural filter is shown in ure 6.6 This network consists of nonlinear neural filters as depicted in Figure 6.5,applied to both the input and output signal, the outputs of which are summedtogether This is a fairly general structure which resembles the architecture of a lin-
Trang 9Narendra and Parthasarathy (1990) provide deep insight into structures of neuralnetworks for identification of nonlinear dynamical systems Due to the duality betweensystem identification and prediction, the same architectures are suitable for predic-tion applications From Figures 6.3–6.6, we can identify four general architectures ofneural networks for prediction and system identification These architectures come ascombinations of linear/nonlinear parts from the architecture shown in Figure 6.6, andfor the nonlinear prediction configuration are specified as follows
(i) The output y(k) is a linear function of previous outputs and a nonlinear function
of previous inputs, given by
y(k) =
N
j=1
a j (k)y(k − j) + F (u(k − 1), u(k − 2), , u(k − M)), (6.13)
where F ( · ) is some nonlinear function This architecture is shown in
Fig-ure 6.7(a)
(ii) The output y(k) is a nonlinear function of past outputs and a linear function of
past inputs, given by
y(k) = F (y(k − 1), y(k − 2), , y(k − N)) +
M
i=1
b i (k)u(k − i). (6.14)This architecture is depicted in Figure 6.7(b)
(iii) The output y(k) is a nonlinear function of both past inputs and outputs The
functional relationship between the past inputs and outputs can be expressed
Trang 10Z Z
Z
u(k-1)
u(k-M) y(k-N)
y(k-1)
u(k)
y(k)
(d) Recurrent neural filter (6.16)
Figure 6.7 Architectures of recurrent neural networks as nonlinear adaptive filters
in a separable manner as
y(k) = F (y(k − 1), , y(k − N)) + G(u(k − 1), , u(k − M)). (6.15)This architecture is depicted in Figure 6.7(c)
(iv) The output y(k) is a nonlinear function of past inputs and outputs, as
y(k) = F (y(k − 1), , y(k − N), u(k − 1), , u(k − M)). (6.16)This architecture is depicted in Figure 6.7(d) and is most general
Trang 11y(k-2)
y(k-N)
u(k-M) u(k-1)
Figure 6.8 NARMA type neural identifier
6.6 Training Algorithms for Recurrent Neural Networks
A natural error criterion, upon which the training of recurrent neural networks isbased, is in the form of the accumulated squared prediction error over the wholedataset, given by
par-m = 1, 2, , M , whereas in the nonstationary case, since the statistics change over
time, it is unreasonable to take into account the whole previous history of the errors
For this case, a forgetting mechanism is usually employed, whereby 0 < λ(m) < 1.
Since many real-world signals are nonstationary, online learning algorithms commonlyuse the squared instantaneous error as an error criterion, i.e
Here, the coefficient 12 is included for convenience in the derivation of the algorithms
6.7 Learning Strategies for a Neural Predictor/Identifier
A NARMA/NARMAX type neural identifier is depicted in Figure 6.8 When sidering a neural predictor, the only difference is the position of the neural modulewithin the system structure, as shown in Chapter 2 There are two main trainingstrategies to estimate the weights of the neural network shown in Figure 6.8 In thefirst approach, the links between the real system and the neural identifier are asdepicted in Figure 6.9 During training, the configuration shown in Figure 6.9 can be
Trang 12_ Σ
y(k)
y(k) e(k)
Figure 6.9 The nonlinear series–parallel (teacher forcing) learning configuration
y(k-1) y(k-2) y(k-N)
Neural Network
Figure 6.10 The nonlinear parallel (supervised) learning configuration
described by
ˆ
y(k) = f (u(k), , u(k − M), y(k − 1), , y(k − N)), (6.19)
which is referred to as the nonlinear series–parallel model (Alippi and Piuri 1996; Qin
et al 1992) In this configuration, the desired signal y(k) is presented to the network,
which produces biased estimates (Narendra 1996)