The ability of neural networks to model nonlinear dynamical systems is demonstrated, and thecorrespondence between neural networks and block-stochastic models is established.Finally, fur
Trang 1Authored by Danilo P Mandic, Jonathon A Chambers
Copyright c2001 John Wiley & Sons Ltd
ISBNs: 0-471-49517-4 (Hardback); 0-470-84535-X (Electronic)
5
Recurrent Neural Networks
Architectures
5.1 Perspective
In this chapter, the use of neural networks, in particular recurrent neural networks,
in system identification, signal processing and forecasting is considered The ability
of neural networks to model nonlinear dynamical systems is demonstrated, and thecorrespondence between neural networks and block-stochastic models is established.Finally, further discussion of recurrent neural network architectures is provided
5.2 Introduction
There are numerous situations in which the use of linear filters and models is limited.For instance, when trying to identify a saturation type nonlinearity, linear models willinevitably fail This is also the case when separating signals with overlapping spectralcomponents
Most real-world signals are generated, to a certain extent, by a nonlinear anism and therefore in many applications the choice of a nonlinear model may benecessary to achieve an acceptable performance from an adaptive predictor Commu-nications channels, for instance, often need nonlinear equalisers to achieve acceptableperformance The choice of model has crucial importance1 and practical applicationshave shown that nonlinear models can offer a better prediction performance than theirlinear counterparts They also reveal rich dynamical behaviour, such as limit cycles,bifurcations and fixed points, that cannot be captured by linear models (Gershenfeldand Weigend 1993)
mech-By system we consider the actual underlying physics2 that generate the data,
whereas by model we consider a mathematical description of the system Many
vari-ations of mathematical models can be postulated on the basis of datasets collectedfrom observations of a system, and their suitability assessed by various performance
1 System identification, for instance, consists of choice of the model, model parameter estimation and model validation.
2 Technically, the notions of system and process are equivalent (Pearson 1995; Sj¨ oberg et al 1995).
Trang 2Figure 5.1 Effects of y = tanh(v) nonlinearity in a neuron model upon two example
inputsmetrics Since it is not possible to characterise nonlinear systems by their impulseresponse, one has to resort to less general models, such as homomorphic filters, mor-phological filters and polynomial filters Some of the most frequently used polynomialfilters are based upon Volterra series (Mathews 1991), a nonlinear analogue of thelinear impulse response, threshold autoregressive models (TAR) (Priestley 1991) andHammerstein and Wiener models The latter two represent structures that consist
of a linear dynamical model and a static zero-memory nonlinearity An overview ofthese models can be found in Haber and Unbehauen (1990) Notice that for nonlinearsystems, the ordering of the modules within a modular structure3plays an importantrole
To illustrate some important features associated with nonlinear neurons, let us sider a squashing nonlinear activation function of a neuron, shown in Figure 5.1 Fortwo identical mixed sinusoidal inputs with different offsets, passed through this non-linearity, the output behaviour varies from amplifying and slightly distorting the inputsignal (solid line in Figure 5.1) to attenuating and considerably nonlinearly distortingthe input signal (broken line in Figure 5.1) From the viewpoint of system theory,neural networks represent nonlinear maps, mapping one metric space to another
con-3 To depict this, for two modules performing nonlinear functions H1 = sin(x) and H2 = ex, we
have H1(H2(x)) = H2(H1(x)) since sin(e x)= e sin(x) This is the reason to use the term nesting
rather than cascading in modular neural networks.
Trang 3Nonlinear system modelling has traditionally focused on Volterra–Wiener analysis.These models are nonparametric and computationally extremely demanding TheVolterra series expansion is given by
for the representation of a causal system A nonlinear system represented by a Volterra
series is completely characterised by its Volterra kernels h i , i = 0, 1, 2, The
Volterra modelling of a nonlinear system requires a great deal of computation, andmostly second- or third-order Volterra systems are used in practice
Since the Volterra series expansion is a Taylor series expansion with memory, theyboth fail when describing a system with discontinuities, such as
where sgn(· ) is the signum function.
To overcome this difficulty, nonlinear parametric models of nonlinear systems,termed NARMAX, that are described by nonlinear difference equations, have been
introduced (Billings 1980; Chon and Cohen 1997; Chon et al 1999; Connor 1994).
Unlike the Volterra–Wiener representation, the NARMAX representation of nonlinearsystems offers compact representation
The NARMAX model describes a system by using a nonlinear functional dence between lagged inputs, outputs and/or prediction errors A polynomial expan-sion of the transfer function of a NARMAX neural network does not comprise ofdelayed versions of input and output of order higher than those presented to the net-work Therefore, the input of an insufficient order will result in undermodelling, whichcomplies with Takens’ embedding theorem (Takens 1981)
depen-Applications of neural networks in forecasting, signal processing and control requiretreatment of dynamics associated with the input signal Feedforward networks forprocessing of dynamical systems tend to capture the dynamics by including pastinputs in the input vector However, for dynamical modelling of complex systems,there is a need to involve feedback, i.e to use recurrent neural networks There arevarious configurations of recurrent neural networks, which are used by Jordan (1986)for control of robots, by Elman (1990) for problems in linguistics and by Williams andZipser (1989a) for nonlinear adaptive filtering and pattern recognition In Jordan’snetwork, past values of network outputs are fed back into hidden units, in Elman’snetwork, past values of the outputs of hidden units are fed back into themselves,whereas in the Williams–Zipser architecture, the network is fully connected, havingone hidden layer
There are numerous modular and hybrid architectures, combining linear adaptivefilters and neural networks These include the pipelined recurrent neural network andnetworks combining recurrent networks and FIR adaptive filters The main idea here
is that the linear filter captures the linear ‘portion’ of the input process, whereas aneural network captures the nonlinear dynamics associated with the process
Trang 472 OVERVIEW
5.3 Overview
The basic modes of modelling, such as parametric, nonparametric, white box, black
box and grey box modelling are introduced Afterwards, the dynamical richness of
neural models is addressed and feedforward and recurrent modelling for noisy timeseries are compared Block-stochastic models are introduced and neural networks areshown to be able to represent these models The chapter concludes with an overview ofrecurrent neural network architectures and recurrent neural networks for NARMAXmodelling
5.4 Basic Modes of Modelling
The notions of parametric, nonparametric, black box, grey box and white box elling are explained These can be used to categorise neural network algorithms, such
mod-as the direct gradient computation, a posteriori and normalised algorithms The bmod-asic idea behind these approaches to modelling is not to estimate what is already known.
One should, therefore, utilise prior knowledge and knowledge about the physics of thesystem, when selecting the neural network model prior to parameter estimation
5.4.1 Parametric versus Nonparametric Modelling
A review of nonlinear input–output modelling techniques is given in Pearson (1995)
Three classes of input–output models are parametric, nonparametric and
semipara-metric models We next briefly address them.
• Parametric modelling assumes a fixed structure for the model The model
iden-tification problem then simplifies to estimating a finite set of parameters of thisfixed model This estimation is based upon the prediction of real input data,
so as to best match the input data dynamics An example of this technique isthe broad class of ARIMA/NARMA models For a given structure of the model(NARMA for instance) we recursively estimate the parameters of the chosenmodel
• Nonparametric modelling seeks a particular model structure from the input
data The actual model is not known beforehand An example taken from
non-parametric regression is that we look for a model in the form of y(k) = f (x(k)) without knowing the function f ( · ) (Pearson 1995).
• Semiparametric modelling is the combination of the above Part of the model
structure is completely specified and known beforehand, whereas the other part
of the model is either not known or loosely specified
Neural networks, especially recurrent neural networks, can be employed within mators of all of the above classes of models Closely related to the above concepts arewhite, grey and black box modelling techniques
Trang 5esti-5.4.2 White, Grey and Black Box Modelling
To understand and analyse real-world physical phenomena, various mathematical
models have been developed Depending on some a priori knowledge about the
pro-cess, data and model, we differentiate between three fairly general modes of modelling.The idea is to distinguish between three levels of prior knowledge, which have been
‘colour-coded’ An overview of the white, grey and black box modelling techniquescan be found in Aguirre (2000) and Sj¨oberg et al (1995).
Given data gathered from planet movements, then Kepler’s gravitational laws mightwell provide the initial framework in building a mathematical model of the process
This mode of modelling is referred to as white box modelling (Aguirre 2000),
under-lying its fairly deterministic nature Static data are used to calculate the parameters,and to do that the underlying physical process has to be understood It is therefore
possible to build a white box model entirely from physical insight and prior
knowl-edge However, the underlying physics are generally not completely known, or are toocomplicated and often one has to resort to other types of modelling
The exact form of the input–output relationship that describes a real-world system
is most commonly unknown, and therefore modelling is based upon a chosen set ofknown functions In addition, if the model is to approximate the system with anarbitrary accuracy, the set of chosen nonlinear continuous functions must be dense.This is the case with polynomials In this light, neural networks can be viewed as
another mode of functional representations Black box modelling therefore assumes
no previous knowledge about the system that produces the data However, the chosennetwork structure belongs to architectures that are known to be flexible and have
performed satisfactorily on similar problems The aim hereby is to find a function F that approximates the process y based on the previous observations of process yPAST and input u, as
This ‘black box’ establishes a functional dependence between the input and put, which can be either linear or nonlinear The downside is that it is gener-ally not possible to learn about the true physical process that generates the data,especially if a linear model is used Once the training process is complete, a neu-
out-ral network represents a black box, nonparametric process model Knowledge about
the process is embedded in the values of the network parameters (i.e synapticweights)
A natural compromise between the two previous models is so-called grey box elling It is obtained from black box modelling if some information about the system
mod-is known a priori Thmod-is can be a probability density function, general statmod-istics of
the process data, impulse response or attractor geometry In Sj¨oberg et al (1995), two subclasses of grey box models are considered: physical modelling, where a model
structure is built upon understanding of the underlying physics, as for instance the
state-space model structure; and semiphysical modelling, where, based upon physical
insight, certain nonlinear combinations of data structures are suggested, and then
estimated by black box methodology.
Trang 674 NARMAX MODELS AND EMBEDDING DIMENSION
ν(k) + +
+ _
^
Neural Network Model
Σ
Σ
Figure 5.2 Nonlinear prediction configuration using a neural network model
5.5 NARMAX Models and Embedding Dimension
For neural networks, the number of input nodes specifies the dimension of the networkinput In practice, the true state of the system is not observable and the mathematicalmodel of the system that generates the dynamics is not known The question arises:
is the sequence of measurements {y(k)} sufficient to reconstruct the nonlinear
sys-tem dynamics? Under some regularity conditions, Takens’ (1981) and Mane’s (1981)embedding theorems establish this connection To ensure that the dynamics of a non-linear process estimated by a neural network are fully recovered, it is convenient touse Takens’ embedding theorem (Takens 1981), which states that to obtain a faithful
reconstruction of the system dynamics, the embedding dimension d must satisfy
where D is the dimension of the system attractor Takens’ embedding theorem (Takens
1981; Wan 1993) establishes a diffeomorphism between a finite window of the timeseries
[y(k − 1), y(k − 2), , y(k − N)] (5.5)and the underlying state of the dynamic system which generates the time series Thisimplies that a nonlinear regression
y(k) = g[y(k − 1), y(k − 2), , y(k − N)] (5.6)can model the nonlinear time series An important feature of the delay-embeddingtheorem due to Takens (1981) is that it is physically implemented by delay lines
Trang 71 w0
y(k) w
Figure 5.3 A NARMAX recurrent perceptron with p = 1 and q = 1
There is a deep connection between time-lagged vectors and underlying dynamics.Delay vectors are not just a representation of a state of the system, their length isthe key to recovering the full dynamical structure of a nonlinear system A generalstarting point would be to use a network for which the input vector comprises delayedinputs and outputs, as shown in Figure 5.2 For the network in Figure 5.2, both theinput and the output are passed through delay lines, hence indicating the NARMAXcharacter of this network The switch in this figure indicates two possible modes oflearning which will be explained in Chapter 6
5.6 How Dynamically Rich are Nonlinear Neural Models?
To make an initial step toward comparing neural and other nonlinear models, weperform a Taylor series expansion of the sigmoidal nonlinear activation function of a
single neuron model as (Billings et al 1992)
Depending on the steepness β and the activation potential v(k), the polynomial
rep-resentation (5.7) of the transfer function of a neuron exhibits a complex nonlinearbehaviour
Let us now consider a NARMAX recurrent perceptron with p = 1 and q = 1,
as shown in Figure 5.3, which is a simple example of recurrent neural networks Itsmathematical description is given by
y(k) = Φ(w1x(k − 1) + w2y(k − 1) + w0). (5.8)Expanding (5.8) using (5.7) yields
y(k) = 12+14[w1 x(k −1)+w2y(k −1)+w0]−1
48[w1 x(k −1)+w2y(k −1)+w0]3+· · · , (5.9)
where β = 1 Expression (5.9) illustrates the dynamical richness of squashing
activa-tion funcactiva-tions The associated dynamics, when represented in terms of polynomialsare quite complex Networks with more neurons and hidden layers will produce morecomplicated dynamics than those in (5.9) Following the same approach, for a general
Trang 876 HOW DYNAMICALLY RICH ARE NONLINEAR NEURAL MODELS?
recurrent neural network, we obtain (Billings et al 1992)
the network Representation (5.10) also models an offset (mean value) c0of the inputsignal
5.6.1 Feedforward versus Recurrent Networks for Nonlinear Modelling
The choice of which neural network to employ to represent a nonlinear physical processdepends on the dynamics and complexity of the network that is best for representingthe problem in hand For instance, due to feedback, recurrent networks may sufferfrom instability and sensitivity to noise Feedforward networks, on the other hand,might not be powerful enough to capture the dynamics of the underlying nonlineardynamical system To illustrate this problem, we resort to a simple IIR (ARMA)linear system described by the following first-order difference equation
z(k) = 0.5z(k − 1) + 0.1x(k − 1). (5.11)
The system (5.11) is stable, since the pole of its transfer function is at 0.5, i.e within the unit circle in the z-plane However, in a noisy environment, the output z(k) is corrupted by noise e(k), so that the noisy output y(k) of system (5.11) becomes
which will affect the quality of estimation based on this model This happens becausethe noise terms accumulate during recursions4 (5.11) as
y(k) = 0.5y(k − 1) + 0.1x(k − 1) + e(k) − 0.5e(k − 1). (5.13)
An equivalent FIR (MA) representation of the same filter (5.11), using the method
of long division, gives
z(k) = 0.1x(k − 1) + 0.05x(k − 2) + 0.025x(k − 3) + 0.0125x(k − 4) + · · · (5.14)
and the representation of a noisy system now becomes
y(k) = 0.1x(k −1)+0.05x(k −2)+0.025x(k −3)+0.0125x(k −4)+· · ·+e(k) (5.15)
4 Notice that if the noise e(k) is zero mean and white it appears coloured in (5.13), i.e correlated
with previous outputs, which leads to biased estimates.
Trang 9Clearly, the noise in (5.15) is not correlated with the previous outputs and the mates are unbiased.5 The price to pay, however, is the infinite length of the exactrepresentation of (5.11).
esti-A similar principle applies to neural networks In Chapter 6 we address the modes
of learning in neural networks and discuss the bias/variance dilemma for recurrentneural networks
5.7 Wiener and Hammerstein Models and Dynamical Neural Networks
Under relatively mild conditions,6 the output signal of a nonlinear model can beconsidered as a combination of outputs from some suitable submodels The structureidentification, model validation and parameter estimation based upon these submodelsare more convenient than those of the whole model Block oriented stochastic modelsconsist of static nonlinear and dynamical linear modules Such models often occur inpractice, examples of which are
• the Hammerstein model, where a zero-memory nonlinearity is followed by a
lin-ear dynamical system characterised by its transfer function H(z) = N (z)/D(z);
• the Wiener model, where a linear dynamical system is followed by a zero-memory
nonlinearity
5.7.1 Overview of Block-Stochastic Models
The definitions of certain stochastic models are given by the
Trang 1078 WIENER AND HAMMERSTEIN MODELS AND DYNAMICAL NNs
D(z)
nonlinear function
(a) The Hammerstein stochastic model
function
D(z)
(b) The Wiener stochastic model
Figure 5.4 Nonlinear stochastic models used in control and signal processing
Theoretically, there are finite size neural systems with dynamic synapses whichcan represent all of the above Moreover, some modular neural architectures, such
as the PRNN (Haykin and Li 1995), are able to represent block-cascaded Wiener–Hammerstein systems described by (Mandic and Chambers 1999c)
to be representable this way
5.7.2 Connection Between Block-Stochastic Models and Neural Networks
Block diagrams of Wiener and Hammerstein systems are shown in Figure 5.4 Thenonlinear function from Figure 5.4(a) can be generally assumed to be a polynomial,7i.e
rep-corrupted with additive output noise η(k) is
y(k) = Φ[u(k − 1)] +
∞
i=2 hiΦ[u(k − i)] + ν(k), (5.22)
where Φ is a nonlinear function which is continuous Other requirements are that the
linear dynamical subsystem is stable This network is shown in Figure 5.5