We will study several ways to minimize this cost function, keeping in mind samples within S,, it will not be possible to do a good job at lowering the input space SX by the xi data e.g.,
Trang 1Chapter 4
Approximators
4.1 Overview
because of our constant exposure to it For instance, in business investments
we seek to maximize our profits; in recreational games we seek to maximize
optimizaStion plays a key role in engineering and many other fields In circuit
want to design for the highest possible torque delivery for a given amount of
a system
Here, as in many adaptive control methods, the adaptive schemes are de-
skip (or skim) this chapter and go to the next one
Stable Adaptive Control and Estimation for Nonlinear Systems:
Neural and Fuzzy Approximator Techniques.
Jeffrey T Spooner, Manfredi Maggiore, Ra´ul Ord´o˜nez, Kevin M Passino
Copyright 2002 John Wiley & Sons, Inc ISBNs: 0-471-41546-4 (Hardback); 0-471-22113-9 (Electronic)
Trang 24.2 Problem Formulation
is a vector of p adjustable parameters In other words, we wish to find some 0* E S such that
8” = argminJ(B)
This type of optimization problem is referred to as “constrained optimiza- tion” since we require that 8 E S When S = R”, the minimization problem becomes “unconstrainted.”
If we wish to find a parameter set 8 that shapes a function F(x, 0) (that represents a neural network or fuzzy system with tunable parameters 0) so that ?@,8) and f(z) match at II: = 2, then one might try to minimize the cost function
would be one possible cost function to consider Minimizing the difference
another function f(x) which is in general only partially known is referred
of particular interest to us throughout the study of adaptive systems using fuzzy systems and neural networks
Practically speaking, however, in our adaptive estimation and control problems we are either only given a finite amount of information in the form
of input-output pairs about the unknown function f(z) or we are given such input-output pairs one at a time in a sequence Suppose that there are n input va,riables so x = [xi, , x,lT E R’“ Suppose we present the function
denoted by
i T
x2 = [Xi) - ) XJ , where x2 E 5’Z and denote the output of the function by
yi = f(xi)
Furthermore, let the “training data set” be denoted by
Trang 3Given this, a pra.ctical cost function to minimize is given by
(4 4)
We will study several ways to minimize this cost function, keeping in mind
samples within S,, it will not be possible to do a good job at lowering the
input space SX by the xi data (e.g., uniform on a grid with a small distance
forced to use a given G directly as it is given to you (and you cannot change
e(z) = f (2) - WG e>
is indeed a difficult problem
tive control? As a simple example, suppose we wish to drive the output of the system defined by
happens beca’use the closed-loop dynamics become i = -Icx which is an exponentially stable system
In general, it will be our task to find 0 = 8* so that the approximator
.qs,e) = qx,e*) E f cx) N o t ice that even in this simple problem, some key issues with trying to find 8 = 0* are present when the cost function (4.4) is used in place of (4.3) For instance, to generate the training data set G we need to assume that we know (can measure) x However, even
Trang 4assume that we know li: (which can be difficult to measure due to noise) If
we do know L? then we can let
f( 1 x =25-u
to form G for training the approximator, we cannot in general pick the xi unless we can repeatedly initialize the differential equation in (4.5) with every possible xi E S, Since in many practical situations there is only one initia,l condition (or a finite number of them) that we can pick, the only data we can gather are constrained by the solution of (4.5) This presents
us with the following problem: How can we pick the u(t) so that the data
we cancollect to put in G will ensure good function approximation? Note
impact our ability to “steer” the state to regions in 2& where we need
to improve a,pproximation accuracy Also, the connection between G and approximation accuracy depends critically on what optimization algorithm
is used to construct 8, as well as on the approximator’s structural potential
We see that even for our simple scalar problem, a guarantee of ap- proximation accuracy is difficult to provide Often, the central focus is on showing that even if perfect approximation is not achieved, we still get a stable closed-loop system In fact, for our example even if F(x, 0) does
ble and converge to a ball around zero assuming that an 6’ can be found
problem (4.7))
The main focus of this chapter is to provide optimization algorithms for
illustrates, the end approximation accuracy will not be paramount We simply need to show that if we use the optimization methods shown in this chapter to adjust the approximator, then the resulting closed-loop system will be stable The size of the approxima,tor error, however, will typically a,ffect the performance of the closed-loop system
4.3 Linear Least Squares
We will first concentrate on solving the least squares problem for the ca,se where
J(B) = FWi jf(Xi)-3(Xi,8)/2) P-7)
Trang 5where wi > 0 are some scalars, f(x) is an unknown function, and F(x, 0)
consider some F(x, 0) such that B does not necessarily appear linearly in the
of (4.8)
4.3.1 Batch Least Squares
model
where u(k) and y(k) are the system input and output at time k This form
with
In the batch least squares method we define
Y = [ y1,y2, *,yyT
Trang 6to be an M x I vector of output data where the yi, i = 1,2, , M come from G (i.e., y’ such that (ci, y”) E G) We let
- 1 T
K >
21
cc >
matrix (i.e., the Ci such that (Ci, yi) E G) Let
given 8, which is (4.7) with wi = 1 for i = 1, M We want to pick 6’ to
a global minimum in this case
with respect to 8 and set it equal to zero, we get an equation for the best
to deriving this result is to notice that
letting
Trang 7Since the first term in this equation is independent of 0, we cannot reduce
J via this term, so it can be ignored Thus, to get the smallest value of J,
we choose 6 so that the second term is equal to zero since its contribution
since the smallest we can make the last term in the above equation is zero This is the equation for batch least squares that shows we can directly
loa’ded into + and Y If we pick the inputs to the system so that it is
w1 < w2 < - < we when x2 is collected after x1, x3 is collected after x2,
To show this, simply use (4.10) and proceed with the derivation in the same manner as above
be used, suppose that we would like to identify the coefficients for the system
like to fit the system to is given by
G={([ :I+([ :]+([ $4)))
so that M = 3 We will use (4.9) to compute the parameters for the solution that best fits the data (in the sense that it will minimize the
Trang 8sum of the squared distances between the identified system and the data) To do this we let
and the system
best fits the data in the least squares sense The same general ap-
4.3.2 Recursive Least Squares
While the batch least squares approach has proven to be very successful for a variety of applications, the fa#ct that by its very nature it is a “batch” method (i.e., all the data are gathered, then processing is done) may present computation al problems For small M we could clearly repeat the batch calculation for increasingly more data as they are gathered, but as M be- comes larger the computations become prohibitive due to the fact that the dimensions of @ and Y depend on n/r Next, we derive a recursive version
of the batch least squares method that will allow us to update our estimate
of 8* each time we get a new data pair, without using all the old data in the computation and without having to compute the inverse of iPT4e Since we will be successively increasing the size of G, and since we will assume that we increase the size by one each time step, we let a time index
Trang 9Sec 4.3 Linear Least Squares
for all k We have P-r (k) = aT@ = X:=1 Ci(<i)T so we can pull the last
k-l P-l(k) = E Ci(Ci)l + [“(C”)’
This provides a method to compute an estimate of the parameters 6(k) at
that we received, (Ck, yk) Notice that (yk - ([‘“)‘O(k - 1)) is the error in predicting yk using H(k - 1)
Trang 10But then we will have to compute an inverse of a matrix at each time step (i.e., each time we get another input output data pair) Clearly, this
We will use the matrix inversion lemma to remove the need to compute the inverse of P-r(k) that comes from (4.17) so that it can be used in (4.16)
to update 8 Notice that
which, together with
e(k) = e(k - 1) + ~(k)c~(~” - (ckjTe(k - 1)) (4.20) (which was derived in (4.16)), is called the “recursive least squares (RLS) a,lgorithm.” Basically, the matrix inversion lemma turns a matrix inversion
scalar)
for some large a > 0 This is the choice that is often used in practice
of the true parameters
There is a “weighted recursive least squares” (WRLS) algorithm also Suppose that the parameters of the physical system 8 vary slowly In this case it may be advantageous to choose
i=l
where 0 < ,A 5 1 is called a “forgetting factor” since it gives the more recent
Trang 11above, you can show that the equations for WRLS are given by
where when X = 1 we get standard RLS
for all k The question is what happens if this term becomes singular For
rich signal) and notice that in this case (4.21) becomes
(4.22)
bigger than one) then 0(k) = 0 for all k >_ 0 and as k + 00 the diagonal
where a = 0.9 and b(lc) = 1 + 0.2sin(O.O2~lc) Since b(k) is a time- va,rying coefficient, the recursive least squares routine with a for- getting factor may be used to estimate a and b(lc) using C(lc) = [y(k - l),u(k - l)lT As th e value of X is decreased, the RLS routine will tend to “forget” older input-output samples more quickly Figure 4.1 shows the true value of b(k) along with RLS estimates when
X = 0.2,0.8,0.99 where the input is defined by u(k) = 0.5 sin(O.2xlc) + cos(O.l7&) As X -+ 1, the RLS with forgetting factor converges to the batch least squares routine, where a constant which minimizes the sum of the squared errors is estimated for b(k) Even though using
X = 0.2 in this case caused the RLS estimate to accurately track the true value of b(lc), using small values for X will tend to make errors in the parameter estimates more sensitive to noise in the measurements
Trang 121.3
sample k
ting factors
4.4 Nonlinear Least Squares
While in the la,st section we studied the use of linear in the parameter
aF approximators F(x, 19) = m6 to minimize (4.7), here we will also seek to minimize (4.7) (with the wi = l), but we will consider the adjustment of
8 for the general nonlinear in the parameter problem for ?(x, O), for the remainder of this chapter First, we explain how to use gradient methods
to adjust 0 for only a single training data pair Next, we generalize the development to the case of multiple (and sequential) training data, and the discrete time case (throughout we will discuss the important issue of convergence) We discuss the constrained optimization problem and close the chapter with a brief treatment of line search and high order techniques for function approximation This last section will be particularly useful to those who are only concerned with the off-line training of approximators (e.g., for estimators), and in cases when you want to perform off-line tuning
of a’n approximator before adjusting it on-line
Trang 134.4.1 Gradient Optimization: Single Training Data Pair
f (ccl) Given an input x1 one would like to adjust 0 so that the difference
where 7 > 0 is a constant and if 0 = [&, , OplT, then
To see that J is nonincreasing when 0 is adjusted according to (4.24), notice that
& -ggylTy~ - 23(~',O)~y' + 3(~~,0)~3(x~,O))
Now, taking the partial derivative, we get
Trang 14of adaptive controllers will use information about the gradient The above results are summarized in the following theorem, the proof of which is obvious since J is nonincreasing:
defined by
?(x’, 8) = ~1 + tanh(azz?), where 8 = [al, asIT
Trang 154.4.2 Gradient Optimization: Multiple Training Data Pairs
when a single data pair (x1 , yr) is to be matched Now consider the problem
be matched for i = 1, , M In this case, we let
ei = yi - F(xQ), and let the cost function be
does not increase over time, as stated in the following theorem:
i
M 2x
(4.33)
(4.35)
implies that the output error for each data pair is bounded for all time n
Trang 16The update la,w defined by (4.31) is a continuous version of the batch
ature Once each ci is computed, it is rather easy to apply To see how to calculate each [i consider the following example:
function f(x) over x E D c Rn In other words, it may be possible to choose some 8 such that
SUP If(x) - 3(x,@>l < ET
where c > 0 is some small constant It is our hope that if we choose the
Trang 17Global
finding some 6’ such that IJi -+ 0 will imply that If(x) - ~=(xJ?)I + 0 on
D provided that M is large enough
Notice that the theorems for the update laws only guarantee that the magnitude of the error will not increase; they do not ensure that leil + 0 for i = 1,2, , M The update laws are gradient-based algorithms which modify 8 in the direction which decreases J(6) Depending upon the ap- proximator structure, situations may exist such that both local and global
proximator parameters such that
O* = argmin J(8)
Tha#t is, 8* is the set of approximator parameters which minimizes J(0) over all S A loca,l minimum is found when an arbitrarily small change in 8 # 6*
in any direction will not decrease J Since aJ/% = 0 at a local minimum,
6 = 0 so that the approximator parameters stop adjusting Clearly, we would like to find a global minimum: but for multiple data pairs this may
be difficult
ters of an approximator (which may be considered to be either a radial basis function neural network or a type of fuzzy system) defined by
(4.39) where p = 20, 0 = 0.5, and the centers ci are evenly distributed in