Recursive WLS: Derivation The WLS parameter est- 123docz.net

(2.17) In the case where w k = 1, p k is the sample regressor autocorre~ation matrix and R k is the sample cross-correlation matrix between the regressor and the function output. For interpretations of these algorithms in a statistical setting, the interested reader should see, for example, [ 133, 1641.

From the definitions of @, Y , and W , assuming that W is a diagonal matrix, we have that

k k

e k = P L I R k , where P k = ' @ k w k @ L and R k = - @ k W k Y k .

Y k + l = [ yk Y k + l ] @ k + l = [ @ k # k + l 1 , and w k + l = [ L k + l ] . (2.18)

Therefore,

Calculation of the WLS parameter estimate after the (Ic + 1)st sample is available will require inversion of the @ k + l W k + 1 @ & 1 . The Matrix Inversion Lemma [99] will enable derivation of the desired recursive algonthm based on eqn. (2.19).

The Matrix Inversion Lemma states that if matrices A, C, and ( A + B C D ) are invertible (and of appropriate dimension), then

( A + B C D ) - ~ = A-1 - A - ~ B ( D A - ~ B + c-l)-l DA-'.

The validity of this expression is demonstrated by multiplying ( A + B C D ) by the right- hand side expression and showing that the result is the identity matrix.

Applying the Matrix Inversion Lemma to the task of inverting @ k + l W k + l @ L + l , with

Ak = @ k w k @ l , B = f$k+l, c = W k + l , and D = C$:+~, yields AL;l = ( @ k w k @ l f 4 k + l W k + l d ) l + 1 ) - 1

Ai;1 = A i l - AL14k+l (&+iAi14k+l + wi:i)-' 4i+;rlAk1. (2.21) Note that the WLS estimate after samples k and ( k + 1) can respectively be espressed as

e k = Acl@kwkYk and e k + l = A i : 1 @ k + l W k + 1 Y k + l .

The recursive WLS update is derived, using eqns. (2.20) and (2.21), as follows:

e k + l = [A,' - Akl$k+l (&+lAild'k+l f w;;1)-' 4i+1Ak1]

[ @ k w k y k + $ k + l w k + l Y k + l ]

= e k - AL14k+l (&+lAL1d)k+l f wc:l)-' @kjiek + A i 1 4 k + l w k + l Y k + l

-Ai14k+l (4L+;1Ak14k+l + w;:i)- d)k+l T A-1 k + k + l w k + l Y k + l

= e k - Ai14k+l (&+iA;'$k+l f w;:i)-' 4;+1ek

+AL1d%+i [I - (&+1Ai14k+i + wi;l)-l d ) : f l A i 1 4 k + i ] wk+iYk+i

= e k - Ail$k+i (&+1Ai14k+l + WL:l)-l #;+lek +A,l$k+l (4kj1Ai1d)k+l f wi:l)-'

[4L+1AL1$k+l + wL;l - 4;+1Ai14k+l] w k + l Y k + l

= e k + Ai14k+1 (&+lAkl$k+l + WF;1)-' ( Y k f l - 4L+iek) 3

= e k + A,' ( & + l A ~ l $ k + l + wi;i)-' $k+l ( Y k + l - 4 l + ; , e k ) 2

e k + l

e k + l (2.22)

where we have used the fact that (4L+;,Ai14k+l + w i i 1 ) is a scalar. Shifting indices in eqn. (2.21) yields the recursive equation for A i l :

(2.23) 2.3.2.2 Recursive WLS: Properties The RWLS algorithm is defined by eqns. (2.22) and (2.23). This algorithm has several features worth noting.

A i l = A-' - A - T - 1

k - 1 k l l @ k ( 4 k A k - l @ k f w c 1 ) - l d)LAL:l.

1 . Eqn. (2.22) has a standard predictor-corrector format

ek+l = ek + n k $ k + l b k + l - g k + l : k ) (2.24)

- 1

where RI, = A i l ($L+lAklI$k+l + wkil) is the estimate of yk+l based on ek. The majority of computations for the RWLS algorithm are involved in the propagation of A i l by eqn. (2.23).

2. The RWLS calculation only uses information from the last iteration (i.e., A i l and

8 k ) and the current sample (i.e., Y k + l and &+I). The memory requirements of the RWLS algorithm are proportional to N , not k. Therefore, the memory requirements are fixed at the design stage.

and $k+l:k =

3. The WLS calculation of eqn. (2.13) requires inversion of an N x N matrix. The RWLS algorithm only requires inversion of an n x n matrix where N is the number of basis functions and n is the output dimension o f f , which we have assumed to be one. Therefore, the matrix inversion simplifies to a scalar division. Note that Ak is never required. Therefore, A i l is propagated, but never inverted.

4. All vectors and matrices in eqns. (2.22) and (2.23) have dimensions related to N , not k. Therefore, the computational requirements of the RWLS algorithm are fixed at the design stage.

5. Since no approximations have been made, the recursive WLS parameter estimate is the same as the solution of eqn. (2.13), if the matrix A i l is properly initialized. One approach is to accumulate enough samples that A k is nonsingular before initializing the RWLS algorithm. An alternative common approach is to initialize A;' as a large positive definite matrix. This approximate initialization introduces an error in 81 that is proportional to IIAolI. This error is small and decreases as k increases. For additional details see Section 2.2 in [154].

6. Due to the equivalence of the WLS and RWLS solutions, the RWLS estimate will not be the unique solution to the WLS cost function until the matrix @k Wk@l is not singular. This condition is referred to as @k being su8ciently exciting.

Various alternative parameter estimation algorithms can be derived (see Chapter 4).

These algorithms require substantially less memory and fewer computations since they do not propagate A i l , the tradeoff is that the alternative algorithms converge asymptotically instead of yielding the optimal parameter estimate as soon as $k achieves sufficient excitation. In fact, if convergence of the parameter vector is desired for non-WLS algorithms, then the more stringent condition ofpersistence of excitation will be required.

EXAMPLE2.4

Example 2.1 presented a control approach requiring the storage of all past data z ( k ) . That approach had the drawback of requiring memory and computational resources that increased with k . The present section has shown that use of a function approximation structure of the form

and a parameter update law of the form eqn. (2.24) (e.g., the RWLS algorithm) results in an adaptive function approximation approach with fixed memory and computational

f^(4 = $(.)Te

- 0.5

g o

-0.5

-1.5 2

-2 0

1.5, I

- 0.5 11 A

g o

I. :

-0.5

-1 * . 4.

-1.5

-2 0 2

0.5

g o I . . .

-0.5

-1 4.. .4:

-1 5 2

-2 0

g o 0.5 l.:Jr:i . .

1 -

-0 5

-1 .. *.

-1 5

-2 0 2

Figure 2.5: Least squares polynomial approximations to experimental data. The polynomial orders are 1 (top left), 3 (top right), 5 (bottom left), and 7 (bottom right).

requirements. This example further considers Example 2.1 to motivate additional issues related to the adaptive function approximation problem.

Let f be a polynomial of order m. Then, one possible choice of a basis for this approximator is (see Section 3.2) $(z) = [ l , z, . . . , P I T . Figure 2.5 displays the function approximation results for one set of experimental data (600 samples) and four different order polynomials. The x-axis of this figure corresponds to D = [ - T , T ]

as specified in Example 2.1. Each of the polynomial approximations fits the data in the weighted least squares sense over the range of the data, which is approximately

B = (-2; 2). Outside of the region B, the behavior of each approximation is distinct.

The disparity of the behavior of the approximators on D - B should motivate questions related to the idea of generalization relative to the training data. First, we dichotomize the problem into local and nonlocal generalization. Gocal generalization refers to the ability of the approximator to accurately compute f(x) = f ( z , + d z )

where z, is the nearest training point and dz is small. Local generalization is a neces- sary and desirable characteristic of parametric approximators. Local generalization allows accurate function approximation with finite memory approximators and finite amounts of training data. The approximation and local generalization characteristics of an approximator will depend on the type and magnitude of the measurement noise and disturbances, the continuity characteristics o f f and f, and the type and number of elements in the regressor vector 4. NFnlocal generalization refers to the ability of an approximator to accurately compute f(x) for z E V - B. Nonlocal generalization is always a somewhat risky proposition.

Although the designer would like to minimize the norm of the function approximation errors, l/f(z) - f ( z ) / / d z , this quantity is not able to be evaluated online, since f ( z ) is not known. The norm ofthe sample data fit error C,"=, lIyz - f(z,)11 can be evaluated and minimized. Figure 2.6 compares the minimum of these two quantities

1 o2

1 - Error relative to actual fundion Error relative to data

10’

‘a .

l? loo ;

.- 3 : E ,

lo-’ -

0 0 0 0 0 0

2 3 4 5 6 7 0

Approximating Polymontal Order 10-2,

Figure 2.6: Data fit (dotted with circles) and function approximation (solid with x’s) error versus polynomial order.

for the data of Figure 2.5 as the order m of the polynomial is increased. Both graphs decrease for small values of m until some critical regressor dimension m* is attained.

For m > m*, the data fit error continues to decrease while the function approximation error actually increases. The data fit error decreases with m, since increasing the number of degrees of freedom of the approximator allows the measured data to be fit more accurately. The function approximation error increases with m for m > m*, since the ability of the approximator to fit the measurement noise actually increases the error of the approximator relative to the true function. The value m* is problem, data, and approximator dependent. In adaptive approximation problems where the data distribution and f are unknown, estimation of m’ prior to online operation is a difficult problem.

Since this example has used the RWLS method which propagates A-I without data forgetting, the parameter estimate is independent of the order in which the data is presented. Generally, parameter estimation algorithms of the form eqn. (2.24) (e.g., gradient descent) are also trajectory (i.e., order of data presentation) dependent. A

Starting in Chapter 4, all derivations will be performed in continuous-time. In continuous- time, the analog of recursive parameter updates will be written as

where r ( t ) is the adaptive gain or learning rate. In discrete-time the corresponding adaptive gain O ( t ) (sometimes referred to as step size) needs to be sufficiently small in order to guarantee convergence; however, in continuous-time r( t ) simply needs to be positive definite (due to the infinitesimal change of the derivative 6 ( t ) ) .

1 EXAMPLE2.5

The continuous-time least squares problem estimates the vector 0 such that $(t) =

$(t)Te minimizes

J ( 0 ) = 1' ( y ( 7 ) - G ( T ) ) ~ d 7 = 1' ( y ( 7 ) - 4(7)T@)2 d 7 (2.26) where y : X+ H X', 0 E XN, and q5 : X+ H XN. Setting the gradient of J ( 0 ) with respect to 0 to zero yields the following

1' 4(7) ( Y ( 7 ) - 4(4'0) d r = 0

I' $(T)Y(T)d.T = Lt @ ( 4 4 ( 4 T d T 0 R ( t ) = P-'(t) 0

e ( t ) = P ( t ) R ( t ) (2.27)

where R ( t ) =

tions of P and R, that P-' is symmetric and that

@ ( . r ) y ( r ) d T and P - l ( t ) =

d d t

(b(7)4(T)'d.. Note by the defini-

- R t ) l = # ( t ) y ( t )

Since P ( t ) P - l ( t ) = I , differentiation and rearrangement shows that in general the time derivative of a matrix and its inverse must satisfy P = -P$ [P-'(t)] P;

therefore, in least squares estimation

P = - P ( t ) $ ( t ) q ( t ) T P ( t ) . (2.28) Finally, to show that the continuous-time least squares estimate of 0 satisfies eqn.

(2.25) we differentiate both sides of eqn. (2.27):

e ( t ) = P(t)R(t) + P ( t ) & ( t )

= - P ( t ) ~ ( t ) @ ( t ) T P ( t ) R ( t ) + P (t)d t)Y(t)

= P ( t ) 4 ( t ) ( -4WTW) + !At))

d ( t ) = W d t ) ( Y ( 4 - B ( t ) ) . (2.29)

Implementation of the continuous-time least squares estimation algorithm uses equa- tions (2.28)-(2.29). Typically, the initial value of the matrix P is selected to be large.

The initial matrix must be nonsingular. Often, it is initialized as P( 0) = y I where y

is a large positive number. The implementation does not invert any matrix.

Before concluding this section, we consider the problem of approximating a function over a compact region V. The cost function of interest is

(.m - ~ ( z ) ~ e ) ~ (M - q 5 ~ ~ 0 ) d z .

Again, we find the gradient of J with respect to 8, set it to zero, and find the resulting parameter estimate. The final result is that 8 must satisfy (see Exercise 2.9)

(2.30) Computation of 8 by eqn. (2.30) requires knowledge of the function f . For the applications of interest herein, we do not have this luxury. Instead, we will have measurements that are indirectly related to the unknown function. Nonetheless, eqn. (2.30) shows that the condition of the matrix s, 4 ( ~ ) 4 ( z ) ~ d z is important. When the elements of the 4 are mutually orthonormal over D, then s, $ ( ~ ) $ ( z ) ~ d z is an identity matrix. This is the optimal situation for solution of eqn. (2.30), but is often not practical in applications.

Recursive WLS: Derivation The WLS parameter estimate can be expressed

COMPONENTS OF APPROXIMATION BASED CONTROL

Extent of Influence Function Support