In the specification of the approximators of eqns. (2.3 1) or (2.32), a major factor in de- termining the ultimate performance that can be achieved is the selection of the functions
4(x). An important characteristic in the selection of 4 is the extent of the support of the elements of 4, which is defined to be S, = Supp6, = {x E Dl+2(z) # O}. Let p ( A ) be a function that measures the area of the set A C D. Then, the functions Qt will be referred to as globally supported functions if p ( S ~ p p + ~ ) = ~(2)). The functions ql will be referred to as locally supported functions if S, is connected and p(S,) << p ( D ) .
The solution of the theoretical least squares problem where f is a known function is given in eqn. 2.30. The accuracy of the solution depends on the condition of the matrix
JD q(x)$(x)Tdz. The elements of this matrix are
When the basis elements have local support, this matrix will be sparse and have a banded- diagonal structure. With careful design of the regressor vector the elements of each diagonal will each be of about the same size and the matrix sD @ ( z ) @ ( ~ ) ~ d x will be well conditioned.
The following subsections introduce a general representation for approximators with locally supported basis elements, contrast the advantages of locally and globally supported basis elements, and introduce the concept of a lattice network.
2.4.8.1 Approximators with Local Influence Functions Several approximators with local influence functions have been proposed in the literature. This section analyzes such approximators in a general framework [73, 83,85, 173, 1751. Specific approximators are discussed in Chapter 3.
Definition 2.4.10 (Local Approximation Structure) - afunction f(x, 8 ) is a local approx- imation to f ( z ) at zo iffor any E there exist 8 and S such that I[ f(x) - f ( z . 8 ) /I< E for all z E B ( Q , 6) = {XI 115- zoii < 6).
Two common examples of local approximation structures are constant and linear functions.
It is well known that constant, linear, or higher order polynomial functions can be used to accurately approximate an arbitrary continuous function if the region of validity of the approximation is small enough.
Definition 2.4.11 (Global Approximation Structure) - a parametric model f(x. 8 ) is an
E-accurate global approximation to f (x) over domain D iffor the given E there exists 0 such that /I f(x) - f(z. 8) 115 E f o r all x E D.
Note the following issues related to the above definitions.
Local and global approximation structures can be distinguished as follows.
0 Models derived from first principals are usually (expected to be) global approximation Whether a given approximation structure is local or global is dependent on the system that is being modeled. For example, a linear approximating structure is global for linear plants, but only local for nonlinear plants.
structures.
The set of global models is a strict subset of the set of local models. This is obvious, since if there exists a set of parameters 8 satisfying Definition 2.4.1 1 for a particular
E , then this 0 also satisfies Definition 2.4.10 for the same c at each xo E V.
To maintain accuracy over domain V, a local approximation structure can either adjust its parameter vector, through time, as the operating point zo changes; or store its parameter vector as a function of the operating point. The former approach is typical of adaptive control methodologies while the latter approach is being motivated herein as learning control. The latter approach can effectively construct a global approximation structure by connecting several local approximating structures.
A main objective of this subsection is to appropriately piece together a (large) set of local approximation structures to achieve a global approximation structure. The following definition of the class of Basis-Influence Functions [ 16, 76, 85, 122, 1731 presents one means of achieving this objective.
Definition 2.4.12 (Basis-Influence (BI) Functions) - A function approximator is of the BI Class ifand only ifit can be written as
(2.40)
i
where each fi(z, 6) is a local approximation to f(z) for all z E B ( x i , 6), and r i ( x ) has local support Si which is a subset of B ( x i ; 6 ) such that D & ui Si.
Examples of Basis-Influence approximators include: Boxes [23 I], CMAC [2], Radial Basis Functions [205], splines, and several versions of fuzzy systems [ 198,2831. In the traditional implementation of each of these approximators, the basis functions are constant on the support of the influence function. If more capable basis functions (e.g., linear functions) were implemented, then the designer should expect there to be a decrease in the number of required local approximation structures. An alternative definition of local influence, which also provides a measure of the degree of localization based on the learning algorithm, is given in [288].
The partition of unity is defined as follows [253, 2931.
Definition 2.4.13 (Partition of Unity) - The set ofpositive semidejnite influence functions { r i } f o r m a Partition ofunity on iffor any 5 E V ,
Influence functions that form a partition of unity have a variety of benefits. First, if {ri}
form a Partition of Unity on 'D, then there cannot be any x E V such that xgl l?i(x) = 0.
Also, when the approximator is defined by eqn. (2.40) with {Ti} forming a Partition of Unity, then at any z E 27, f ( x , 6) is a convex combination of fi(z, 6).
If a set of positive semidefinite influence functions {ri} do not form a partition of unity, but have the coverage property (i.e., for any x E V there exists at least one i such that Fi(x) # 0), then a partition of unity can be formed from {Ti} as
r i ( x ) = 1.
(2.41)
This normalization operation should however be used cautiously [22 11. Such normalization can yield ri(z) that have large flat areas. In addition, even when (x) is unimodal, I'i(z) may be multimodal. See Exercise 2.10. When the functions Ti(.) are fixed after the design
stage, the designer can ensure that the ri (2) have desirable properties; however, when the centers and radii of the fi(z) are adapted online (i.e., nonlinear in the parameter adaptive approximation), then such anomalous behaviors may occur.
Given Definition 2.4.12 it is possible to constructively prove a sufficient condition for Basis-Influence functions to be global approximators.
Theorem 2.4.9 r f f ( x , b) is of Class BI with each fi(x, 0) satisfiing Definition 2.4. lOfor afied E > 0, then
are suflcient conditions for f ( x , 6) to be an E accurate global approximation to f E C ( D ) for compact 73.
Proof. Fixz E D. LetN, = {i E I Iri(z) # O}. ThenbyDefinition2.4.12,CiEN, r'i(z) = 1. For each i E Nx, f o r z E Si, by Definitions 2.4.12 and 2.4.10, there exists iai(z)l 5 E
such that
fi(Z, 8) = f ( z ) + E i ( ( C ) . (2.42) Therefore,
and
Since z is an arbitrary point in D, this completes the proof. rn
When a multivariable Basis-Influence approximator can be represented by taking the product of the influence functions for each single variable:
f(z, Y1Q = c c f i 3 (z, Yt @ r x % (z)r, (Y; ) (2.43)
i 3
the basis-influence approximator fits the definition of a C1T-network to which Theorem 2.4.4 applies.
a EXAMPLE 2.17
A one input approximator that meets all the conditions of Theorem 2.4.9 is
(2.44) where xi = a$
and fi(x, 8) can be any function capable of providing a local approximation to f(x)
at xi. n
a EXAMPLE 2.18
Figure 2.10 illustrates basis-influence function approximation. The routine for con- structing this plot used r as defined in eqn. (2.45) with X = 0.785. In the notation of Definition 2.4.12, for i = 1, . . . ,6:
where cz = 0.2(i - 1) and D = [0,1]. For clarity, the influence functions are plotted at a 10% scale and only a portion of each linear approximation is plotted.
Note that the parameters of the approximator have beenjointly optimized such that eqn. (2.44) has minimum least squares approximation error over D. This does not imply that each f i is least squares optimal over Si. This is clearly evident from the figure. For example, f5 is not least squares optimal over S5 = [0.6,1.0]. The least squared error of f5 over 5’5 would be decreased by shifting f5 down. It is possible to improve the local accuracy ofeach f i over Si, but this will increase the approximation error of eqn. (2.44) over D. Often, this increase is small and such receptive field weighted regression methods have other advantages in terms of computation and ap- proximator structure adaptation (i.e., approximator self-organization) [ 13,236,2371.
n
2.4.8.2 Retention of Training Experience Based on the discussion of Subsection 2.4.7, the designer should not expect f to accurately extrapolate training data from regions of D containing significant training data into other (unexplored) regions. In addition, it is desirable for training data in new regions to not affect the previously achieved approxima- tion accuracy in distant regions. These two issues are tightly interrelated. The issues of localization and interference in learning algorithms were rigorously examined in [288,289].
The online parameter estimation algorithms of Chapters 4, 6, and 7 will adapt the pa- rameter vector estimate @ ( t ) based on the current (possibly filtered) tracking error e ( t ) . The algorithms will have the generic forms of eqns. (2.24) and (2.25).* If the regressor (i.e., @(z)) has global support, then changing the estimated parameter 0, affects the ap- proximation accuracy throughout D. Alternatively, if q& has local support, then changing the estimated parameter Bt affects the approximation accuracy only on Suppm, which by assumption is a small region of D containing the training point.
- $ 0 5 - 0 4 - 0 3 - 0 2
0 1 0.6
\
\
Figure 2.10: Basis-Influence Function Approximation of Example 2.18. The original func- tion is shown as a dashed line. The local approximations (basis functions) are shown as solid lines. The influence functions (drawn at 10% scale) are shown as solid lines at the bottom of the figure.
EXAMPLE 2.19
Consider the task of estimating a function f(z) by an approximator f(z) = As in a control application, assume that samples are obtained incrementally and the
z k f l is near x k . This example considers how the support characteristics of the basis elements { r+f~i}& affects the convergence of the function approximation.
For computational purposes, assume that f(z) = sin(z) and the domain ofapprox- imation D = [ - T , 7r]. Also, let x k = -3.6 + 0 . l k for k = 0, . . . , 7 2 . Consider two possible basis sets. The first set of basis elements is the first eight Legendre polyno- mials (see Section 3.2) with the input to the polynomial scaled so that 'D H [ - 1,1].
This basis sets has global support over 'D. The approximator with the first eight Legendre polynomials as basis elements is capable of approximating the sin func- tion with a maximum error over 'D of approximately 1.0 x The second set of basis elements is a set of Gaussian radial basis elements (see Section 3.4) with centers at ci = -4 + 0.52 for i = 0 , . . . ,16 and spread = 0.5. Although each Gaussian basis element is nonzero over all of D, each basis element is effectively locally supported. This 17-element RBF approximator is capable of approximating the sin function with maximum error over 'D of approximately 0.5 x low3. For both approximators, initially the parameter estimate is the zero vector.
Figure 2.1 1 shows the results of gradient descent based (normalized least mean squares) estimation of the sin function with each of the two approximators. The Legendre polynomial approximation process is illustrated in the top graph. The RBF approximation process is illustrated in the bottom graph. Each of the graphs contains three curves. The solid line indicates the function f (z) that is to be approximated. The
A B
- sin(x)
Training over [-3.4,-0 71
1 H- - Training over [-3.4, 2.31 - _ _ - - - _ --- - - 1
I I
Figure 2.11: Incremental Approximations to a sin function. Top - Approximation by 8-th order Legendre Polynomials. Bottom - Approximation by normalized Radial Basis Functions. The asterisks indicate the rightmost training point for the two training periods discussed in the text.
dotted line is the approximation at k = 29. At this time, the approximation process has only incorporated training examples over the region V29 = [-3.6, -0.71. The left asterisk on the x-axis indicates the largest value of x in D29. Note that both approximators have partially converged over V29. The RBF approximation is more accurate over V2g. The polynomial approximation has changed on V -V29. The RBF approximation is largely unchanged on V -V29. The dashed line is the approximation at k = 59. At this time, the approximation process has incorporated training examples overtheregionD29 = [-3.6,2.3]. Therightasteriskonthex-axis indicates thelargest value of x in 2759. Note that while the polynomial approximation is now accurate near the current training point (z = 2.3), the approximation error has increased, relative to the dotted curve, on V29. Alternatively, the RBF approximator is not only accurate in the vicinity of the current training point, but is still accurate on V - V29, even though no recent training data has been in that set. For both approximators, the norm of the parameter error is decreasing throughout the training.
This example has used polynomials and Gaussian RBFs for computational pur- poses, but the main idea can be more broadly stated. When the approximator uses locally supported basis elements, there is a close correspondence between parameters of the approximation and regions of V. Therefore, the function can be adapted lo- cally to learn new information, without affecting the function approximation in other regions of the domain of approximation. This fact facilitates the retention of past training data. When the basis elements have global support, retention of past training
1 2 3
r" 1
;
;
-1
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3
X
Figure 2.12: Three RBF approximations to a sine function using different values of g. The basis elements of the middle and bottom approximations form partitions of unity.
data is much more complicated. It can be accomplished, for example, using recursive
n
least squares, but only at significant computational expense.
When an approximator uses influence functions that do not form a partition of unity and the influence functions are too narrow relative to their separation, the resulting approxima- tion may be "spiky." Alternatively, when the influence functions do form a partition of unity and the influence functions are too narrow relative to their separation, the approximation may have flat spots.
EXAMPLE 2.20
Figure 2.12 shows three radial basis function approximations to a sine function. The top plot uses an approximation hl with unnormalized RBF functions every 0.5 units and u = 0.1. Since u is much less than the separation between the basis elements, the approximation is spiky. The middle approximation hz uses normalized RBF functions every 0.5 units with D = 0.1. Since r is much less than the separation between the basis elements, the normalization of the regressor vector results in an approximation that has flat regions. The bottom approximation h3 uses normalized RBF functions every 0.5 units with u = 0.5. Since u is similar to the separation between the basis elements, the support of adjacent basis elements overlap. In this
n
case, the approximation has neither spikes nor flat regions.
The choice of the functions fi(x, 8; xi) are important to application success and compu- tational feasibility. Consider the case where fi(x, 8; xi) are either zero or first order local Taylor series approximations:
f i ( ~ , 8 ) = A (2.46)
or fi(z, 8) = A + B ( Z - xi). (2.47) In the first case, the basis functions are constants, as in the case of normalized radial basis functions. For a given desired approximation accuracy E , many more basis-influence pairs may be required if constant basis functions are used instead of linear basis functions.
Estimates of the magnitude of higher order derivatives can be used to estimate the number of Basis Influence (BI) function pairs required in a given application.
The linear basis functions hold two advantages in control applications.
1. Linear approximations are often known a priori (e.g., from previous gain sched- uled designs or operating point experiments). It is straightforward to use this prior information to initialize the BI function parameters.
2. Linear approximations are often desired aposteriori either for analysis or design pur- poses. These linear approximations are easily derived from the BI model parameters.
See the related discussion in Section 2.4.9 of network transparency.
2.4.8.3 Curse of Dimensionality A well-known drawback [20] of function approx- imators with locally supported regressor elements is the “curse of dimensionality,” which refers to the fact that the number of parameters required for localized approximators grows exponentially with the number of dimensions V.
EXAMPLE 2.21
Let d = dim(V). If V is partitioned into E divisions per dimension, then there will
n
be N = Ed total partitions.
This exponential increase in N with d is a problem if either the computation time or memory requirements of the approximator become too large. The embedding approach discussed in Section 3.5 is a method of allowing the number of partitions of V to increase exponentially without a corresponding increase in the number of approximator parameters.
The lattice networks discussed in Section 2.4.8.4 illustrate a method by which the com- putational requirements grow much slower than the exponential growth in the number of parameters.
2.4.8.4 Lattice-BasedApproximators Specification of locally supported basis func- tions requires specification of the type and support of each basis element. Typically, the support of a basis element is parameterized by the center and width parameters of each $i.
This specification includes the choice as to whether the center and width parameters are fixed a priori or estimated based on the acquired data.
Adaptive estimation of the center and width parameters is a nonlinear estimation problem.
Therefore, the resulting approximator would not have the best approximator property, but would have the beneficial “order of approximation” behavior as discussed in Section 2.4. I . Prior specification of the centers on a grid of points results in a lattice-based approxi- mator [32]. Lattice-based approximators result in significant computational simplification over adaptive center-based approximators for two reasons. First, the center adaptation cal- culations are not required. Second, the non-zero elements of the vector q5 can be determined without direct calculation of q5 (see below). If the width parameters are also fixed apriori, then a linear parameter estimation problem results with the corresponding benefits.
EXAMPLE 2.22
The purpose of this example [75] is to clarify how lattice-based approximators can reduce the amount of computation required per iteration. For clarity, the example discusses a two-dimensional region of approximation, as shown in Figure 2.13, but the discussion directly extends to d > 2 dimensions.
AfunctionfistobeapproximatedovertheregionD = {(z) y) E [0, l]x[O, 11). If the approximator takes the form f(z) = OT#(z), where 6 E E N and q5 : R2 -+ S R N , then evaluation o f f for a general approximator requires calculation of the N elements of #(z) followed by an N-vector multiply (with the associated memory accesses).
Assuming that $(z) is maintained in memory between the approximator computation and parameter adaptation, then adaptation of 8 requires (at the minimum) a scalar by N-vector multiply.
Alternatively, let the elements of #(z) be locally supported with fixed centers defined on a lattice by
cm = C2,J = ((i - 1). dz, ( j - 1 ) . d y )
for i = 1,. . . ,nx and j = 1,. . . . ny, where N = n,ny, m = i + n, * ( j - l), dx = &, and dy = 1 n,-l. Also, let #z,3 (x) = g ((z, Y) - c ~ , ~ ) be locally supported such g ((5, y) - c ~ , ~ ) = 0 if ii(z* y) - ljoo > A. The parameter X
is referred to as the generalization parameter. To allow explicit discussion in the following, assume that X = 1.5dz. Also, as depicted in Figure 2.13, assume that nz = ny = 5, so that dx = dy = 0.25. The figure indicates the nodal centers with z’s and indicates the values of rn on the lattice diagram. In general, these assumptions imply that although N may be quite large, at most 9 elements of the vector I$ will
X X I 8 x
l 9 * O I 16 1 7
l3 x
X l 2 x
l 4 ‘j:
1 1
6 7 * a 9 101
1 2 3 4 5 ,x1
Figure 2.13: Lattice structure diagram for Example 2.22. The 2’s indicate locations of nodal centers. The integers near the z’s indicate the nodal addresses m. The * indicates an evaluation point.
be non-zero at a given value of z; therefore, calculation of f only requires a 9- element vector multiply (with the associated memory accesses). This computational simplification assumes that there is a simple method for determining the appropriate elements of $ and 8 without search and without directly calculating all of $(z).
The indices for the nonzero elements of $ and corresponding elements of 8 (some- times called nodal addresses) can be found by an algorithm such as
($1
j c ( y ) = 1 + round
where round(z) is the function that returns the nearest integer to z . The set of indices corresponding to nonzero basis elements (neglecting evaluation points within and
of the edges of D) is then
(ic - 1 , j c + 1) ( i c , j c + 1) (ic + l , j c + 1)
( i c - L j c ) ( i c , j c ) ( i c + 1 . L )
(ic - 1 : j c - 1) ( i c , j c - 1) (ic + l , j c - 1).
At the evaluation point indicated by the *, (zc,jc) = (3.2), m = 8, and the nodal D addresses of the nonzero basis elements are { 2 , 3 , 4 , 7 , 8 , 9 , 1 2 , 1 3 , 1 4 } .
To summarize, if an approximator has locally supported basis elements defined on a lattice, then both the approximation at a point and the parameter estimation update can be performed (due to the sparseness of $ and the regularity of the centers) without calculating all of @ and without direct sorting of $ to find its non-zero elements.
Even if each element of $ is locally supported, if the centers are not defined on a lattice, then in general there is no method to find the nonzero elements of $ without direct calculation of and search over the vector 4.
A common argument against lattice networks is that fewer basis functions may be re- quired if the centers are allowed to adapt their locations to optimize their distribution relative to the function being approximated. There is a tradeoff involved between the decrease in memory required (due to the potential decreased number of basis functions) and the in- creased per iteration computation (due to all of $ being calculated). In addition, online adaptation of the center locations optimizes the estimated center locations relative to the training data, which at any given time may not represent optimization relative to the actual function.