As the advantage of flexibility arises entirely from nonlinearity in the predictors and the computational challenges arise entirely from nonlinearity in the parameters, it makes sense to
Trang 1= EY t − μ(X t )2
+ Eμ(X t ) − X
t β2
The final equality follows from the fact that for all β
E
Y t − μ(X t )
μ(X t ) − X
t β
= EE
Y t − μ(X t )
μ(X t ) − X
t β X t
= EE
Y t − μ(X t ) X t
μ(X t ) − X tβ
= 0, because E [(Y t − μ(X t )) | X t] = 0 Thus,
E
Y t − X tβ2
= EY t − μ(X t )2
+ Eμ(X t ) − X tβ2
(3)
= σ2
∗ +
μ(x) − xβ2
dH (x),
where dH denotes the joint density of X t and σ∗2 denotes the “pure PMSE”, σ∗2 ≡
E [(Y t − μ(X t ))2]
From(3)we see that the PMSE can be decomposed into two components, the pure
PMSE σ∗2, associated with the best possible prediction (that based on μ), and the
approximation mean squared error (AMSE),
(μ(x) − xβ)2dH (x), for xβ as an
ap-proximation to μ(x) The AMSE is weighted by dH , the joint density of X t, so that the
squared approximation error is more heavily weighted in regions where X tis likely to be
observed and less heavily weighted in areas where X tis less likely to be observed This weighting forces the optimal approximation to be better in more frequently observed
regions of the distribution of X t, at the cost of being less accurate in less frequently
observed regions of the distribution of X t
It follows that to minimize PMSE it is necessary and sufficient to minimize AMSE
That is, because β∗minimizes PMSE, it also satisfies
β∗= arg min
β∈Rk
μ(x) − xβ2
dH (x).
This shows that β∗is the vector delivering the best possible approximation of the form
xβ to the PMSE-best predictor μ(x) of Y t given X t = x, where the approximation is
best in the sense of AMSE For brevity, we refer to this as the “optimal approximation property”
Note that AMSE is nonnegative It is minimized at zero if and only if for some
βo, μ(x) = xβ
o(a.s.-H ), that is, if and only if L is correctly specified In this case,
β∗= βo.
An especially convenient property of β∗is that it can be represented in closed form.
The first order conditions for β∗from problem(2)can be written as
E
X t X
t
β∗− E(X t Y t ) = 0.
Define M ≡ E(X t X
t ) and L ≡ E(X t Y t ) If M is nonsingular then we can solve for β∗
to obtain the desired closed form expression
β∗= M−1L.
Trang 2The optimal point forecast based on the linear modelL given predictors X tis then given simply by
Y∗
t = lX t , β∗
= Xt β∗.
In forecasting applications we typically have a sample of data that we view as represen-tative of the underlying population distribution generating the data (the joint distribution
of Y t and X t), but the population distribution is itself unknown Typically, we do not
even know the expectations M and L required to compute β∗, so the optimal point
forecast Y∗
t is also unknown Nevertheless, we can obtain a computationally
conve-nient estimator of β∗from the sample data using the “plug-in principle” That is, we
replace the unknown M and L by sample analogs ˆ M ≡ 1
n
n
t=1X t X
t = XX/n and
ˆL ≡ 1
n
n
t=1X t Y t = XY /n, where X is the n × k matrix with rows X
t , Y is the
n × 1 vector with elements Y t , and n is the number of sample observations available for
estimation This yields the estimator
ˆβ ≡ ˆM−1ˆL,
which we immediately recognize to be the ordinary least squares (OLS) estimator
To keep the scope of our discussion tightly focused on the more practical aspects of the subject at hand, we shall not pay close attention to technical conditions underlying the statistical properties of ˆβ or the other estimators we discuss, and we will not state
formal theorems here Nevertheless, any claimed properties of the methods discussed here can be established under mild regularity conditions relevant for practical applica-tions In particular, under conditions ensuring that the law of large numbers holds (i.e., ˆ
M → M a.s., ˆL → L a.s.), it follows that as n → ∞, ˆβ → β∗ a.s., that is, ˆβ
con-sistently estimates β∗ Asymptotic normality can also be straightforwardly established
for ˆβ under conditions sufficient to ensure the applicability of a suitable central limit
theorem [SeeWhite (2001, Chapters 2–5)for treatment of these issues.]
For clarity and notational simplicity, we operate throughout with the implicit under-standing that the underlying regularity conditions ensure that our data are generated
by an essentially stationary process that has suitably controlled dependence For cross-section or panel data, it suffices that the observations are independent and identically distributed (i.i.d.) In time series applications, stationarity is compatible with consider-able dependence, so we implicitly permit only as much dependence as is compatible with the availability of suitable asymptotic distribution theory Our discussion thus ap-plies straightforwardly to unit root time-series processes after first differencing or other suitable transformations, such as those relevant for cointegrated processes For sim-plicity, we leave explicit discussion of these cases aside here Relaxing the implicit stationarity assumption to accommodate heterogeneity in the data generating process
is straightforward, but the notation necessary to handle this relaxation is more cumber-some than is justified here
Returning to our main focus, we can now define the point forecast based on the linear modelL using ˆβ for an out-of-sample predictor vector, say X n+1 This is computed
Trang 3simply as
ˆY n+1= X
n+1ˆβ.
We italicized “out-of-sample” just now to emphasize the fact that in applications,
fore-casts are usually constructed based on predictors X n+1not in the estimation sample, as
the associated target variable (Y n+1) is not available until after X n+1is observed, as we discussed at the outset The point of the forecasting exercise is to reduce our uncertainty
about the as yet unavailable Y n+1
2.2 Nonlinearity
A nonlinear parametric model is generated from a nonlinear parameterization For this,
let & be a finite integer and let the parameter space be a subset of R & Let f be a
function mappingR k × into R This generates the parametric model
N ≡ m : R k → R | m(x) = f (x, θ), θ ∈ .
The parameterization f (equivalently, the parametric model N ) can be nonlinear in
the predictors only, nonlinear in the parameters only, or nonlinear in both Models that are nonlinear in the predictors are of particular interest here, so for convenience we call the forecasts arising from such models “nonlinear forecasts” For now, we keep our discussion at the general level and later pay more particular attention to the special cases
Completely parallel to our discussion of linear models, we have that solving prob-lem (1) withM = N , that is, solving
min
m∈NE
Y t − m(X t )2
yields the optimal forecasting function f ( ·, θ∗), where
(4)
θ∗= arg min
θ ∈ E
Y t − f (X t , θ )2
Here θ∗ is the PMSE-optimal coefficient vector This delivers not only the best
fore-cast for Y t given X t based on the nonlinear modelN , but also the optimal nonlinear approximation to μ [see, e.g.,White (1981)] Now we have
θ∗= arg min
θ ∈
μ(x) − f (x, θ)2dH (x).
The demonstration is completely parallel to that for β∗, simply replacing xβ with
f (x, θ ) Now θ∗is the vector delivering the best possible approximation of the form
f (x, θ ) to the PMSE-best predictor μ(x) of Y t given X t = x, where, as before, the approximation is best in the sense of AMSE, where the weight is again dH , the density
of the X’s
Trang 4The optimal point forecast based on the nonlinear modelN given predictors X t is thus given explicitly by
Y∗
t = fX t , θ∗
.
The advantage of using a nonlinear modelN is that nonlinearity in the predictors can
afford greater flexibility and thus, in principle, greater forecast accuracy Provided the nonlinear model nests the linear model (i.e.,L ⊂ N ), it follows that
min
m∈NE
Y t − m(X t )2
min
m∈LE
Y t − m(X t )2
,
that is, the best PMSE for the nonlinear model is always at least as good as the best PMSE for the linear model (The same relation also necessarily holds for AMSE.)
A simple means of ensuring thatN nests L is to include a linear component in f ,
for example, by specifying
f (x, θ ) = xα + g(x, β),
where g is some function nonlinear in the predictors.
Against the advantage of theoretically better forecast accuracy, using a nonlinear model has a number of potentially serious disadvantages relative to linear models: (1) the associated estimators can be much more difficult to compute; (2) nonlinear mod-els can easily overfit the sample data, leading to inferior performance in practice; and (3) the resulting forecasts may appear more difficult to interpret It follows that the more appealing nonlinear methods will be those that retain the advantage of flexibility but that mitigate or eliminate these disadvantages relative to linear models We now discuss considerations involved in constructing forecasts with these properties
3 Linear, nonlinear, and highly nonlinear approximation
When a parameterization is nonlinear in the parameters, there generally does not exist a
closed form expression for the PMSE-optimal coefficient vector θ∗ One can
neverthe-less apply the plug-in principle in such cases to construct a potentially useful estimator ˆθ
by solving the sample analog of the optimization problem(4)defining θ∗, which yields
ˆθ ≡ arg min
θ ∈
1
n
n
t=1
Y t − f (X t , θ )2
.
The point forecast based on the nonlinear modelN using ˆθ for an out-of-sample pre-dictor vector X n+1, is computed simply as
ˆY n+1= fX n+1, ˆ θ
.
The challenge posed by attempting to use ˆθ is that its computation generally requires an
iterative algorithm that may require considerable fine-tuning and that may or may not
Trang 5behave well, in that the algorithm may or may not converge, and, even with considerable effort, the algorithm may well converge to a local optimum instead of to the desired global optimum These are the computational difficulties alluded to above
As the advantage of flexibility arises entirely from nonlinearity in the predictors and the computational challenges arise entirely from nonlinearity in the parameters, it makes sense to restrict attention to parameterizations that are “series functions” of the form
(5)
f (x, θ ) = xα+
q
j=1
ψ j (x)β j ,
where q is some finite integer and the “basis functions” ψ j are nonlinear functions
of x This provides a parameterization nonlinear in x, but linear in the parameters
θ ≡ (α, β), β ≡ (β1 , , β q ), thus delivering flexibility while simultaneously
elim-inating the computational challenges arising from nonlinearity in the parameters The method of OLS can now deliver the desired sample estimator ˆθ for θ∗.
Restricting attention to parameterizations having the form(5)thus reduces the prob-lem of choosing a forecasting model to the probprob-lem of jointly choosing the basis
functions ψ j and their number, q With the problem framed in this way, an important
next question is, “What choices of basis functions are available, and when should one prefer one choice to another?”
There is a vast range of possible choices of basis functions; below we mention some
of the leading possibilities Choosing among these depends not only on the properties
of the basis functions, but also on one’s prior knowledge about μ, and one’s empirical knowledge about μ, that is, the data.
Certain broad requirements help narrow the field First, given that our objective is to
obtain as good an approximation to μ as possible, a necessary property for any choice
of basis functions is that this choice should yield an increasingly better approximation
to μ as q increases Formally, this is the requirement that the span (the set of all linear combinations) of the basis functions {ψ j , j = 1, 2, } should be dense in the function space inhabited by μ Here, this space is M ≡ L2 ( R k−1, dH ), the separable Hilbert
space of functions m on R k−1for which
m(x)2dH (x) is finite (Recall that x contains the constant unity, so there are only k− 1 variables.) Second, given that we are funda-mentally constrained by the amount of data available, it is also necessary that the basis
functions should deliver a good approximation using as small a value for q as possible.
Although the denseness requirement narrows the field somewhat, there is still an overwhelming variety of choices for{ψ j} that have this property Familiar examples
are algebraic polynomials in x of degree dependent on j , and in particular the related
special polynomials, such as Bernstein, Chebyshev, or Hermite, etc.; and trigonometric
polynomials in x, that is, sines and cosines of linear combinations of x corresponding
to pre-specified (multi-)frequencies, delivering Fourier series Further, one can combine different families, as in Gallant’s (1981) flexible Fourier form, which includes poly-nomials of first and second order, together with sine and cosine terms for a range of frequencies
Trang 6Important and powerful extensions of the algebraic polynomials are the classes of piecewise polynomials and splines [e.g.,Wahba and Wold (1975), Wahba (1990)] Well-known types of splines are linear splines, cubic splines, and B-splines
The basis functions for the examples given so far are either orthogonal or can be made
so with straightforward modifications Orthogonality is not a necessary requirement, however A particularly powerful class of basis functions that need not be orthogonal is the class of “wavelets”, introduced byDaubechies (1988, 1992) These have the form
ψ j (x) = (A j (x)), where is a “mother wavelet”, a given function satisfying certain specific conditions, and A j (x) is an affine function of x that shifts and rescales x
ac-cording to a specified dyadic schedule analogous to the frequencies of Fourier analysis For a treatment of wavelets from an economics perspective, seeGencay, Selchuk and Whitcher (2001)
Recall that a vector space is linear if (among other things) for any two elements of the space f and g, all linear combinations af + bg also belong to the space, where a and b are any real numbers All of the basis functions mentioned so far define spaces
of functions g q (x, β) ≡q
j=1ψ j (x)β j that are linear in this sense, as taking a linear combination of two elements of this space gives
a
) q
j=1
ψ j (x)β j
*
+ b
) q
j=1
ψ j (x)γ j
*
=
q
j=1
ψ j (x) [aβ j + bγ j ],
which is again a linear combination of the first q of the ψ j’s
Significantly, the second requirement mentioned above, namely that the basis should
deliver a good approximation using as small a value for q as possible, suggests that
we might obtain a better approximation by not restricting ourselves to the functions
g q (x, β), which force the inclusion of the ψ j’s in a strict order (e.g., zero order polyno-mials first, followed by first order polynopolyno-mials, followed by second order polynopolyno-mials, and so on), but instead consider functions of the form
g % (x, β)≡
j ∈%
ψ j (x)β j ,
where % is a set of natural numbers (“indexes”) containing at most q elements, not nec-essarily the integers 1, , q The functions g % are more flexible than the functions g q,
in that g % admits g qas a special case The key idea is that by suitably choosing which basis functions to use in any given instance, one may obtain a better approximation for
a given number of terms q.
The functions g % define a nonlinear space of functions, in that linear combinations
of the form ag % + bg K , where K also has q elements, generally have up to 2q terms, and are therefore not contained in the space of q-term linear combinations of the ψ j’s
Consequently, functions of the form g % are called nonlinear approximations in the
approximation theory literature Note that the nonlinearity referred to here is the
nonlin-earity of the function spaces defined by the functions g % For given %, these functions are still linear in the parameters β , which preserves their appeal for us here
Trang 7Recent developments in the approximation theory literature have provided consider-able insight into the question of which functions are better approximated using linear
approximation (functions of the form g q), and which functions are better approximated
using nonlinear approximation (functions of the form g %) The survey ofDeVore (1998)
is especially comprehensive and deep, providing a rich catalog of results permitting a comparison of these approaches Given sufficient a priori knowledge about the function
of interest, μ, DeVore’s results may help one decide which approach to take.
To gain some of the flavor of the issues and results treated byDeVore (1998)that are relevant in the present context, consider the following approximation root mean squared errors:
σ q (μ, ψ )≡ inf
β
μ(x) − g q (x, β)2
dH (x)
1/2
,
σ % (μ, ψ )≡ inf
%,β
μ(x) − g % (x, β)2
dH (x)
1/2
.
These are, for linear and nonlinear approximation respectively, the best possible
ap-proximation root mean squared errors (RMSEs) using qψ j’s (For simplicity, we are
ignoring the linear term xα previously made explicit; alternatively, imagine we have
absorbed it into μ.) DeVore devotes primary attention to one of the central issues of
approximation theory, the “degree of approximation” question: “Given a positive real
number a, for what functions μ does the degree of approximation (as measured here
by the above approximation RMSE’s) behave as O(q −a )?” Clearly, the larger is a, the
more quickly the approximation improves with q.
In general, the answer to the degree of approximation question depends on the
smoothness and dimensionality (k − 1) of μ, quantified in precisely the right ways.
For linear approximation, the smoothness conditions typically involve the existence of a
number of derivatives of μ and the finiteness of their moments (e.g., second moments),
such that more smoothness and smaller dimensionality yield quicker approximation
The answer also depends on the particular choice of the ψ j’s; suffice it to say that the details can be quite involved
In the nonlinear case, familiar notions of smoothness in terms of derivatives generally
no longer provide the necessary guidance To describe the smoothness notion relevant in this context, suppose for simplicity that{ψ j} forms an orthonormal basis for the Hilbert
space in which μ lives Then the optimal coefficients β∗
j are given by
β∗
j =
ψ j (x)μ(x) dH (x).
AsDeVore (1998, p 135)states, “smoothness for [nonlinear] approximation should be
viewed as decay of the coefficients with respect to the basis [i.e., the β∗
j’s]” (emphasis
added) In particular, let τ = 1/(a+1/2) Then according toDeVore (1998, Theorem 4)
σ % (μ, ψ ) = O(q −a ) if and only if there exists a finite constant M such that #{j: β∗
j >
z } M τ z −τ For example, σ % (μ, ψ ) = O(q −1/2 ) if for some M we have #{j: β∗
j >
z } Mz−1
Trang 8An important and striking aspect of this view of smoothness is that it is relative to
the basis A function that is not at all smooth with respect to one basis may be quite
smooth with respect to another Another striking feature of results of this sort is that
the dimensionality of μ no longer plays an explicit role, seemingly suggesting that
non-linear approximation may somehow hold in abeyance the “curse of dimensionality” (the inability to well approximate functions in high-dimensional spaces without inordi-nate amounts of data) A more precise interpretation of this situation seems to be that smoothness with respect to the basis also incorporates dimensionality, such that a given decay rate for the optimal coefficients is a stronger condition in higher dimensions
In some cases, theory alone can inform us about the choice of basis functions For example, it turns out, asDeVore (1998, p 106)discusses, that with respect to nonlinear approximation, rational polynomials have approximation properties essentially equiva-lent to those of piecewise polynomials In this sense, there is nothing to gain or lose in selecting one of these bases over another In other cases, the helpfulness of the theory
in choosing a basis depends on having quite specific knowledge about μ, for example,
that it is very smooth (in the familiar sense) in some places and very rough in others or that it has singularities or discontinuities For example,Dekel and Leviatan (2003)show that in this sense, wavelet approximations do not perform well in capturing singularities along curves, whereas nonlinear piecewise polynomial approximations do
Usually, however, we economists have little prior knowledge about the familiar
smoothness properties of μ, let alone their smoothness with respect to any given
ba-sis As a practical matter, then, it may make sense to consider a collection of different bases, and let the data guide us to the best choice Such a collection of bases is called
a library An example is the wavelet packet library proposed byCoifman and Wicker-hauser (1992)
Alternatively, one can choose the ψ j’s from any suitable subset of the Hilbert space
Such a subset is called a dictionary; the idea is once again to let the data help decide
which elements of the dictionary to select Artificial neural networks (ANNs) are an
example of a dictionary, generated by letting ψ j (x) = (xγ j ) for a given
“activa-tion func“activa-tion” , such as the logistic cdf ((z) = 1/(1 + exp(−z))), and with γ j any element ofR k For a discussion of artificial neural networks from an econometric per-spective, seeKuan and White (1994).Trippi and Turban (1992)contains a collection of papers applying ANNs to economics and finance
Approximating a function μ using a library or dictionary is called highly nonlinear
approximation, as not only is there the nonlinearity associated with choosing q basis
functions, but there is the further choice of the basis itself or of the elements of the dic-tionary Section 8 ofDeVore’s (1998)comprehensive survey is devoted to a discussion
of the so far somewhat fragmentary degree of approximation results for approxima-tions of this sort Nevertheless, some powerful results are available Specifically, for sufficiently rich dictionariesD (e.g., artificial neural networks as above),DeVore and Temlyakov (1996)show [seeDeVore (1998, Theorem 7)] that for a 1
2and sufficiently
smooth functions μ
σ q (μ, D) C a q −a ,
Trang 9where C a is a constant quantifying the smoothness of μ relative to the dictionary, and,
analogous to the case of nonlinear approximation, we define
σ q (μ, D) ≡ inf
D,β
μ(x) − g D (x, β)2
dH (x)
1/2
,
g D (x, β)≡
ψ j ∈D
ψ j (x)β j ,
where D is a q element subset of D DeVore and Temlyakov’s result generalizes an earlier result for a= 1
2of Maurey [seePisier (1980)].Jones (1992)provides a “greedy
algorithm” and a “relaxed greedy algorithm” achieving a = 1
2 for a specific dictionary
and class of functions μ, andDeVore (1998)discusses further related algorithms The cases discussed so far by no means exhaust the possibilities Among other
no-table choices for the ψ j’s relevant in economics are radial basis functions [Powell (1987), Lendasse et al (2003)] and ridgelets [Candes (1998, 1999a, 1999b, 2003)] Radial basis functions arise by taking
ψ j (x) = p2(x, γ j )
, where p2(x, γ j ) is a polynomial of (at most) degree 2 in x with coefficients γ j , and is typically taken to be such that, with the indicated choice of p2 , (x, γ j ), (p2(x, γ j )) is proportional to a density function Standard radial basis functions treat the γ j’s as free
parameters, and restrict p2(x, γ j ) to have the form
p2(x, γ j ) = −(x − γ1j )γ
2j (x − γ1j )/2, where γ j ≡ (γ
1j , γ
2j ), so that γ
1j acts as a centering vector, and γ 2j is a k ×k symmet-ric positive semi-definite matrix acting to scale the departures of x from γ 1j A common
choice for is = exp, which delivers (p2 (x, γ j )) proportional to the multivariate normal density with mean γ 1j and with γ 2j a suitable generalized inverse of a given covariance matrix Thus, standard radial basis functions have the form of a linear com-bination of multivariate densities, accommodating a mixture of densities as a special
case Treating the γ j’s as free parameters, we may view the radial basis functions as a dictionary, as defined above
Candes’s ridgelets can be thought of as a very carefully constructed special case of ANNs Ridgelets arise by taking
ψ j (x) = γ 1j −1/2
˜xγ 2j − γ0j/γ 1j
,
where˜x denotes the vector of nonconstant elements of x (i.e., x = (1, ˜x)), γ0jis real,
γ 1j > 0, and γ2j belongs to S k−2, the unit sphere inR k−1 The activation function
is taken to belong to the space of rapidly decreasing functions (Schwartz space, a subset of C∞) and to satisfy a specific admissibility property on its Fourier transform
[seeCandles (1999a, Definition 1)], essentially equivalent to the moment conditions
z j (z) dz = 0, j = 0, , k/2 − 1.
Trang 10This condition ensures that oscillates, has zero average value, zero average slope, etc For example, = D h φ, the hth derivative of the standard normal density φ, is readily verified to be admissible with h = k/2.
The admissibility of the activation function has a number of concrete benefits, but the chief benefit for present purposes is that it leads to the explicit specification of a countable sequence{γ j = (γ0j, γ1j , γ
2j )} such that any function f square integrable
on a compact set has an exact representation of the form
f (x)≡∞
j=1
ψ j (x)β∗
j
The representing coefficients β∗
j are such that good approximations can be obtained
using g q (x, β) or g % (x, β) as above In this sense, the ridgelet dictionary that arises by letting the γ j’s be free parameters (as in the usual ANN approach) can be reduced to a countable subset that delivers a basis with appealing properties
AsCandes (1999b)shows, ridgelets turn out to be optimal for representing otherwise smooth multivariate functions that may exhibit linear singularities, achieving a rate of
approximation of O(q −a ) with a = s/(k − 1), provided the sth derivatives of f exist
and are square integrable This is in sharp contrast to Fourier series or wavelets, which can be badly behaved in the presence of singularities.Candes (2003)provides an ex-tensive discussion of the properties of ridgelet regression estimators, and, in particular, certain shrinkage estimators based on thresholding coefficients from a ridgelet regres-sion (By thresholding is meant setting to zero estimated coefficients whose magnitude does not exceed some pre-specified value.) In particular,Candes (2003)discusses the superiority in multivariate contexts of ridgelet methods to kernel smoothing and wavelet thresholding methods
InDeVore’s (1998)survey, Candes’s papers, and the references cited there, the inter-ested reader can find a wealth of further material describing the approximation
prop-erties of a wide variety of different choices for the ψ j’s From a practical standpoint, however, these results do not yield hard and fast prescriptions about how to choose
the ψ j’s, especially in the circumstances commonly faced by economists, where one may have little prior information about the smoothness of the function of interest Nev-ertheless, certain helpful suggestions emerge Specifically:
(i) nonlinear approximations are an appealing alternative to linear approximations; (ii) using a library or dictionary of basis functions may prove useful;
(iii) ANNs, and ridgelets in particular, may prove useful
These suggestions are simply things to try In any given instance, the data must be the final arbiter of how well any particular approach works In the next section, we provide
a concrete example of how these suggestions may be put into practice and how they interact with other practical concerns
... choice of the basis itself or of the elements of the dic-tionary Section ofDeVore’s (1998)comprehensive survey is devoted to a discussionof the so far somewhat fragmentary degree of approximation... affine function of x that shifts and rescales x
ac-cording to a specified dyadic schedule analogous to the frequencies of Fourier analysis For a treatment of wavelets from an economics... approximation (functions of the form g %) The survey ofDeVore (1998)
is especially comprehensive and deep, providing a rich catalog of results permitting a comparison of these approaches