Handbook of Economic Forecasting part 51 pps

Taking advantage of these results leads to parametric models that are nonlinear in the predictors, with the attendant advantages of flexibility, and linear in the parameters, with the at

Trang 1

4 Artificial neural networks

4.1 General considerations

In the previous section, we introduced artificial neural networks (ANNs) as an example

of an approximation dictionary supporting highly nonlinear approximation In this sec-tion, we consider ANNs in greater detail Our attention is motivated not only by their flexibility and the fact that many powerful approximation methods can be viewed as special cases of ANNs (e.g., Fourier series, wavelets, and ridgelets), but also by two fur-ther reasons First, ANNs have become increasingly popular in economic applications Second, despite their increasing popularity, the application of ANNs in economics and other fields has often run into serious stumbling blocks, precisely reflecting the three key challenges to the use of nonlinear methods articulated at the outset In this section

we explore some further properties of ANNs that may help in mitigating or eliminating some of these obstacles, permitting both their more successful practical application and

a more informed assessment of their relative usefulness

Artificial neural networks comprise a family of flexible functional forms posited by cognitive scientists attempting to understand the behavior of biological neural systems Kuan and White (1994)provide a discussion of their origins and an econometric per-spective Our focus here is on the ANNs introduced above, that is, the class of “single hidden layer feedforward networks”, which have the functional form

(6)

f (x, θ ) = xα+

q

j=1

(xγ

j )β j ,

where is a given activation function, and θ ≡ (α, β, γ), β ≡ (β1 , , β q ), γ ≡

(γ

1, , γ

q ) (xγj) is called the “activation” of “hidden unit” j

Except for the case of ridgelets, ANNs generally take the γj’s to be free parameters, resulting in a parameterization nonlinear in the parameters, with all the attendant com-putational challenges that we would like to avoid Indeed, these difficulties have been formalized byJones (1997)andVu (1998), who prove that optimizing such an ANN

is an NP-hard problem It turns out, however, that by suitably choosing the activation

function , it is possible to retain the flexibility of ANNs without requiring the γj’s

to be free parameters and without necessarily imposing the ridgelet activation function

or schedule of γj values, which can be somewhat cumbersome to implement in higher dimensions

This possibility is a consequence of results of Stinchcombe and White (1998) (“SW”), as foreshadowed in earlier results ofBierens (1990) Taking advantage of these results leads to parametric models that are nonlinear in the predictors, with the attendant advantages of flexibility, and linear in the parameters, with the attendant advantages of computational convenience These computational advantages create the possibility of mitigating the difficulties formalized byJones (1997)andVu (1998) We first take up the results of SW that create these opportunities and then describe a method for exploiting

Trang 2

them for forecasting purposes Subsequently, we perform some numerical experiments that shed light on the extent to which the resulting methods may succeed in avoiding the documented difficulties of nonlinearly parameterized ANNs

4.2 Generically comprehensively revealing activation functions

In work proposing new specification tests with the property of consistency (that is, the property of having power against model misspecification of any form)Bierens (1990) proved a powerful and remarkable result This result states essentially that for any

random variable εt and random vector Xt, under general conditions E(εt | X t ) = 0

with nonzero probability implies E(exp(X

t γ )ε t ) = 0 for almost every γ ∈ Γ , where

Γ is any nonempty compact set Applying this result to the present context with

ε t = Y t − f (X t , θ∗), Bierens’s result implies that if (with nonzero probability)

E

Y t − fX t , θ∗

| X t

= μ(X t ) − fX t , θ∗

= 0,

then for almost every γ ∈ Γ we have

E

exp

X

t γ

Yt − fXt , θ∗

= 0.

That is, if the modelN is misspecified, then the prediction error εt = Y t − f (X t , θ∗)

resulting from the use of modelN is correlated with exp(X

t γ ) for essentially any choice

of γ Bierens exploits this fact to construct a specification test based on a choice for γ that maximizes the sample correlation between exp(X

t γ ) and the sample prediction

errorˆε t = Y t − f (X t , ˆ θ ).

Stinchcombe and White (1998)show thatBierens’s (1990)result holds more

gen-erally, with the exponential function replaced by any belonging to the class of

generically comprehensively revealing (GCR) functions These functions are

“compre-hensively revealing” in the sense that they can reveal arbitrary model misspecifications

(μ(X t ) − f (X t , θ∗)= 0 with nonzero probability); they are generic in the sense that

almost any choice for γ will reveal the misspecification.

An important class of functions that SW demonstrate to be GCR is the class of non-polynomial real analytic functions (functions that are everywhere locally given by a convergent power series), such as the logistic cumulative distribution function (cdf) or the hyperbolic tangent function, tanh Among other things, SW show how the GCR functions can be used to test for misspecification in ways that parallel Bierens’s proce-dures for the regression context, but that also extend to specification testing beyond the regression context, such as testing for equality of distributions

Here, we exploit SW’s results for a different purpose, namely to obtain flexible para-meterizations nonlinear in predictors and linear in parameters To proceed, we represent

a q hidden unit ANN more explicitly as

f q

x, θ∗

q

= xα∗

q+

q

j=1

xγ∗

j

β∗

qj ,

Trang 3

where is GCR, and we let

ε t = Y t − f q

x, θ∗

q

.

If, with nonzero probability, μ(Xt ) − f q (x, θ∗

q ) = 0, then for almost every γ ∈ Γ we

have

E

X

t γ

ε t

= 0.

As Γ is compact, we can pick γ∗

q+1such that

corr

X

t γ∗

q+1

, εt corr

X

t γ

, εt

for all γ ∈ Γ , where corr(·, ·) denotes the correlation of the indicated variables Let

Γ m be a finite subset of Γ having m elements whose neighborhoods cover Γ With

chosen to be continuous, the continuity of the correlation operator then ensures that,

with m sufficiently large, one can achieve correlations nearly as great as by optimizing over Γ by instead optimizing over Γ m Thus one can avoid full optimization over Γ at potentially small cost by instead picking γ∗

q+1∈ Γ msuch that

corr

X

t γ∗

q+1

, ε t corr

X

t γ

, ε t

for all γ ∈ Γ m This suggests a process of adding hidden units in a stepwise manner,

stopping when|corr((X

t γ∗

q+1), εt )| (or some other suitable measure of the predictive

value of the marginal hidden unit) is sufficiently small

5 QuickNet

We now propose a family of algorithms based on these considerations that can work well

in practice, called “QuickNet” The algorithm requires specifying a priori a maximum

number of hidden units, say ¯q, a GCR activation function , an integer m specifying

the cardinality of Γm, and a method for choosing the subsets Γm.

In practice, initially choosing ¯q to be on the order of 10 or 20 seems to work well; if

the results indicate there is additional predictability not captured using ¯q hidden units,

this limit can always be relaxed (For concreteness and simplicity, suppose for now that

¯q < ∞ More generally, one may take ¯q = ¯q n, with ¯q n → ∞ as n → ∞.) A common

choice for is the logistic cdf, (z) = 1/(1 + exp(−z)) Ridgelet activation functions

are also an appealing option

Choosing m to be 500–1000 often works well with Γ mconsisting of a range of values (chosen either deterministically or, especially with more than a few predictors,

ran-domly) such that the norm of γ is neither too small nor too large As we discuss in greater detail below, when the norm of γ is too small, (X

t γ ) is approximately linear

in X t , whereas when the norm of γ is too large, (X

t γ ) can become approximately constant in Xt, both situations to be avoided This is true not only for the logistic cdf

Trang 4

but also for many other nonlinear choices for In any given instance, one can

experi-ment with these choices to observe the sensitivity or robustness of the method to these choices

Our approach also requires a method for selecting the appropriate degree of model complexity, so as to avoid overfitting, the second of the key challenges to the use of non-linear models identified above For concreteness, we first specify a prototypical member

of the QuickNet family using cross-validated mean squared error (CVMSE) for this pur-pose Below, we also briefly discuss possibilities other than CVMSE

5.1 A prototype QuickNet algorithm

We now specify a prototype QuickNet algorithm The specification of this section is

generic, in that for succinctness we do not provide details on the construction of Γm

or the computation of CVMSE We provide further specifics on these aspects of the algorithm in Sections5.2 and 5.3

Our prototypical QuickNet algorithm is a form of relaxed greedy algorithm consisting

of the following steps:

Step 0: Computeˆα0andˆε0t (t = 1, , n) by OLS: ˆα0 = (XX)−1XY,

ˆε0t = Y t − X

t ˆα0

Compute CVMSE(0) (cross-validated mean squared error for Step 0; details are

pro-vided below), and set q= 1

Step 1a: Pick Γ m, and find ˆγ qsuch that

ˆγ q= arg max

γ ∈Γ m

ˆr

X

t γ

, ˆε q −1,t2

,

whereˆr denotes the sample correlation between the indicated random variables To

perform this maximization, one simply regressesˆε q −1,t on a constant and (X

t γ ) for each γ ∈ Γ m, and picks as ˆγ q the γ that yields the largest R2

Step 1b: Compute ˆα q , ˆ β q ≡ ( ˆβ q1 , , ˆ β qq ) by OLS, regressing Y

t on X t and

(X

t ˆγ j ), j = 1, , q, and compute ˆε qt (t = 1, , n) as

ˆε qt = Y t − X tˆα q−

q

j=1

X

t ˆγ j ˆβqj

Compute CVMSE(q) and set q = q + 1 If q > ¯q, stop Otherwise, return to Step 1a.

Step 2: Pick ˆq such that

ˆq = arg min

q ∈{1, , ¯q} CVMSE(q),

and set the estimated parameters to be those associated with ˆq:

ˆθ ˆq≡ˆα

ˆq , ˆ βˆq , ˆγ1, , ˆγ ˆq.

Trang 5

Step 3 (Optional): Perform nonlinear least squares for Y t using the functional form

f ˆq (x, θ ˆq ) = xα+

ˆq

j=1

xγ

j

β j ,

starting the nonlinear iterations at ˆθ ˆq.

For convenience in what follows, we let ˆθ denote the parameter estimates obtained

via this QuickNet algorithm (or any other members of the family, discussed below) QuickNet’s most obvious virtue is its computational simplicity Steps 0–2 involve

only OLS regression; this is essentially a consequence of exploiting the linearity of fq

in α and β Although a potentially large number (m) of regressions are involved in

Step 1a, these regressions only involve a single regressor plus a constant These can be computed so quickly that this is not a significant concern Moreover, the user has full

control (through specification of m) over how intense a search is performed in Step 1a.

The only computational headache posed by using OLS in Steps 0–2 results from multicollinearity, but this can easily be avoided by taking proper care to select predictors

X tat the outset that vary sufficiently independently (little, if any, predictive power is lost

in so doing), and by avoiding (either ex ante or ex post) any choice of γ in Step 1a that results in too little sample variation in (X

t γ ) (See Section5.2below for more on this issue.) Consequently, execution of Steps 0–2 of QuickNet can be fast, justifying our name for the algorithm

Above, we referred to QuickNet as a form of relaxed greedy algorithm QuickNet is

a greedy algorithm, because in Step 1a it searches for a single best additional term The

usual greedy algorithms add one term at a time, but specify full optimization over γ In contrast, by restricting attention to Γ m, QuickNet greatly simplifies computation, and

by using a GCR activation function , QuickNet ensures that the risk of missing

pre-dictively useful nonlinearities is small QuickNet is a relaxed greedy algorithm because

it permits full adjustment of the estimated coefficients of all the previously included terms, permitting it to take full predictive advantage of these terms as the algorithm proceeds In contrast, typical relaxed greedy algorithms permit only modest adjustment

in the relative contributions of the existing and added terms

The optional Step 3 involves an optimization nonlinear in parameters, so here one may seem to lose the computational simplicity motivating our algorithm design In fact, however, Steps 0–2 set the stage for a relatively simple computational exercise

in Step 3 A main problem in the brute-force nonlinear optimization of ANN models is,

for given q, finding a good (near global optimum) value for θ , as the objective function

is typically nonconvex in nasty ways Further, the larger is q, the more difficult this

becomes and the easier it is to get stuck at relatively poor local optima Typically, the optimization bogs down fairly early on (with the best fits seen for relatively small values

of q), preventing the model from taking advantage of its true flexibility (Our example

in Section7illustrates these issues.)

Trang 6

In contrast, ˆθ produced by Steps 0–2 of QuickNet typically delivers much better fit

than estimates produced by brute-force nonlinear optimization, so that local optimiza-tion in the neighborhood of ˆθ produces a potentially useful refinement of ˆ θ Moreover,

the required computations are particularly simple, as optimization is done only with a fixed number ˆq of hidden units, and the iterations of the nonlinear optimization can be

computed as a sequence of OLS regressions Whether or not the refinements of Step 3 are helpful can be assessed using the CVMSE If CVMSE improves after Step 3, one can use the refined estimate; otherwise one can use the unrefined (Step 2) estimate

5.2 Constructing Γ m

The proper choice of Γmin Step 1a can make a significant difference in QuickNet’s

performance The primary consideration in choosing Γm is to avoid choices that will result in candidate hidden unit activations that are collinear with previously included predictors, as such candidate hidden units will tend to be uncorrelated with the predic-tion errors, ˆε q −1,t and therefore have little marginal predictive power As previously

included predictors will typically include the original X t’s, particular care should be

taken to avoid choosing Γ m so that it contains elements (X

t γ ) that are either approx-imately constant or approxapprox-imately proportional to X

t γ

To see what this entails in a simple setting, consider the case of logistic cdf

activa-X t, having mean zero We denote a candidate

nonlinear predictor as (γ1X t + γ0 ) If γ0is chosen to be large in absolute value

rela-tive to γ1Xt , then (γ1Xt + γ0 ) behaves approximately as (γ0), that is, it is roughly constant To avoid this, γ0can be chosen to be roughly the same order of magnitude as

sd(γ1Xt ), the standard deviation of γ1Xt On the other hand, suppose γ1is chosen to

X t ) Then (γ1X t + γ0 ) varies approximately proportionately

to γ1X t + γ0 To avoid this, γ1should be chosen to be at least of the order of magnitude

X t ).

A simple way to ensure these properties is to pick γ0and γ1randomly, independently

X t We can pick γ1to be positive, with a range spanning modest

X t ) and pick γ0 to have mean zero, with a variance that is roughly

comparable to that of γ1X t The lack of nonnegative values for γ1is of no consequence

here, given that is monotone Randomly drawing m such choices for (γ0, γ1) thus delivers a set Γ mthat will be unlikely to contain elements that are either approximately constant or collinear with the included predictors With these precautions, the elements

Xt, such as previously included linear or nonlinear

predictors Choosing Γm in this way thus generates a plausibly useful collection of candidate nonlinear predictors

In the multivariate case, similar considerations operate Here, however, we replace

γ1X t with γ1 X

t γ2), where γ2is a direction vector, that is, a vector onS k−2, the unit

sphere in R k−1, as in Candes’s ridgelet parameterization Now the magnitude of γ0

should be comparable to sd(γ1 Xγ

2)), and the magnitude of γ1 should be chosen to

Trang 7

t γ2) One can proceed by picking a di-rection γ2on the unit sphere (e.g., γ2 = Z/(ZZ) 1/2is distributed uniformly on the unit sphere, providedZ is (k − 1)-variate unit normal) Then chose γ1to be positive,

X

t γ2) and pick γ0 to have mean zero,

with a variance that is roughly comparable to that of γ1 X

t γ2) Drawing m such choices for (γ0, γ1, γ

2) thus delivers a set Γmthat will be unlikely to contain elements that are either approximately constant or collinear with the included predictors, just as in the univariate case

These considerations are not specific to the logistic cdf activation , but operate generally The key is to avoid choosing a Γm that contains elements that are either approximately constant or proportional to the included predictors The strategies just described are broadly useful for this purpose and can be fine tuned for any particular choice of activation function

5.3 Controlling overfit

The advantageous flexibility of nonlinear modeling is also responsible for the second key challenge noted above to the use nonlinear forecasting models, namely the danger of over-fitting the data Our prototype QuickNet uses cross-validation to choose the

meta-parameter q indexing model complexity, thereby attempting to control the tendency

of such flexible models to overfit the sample data This is a common method, with a long history in statistical and econometric applications Numerous other members of the QuickNet family can be constructed by replacing CVMSE with alternate measures

of model fit, such as AIC [Akaike (1970, 1973)], Cp[Mallows (1973)], BIC [Schwarz (1978), Hannan and Quinn (1979)], Minimum Description Length (MDL) [Rissanen (1978)], Generalized Cross-Validation (GCV) [Craven and Wahba (1979)], and others

We have specified CVMSE for concreteness and simplicity in our prototype, but, as results ofShao (1993, 1997)establish, the family members formed by using alternate model selection criteria in place of CVMSE have equivalent asymptotic properties under specific conditions, as discussed further below

The simplest form of cross-validation is “delete 1” cross-validation [Allen (1974), Stone (1974, 1976)] which computes CVMSE as

CVMSE(1) (q)= 1

n

t=1

ˆε2

qt ( −t) ,

whereˆε qt ( −t) is the prediction error for observation t computed using estimators ˆα0( −t)

and ˆβ qj ( −t) , j = 1, , q, obtained by omitting observation t from the sample, that is,

ˆε qt ( −t) = Y t − Xt ˆα0( −t)−

q

j=1

X

t ˆγ j ˆβqj ( −t)

Alternatively, one can calculate the “delete d” cross-validated mean squared error,

CVMSE(d)[Geisser (1975)] For this, let S be a collection of N subsets s of{1, , n}

Trang 8

containing d elements Let ˆε qt ( −s) be the prediction error for observation t computed

using estimatorsˆα0( −s)and ˆβ qj ( −s) , j = 1, , q, obtained by omitting observations in

the set s from the estimation sample Then CVMSE (d)is computed as

hN

s ∈S

t ∈s

ˆε2

qt ( −s) .

Shao (1993, 1997)analyzes the model selection performance of these cross-validation measures and relates their performance to the other well-known model selection proce-dures in a context that accommodates cross-section but not time-series data.Shao (1993, 1997)gives general conditions establishing that given model selection procedures are either “consistent” or “asymptotically loss efficient” A consistent procedure is one that

selects the best q term (now q = q n) approximation with probability approaching one

as n increases An asymptotically loss efficient procedure is one that selects a model such that the ratio of the sample mean squared error of the selected q term model to that

of the truly best q term model approaches one in probability Consistency of selection

is a stronger property than asymptotic loss efficiency

The performance of the various procedures depends crucially on whether the model

is misspecified (Shao’s “Class 1”) or correctly specified (Shao’s “Class 2”) Given our focus on misspecified models, Class 1 is that directly relevant here, but the compar-ison with performance under Class 2 is nevertheless of interest Put succinctly,Shao (1997) show that for Class 1 under general conditions, CVMSE(1) is consistent for model selection, as is CVMSE(d) , provided d/n → 0 [Shao (1997, Theorem 4; see also p 234)] These methods behave asymptotically equivalently to AIC, GCV, and Mallows’ Cp Further, for Class 1, CVMSE(d) is asymptotically loss efficient given

d/n → 1 and q/(n − d) → 0 [Shao (1997, Theorem 5)] With these weaker conditions

on d, CVMSE (d)behaves asymptotically equivalently to BIC

In contrast, for Class 2 (correctly specified models) in which the correct specification

is not unique (e.g., there are terms whose optimal coefficients are zero), under Shao’s conditions, CVMSE(1)and its equivalents (AIC, GCV, Cp) are asymptotically loss ef-ficient but not consistent, as they tend to select more terms than necessary In contrast, CVMSE(d) is consistent provided d/n → 1 and q/(n−d) → 0, as is BIC [Shao (1997, Theorem 5)] The interested reader is referred toShao (1993, 1997)and to the discus-sion followingShao (1997)for details and additional guidance and insight

Given these properties, it may be useful as a practical procedure in cross-section applications to compute CVMSE(d) for a substantial range of values of d to identify

an interval of values of d for which the model selected is relatively stable, and use that

model for forecasting purposes

In cross-section applications, the subsets of observations s used for cross-validation

can be populated by selecting observations at random from the estimation data In time series applications, however, adjacent observations are typically stochastically dependent, so random selection of observations is no longer appropriate Instead, cross-validation observations should be obtained by removing blocks of contiguous observa-tions in order to preserve the dependence structure of the data A straightforward analog

Trang 9

of CVMSE(d) is “h-block” cross-validation [Burman, Chow and Nolan (1994)], whose

objective function CVMSEhcan be expressed as

CVMSEh (q)= 1

n

t=1

ˆε2

qt ( −t:h) ,

where ˆε qt ( −t:h) is the prediction error for observation t computed using estimators

ˆα0( −t:h)and ˆβ qj ( −t:h) , j = 1, , q, obtained by omitting a block of h observations

on either side of observation t from the estimation sample, that is,

ˆε qt ( −t:h) = Y t − X tˆα0( −t:h)−

q

j=1

X

t ˆγ j ˆβ qj ( −t:h)

Racine (2000) shows that with data dependence typical of economic time series, CVMSEh is inconsistent for model selection in the sense of Shao (1993, 1997) An important contributor to this inconsistency, not present in the framework ofShao (1993, 1997), is the dependence between the observations of the omitted blocks and the re-maining observations

As an alternative,Racine (2000) introduces a provably consistent model selection

method for Shao’s Class 2 (correctly specified) case that he calls “hv-block” cross-validation In this method, for given t one removes v “validation” observations on either side of that observation (a block of n v = 2v + 1 observations) and computes the

mean-squared error for this validation block using estimates obtained from a sample that omits

not only the validation block, but also an additional block of h observations on either side of the validation block Estimation for a given t is thus performed for a set of

n e = n−2h−2v −1 observations (The size of the estimation set is somewhat different

for t near 1 or near n.)

One obtains CVMSEhvby averaging the CVMSE for each validation block over all

n − 2v available validation blocks, indexed by t = v + 1, , n − v With suitable

choice of h [e.g., h = int(n 1/4 ), as suggested byRacine (2000)], this approach can be proven to induce sufficient independence between the validation block and the remain-ing observations to ensure consistent variable selection AlthoughRacine (2000)finds

that h = int(n 1/4 ) appears to work well in practice, practical choice of h is still an

interesting area warranting further research

Mathematically, we can represent CVMSEhvas

n − 2v

n −v

t =v+1

!

1

n v

t +v

τ =t−v

ˆε2

qτ ( −t:h,v)

"

.

(Note that a typo appears in Racine’s article; the first summation above must begin at

v + 1, not v.) Here ˆε qτ ( −t:h,v) is the prediction error for observation τ computed using

estimators ˆα0( −t:h,v)and ˆβ qj ( −t:h:v) , j = 1, , q, obtained by omitting a block of h+v

Trang 10

observations on either side of observation t from the estimation sample, that is,

ˆε qτ ( −t:h,v) = Y τ − X τ ˆα0( −t:h,v)−

q

j=1

X

τ ˆγ j ˆβ qj ( −t:h:v)

Racine shows that CVMSEhvleads to consistent variable selection for Shao’s Class 2

case by taking h to be sufficiently large (controlling dependence) and taking

v=n − int(n δ ) − 2h − 1

where int(n δ ) denotes the integer part of n δ , and δ is chosen such that ln( ¯q)/ ln(n) <

δ < 1 In some simulations, Racine observes good performance taking h = int(n γ ) with γ = 0.25 and δ = 0.5 Observe that analogous to the requirement d/n → 1 in

Shao’s Class 2 case, Racine’s choice analogously leads to 2v/n→ 1

Although Racine does not provide results for Shao’s Class 1 (misspecified) case, it is quite plausible that for Class 1, asymptotic loss efficiency holds with the behavior for

h and v as specified above, and that consistency of selection holds with h as above and with v/n→ 0, parallel to Shao’s requirements for Class 1 In any case, the performance

of Racine’s hv-block bootstrap generally and in QuickNet in particular is an appealing

topic for further investigation Some evidence on this point emerges in our examples of Section7

Although hv-block cross-validation appears conceptually straightforward, one may

have concerns about the computational effort involved, in that, as just described, on

the order of n2calculations are required Nevertheless, asRacine (1997)shows, there are computational shortcuts for block cross-validation of linear models that make this

exercise quite feasible, reducing the computations to order nh2, a very considerable

savings (In fact, this can be further reduced to order n.) For models nonlinear in the

pa-rameters the same shortcuts are not available, so not only are the required computations

of order n2, but the computational challenges posed by nonconvexities and

nonconver-gence are further exacerbated by a factor of approximately n This provides another

very strong motivation for working with models linear in the parameters We comment further on the challenges posed by models nonlinear in the parameters when we discuss our empirical examples in Section7

The results described in this section are asymptotic results For example, for Shao’s

results, q = q n may depend explicitly on n, with qn → ∞, provided q n/(n −d) → 0 In

our discussion of previous sections, we have taken q ¯q < ∞, but this has been simply

for convenience Letting¯q = ¯q nsuch that ¯q n→ ∞ with suitable restrictions on the rate

at which ¯q ndiverges, one can obtain formal results describing the asymptotic behavior

of the resulting nonparametric estimators via the method of sieves The interested reader

is referred toChen (2005)for an extensive survey of sieve methods

Before concluding this section, we briefly discuss some potentially useful variants of the prototype algorithm specified above One obvious possibility is to use CVMSEhvto select the linear predictors in Step 0, and then to select more than one hidden unit term

an interval of values of d for which the model selected is relatively stable, and use that

model for forecasting purposes... refinement of ˆ θ Moreover,

the required computations are particularly simple, as optimization is done only with a fixed number ˆq of hidden units, and the iterations of the nonlinear... such that the ratio of the sample mean squared error of the selected q term model to that

of the truly best q term model approaches one in probability Consistency of selection

Tiêu đề	Artificial Neural Networks
Tác giả	H. White
Thể loại	Thesis

Định dạng
Số trang	10
Dung lượng	115,77 KB