A positive value for the Hold-Out R2indicates that the out-of-sample predictive performance for the estimated model is better than that afforded by the simple constant prediction provide
Trang 1using this last step, but also because it facilitates a feasible computation of an approxi-mation to the cross-validated MSE
Although we touched on this issue only briefly above, it is now necessary to con-front head-on the challenges for cross-validation posed by models nonlinear in the parameters This challenge is that in order to compute exactly the cross-validated MSE associated with any given nonlinear model, one must compute the NLS parameter es-timates obtained by holding out each required validation block of observations There are roughly as many validation blocks as there are observations (thousands here) This multiplies by the number of validation blocks the difficulties presented by the conver-gence problems encountered in a single NLS optimization over the entire estimation data set Even if this did not present a logistical quagmire (which it surely does), this also requires a huge increase in the required computations (a factor of approximately
1700 here) Some means of approximating the cross-validated MSE is thus required Here we adopt the expedient of viewing the hidden unit coefficients obtained by the initial NLS on the estimation set as identifying potentially useful predictive transforms
of the underlying variables and hold these fixed in cross-validation Thus we only need
to re-compute the hidden-to-output coefficients by OLS for each validation block As mentioned above, this can be done in a highly computationally efficient manner using
Racine’s (1997)feasible block cross-validation method This might well result in overly optimistic cross-validated estimates of MSE, but without some such approximation, the exercise is not feasible (The exercise avoiding such approximations might be feasible
on a supercomputer, but, as we see shortly, this brute force NLS approach is dominated
by QuickNet, so the effort is not likely justified.)
Table 1reports a subset of the results for this first exercise Here we report two
sum-mary measures of goodness of fit: mean squared error (MSE) and r-squared (R2).
We report these measures for the estimation sample, the cross-validation sample (CV), and the hold-out sample (Hold-Out) For the estimation sample, R2is the stan-dard multiple correlation coefficient For the cross-validation sample, R2 is computed
as one minus the ratio of the cross-validated MSE to the estimation sample variance of the dependent variable For the hold-out sample, R2is computed as one minus the ratio
of the hold-out MSE to the hold-out sample variance of the dependent variable about
the estimation sample mean of the dependent variable Thus, we can observe negative
values for the CV and Hold-Out R2’s A positive value for the Hold-Out R2indicates that the out-of-sample predictive performance for the estimated model is better than that afforded by the simple constant prediction provided by the estimation sample mean of the dependent variable
FromTable 1we see that, as expected, the estimation R2is never very large, ranging from a low of about 0.0089 to a high of about 0.0315 For the full experiment, the great-est great-estimation sample R2is about 0.0647, occurring with 50 hidden units (not shown) This apparently good performance is belied by the uniformly negative CV R2’s Al-though the best CV R2 or MSE (indicated by “*”) identifies the model with the best Hold-Out R2 (indicated by “∧”), that is, the model with only linear predictors (zero
hidden units), this model has a negative Hold-Out R2, indicating that it does not even
Trang 2Ch 9: Approximate Nonlinear Forecasting Methods 495
Table 1 S&P 500: Naive nonlinear least squares – Logistic
Summary goodness of fit
Hidden
units
Estimation
MSE
CV MSE
Hold-out MSE
Estimation R-squared
CV R-squared
Hold-out R-squared
0 1.67890 1.79932∗ 0.55548 0.00886 −0.06223 −0.03016 ∧,∗
10 1.66970 1.94597 0.56098 0.01429 −0.14880 −0.04037
11 1.64669 1.87287 0.58445 0.02788 −0.10565 −0.08390
12 1.65209 1.85557 0.55982 0.02469 −0.09544 −0.03822
13 1.64594 2.03215 0.56302 0.02832 −0.19968 −0.04415
14 1.64064 1.91624 0.58246 0.03145 −0.13125 −0.08020
15 1.64342 2.00411 0.57788 0.02981 −0.18313 −0.07170
16 1.65963 2.00244 0.57707 0.02024 −0.18214 −0.07021
17 1.65444 2.05466 0.58594 0.02330 −0.21297 −0.08665
18 1.64254 1.98832 0.60214 0.03033 −0.17381 −0.11670
19 1.65228 2.01295 0.59406 0.02458 −0.18835 −0.10172
20 1.64575 2.09084 0.60126 0.02843 −0.23432 −0.11506
perform as well as using the estimation sample mean as a predictor in the hold-out sample
This unimpressive prediction performance is entirely expected, given our earlier dis-cussion of the implications of the efficient market hypothesis, but what might not have been expected is the erratic behavior we see in the estimation sample MSEs We see that as we consider increasingly flexible models, we do not observe increasingly better in-sample fits Instead, the fit first improves for hidden units one and two, then wors-ens for hidden unit three, then at hidden units four and five improves dramatically, then worsens for hidden unit six, and so on, bouncing around here and there Such behavior will not be surprising to those with prior ANN experience, but it can be disconcerting
to those not previously inoculated
The erratic behavior we have just observed is in fact a direct consequence of the challenging nonconvexity of the NLS objective function induced by the nonlinearity in parameters of the ANN model, coupled with our choice of a new set of random starting values for the coefficients at each hidden unit addition This behavior directly reflects and illustrates the challenges posed by parameter nonlinearity pointed out earlier
Trang 3Table 2 S&P 500: Modified nonlinear least squares – Logistic
Summary goodness of fit
Hidden
units
Estimation
MSE
CV MSE
Hold-out MSE
Estimation R-squared
CV R-squared
Hold-out R-squared
0 1.67890 1.79932∗ 0.55548 0.00886 −0.06223 −0.03016 ∧,∗
This erratic estimation performance opens the possibility that the observed poor predictive performance could be due not to the inherent unpredictability of the target variable, but rather to the poor estimation job done by the brute force NLS approach We next investigate the consequences of using a modified NLS that is designed to eliminate this erratic behavior This modified NLS method picks initial values for the coefficients
at each stage in a manner designed to yield increasingly better in-sample fits as flexibil-ity increases We simply use as initial values the final values found for the coefficients in the previous stage and select new initial coefficients at random only for the new hidden unit added at that stage; this implements a simple homotopy method
We present the results of this next exercise inTable 2 Now we see that the in-sample MSE’s behave as expected, decreasing nicely as flexibility increases On the other hand, whereas our nạve brute force approach found a solution with only five hidden units delivering an estimation sample R2of 0.0293, this second approach requires 30 hidden units (not reported here) to achieve a comparable in-sample fit Once again we have the best CV performance occurring with zero hidden units, corresponding to the best (but negative) out-of-sample R2 Clearly, this modification to nạve brute force NLS does not resolve the question of whether the so far unimpressive results could be due to poor
Trang 4Ch 9: Approximate Nonlinear Forecasting Methods 497
Table 3 S&P 500: QuickNet – Logistic
Summary goodness of fit
Hidden
units
Estimation
MSE
CV MSE
Hold-out MSE
Estimation R-squared
CV R-squared
Hold-out R-squared
11 1.57871 1.75054∗ 0.64341 0.06801 −0.03343 −0.19323∗
estimation performance, as the estimation performance of the nạve method is better, even if more erratic Can QuickNet provide a solution?
Table 3reports the results of applying QuickNet to our S&P 500 data, again with the logistic cdf activation function At each iteration of Step 1, we selected the best of
m= 500 candidate units and applied cross-validation using OLS, taking the hidden unit coefficients as given Here we see much better performance in the CV and estimation samples than we saw in either of the two NLS approaches The estimation sample MSEs decrease monotonically, as we should expect Further, we see CV MSE first decreasing and then increasing as one would like, identifying an optimal complexity of eleven hidden units for the nonlinear model The estimation sample R2for this CV-best model
is 0.0634, much better than the value of 0.0293 found by the CV-best model inTable 1, and the CV MSE is now 1.751, much better than the corresponding best CV MSE of 1.800 found inTable 1
Thus QuickNet does a much better job of fitting the data, in terms of both estima-tion and cross-validaestima-tion measures It is also much faster Apart from the computaestima-tion time required for cross-validation, which is comparable between the methods, Quick-Net required 30.90 seconds to arrive at its solution, whereas nạve NLS required 600.30
Trang 5seconds and modified NLS required 561.46 seconds respectively to obtain inferior so-lutions in terms of estimation and cross-validated fit
Another interesting piece of evidence related to the flexibility of ANNs and the rela-tive fitting capabilities of the different methods applied here is that QuickNet delivered
a maximum estimation R2of 0.1727, compared to 0.0647 for nạve NLS and 0.0553 for modified NLS, with 50 hidden units (not shown) generating each of these values Comparing these and other results, it is clear that QuickNet rapidly delivers much better sample fits for given degrees of model complexity, just as it was designed to do
A serious difficulty remains, however: the CV-best model identified by QuickNet is not at all a good model for the hold-out data, performing quite poorly It is thus im-portant to warn that even with a principled attempt to avoid overfit via cross-validation, there is no guarantee that the CV-best model will perform well in real-world hold-out data One possible explanation for this is that, even with cross-validation, the sheer flex-ibility of ANNs somehow makes them prone to over-fitting the data, viewed from the perspective of pure hold-out data
Another strong possibility is that real world hold-out data can differ from the esti-mation (and thus cross-validation) data in important ways If the relationship between the target variable and its predictors changes between the estimation and hold-out data, then even if we have found a good prediction model using the estimation data, there
is no reason for that model to be useful on the hold-out data, where a different predic-tive relationship may hold A possible response to handling such situations is to proceed recursively for each out-of-sample observation, refitting the model as each new observa-tion becomes available For simplicity, we leave aside an investigaobserva-tion of such methods here
This example underscores the usefulness of an out-of-sample evaluation of predictive performance Our results illustrate that it can be quite dangerous to simply trust that the predictive relationship of interest is sufficiently stable to permit building a model useful for even a modest post-sample time frame
Below we investigate the behavior of our methods in a less ambiguous environment, using artificial data to ensure (1) that there is in fact a nonlinear relationship to be un-covered, and (2) that the predictive relationship in the hold-out data is identical to that in the estimation data Before turning to these results, however, we examine two alterna-tives to the standard logistic ANN applied so far The first alternative is a ridgelet ANN, and the second is a nonneural network method that uses the familiar algebraic polyno-mials The purpose of these experiments is to compare the standard ANN approach with
a promising but less familiar ANN method and to contrast the ANN approaches with a more familiar benchmark
InTable 4, we present an experiment identical to that ofTable 3, except that instead of using the standard logistic cdf activation function, we instead use the ridgelet activation function
(z) = D5φ(z)=−z5+ 10z3− 15zφ(z).
Trang 6Ch 9: Approximate Nonlinear Forecasting Methods 499
Table 4 S&P 500: QuickNet – Ridgelet
Summary goodness of fit
Hidden
units
Estimation
MSE
CV MSE
Hold-out MSE
Estimation R-squared
CV R-squared
Hold-out R-squared
.
.
.
.
.
.
.
.
.
39 1.33741 1.64768∗ 0.88580 0.21046 0.02729 −0.64277∗
The choice of h = 5 is dictated by the fact that k = 10 for the present example As this is
a nonpolynomial analytic activation function, it is also GCR, so we may expect Quick-Net to perform well in sample We emphasize that we are simply performing QuickQuick-Net with a ridgelet activation function and are not implementing any estimation procedure specified by Candes The results given here thus do not necessarily put ridgelets in their best light, but are nevertheless of interest as they do indicate what can be achieved with some fairly simple procedures
ExaminingTable 4, we see results qualitatively similar to those for the logistic cdf ac-tivation function, but with the features noted there even more pronounced Specifically, the estimation sample fit improves with additional complexity, but even more quickly, suggesting that the ridgelets are even more successful at fitting the estimation sample
Trang 7Table 5 S&P 500: QuickNet – Polynomial
Summary goodness of fit
Hidden
units
Estimation
MSE
CV MSE
Hold-out MSE
Estimation R-squared
CV R-squared
Hold-out R-squared
0 1.67890 1.79932∗ 0.55548 0.00886 −0.06223 −0.03016 ∧,∗
data patterns The estimation sample R2 reaches a maximum of 0.2534 for 50 hidden units, an almost 50% increase over the best value for the logistic The best CV perfor-mance occurs with 39 hidden units, with a CV R2that is actually positive (0.0273) As good as this performance is on the estimation and CV data, however, it is quite bad on the hold-out data The Hold-out R2with 39 ridgelet units is−0.643, reinforcing our comments above about the possible mismatch between the estimation predictive rela-tionship and the importance of hold-out sample evaluation
In recent work,Hahn (1998)andHirano and Imbens (2001) have suggested using algebraic polynomials for nonparametric estimation of certain conditional expectations arising in the estimation of causal effects These polynomials thus represent a famil-iar and interesting benchmark against which to contrast our previous ANN results In
Table 5we report the results of nonlinear approximation using algebraic polynomials, performed in a manner analogous to QuickNet The estimation algorithm is identical,
except that instead of randomly choosing m candidate hidden units as before, we now randomly choose m candidate monomials from which to construct polynomials.
For concreteness and to control erratic behavior that can result from the use of poly-nomials of too high a degree, we restrict ourselves to polypoly-nomials of degree less than or
Trang 8Ch 9: Approximate Nonlinear Forecasting Methods 501
equal to 4 As before, we always include linear terms, so we randomly select candidate monomials of degree between 2 and 4 The candidates were chosen as follows First, we randomly selected the degree of the candidate monomial such that degrees 2, 3, and 4
had equal (1/3) probabilities of selection Let the randomly chosen degree be denoted d Then we randomly selected d indexes with replacement from the set {1, , 9} and
con-structed the candidate monomial by multiplying together the variables corresponding to the selected indexes
The results ofTable 5are interesting in several respects First, we see that although the estimation fits improve as additional terms are added, the improvement is nowhere near as rapid as it is for the ANN approaches Even with 50 terms, the estimation R2only reaches 0.1422 (not shown) Most striking, however, is the extremely erratic behavior
of the CV MSE This bounces around, but generally trends up, reaching values as high
as 41 As a consequence, the CV MSE ends up identifying the simple linear model as best, with its negative Hold-out R2 The erratic behavior of the CV MSE is traceable to extreme variation in the distributions of the included monomials (Standard deviations can range from 2 to 150; moreover, simple rescaling cannot cure the problem, as the associated regression coefficients essentially undo any rescaling.) This variation causes the OLS estimates, which are highly sensitive to leverage points, to vary wildly in the cross-validation exercise, creating large CV errors and effectively rendering CV MSE useless as an indicator of which polynomial model to select
Our experiments so far have revealed some interesting properties of our methods, but because of the extremely challenging real-world forecasting environment to which they have been applied, we have not really been able to observe anything of their relative forecasting ability To investigate the behavior of our methods in a more controlled environment, we now discuss a second set of experiments using artificial data in which
we ensure (1) that there is in fact a nonlinear relationship to be uncovered, and (2) that the predictive relationship in the hold-out data is identical to that in the estimation data
We achieve these goals by generating artificial estimation data according to the non-linear relationship
Y∗
t = af q
X t , θ∗
q
+ 0.1ε t
, with q = 4, where X t = (Y t−1, Y t−2, Y t−3, |Y t−1|, |Y t−2|, |Y t−3|, R t−1, R t−2, R t−3),
as in the original estimation data (note that X t contains lags of the original Y t and not
lags of Y∗
t ) In particular, we take to be the logistic cdf and set
f q
x, θ∗
q
= xα∗
q
j=1
xγ∗
j
β∗
qj ,
where εt = Y t − f q(x, θ∗
q ), and with θ∗
qobtained by applying QuickNet (logistic) to the
original estimation data with four hidden units We choose a to ensure that Y∗
t exhibits the same unconditional standard deviation in the simulated data as it does in the actual data The result is an artificial series of returns that contains an “amplified” nonlinear
signal relative to the noise constituted by εt We generate hold-out data according to the
Trang 9Table 6 Artificial data: Ideal specification
Summary goodness of fit
Hidden
units
Estimation
MSE
CV MSE
Hold-out MSE
Estimation R-squared
CV R-squared
Hold-out R-squared
4 0.43081 0.45147∗ 0.45279 0.74567 0.73348 0.57439 ∧,∗
same relationship using the actual Xt ’s, but now with εtgenerated as i.i.d normal with mean zero and standard deviation equal to that of the errors in the estimation sample The maximum possible hold-out sample R2turns out to be 0.574, which occurs when the model uses precisely the right set of coefficients for each of the four hidden units The relationship is decidedly nonlinear, as using a linear predictor alone delivers a Hold-Out R2 of only 0.0667 The results of applying the precisely right hidden units are presented inTable 6
First we apply nạve NLS to these data, parallel to the results discussed ofTable 1 Again we choose initial values for the coefficients at random Given that the ideal hid-den unit coefficients are located in a 40-dimensional space, there is little likelihood
of stumbling upon these, so even though the model is in principle correctly specified for specifications with four or more hidden units, whatever results we obtain must be viewed as an approximation to an unknown nonlinear predictive relationship
We report our nạve NLS results inTable 7 Here we again see the bouncing pattern
of in-sample MSEs first seen inTable 1, but now the CV-best model containing eight hidden units also identifies a model that has locally superior hold-out sample perfor-mance For the CV-best model, the estimation sample R2is 0.6228, the CV sample R2
is 0.5405, and the Hold-Out R2is 0.3914 We also include inTable 7the model that has the best Hold-Out R2, which has 49 hidden units For this model the Hold-Out R2is 0.4700; however, the CV sample R2is only 0.1750, so this even better model would not have appeared as a viable candidate Despite this, these results are encouraging, in that now the ANN model identifies and delivers rather good predictive performance, both in and out of sample
Table 8 displays the results using the modified NLS procedure parallel toTable 2 Now the estimation sample MSEs decline monotonically, but the CV MSEs never approach those seen inTable 7 The best CV R2 is 0.4072, which corresponds to a Hold-Out R2of 0.286 The best Hold-Out R2of 0.3879 occurs with 41 hidden units, but again this would not have appeared as a viable candidate, as the corresponding CV
R2is only 0.3251
Trang 10Ch 9: Approximate Nonlinear Forecasting Methods 503
Table 7 Artificial data: Naive nonlinear least squares – Logistic
Summary goodness of fit
Hidden
units
Estimation
MSE
CV MSE
Hold-out MSE
Estimation R-squared
CV R-squared
Hold-out R-squared
.
.
.
.
.
.
.
.
.
Next we examine the results obtained by QuickNet, parallel to the results ofTable 3
InTable 9we observe quite encouraging performance The CV-best configuration has
33 hidden units, with a CV R2of 0.6484 and corresponding Hold-Out R2of 0.5430 This is quite close to the maximum possible value of 0.574 obtained by using precisely the right hidden units Further, the true best hold-out performance has a Hold-Out R2of 0.5510 using 49 hidden units, not much different from that of the CV-best model The corresponding CV R2is 0.6215, also not much different from that observed for the CV best model
The required estimation time for QuickNet here is essentially identical to that re-ported above (about 31 seconds), but now nạve NLS takes 788.27 seconds and modified NLS requires 726.10 seconds
InTable 10, we report the results of applying QuickNet with a ridgelet activation function Given that the ridgelet basis is less smooth relative to our target function than the standard logistic ANN, which is ideally smooth in this sense, we should not expect
... coefficients for each of the four hidden units The relationship is decidedly nonlinear, as using a linear predictor alone delivers a Hold-Out R2 of only 0.0667 The results of applying the... equal to that of the errors in the estimation sample The maximum possible hold-out sample R2turns out to be 0.574, which occurs when the model uses precisely the right set of coefficients... ofTable Again we choose initial values for the coefficients at random Given that the ideal hid-den unit coefficients are located in a 40-dimensional space, there is little likelihoodof