Handbook of Economic Forecasting part 52 doc

Nev-ertheless, given the frequent objection to nonlinear models on the grounds that they are difficult to interpret, it appears to be worth some effort to show that there is nothing part

Trang 1

in each iteration of Step 1, replacing the search for the maximally correlated hidden unit term with a more extensive variable selection procedure based on CVMSEhv

By replacing CVMSE with AIC, Cp, GCV, or other consistent methods for control-ling model complexity, one can easily generate other potentially appeacontrol-ling members of the QuickNet family, as noted above It is also of interest to consider the use of more recently developed methods for automated model building, such as PcGets [Hendry and Krolzig (2001)] and RETINA [Perez-Amaral, Gallo and White (2003, 2005)] Using either (or both) of these approaches in Step 1 results in methods that can select multiple hidden unit terms at each iteration of Step 1 In these members of the QuickNet family, there is no need for Step 2; one simply iterates Step 1 until no further hidden unit terms are selected

Related to these QuickNet family members are methods that use multiple hypothesis testing to control the family-wise error rate [FWER, seeWestfall and Young (1993)], the false discovery rate [FDR,Benjamini (1995)andWilliams (2003)], the false discovery proportion [FDP, see Lehmann and Romano (2005)] in selecting linear predictors in Step 0 and multiple hidden unit terms at each iteration of Step 1 [In so doing, care must be taken to use specification-robust standard errors, such as those ofGonçalves and White (2005).] Again, Step 2 is unnecessary; the algorithm stops when no further hidden unit terms are selected

6 Interpretational issues

The third challenge identified above to the use of nonlinear forecasts is the apparent difficulty of interpreting the resulting forecasts This is perhaps an issue not so much

of difficulty, but rather an issue more of familiarity Linear models are familiar and comfortable to most practitioners, whereas nonlinear models are less so Practitioners may thus feel comfortable interpreting linear forecasts, but somewhat adrift interpreting nonlinear forecasts

The comfort many practitioners feel with interpreting linear forecasts is not neces-sarily well founded, however Forecasts from a linear model are commonly interpreted

on the basis of the estimated coefficients of the model, using a standard interpretation for these estimates, namely that any given coefficient estimate is the estimate of the

ceteris paribus effect of that coefficient’s associated variable, that is, the effect of that

variable holding all other variables constant The forecast is then the net result of all of the competing effects of the variables in the model

Unfortunately, this interpretation has validity in only in highly specialized circum-stances that are far removed from the context of most economic forecasting applications Specifically, this interpretation can be justified essentially only in ideal circumstances where the predictors are error-free measures of variables causally related to the target variable, the linear model constitutes a correct specification of the causal relationship, the observations used for estimation have been generated in such a way that unob-servable causal factors vary independently of the obunob-servable causal variables, and the

Trang 2

forecaster (or some other agency) has, independently of the unobservable causal factors,

set the values of the predictors that form the basis for the current forecast.

The familiar interpretation would fail if even one of these ideal conditions failed; however, in most economic forecasting contexts, none of these conditions hold In al-most all cases, the predictors are error-laden measurements of variables that may or may not be causally related to the target variable, so there is no necessary causal relationship pertinent to the forecasting exercise at hand At most, there is a predictive relationship,

embodied here by the conditional mean μ, and the model for this predictive relationship

(either linear or nonlinear) is, as we have acknowledged above, typically misspecified Moreover, the observations used for estimation have been generated outside the fore-caster’s (or any other sole agency’s) control, as have the values of the predictors for the current forecast

Faced with this reality, the familiar and comfortable interpretation thought to be avail-able for linear forecasts cannot credibly be maintained How, then, should one interpret forecasts, whether based on linear or nonlinear models? We proceed to give detailed

answers to this question Ex post, we hope the answers will appear to be obvious

Nev-ertheless, given the frequent objection to nonlinear models on the grounds that they are difficult to interpret, it appears to be worth some effort to show that there is nothing particularly difficult or mysterious about nonlinear forecasts: the interpretation of both linear and nonlinear forecasts is essentially similar Further, our discussion highlights some important practical issues and methods that can be critical to the successful use of nonlinear models for forecasting

6.1 Interpreting approximation-based forecasts

There are several layers available in the interpretation of our forecasts The first and most direct interpretation is that developed in Sections1 and 2above: our forecasts are optimal approximations to the MSE-optimal prediction of the target variable given the predictors, namely the conditional mean The approximation occurs on two levels One

is a functional approximation arising from the likely misspecification of the parame-terized model The other is a statistical approximation arising from our use of sample

distributions instead of population distributions This interpretation is identical for both

linear and nonlinear models

In the familiar, comfortable, and untenable interpretation for linear forecasts de-scribed above, the meaning of the estimated coefficients endows the forecast with its interpretation Here the situation is precisely opposite: the interpretation of the forecast gives the estimated coefficients their meaning: the estimated coefficients are simply those that deliver the optimal approximation, whether linear or nonlinear

6.2 Explaining remarkable forecast outcomes

It is, however, possible to go further and to explain why a forecast takes a particular

value, in a manner parallel to the explanation afforded by the familiar linear

Trang 3

interpre-tation when it validly applies As we shall shortly see, this understanding obtains in a manner that is highly parallel for the linear and nonlinear cases, although the greater flexibility in the nonlinear case does lead to some additional nuances

To explore this next layer of interpretation, we begin by identifying the circumstance

to be explained We first consider the circumstance that a forecast outcome is in some sense remarkable For example, we may be interested answering the question, “Why is our forecast quite different than the simple expectation of our target variable?”

When put this way, the answer quickly becomes obvious Nevertheless, it is helpful to consider this question in a little detail, from both the population and the sample point of view This leads not only to useful insights but also to some important practical proce-dures We begin with the population view for clarity and simplicity The understanding obtained here then provides a basis for understanding the sample situation

6.2.1 Population-based forecast explanation

Because our forecasts are generated by our parameterization, for the population setting

we are interested in understanding how the difference

δ∗(X

t ) ≡ fX t , θ∗

− μ

arises, where ¯μ is the unconditional mean of the target variable, ¯μ ≡ E(Y t ) If this

dif-ference is large or otherwise unusual, then there is some explaining to do and otherwise not

We distinguish between values that, when viewed unconditionally, are unusual and values that are extreme We provide a formal definition of these concepts below For now, it suffices to work with the heuristic understanding that extreme values are partic-ularly large magnitude values of either sign and that unusual values are not necessarily extreme, but (unconditionally) have low probability density (Consider a bimodal den-sity with well separated modes – values lying between the modes may be unusual although not extreme in the usual sense.) Extreme values may well be unusual, but are not necessarily so For convenience, we call values that are either extreme or unusual

“remarkable”

Put this way, the explanation for remarkable forecasts outcomes clearly lies in the conditioning That is, what would otherwise be remarkable is no longer remarkable (indeed, is least remarkable in a precise sense), once one accounts for the conditioning

Two aspects of the conditioning are involved: the behavior of X t (that is, the conditions

underlying the conditioning) and the properties of f∗( ·) ≡ f (·, θ∗) (the conditioning relationship and our approximation to it)

With regard to the properties of f∗, for present purposes it is more relevant to dis-tinguish between parameterizations monotone or nonmonotone in the predictors than

to distinguish between parameterizations linear or nonlinear in the predictors We say

that f∗is monotone if f∗is (weakly) monotone in each of its arguments (as is true if

f∗(X t ) is in fact linear in X t ); we say that f∗is nonmonotone if f∗is not monotone (either strongly or weakly) in at least one of its arguments

Trang 4

If f∗is monotone, remarkable values of δ∗(X

t ) must arise from remarkable values

of X t The converse is not true, as remarkable values of different elements of X t can

cancel one another out and yield unremarkable values for δ∗(X t ).

If f∗ is not monotone, then extreme values of δ∗(X t ) may or may not arise from

extreme values of X t Values for δ∗(X

t ) that are unusual but not extreme must arise

from unusual values for X t, but the converse is not true, as nonmonotonicities permit

unusual values for X t to nevertheless result in common values for δ∗(X

t ).

From these considerations, it follows that insight into the genesis of a particular

in-stance of δ∗(X

t ) can be gained by comparing δ∗(X

t ) to its distribution and X t to its distribution, and observing whether one, both, or neither of these exhibits uncondition-ally extreme or unusual values

There is thus a variety of distinct cases, with differing interpretations As the

monotonicity of f∗ is either known a priori (as in the linear case) or in principle

as-certainable given θ∗ (or its estimate, as below), it is both practical and convenient to

partition the cases according to whether or not f∗is monotone We have the following straightforward taxonomy

Explanatory taxonomy of prediction

Case I: f∗monotone

A δ∗(X t ) not remarkable and X tnot remarkable:

Nothing remarkable to explain

B δ∗(X

t ) not remarkable and X tremarkable:

Remarkable values for Xtcancel out to produce an unremarkable forecast

C δ∗(X t ) remarkable and X t not remarkable:

Ruled out

D δ∗(X

t ) remarkable and X t remarkable:

Remarkable forecast explained by remarkable values for predictors

Case II: f∗not monotone

A δ∗(X t ) not remarkable and X t not remarkable:

Nothing remarkable to explain

B δ∗(X

t ) not remarkable and X t remarkable:

Either remarkable values for Xt cancel out to produce an unremarkable fore-cast, or (perhaps more likely) nonmonotonicities operate to produce an unre-markable forecast

C.1 δ∗(X

t ) unusual but not extreme and X tnot remarkable:

Ruled out

C.2 δ∗(X t ) extreme and X t not remarkable:

Extreme forecast explained by nonmonotonicities

D.1 δ∗(X

t ) unusual but not extreme and X tunusual but not extreme:

Unusual forecast explained by unusual predictors

D.2 δ∗(X t ) unusual but not extreme and X textreme:

Unusual forecast explained by nonmonotonicities

Trang 5

D.3 δ∗(X

t ) extreme and X tunusual but not extreme:

Extreme forecast explained by nonmonotonicities

D.4 δ∗(X

t ) extreme and X textreme:

Extreme forecast explained by extreme predictors

In assessing which interpretation applies, one first determines whether or not f∗is

monotone and then assesses whether δ∗(X

t ) is extreme or unusual relative to its

un-conditional distribution, and similarly for X t In the population setting this can be done using the respective probability density functions In the sample setting, these densities are not available, so appropriate sample statistics must be brought to bear We discuss some useful approaches below

We also remind ourselves that when unusual values for Xt underlie a given forecast,

then the approximation f∗(X t ) to μ(X t ) is necessarily less accurate by construction.

(Recall that AMSE weighs the approximation squared error by dH , the joint density

of Xt.) This affects interpretations I.B, I.D, II.B, and II.D

6.2.2 Sample-based forecast explanation

In practice, we observe only a sample from the underlying population, not the

popula-tion itself Consequently, we replace the unknown populapopula-tion value θ∗with an estimator

ˆθ, and the circumstance to be explained is the difference

ˆδ(X n+1) ≡ fX n+1, ˆ θ

− Y

between our point forecast f (X n+1, ˆ θ ) and the sample mean Y ≡ 1

n

t=1Y t, which provides a consistent estimator of the population mean ¯μ Note that the generic

obser-vation index t used for the predictors in our discussion of the population situation has now been replaced with the out-of-sample index n+ 1, to emphasize the out-of-sample

nature of the forecast

The taxonomy above remains identical, however, simply replacing population objects

with their sample analogs, that is, by replacing f∗with ˆf ( ·) = f (·, ˆθ), δ∗with ˆδ, and

the generic Xt with the out-of-sample Xn+1 With these replacements, we have the sample version of the Explanatory Taxonomy of Prediction There is no need to state this explicitly

In forecasting applications, one may be interested in explaining the outcomes of one

or just a few predictions, or one may have a relatively large number of predictions (a hold-out sample) that one is potentially interested in explaining In the former situ-ation, the sample relevant for the explanation is the estimation sample; this is the only available basis for comparison in this case In the latter situation, the hold-out sample

is that relevant for comparison, as it is the behavior of the predictors in the hold-out sample that is responsible for the behavior of the forecast outcomes

Application of our taxonomy thus requires practical methods for identifying extreme and unusual observations relative either to the estimation or to the hold-out sample The

Trang 6

issues are identical in either case, but for concreteness, it is convenient to think in terms

of the hold-out sample in what follows

One way to proceed is to make use of estimates of the unconditional densities of

Y t and Xt As Yt is univariate, there are many methods available to estimate this

den-sity effectively, both parametric and nonparametric Typically Xt is multivariate, and

it is more challenging to estimate this multivariate distribution without making strong assumptions.Li and Racine (2003)give a discussion of the issues involved and a

par-ticularly appealing practical approach to estimating the density of multivariate X t Given density estimates, one can make the taxonomy operational by defining pre-cisely what is meant by “extreme” and “unusual” in terms of these densities For

example, one may define “α-extreme” values as those lying outside the smallest

con-nected region containing no more than probability mass 1−α Similarly one may define α-unusual values as those lying in the largest region of the support containing no more

than probability mass α.

Methods involving probability density estimates can be computationally intense, so

it is also useful to have more “quick and dirty” methods available that identify extreme

and unusual values according to specific criteria For random scalars such as Y t or ˆf , it

is often sufficient to rank order the sample values and declare any values in the upper

or lower α/2 tails to be α-extreme A quick and dirty way to identify extreme values of random vectors such as Xt is to construct a sample norm Zt = X t such as

X t =X t − Xˆ−1

X t − X1/2

,

where X is the sample mean of the X t’s and ˆ is the sample covariance of the X t’s

The α-extreme values can be taken to be those that lie in the upper α tail of the sample distribution of the scalar Z t

Even more simply, one can examine the predictors individually, as remarkable val-ues for the predictors individually are sufficient but not necessary for remarkable valval-ues for the predictors jointly Thus, one can examine the standardized values of the indi-vidual predictors for extremes Unusual values of the indiindi-vidual predictors can often be identified on the basis of the spacing between their order statistics, or, equivalently, on the average distance to a specified number of neighbors This latter approach of com-puting the average distance to a specified number of neighbors may also work well in

identifying unusual values of random vectors X t

An interesting and important phenomenon that can and does occur in practice is that nonlinear forecasts can be so remarkable as to be crazy.Swanson and White (1995) ob-served such behavior in their study of forecasts based on ANNs and applied an “insanity filter” to deal with such cases Swanson and White’s insanity filter labels forecasts as

“insane” if they are sufficiently extreme and replaces insane forecasts with the uncondi-tional mean An alternative procedure is to replace insane forecasts with a forecast from

a less flexible model, such as a linear forecast

Our explanatory taxonomy explains insane forecasts as special cases of II.C.2, II.D.3 and II.D.4; nonmonotonicities are involved in the first two cases, and both non-monotonicities and extreme values of the predictors can be involved in the last case

Trang 7

Users of nonlinear forecasts should constantly be aware of the possibility of remark-able and, particularly, insane forecasts, and have methods ready for their detection and replacement, such as the insanity filter ofSwanson and White (1995)or some variant

6.3 Explaining adverse forecast outcomes

A third layer of interpretational issues impacting both linear and nonlinear forecasts concerns “reasons” and “reason codes” The application of sophisticated prediction models is increasing in a variety of consumer-oriented industries, such as consumer credit, mortgage lending, and insurance In these applications, a broad array of regula-tions governs the use of such models In particular, when prediction models are used to approve or deny applicants credit or other services or products, the applicant typically has a legal right to an explanation of the reason for the adverse decision Usually these explanations take the form of one or more reasons, typically expressed in the form of

“reason codes” that provide specific grounds for denial (e.g., “too many credit lines”,

“too many late payments”, etc.)

In this context, concern about the difficulty of interpreting nonlinear forecasts trans-lates into a concern about how to generate reasons and reason codes from such forecasts Again, these concerns are perhaps due not so much to the difficulty of generating mean-ingful reason codes from nonlinear forecasts, but due rather to a lack of experience with such forecasts In fact, there are a variety of straightforward methods for gener-ating reasons and reason codes from nonlinear forecasting models We now discuss briefly a straightforward approach for generating these from either linear or nonlinear forecasts As the application areas for reasons and reason codes almost always involve cross-section or panel data, it should be understood that the approach described below is targeted specifically to such data Analogous methods may be applicable to time-series data, but we leave their discussion aside here

As in the previous section, we specify the circumstance to be explained, which is now an adverse forecast outcome In our example, this is a rejection or denial of an application for a consumer service or product For concreteness, consider an application for credit Commonly in this context, approval or denial may be based on attaining a sufficient “credit score”, which is often a prediction from a forecasting model based

on admissible applicant characteristics If the credit score is below a specified cut-off level, the application will be denied Thus, the circumstance to be explained is a forecast outcome that lies below a given target threshold

A sound conceptual basis for explaining a denial is to provide a reasonable alterna-tive set of applicant characteristics that would have generated the opposite outcome, an approval (For example, “had there not been so many late payments in the credit file, the application would have been approved”.) The notion of reasonableness can be for-mally expressed in a satisfactory way in circumstances where the predictors take values

in a metric space, so that there is a well-defined notion of distance between predictor values Given this, reasonableness can be equated to distance in the metric (although some metrics may be more appropriate in a given context than others) The explanation

Trang 8

for the adverse outcome can now be formally specified as the fact that the predictors (e.g., applicant attributes) differ from the closest set of predictor values that generates the favorable outcome

This approach, while conceptually appealing, may present challenges in applications One set of challenges arises from the fact that predictors are often categorical in practice, and it may or may not be easy to embed categorical predictors in a metric space Another set of challenges arises from the fact that even when metrics can be applied, they can, if not wisely chosen, generate explanations that may invoke differences in every predictor

As the forecast may depend on potentially dozens of variables, the resultant explanation may be unsatisfying in the extreme

The solution to these challenges is to apply a metric that is closely and carefully tied to the context of interest When properly done, this makes it possible to gener-ate a prioritized list of reasons for the adverse outcome (which can then be translgener-ated into prioritized reason codes) that is based on the univariate distance of specific rele-vant predictors from alternative values that generate favorable outcomes To implement this approach, it suffices to suitably perturb each of the relevant predictors in turn and observe the behavior of the forecast outcome

Clearly, this approach is equally applicable to linear or nonlinear forecasts For con-tinuous predictors, one increases or decreases each predictor until the outcome reaches the target threshold For binary predictors, one “flips” the observed predictor to its complementary value and observes whether the forecast outcome exceeds the target threshold For categorical predictors, one perturbs the observed category to each of its possible values and observes for which (if any) categories the outcome exceeds the tar-get threshold

If this process generates one or more perturbations that move the outcome past the target threshold, then these perturbations represent sufficient reasons for denial We call these “sufficient perturbations” to indicate that if the predictor had been different

in the specified way, then the score would have been sufficient for an approval The sufficient perturbations can then be prioritized, and corresponding reasons and reason codes prioritized accordingly

When this univariate perturbation approach fails to generate any sufficient perturba-tions, one can proceed to identify joint perturbations that can together move the forecast outcome past the target threshold A variety of approaches can be specified, but we leave these aside so as not to stray too far from our primary focus here

Whether one uses a univariate or joint perturbation approach, one must next prioritize the perturbations Here the chosen metric plays a critical role, as this is what measures the closeness of the perturbation to the observed value for the individual Specifying

a metric may be relatively straightforward for continuous predictors, as here one can, for example, measure the number of (unconditional) standard deviations between the observed and sufficient perturbed values One can then prioritize the perturbations in order of increasing distance in these univariate metrics

A straightforward way to prioritize binary/categorical variables is in order of the closeness to the threshold delivered by the perturbation Those perturbations that

Trang 9

de-liver scores closer to the threshold can then be assigned top priority This makes sense, however, as long as perturbations that make the outcome closer to the threshold are

in some sense “easier” or more accessible to the applicant Here again the underlying metric plays a crucial role, and domain expertise must play a central role in specifying this

Given that domain expertise is inevitably required for achieving sensible prioritiza-tions (especially as between continuous and binary/categorical predictors), we do not delve into further detail here Instead, we emphasize that this perturbation approach to the explanation of adverse forecast outcomes applies equally well to both linear and nonlinear forecasting models Moreover, the considerations underlying prioritization of reasons are identical in either instance Given these identities, there is no necessary in-terpretational basis with respect to reasons and reason codes for preferring linear over nonlinear forecasts

7 Empirical examples

7.1 Estimating nonlinear forecasting models

In order to illustrate some of the ideas and methods discussed in the previous sections,

we now present two empirical examples, one using real data and another using simulated data

We first discuss a forecasting exercise in which the target variable to be predicted is the one day percentage return on the S&P 500 index Thus,

Y t = 100(P t − P t−1)/P t−1,

where Pt is the closing index value on day t for the S&P 500 As predictor variables Xt,

we choose three lags of Yt, three lags of|Y t| (a measure of volatility), and three lags of

the daily range expressed in percentage terms,

R t = 100(H i t − Lo t )/Lo t ,

where H i t is the maximum value of the index on day t and Lo t is the minimum value

of the index on day t R tthus provides another measure of market volatility With these choices we have

X t =Y t−1, Y t−2, Y t−3, |Y t−1|, |Y t−2|, |Y t−3|, R t−1, R t−2, R t−3

.

We do not expect to be able to predict S&P 500 daily returns well, if at all, as stan-dard theories of market efficiency imply that excess returns in this index should not be predictable using publicly available information, provided that, as is plausible for this index, transactions costs and nonsynchronous trading effects do not induce serial cor-relation in the log first differences of the price index and that time-variations in risk premia are small at the daily horizon [cf.Timmermann and Granger (2004)] Indeed,

Trang 10

concerted attempts to find evidence against this hypothesis have found none [see, e.g., Sullivan, Timmermann and White (1999)] For simplicity, we do not adjust our daily returns for the risk free rate of return, so we will not formally address the efficient mar-kets hypothesis here Rather, our emphasis is on examining the relative behavior of the different nonlinear forecasting methods discussed above in a challenging environment

Of course, any evidence of predictability found in the raw daily returns would cer-tainly be interesting: even perfect predictions of variation in the risk free rate would

result in extremely low prediction r-squares, as the daily risk free rate is on the order

of 0.015% with miniscule variation over our sample compared to the variation in daily returns Even if there is in fact no predictability in the data, examining the performance

of various methods reveals their ability to capture patterns in the data As predictability hinges on whether these patterns persist outside the estimation sample, applying our methods to this challenging example thus reveals the necessary capability of a given method to capture patterns, together with that method’s ability to assess whether the patterns captured are “real” (present outside the estimation data) or not

Our data set consists of daily S&P 500 index values for a period beginning on July 22,

1996 and ending on July 21, 2004 Data were obtained fromhttp://finance.yahoo.com

We reserved the data from July 22, 2003 through July 21, 2004 for out-of-sample eval-uation Dropping the first four observations needed to construct the three required lags

leaves 2008 observations in the data set, with n = 1,755 observations in the estimation

sample and 253 observations in the evaluation hold-out sample

For all of our experiments we use hv-block cross-validation, with v = 672 chosen

proportional to n 1/2 and h = 7 = int(n 1/4 ), as recommended byRacine (2000) Our

particular choice for v was made after a little experimentation showed stable model selection behavior The choice for h is certainly adequate, given the lack of appreciable

dependence exhibited by the data

For our first experiment, we use a version of standard Newton–Raphson-based NLS

to estimate the coefficients of ANN models for models with from zero to ¯q = 50

hid-den units, using the logistic cdf activation function We first fit a linear model (zero hidden units) and then add hidden units one at a time until 50 hidden units have been included For a given number of hidden units, we select starting values for the hidden unit coefficients at random and from there perform Newton–Raphson iteration This first approach represents a nạve brute force approach to estimating the ANN parameter values, and, as the model is nonlinear in parameters, we experience (as ex-pected) difficulties in obtaining convergence Moreover, these become more frequent

as more complex models are estimated In fact, the frequency with which convergence problems arise is sufficient to encourage use of the following modest stratagem: for a given number of hidden units, if convergence is not achieved (as measured by a suffi-ciently small change in the value of the NLS objective function), then the hidden unit coefficients are frozen at the best values found by NLS and OLS is then applied to

esti-mate the corresponding hidden-to-output coefficients (the β’s) In fact, we find it helpful

to apply this final step regardless of whether convergence is achieved by NLS This is useful not only because one usually observes improvement in the objective function

Định dạng
Số trang	10
Dung lượng	93,69 KB