1. Trang chủ
  2. » Khoa Học Tự Nhiên

MODEL CHOICE AND SPECIFICATION ANALYSIS

47 348 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Model Choice and Specification Analysis
Tác giả E. Learner
Người hướng dẫn Peter Schmidt
Trường học University of California, Los Angeles
Chuyên ngành Econometrics
Thể loại Thesis
Năm xuất bản 1983
Thành phố Los Angeles
Định dạng
Số trang 47
Dung lượng 2,65 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

1 The first problem leads clearly to a significance level which is a decreasing function of sample size, whereas the second problem selects a relatively constant significance level.. An

Trang 1

MODEL CHOICE AND

SPECIFICATION ANALYSIS

Trang 2

MODEL CHOICE AND SPECIFICATION ANALYSIS

EDWARD E LEAMER*

University of California, Los Angeles

Contents

3.2 Simplification searches: Model selection with fixed costs 311

4 Proxy searches: Model selection with measurement errors 314

Handbook of Econometrics, Volume I, Edited by Z Griliches and M.D Intriligator

0 North-Holland Publishing Company, I983

Trang 3

1 Introduction

The data banks of the National Bureau of Economic Research contain time-series data on 2000 macroeconomic variables Even if observations were available since the birth of Christ, the degrees of freedom in a model explaining gross national product in terms of all these variables would not turn positive for another two decades If annual observations were restricted to the 30-year period from 1950 to

1979, the degrees of freedom deficit would be 1970 A researcher who sought to sublimate the harsh reality of the degrees of freedom deficit and who restricted himself to exactly five explanatory variables could select from a menu of

of freedom Models in Section 3 arise not from the judgment of the investigator as

in Section 2, but from his purpose, which is measured by a formal loss function Quadratic loss functions are considered, both with and without fixed costs A brief comment is made about “ridge regression” The main conclusion of Section

3 is that quadratic loss does not imply a model selection problem

Sections 4, 5, and 6 discuss problems which are not as well known The problem of selecting the best proxy variable is treated in Section 4 Akaike’s Information Criterion is discussed in Section 5, although it is pointed out that, except for subtle conceptual differences, his problem reduces to estimation with quadratic loss Section 6 deals with methods for discounting evidence when

Trang 4

models are discovered after having previewed the data Finally, Section 7 contains material on “stepwise regression”, “ cross-validation”, and “goodness-of-fit” tests

A uniform notation is used in this chapter The T observations of the depen- dent variable are collected in the T x 1 vector Y, and the T observations of a potential set of k explanatory variables are collected in the T X k matrix X The hypothesis that the vector Y is normally distributed with mean X,8 and variance matrix a21 will be indicated by H:

H: Y- N(X&a21)

A subset of explanatory variables will be denoted by the T X kJ matrix X,, where

J is a subset of the first k integers, each integer selecting an included variable The excluded variables will be indexed by the set 5, and the hypothesis that the variables in 5 have no effect is

The reduction in the error sum-of-squares which results when the variables 7 are added to the model J is

ESS, - ESS = b;X+M-JXjbj

Trang 5

2 Model selection with prior distributions

One very important source of model selection problems is the existence of a priori

opinion that constraints are “likely” Statistical testing is then designed either to determine if a set of constraints is “ true” or to determine if a set of constraints is

“approximately true” The solutions to these two problems, which might be supposed to be essentially the same, in fact diverge in two important respects (1) The first problem leads clearly to a significance level which is a decreasing function of sample size, whereas the second problem selects a relatively constant significance level (2) The first problem has a set of alternative models which is determined entirely from a priori knowledge, whereas the second problem can have a data-dependent set of hypotheses

The problem of testing to see if constraints are “true” is discussed in Section 2.1 under the heading “hypothesis testing searches”, and the problem of testing to see if constraints are “approximately true” is discussed in Section 2.2 under the heading “interpretive searches”

Before proceeding it may be useful to reveal my opinion that “hypothesis testing searches” are very rare, if they exist at all An hypothesis testing search occurs when the subjective prior probability distribution allocates positive proba- bility to a restriction For example, when estimating a simple consumption function which relates consumption linearly to income and an interest rate, many economists would treat the interest rate variable as doubtful The method by which this opinion is injected into a data analysis is usually a formal test of the hypothesis that the interest rate coefficient is zero If the t-statistic on this coefficient is sufficiently high, the interest rate variable is retained; otherwise it is omitted The opinion on which this procedure rests could be characterized by a subjective prior probability distribution which allocates positive probability to the hypothesis that the interest-rate coefficient is exactly zero A Bayesian analysis would then determine if this atom of probability becomes larger or smaller when the data evidence is conditioned upon But the same statistical procedure could be justified by a continuous prior distribution which, although concentrating mass in the neighborhood of zero, allocates zero probability to the origin In that event, the posterior as well as the prior probability of the sharp hypothesis is zero Although the subjective logic of Bayesian inference allows for any kind of prior distribution, I can say that I know of no case in economics when I would assign positive probability to a point (or, more accurately, to a zero volume subset in the interior of the parameter space) Even cooked-up examples can be questioned What is your prior probability that the coin in my pocket produces when flipped

a binomial sequence with probability precisely equal to 0.5? Even if the binomial assumption is accepted, I doubt that a physical event could lead to a probability precisely equal to 0.5 In the case of the interest-rate coefficient described above, ask yourself what chance a 95 percent confidence interval has of covering zero if

Trang 6

the sample size is enormous (and the confidence interval tiny) If you say infinitesimal, then you have assigned at most an infinitesimal prior probability to the sharp hypothesis, and you should be doing data-interpretation, not hypothesis testing

2.1 Hypothesis testing searches

This subsection deals with the testing of a set of M alternative hypotheses of the formRi/3=0, i=l, , M It is assumed that the hypotheses have truth value in the sense that the prior probability is non-zero, P(R,/I = 0) > 0 Familiarity with the concepts of significance level and power is assumed and the discussion focuses first on the issue of how to select the significance level when the hypotheses have

a simple structure The clear conclusion is that the significance level should be a decreasing function of sample size

Neyman and Pearson (1928) are credited with the notion of a power function

of a test and, by implication, the need to consider specific alternative models when testing an hypothesis The study of power functions is, unfortunately, limited in value, since although it can rule in favor of uniformly most powerful tests with given significance levels, it cannot select between two tests with different significance levels Neyman’s (1958) advice notwithstanding, in practice most researchers set the significance level equal to 0.05 or to 0.01, or they use these numbers to judge the size of a P-value The step from goodness-of-fit testing, which considers only significance levels, to classical hypothesis testing, which includes in principle a study of power functions, is thereby rendered small The Bayesian solution to the hypothesis testing problem is provided by Jeffreys

(1961) The posterior odds in favor of the “null” hypothesis HO versus an

“alternative” H, given the data Y is

w%IY)

P(fW’)

In words, the posterior odds ratio is the prior odds ratio times the “Bayes factor”,

B(Y) = P( YIH,)/P(YjH,) This Bayes factor is the usual likelihood ratio for testing if the data were more likely to come from the distribution P( Y 1 H,,) than

from the distribution P( Y 1 H,) If there were a loss function, a Bayesian would

select the hypothesis that yields lowest expected loss, but without a loss function

it is appropriate to ask only if the data favor H,, relative to H, Since a Bayes

factor in excess of one favors the null hypothesis, the inequality B(Y) Q 1 implicitly defines the region of rejection of the null hypothesis and thereby selects the significance level and the power Equivalently, the loss function can be taken

Trang 7

to penalize an error by an amount independent of what the error is, and the prior probabilities can be assumed to be equal

In order to contrast Jeffreys’ solution from Neyman and Pearson’s solution, consider testing the null hypothesis that the mean of a sample of size n, r,, is distributed normally with mean 0 and variance K’ versus the alternative that r,

is normal with mean pa and variance n-‘, where pL, > 0 Classical hypothesis testing at the 0.05 level of significance rejects the null hypothesis if Y, > 1.6~ ‘12, whereas the Bayes rejection region defined by B(Y) G 1 is y, z pa/2 The classical rejection region does not depend on pa, and, somewhat surprisingly, the treatment of the data does not depend on whether pa = 1 or pa = IO” Also, as sample size grows, the classical rejection region gets smaller and smaller, whereas the Bayesian rejection region is constant Thus, the classical significance level is fixed at 0.05 and the power P(Yn < 1.6~“~ 1~ = pL,) goes to one as the sample size grows The Bayes rule, in contrast, has the probabilities of type one and type two errors equal for all sample sizes P(Yn > p,/2(~ = O)/P(yn < ~,/21~ = pa) = 1 The choice between these two treatments is ultimately a matter of personal preference; as for myself, I much prefer Jeffreys’ solution The only sensible alternative is a minimax rule, which in this case is the same as the Bayes rule Minimax rules, such as those proposed by Arrow (1960), generally also have the significance level a decreasing function of sample size

Jeffreys’ Bayesian logic is not so compelling for the testing of composite hypotheses because a prior distribution is required to define the “predictive distribution”, P(YIH,), because prior distributions are usually difficult to select, and because the Bayesian answer in this case is very sensitive to the choice of a prior Suppose, for example, that the null hypothesis Ho is that the mean of a sample of size n, yn, is distributed normally with mean 0 and variance n-‘ Take the alternative H, to be that y, is normally distributed with unknown mean ~1 and variance n-‘ In order to form the predictive distribution P(YnlH,) it is necessary

to assume a prior distribution for p, say normal with mean m* and variance (n*)-‘ Then the marginal distribution P(y,lH,) is also normal with mean m* and variance (n*)-’ +(n)- ‘ The Bayes factor in favor of Ho relative to H, therefore becomes

Trang 8

and the corresponding region of rejection defined by B G 1 is

(n+n*)_‘(n*Y*+2nn*%*-nn*m**)+logn*(n+n*)-’>,O (2.2)

It should be observed that the region of rejection is not one-sided if m* * 0 and

n* < cc Furthermore, both the Bayes factor and the region of rejection depend

importantly on the prior parameters n* and m* If, for example, the prior is located at the origin, m* = 0, then given the data r,, the Bayes factor B(y”; n*)

varies from infinity if n* = 0 to one as n* approaches infinity The minimum

value, B(F,; n*) = (nr*)‘/‘exp{ -(nr* - 1)/2}, is attained at n* = n/(nr2 - 1) if

ny2 - 1 is positive Otherwise B(Y,; n*) is strictly greater than one The region of

rejection varies from the whole line to the region P* 2 n- ‘ This Bayesian logic for selecting the significance level of a test is therefore hardly useful at all if you have much trouble selecting your prior distribution

The Bayesian logic is very useful, however, in emphasizing the fact that the significance level should be a decreasing function of sample size As the sample size grows, the Bayes factor becomes

.lkrnW B(Y,) = (n/n*)“‘exp{ - ir*n},

with corresponding region of rejection nr2 > f log n/n* This contrasts with the usual classical region of rejection ny* > c, where c is a constant independent of sample size chosen such that P( nu2 > c(p = 0) = 0.05 The important point which needs to be made is that two researchers who study the same or similar problems

but who use samples with different sample sizes should use the same significance level only if they have different prior distributions In order to maintain compara-

bility, it is necessary for each to report results based on the same prior Jeffreys, for example, proposes a particular “diffuse” prior which leads to the critical t-values reported in Table 2.1, together with some of my own built on a somewhat different limiting argument It seems to me better to use a table such as this built

on a somewhat arbitrary prior distribution than to use an arbitrarily selected significance level, since in the former case at least you know what you are doing Incidentally, it is surprising to me that the t-values in this table increment so slowly

Various Bayes factors for the linear regression model are reported in Zellner (197 l), Lempers (1971), Gaver and Geisel (1974), and in Learner (1978) The

Bayes factor in favor of model J relative to model J* is P(YIH,)/P(YIH,.) A

marginal likelihood of model J is given by the following result

Theorem I (Marginal likelihood)

Suppose that the observable (T x 1) vector Y has mean vector X,& and variance

matrix hJ ‘ITT, where X, is a T x k, observable matrix of explanatory variables, &

Trang 10

where cJ is a constant depending on the precise choice of prior distribution.’

Learner (1978) argues somewhat unconvincingly that the term c, can be treated as

if it were the same for all models This leads to the model selection criterion appropriate for one form of diffuse prior:

J

which is actually the formula used to produce my critical f-values in Table 2.1 Schwarz (1978) also proposes criterion (2.4) and uses the same logic to produce it This Bayesian treatment of hypothesis testing generalizes straightforwardly to all other settings, at least conceptually The ith composite hypothesis Y - fi( Yl6’,),

di E Oi, is mixed into the point hypothesis Y - fi(Y) = /fi(Yltii)fi(0i)d8i, where

fj (8,) is the prior distribution The Bayes factor in favor of hypothesis i relative to hypothesis j is simply fi( Y )/fj( Y ) Numerical difficulties can arise if computa- tion of the “marginal” likelihood /fi( Ylt9i)f(@i)d6i is unmanageable Reporting difficulties can arise if the choice of prior, fi( e,), is arguable

This relatively settled solution contrasts greatly with the classical treatment The sample space 9 must be partitioned into a set of acceptance regions Ai,

3 = U i Ai, Ai n Aj = 0 for i * j, such that if the data fall in Ai then hypothesis i

is “accepted” From the Bayesian perspective, these regions can be impljcitly defined by the process of maximizing the marginal likelihood, Ai = {Y IJ( Y) >

fj( Y), j * i} But, classically, one is free to choose any partition Once a partition 1s chosen, one is obligated to study its error properties: the type I error P( Y G AilHi, Oi) and the type II error P( Y E Ai1 Hi, d,), the first error being the probability that Hi is not accepted when in fact it is the true model, and the second being the probability that Hi is accepted when in fact it is not the true model A partition, U,A, = ‘24, i s ruled inadmissible if there exists another partition, U i AT = $4 such that

Trang 11

and

P(YEA,IHj,8,)>,P(YEATIH~,Bj) foralli,j,8,,

with at least one strict inequality Otherwise, a partition is ruled admissible The criterion of admissibility can rule out certain obviously silly procedures; for example, when testing H,: x - N(0, 1) versus Hz: x - N&l), the partition with A, = {xix z l} is inadmissible (Let A; = (x1x Q l}.) But otherwise the crite- rion is rarely useful As a result, there is a never-ending sequence of proposals for alternative ways of partitioning the sample space (just as there is a never-ending sequence of alternative prior distributions!)

The most commonly used test discriminates between hypothesis HJ and hy- pothesis H and is based on the following result

where F::p( a) is the upper &h percentile of the F distribution with k - k, and

T - k degrees of freedom What is remarkable about Theorem 2 is that the random variable F has a distribution independent of (& u*); in particular, P(Y c AAH,> I% a*) = a Nonetheless, the probability of a type II error, P( Y E A,1 H, /3, a*), does depend on (8, a*) For that reason, the substantial interpretive clarity associated with tests with uniquely defined error probabilities is not achieved here The usefulness of the uniquely defined type I error attained by the F-test is limited to settings which emphasize type I error to the neglect of type II error

When hypotheses are not nested, it is not sensible to set up the partition such that P(Y @G AJIHj, &, u,‘) is independent of (&, a*) For example, a test of model

J against model J’ could use the partition defined by (2.6), in which case

VYVC.,~~~8,,~,2) would be independent of & uJ’ But this treats the two hypotheses asymmetrically for no apparent reason The most common procedure instead is to select the model with the maximum i?*:

Trang 12

where c = (T - I)&( Y, - r)2 Equivalently, the model is selected which mini- mizes the estimated residual variance

Theil(197 1) gives this rule some support by showing that if model J is true then E(s;) G E(sj,) for all J’ Error probabilities for this rule have been studied by Schmidt (1973)

Other ways to partition the sample space embed non-nested hypotheses into a general model Models J and J’ can each separately be tested against the composite model This partitions the sample space into four regions, the extra pair of regions defined when both J and J’ are rejected and when both J and J’ are accepted.2 Although it is possible to attach meaning to these outcomes if a more general viewpoint is taken, from the standpoint of the problem at hand these are nonsensical outcomes

The specification error tests of Wu (1973) and Hausman (1978), interpreted by Hausman and Taylor (1980), can also be discussed in this section The con- strained estimator (4) has bias (Xix,)- ‘XJXJ$~ = EJ3; and is therefore unbiased (and consistent) if EJ$= 0 The hypothesis that misspecification is inconsequen- tial, E,&= 0, differs from the more traditional hypothesis, &= 0, if EJ has rank less than ky, in particular if k, < kf A classical test of E,&= 0 can therefore differ from a classical test of &= 0 But if your prior has the feature Pr( &=

OlE,&= 0) = 1, then Bayesian tests of the two hypotheses are identical Because E&-is not likely to be a set of special linear combinations of the coefficients, it is quite likely that the only mass point of your prior in the subspace E&F= 0 is at /I,-= 0 As a consequence, the special specification error hypothesis becomes uninteresting

The hypotheses so far discussed all involve linear constraints on parameters in univariate normal linear regression models In more complex settings the most common test statistic is the likelihood ratio

where 8 is a vector of parameters assumed to come from some set 9, and the null hypothesis is that t9 E 0, c 9 Recently, two alternatives have become popular in the econometric theory literature: the Wald test and the Lagrange multiplier test These amount to alternative partitions of the sample space and are fully discussed

in Chapter 13 of this Handbook by Engle

‘See the discussion in Gaver and Geisel ( 1974) and references including the-methods of Cox ( 196 I,

Trang 13

2.2 Interpretive searches

Although it is rare to have the prior probability of an exact constraint be non-zero, it is fairly common to assign substantial prior probability to the hypothesis that a constraint is “approximately true” You may think that various incidental variables have coefficients which, though certainly not zero, are quite likely to be close to zero You may realize that the functional form is certainly not linear, but at the same time you may expect the departures from linearity to be small Similar variables are very unlikely to have exactly the same coefficients, but nonetheless are quite likely to have coefficients roughly the same size

In all these cases it is desirable when interpreting a given data set to use all prior information which may be available, especially so when the data are weak where the prior is strong In practice, most researchers test sequentially the set of

a priori likely constraints If the data do not cast doubt on a constraint, it is retained; otherwise, it is discarded Section 2.2.1 comments on these ad hoc

methods by comparing them with formal Bayesian procedures Section 2.2.2 deals with sensitivity analyses appropriate when the prior is “ambiguous” Various measures of multi-collinearity are discussed in Section 2.2.3 Finally, the dilemma of the degrees-of-freedom deficit is commented on in Section 2.2.4 2.2.1 Interpretive searches with complete prior distributions

A Bayesian with a complete prior distribution and a complete sampling distribu- tion straightforwardly computes his posterior distribution Although no model selection issues seem to arise, the mean of his posterior distribution is a weighted average of least-squares estimates from various constrained models For that reason, he can get an idea where his posterior distribution is located by looking at particular constrained estimates and, conversely, the choice of which constrained estimates he looks at can be used to infer his prior distribution Suppose in particular that the prior for /3 is normal with mean zero and precision matrix D* = diag{d,, d, , , dk} Then, the posterior mean is

where H = a-*XX The following two theorems from Learner and Chamberlain (1976) link b** to model selection strategies

Theorem 3 (The 2k regressions)

The posterior mean (2.9) can be written as

b**= (H+D*)-‘Hb=&vJ~J,

Trang 14

where J indexes the 2k subsets of the first k integers, & is the least-squares estimate subject to the constraints & = 0 for i E J, and

wJa ( 1 iEJ n di l~-~x;-x~I,

C,=l

J

Theorem 4 (Principal Component Regression)

If D* = dI, then the posterior mean (14) can be written as

b** = (II+ dI)-‘Hb = 6 y(d/a2)ci,

of his posterior by computation of the 2k regressions Conversely, a researcher who selects a particular coordinate system in which to omit variables (or equiva- lently selects k linearly independent constraints) and then proceeds to compute regressions on all subsets of variables (or equivalently uses all subsets of the k

constraints) thereby reveals a prior located at the origin with coefficients indepen- dently distributed in the prior It is, perhaps, worth emphasizing that this solves the puzzle of how to choose a parameterization For example, if a variable y depends on x and lagged x, yt = &x, + fi2xI_ ,, it is not likely that /3, and p2 are independent a priori since if I tell you something about /3,, it is likely to alter your opinions about p, But if the model is written as y, = (/3, + p,)(x, + x,_ ,)/2 + (p,

- p,)(x, - x,_ ,)/2, it is likely that the long-run effect (& + /3,) is independent

of the difference in the effects (p, - b2) In that event, computation of the 2“ regressions should be done in the second parameterization, not the first

Theorem 4 makes use of the extra information that the prior variances are all the same If the prior distribution is completely specified, as is assumed in this section, there is always a choice of parameterization such that the parameters are independent and identically distributed In practice, it may not be easy to select such a parameterization This is especially so when the explanatory variables are measured in different units, although logarithmic transformations can be useful in that event When this difficulty is overcome Theorem 4 links principal component

Trang 15

regression selection strategies with a full Bayesian treatment The usual arbitrari- ness of the normalization in principal component regression is resolved by using a parameterization such that the prior is spherical Furthermore, the principal component restrictions should be imposed as ordered by their eigenvalues, not by their t-values as has been suggested by Massy (1965)

2.2.2 Model selection and incomplete priors

The Bayesian logic is to my mind compellingly attractive, and it is something of a paradox that economic data are not routinely analyzed with Bayesian methods It seems clear to me that the principal resistance to Bayesian methods is expressed

in the incredulous grin which greets Bayesians when they make statements like:

“We need to begin with a multivariate prior distribution for the parameter vector 8.” Because prior distributions are not precisely measurable, or because potential readers may differ in their opinions, an analysis built on a particular distribution

is of little interest Instead, a researcher should report as completely as possible the mapping implied by the given data from priors into posteriors In a slogan,

“ the mapping is the message”

In fact, many researchers attempt now to report this mapping Often different least-squares equations with different lists of variables are included to give the reader a sense of the sensitivity of the inferences to choice of model Sometimes a researcher will report that the inferences are essentially the same for a certain family of specifications

The reporting of sensitivity analyses should be greatly encouraged As readers

we have a right to know if an inference “holds up” to minor and/or major changes in the model Actually, results are often ignored until a thorough sensitivity analysis is completed, usually by other researchers Sensitivity analyses are not now routinely reported largely because we do not have particularly useful tools for studying sensitivity, nor do we have economical ways of reporting I believe the Bayesian logic ultimately will have its greatest practical impact in its solutions to this problem

A Bayesian sensitivity analysis supposes that there is a class II of prior distributions This may be a personal class of distributions, containing all possible measurements of my uniquely maintained prior, or it may be a public class of distributions, containing alternative priors which readers are likely to maintain In either case, it is necessary to characterize completely the family of posterior distributions corresponding to the family of priors For a finite universe of elemental events e, with corresponding probabilities P(ei) = ri, the set II may be generated by a set of inequalities:

Trang 16

which can involve constraints on the probabilities of sets, ci q Si, where ai is the indicator function, or constraints on the expectations of random variables

xi q X( e,), where X is th e random variable Given the data Y, a set A of interest with indicator function 6, the likelihood function P( Y lei) = fi, and a particular prior, 7~ E II, then the prior and posterior probabilities are

‘vrCAIY) = Ch4’i /CfiT*

i i

Prior, upper and lower, probabilities are then

with posterior bounds defined analogously The interval P*(A)-P,(A) is called the “confusion” by Dempster (1967) If n is a public class of distribution P* - P, might better be called the “disagreement”.3 Similarly, upper and lower prior expectations of the random variable X are

with posterior bounds defined analogously

Although indeterminate probabilities have been used in many settings, re- viewed by Shafer (1978) and DeRobertis (1979), as far as I know their use for

31n Shafer’s view (1976, p 23), which I share, the Bayesian theory is incapable of representing ignorance: “It does not allow one to withhold belief from a proposition without according that belief

to the negation of the proposition.” Lower probabilities do not necessarily have this restriction, P*(A) + P*( - A) * 1, and are accordingly called “non-additive” Shafer’s review (1978) includes references to Bernoulli, Good, Huber, Smith and Dempster but excludes Keynes (1921) Keynes elevates indeterminate probabilities to the level of primitive concepts and, in some cases, takes only a partial ordering of probabilities as given Except for some fairly trivial calculations on some relationships among bounded probabilities, Keynes’ Trearise is devoid of practical advice Braithwaite,

in the editorial forward, reports accurately at the time (but greatly wrong as a prediction) that “this leads to intolerable difficulties without any compensating advantages” Jeffreys, in the preface to his third edition, began the rumor that Keynes had recanted in his (1933) review of Ramsey and had accepted the view that probabilities are both aleatory and additive Hicks (1979), who clearly prefers Keynes to Jeffreys, finds no recantation in Keynes (1933) and refers to Keynes’ 1937 Quarter.$ Journol

evidence that he had not changed his mind

Trang 17

analyzing the regression model is confined to Learner and Chamberlain (1976), Chamberlain and Learner (1976), and Learner (1978) In each of these papers the prior location is taken as given and a study is made of the sensitivity of the posterior mean (or modes) to changes in the prior covariance matrix One important sensitivity result is the following [Chamberlain and Learner (1976)J

Theorem 5

The posterior mean b * * = (XX + N*) - ‘X’Y, regardless of the choice of the prior precision matrix N*, lies in the ellipsoid

Conversely, any point in this ellipsoid is a posterior mean for some N*

The “skin” of this ellipsoid is the set of all constrained least-squares estimates

subject to constraints of the form R/3 = 0 [Learner (1978, p 127)] Learner (1977)

offers a computer program which computes extreme values of a@** for a given # over the ellipsoid (2.10) and constrained also to a classical confidence ellipsoid of

a given confidence level This amounts to finding upper and lower expectations within a class of priors located at the origin with the further restriction that the prior cannot imply an estimate greatly at odds with the data evidence Learner

(1982) also generalizes Theorem 4 to the case A 6 N* -’ G B, where A and B are

lower and upper variance matrices and A G N* - ’ mean N* - ’ - A is positive definite

Other results of this form can be obtained by making other assumptions about the class II of prior distributions Theorems 3 and 4, which were used above to link model selection methods with Bayesian procedures built on completely specified priors, can also be used to define sets of posterior means for families of prior distributions Take the class l7 to be the family of distributions for /3 located

at the origin with pi independent of pi, i * j Then, Theorem 3 implies that the

upper and lower posterior modes of Xl/3 occur at one of the 2k regressions If the class II includes all distributions uniform on the spheres @‘p = c and located at the origin, the set of posterior modes is a curve called by Dickey (1975) the “curve decolletage”, by Hoer1 and Kennard (1970a) the “ridge trace”, and by Learner (1973) the “information contract curve” This curve is connected to principal component regression by Theorem 3

2.2.3 The multi -collinearity problem and model selection

There is no pair of words that is more misused both in econometrics texts and in the applied literature than the pair “multi-collinearity problem” That many of our explanatory variables are highly collinear is a fact of life And it is completely clear that there are experimental designs X’X which would be much preferred to

Trang 18

the designs the natural experiment has provided us But a complaint about the apparent malevolence of nature is not at all constructive, and the ad hoc cures for

a bad design, such as stepwise regression or ridge regression, can be disastrously inappropriate Better that we should rightly accept the fact that our non-experi- ments are sometimes not very informative about parameters of interest

Most proposed measures of the collinearity problem have the very serious defect that they depend on the coordinate system of the parameter space For example, a researcher might use annual data and regress a variable y on the current and lagged values of x A month later, with a faded memory, he might recompute the regression but use as explanatory variables current x and the difference between current and lagged x Initially he might report that his estimates suffer from the collinearity problem because x and x_, are highly correlated Later, he finds x and x - x_ , uncorrelated and detects no collinearity problem Can whimsy alone cure the problem?

To give a more precise example, consider the two-variable linear model

y = /3,x, + &x2 + u and suppose that the regression of x2 on x, yields the result x2 = TX, + e, where e by construction is orthogonal to x, Substitute this auxiliary relationship into the original one to obtain the model

where 8, = (8, + &r), (3, = &, z, = x,, and z2 = x2 - TX, A researcher who used the variables x, and x2 and the parameters /3, and j?, might report that & is estimated inaccurately because of the collinearity problem But a researcher who happened to stumble on the model with variables z, and z2 and parameters 8, and

0, would report that there is no collinearity problem because z, and z2 are orthogonal (x, and e are orthogonal by construction) This researcher would nonetheless report that e,( = &) is estimated inaccurately, not because of col- linearity, but because z2 does not vary adequately

What the foregoing example aptly illustrates is that collinearity as a cause of weak evidence is indistinguishable from inadequate variability as a cause of weak evidence In light of that fact, it is surprising that all econometrics texts have sections dealing with the “collinearity problem” but none has a section on the

“inadequate variability problem” The reason for this is that there is something special about collinearity It not only causes large standard errors for the coefficients but also causes very difficult interpretation problems when there is prior information about one or more of the parameters For example, collinear data may imply weak evidence about /I, and & separately but strong evidence about the linear combination /I, + /_3, The interpretation problem is how to use the sample information about p, + /3, to draw inferences about pi and & in a context where there is prior information about j3, and/or & Because classical inference is not concerned with pooling samples with prior information, classical

Trang 19

econometrics texts ought not to have special sections devoted to collinearity as distinct from inadequate variability Researchers do routinely pool prior informa- tion with sample information and do confront the interpretation problem, usually without the support of formal statistical theory Because they do, most text writers using a classical framework feel compelled to write a section on the collinearity problem, which usually turns out confused and lame An exception is Kmenta (1971, p 391) who accurately writes: “that a high degree of multicollin- earity is simply a feature of the sample that contributes to the unreliability of the estimated coefficients, but has no relevance for the conclusions drawn as a result

of this unreliability”

This view of collinearity, which seems to me to be entirely straightforward, is nonetheless not widely held and I have one more communication device which I trust will be decisive Consider again the two-variable model y = /3,x1 + &x2 + u and suppose that you have been commissioned to estimate /?r Suppose, further, that the computer program you are using prints out estimates and standard errors, but neither the covariance between /?, and & nor the correlation between

x, and x2 Another program which does compute covariances is available for

$100 Are you willing to buy it? That is, given the estimate and standard error of fi,, are you willing to bear a cost to find out if the standard error of /3, is big because of the correlation between x, and x,? The correct answer is yes if there is another source of information about p, or & which you wish to use to estimate p, Otherwise, the answer is no The best you can do is to use the given estimate

of p, and the given standard error Thus, the interesting aspects of the collinearity problem arise in an information pooling problem Otherwise, collinearity can be ignored

Measurements of the collinearity problem generally fall into one of four categories, each of which measures different aspects of the problem.4

(1) Measures of the “quality of the design” X’X

(2) Measures.of the usefulness of other information

(3) Measures of the inappropriate use of multivariate priors

(4) Measures of the sensitivity of inferences to choice of prior distribution

The last three of these measures deal with information pooling problems The first makes no reference to pooling and has the defects just described

The first set of measures indicates the distance between some ideal design matrix, say V, and the actual design, X’X Apparently thinking that an ideal design is proportional to the identity matrix V= ul, many theorists, including

40ne other way of measuring collinearity, proposed by Farrar and Glauber (1967), is to test if the explanatory variables are drawn independently This proposal has not met with much enthusiasm since, if the design is badly collinear, it is quite irrelevant to issues of inference from the given data

Trang 20

Raduchel(l971) and Belsley, Kuh and Welch (1980), have proposed the condition number of X’X as a measure of the collinearity problem The condition number is the square root of the ratio of the largest to the smallest eigenvalues But there is always a parameterization in which the X’X matrix is the identity, and all eigenvalues are equal to one In fact, the condition number can be made to take

on any value greater than or equal to one by suitable choice of parameterization Aside from the fact that the condition number depends on the parameterization, even if it did not it would be nothing more than a complaint and would not point clearly to any specific remedial action And, if a complaint is absolutely required,

it is much more direct merely to report the standard error of the parameter of interest, and to observe that the standard error would have been smaller if the design were different -in particular, if there were less collinearity or more variability, the two being indistinguishable

Category (2), in contrast, includes measures which do point to specific remedial action because they identify the value of specific additional information For example, Learner (1978, p 197) suggests the ratio of the conditional standard error of p, given & divided by the unconditional standard error as a measure of the incentive to gather information about & if interest centers on p, If the data are orthogonal, that is, X’X is diagonal, then this measure is equal to one Otherwise it is a number less than one The usefulness of this kind of measure is limited to settings in which it is possible to imagine that additional information (data-based or subjective) can be gathered

Measures in category (3) have also been proposed by Learner (1973), who contrasts Bayesian methods with ad hoc methods of pooling prior information and sample information in multivariate settings When the design is orthogonal, and the prior precision matrix is diagonal, the informal use of prior information is not altogether misleading, but when the design is collinear or the prior covari- antes are not zero, the pooling of prior and sample information can result in surprising estimates In particular, the estimates of the issue #;S may not lie between the prior mean and the least-squares estimate Thus, collinearity creates

an incentive for careful pooling

A category measure of collinearity has been proposed by Learner (1973) If there were a one-dimensional experiment to measure the issue #‘/3 with the family

of priors II located at the origin, then the posterior confusion, the difference between the upper and lower expectations, is just b/%1, where b is the least-squares estimate Because the regression experiment in fact is k-dimensional, the confu- sion is increased to lmax +‘b** -mm+%** I, where 6** is constrained to the ellipsoid (4) The percentage increase in the confusion due to the dimensionality, Imax+%**- min #‘b** I/I +‘b I has been proposed as a collinearity measure and

is shown by Chamberlain and Learner (1976) to be equal to hz/,,, where x2 is the &i-squared statistic for testing /3 = 0 and z* is the square of the normal statistic for testing $‘/3 = 0

Trang 21

2.2.4 The degrees - of -freedom deficit

If it is admitted that the degree of a polynomial could be as high as k, then it would usually be admitted that it could be k + 1 as well A theory which allows k

lagged explanatory variables would ordinarily allow k + 1 In fact, I know of no setting in economics in which the list of explanatory variables can be said to be finite Lists of variables in practice are finite, not because of theoretical belief, but only because of the apparent inferential hazards of degrees-of-freedom deficits Actually, it is standard practice to increase the dimensionality of the parameter space as the number of observations increases, thereby revealing that an analysis

at any given sample size is founded on a parameterization which is misleadingly abbreviated

The method of abbreviation usually is based on prior information: variables which are not likely to be very important are excluded, unless the researcher is

“wealthy” enough to be able to “afford” the luxury of “spending” some of the data evidence on these incidental issues Since prior information is at the foundation of the method, it is unsurprising that a Bayesian has no special problem in dealing with k in excess of T In particular, the posterior mean given

by eq (2.8) makes no reference to the invertibility of X’X The usual practice of restricting attention to subsets of variables with k, G T can be justified by Theorem 1 which assigns zero weights to any subsets such that lXJX,( is zero.5

3 Model selection with loss functions

The formal model selection theories presented in Section 2 assume a world in which thinking, information gathering, and computer processing are all errorless and costless In such a world, exogeneity issues aside, you would explain GNP in terms of all 2000 variables on the NBER data files and millions of other variables,

as well, including the size of the polar ice cap Mortals would find the collection

of the data and the assessment of priors for such a study to be unacceptably burdensome and would simplify the model before and after having collected the data

Pre-simplification and post-simplification will refer to decisions made respec- tively before and after data observation A statistical decision-theoretic solution

to the pre-simplification problem would require us to identify the size of the polar ice cap as a possible variable and to reflect upon its probable importance in the

‘The degrees-of-freedom deficit does cause special problems for estimating the residual variance 0”

It is necessary to make inferences about u2 to pick a point on the contract curve, as well as to describe fully the posterior uncertainty

independent of a*,

But if c2 is assigned the Jeffrey? diffuse prior, and if /3 is a priori

then the posterior distribution for /I has a non-integrable singularity on the

(196

Trang 22

equation in terms of both its regression coefficient and its variability We might decide not to observe this variable, and thereby to save observation and process- ing costs, but we would have suffered in the process the intolerable costs of thinking consciously about the millions of variables which might influence GNP

I know of no solution to this dilemma In practice, one selects “intuitively” a

“horizon” within which to optimize There is no formal way to assure that a given pre-simplification is optimal, and a data analysis must therefore remain an art Useful formal theories of post-simplification can be constructed, however

It is most convenient to do so in the context of a model which has not been pre-simplified For that reason, in this section we continue to assume that the researcher faces no costs for complexity until after the model has been estimated Statistical analysis of pre-simplified models is further discussed in Section 6 which deals with data-instigated hypotheses

In order to simplify a model it is necessary to identify the purposes for which the model is intended An ideal model for forecasting will differ from an ideal model for policy evaluation or for teaching purposes For scientific purposes, simplicity is an important objective of a statistical analysis because simple models can be communicated, understood, and remembered easily Simplicity thereby greatly facilitates the accumulation of knowledge, both publicly and personally The word simplicity is properly defined by the benefits it conveys: A simple model is one that can be easily communicated, understood, and remembered Because these concepts do not lend themselves to general mathematical descrip- tions, statistical theory usually, and statistical practice often, have sought parsi- monious models, with parsimony precisely measured by the number of uncertain parameters in the model (An input-output model is a simple model but not a parsimonious one.)

Actually, most of the statistical theory which deals with parsimonious models has not sought to identify simple models Instead, the goal has been an “estima- ble” model A model is estimable if it leads to accurate estimates of the parameters of interest For example, variables may be excluded from a regression equation if the constrained estimators are more accurate than the unconstrained estimators “Overfitting” is the name of the disease which is thought to be remedied by the omission of variables In fact, statistical decision theory makes clear that inference ought always to be based on the complete model, and the search for “estimable” models has the appeal but also the pointlessness of the search for the fountain of youth I do not mean that “overfitting” is not an error But “overfitting” can be completely controlled by using a proper prior distribu- tion Actually, I would say that overfitting occurs when your prior is more accurately approximated by setting parameters to zero than by assigning them the improper uniform prior

The framework within which we will be operating in Section 3 is the following The problem is to estimate /3 given the data Y with estimator b(Y) The loss

Trang 23

incurred in selecting an inaccurate estimate is

loss= L(j3,j)

The risk function is the expected loss conditional on /3:

The estimator B is said to be inadmissible if there exists another estimator b* with

uniformly smaller risk:

with strict inequality for at least one /I Otherwise, b is admissible A Bayes estimator is found by minimizing expected posterior loss:

a prima facie case against model selection procedures for this problem

3.1 Model selection with quadratic loss

A huge literature has been built on the supposition that quadratic loss implies a model selection problem The expected squared difference between an estimator fi and the true value 8, the mean-squared-error, can be written as the variance plus the square of the bias:

= var 8 + bias28

This seems to suggest that a constrained least-squares estimator might be better than the unconstrained estimator, since although the constrained estimator is biased it also has smaller variance Of course, the constrained estimator will do

Ngày đăng: 17/10/2013, 07:15

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN