All the models we considered in Part I could be estimatedwithout making full distributional assumptions about the endogenous variablesconditional on the exogenous variables: maximum like
Trang 113.1 Introduction
This chapter contains a general treatment of maximum likelihood estimation (MLE)under random sampling All the models we considered in Part I could be estimatedwithout making full distributional assumptions about the endogenous variablesconditional on the exogenous variables: maximum likelihood methods were notneeded Instead, we focused primarily on zero-covariance and zero-conditional-meanassumptions, and secondarily on assumptions about conditional variances and co-variances These assumptions were su‰cient for obtaining consistent, asymptoticallynormal estimators, some of which were shown to be e‰cient within certain classes ofestimators
Some texts on advanced econometrics take maximum likelihood estimation as theunifying theme, and then most models are estimated by maximum likelihood In ad-dition to providing a unified approach to estimation, MLE has some desirable e‰-ciency properties: it is generally the most e‰cient estimation procedure in the class ofestimators that use information on the distribution of the endogenous variables giventhe exogenous variables (We formalize the e‰ciency of MLE in Section 14.5.) Sowhy not always use MLE?
As we saw in Part I, e‰ciency usually comes at the price of nonrobustness, and this
is certainly the case for maximum likelihood Maximum likelihood estimators aregenerally inconsistent if some part of the specified distribution is misspecified As anexample, consider from Section 9.5 a simultaneous equations model that is linear inits parameters but nonlinear in some endogenous variables There, we discussed esti-mation by instrumental variables methods We could estimate SEMs nonlinear inendogenous variables by maximum likelihood if we assumed independence betweenthe structural errors and the exogenous variables and if we assumed a particular dis-tribution for the structural errors, say, multivariate normal The MLE would beasymptotically more e‰cient than the best GMM estimator, but failure of normalitygenerally results in inconsistent estimators of all parameters
As a second example, suppose we wish to estimate Eð y j xÞ, where y is boundedbetween zero and one The logistic function, expðxb Þ=½1 þ expðxb Þ, is a reasonablemodel for Eð y j xÞ, and, as we discussed in Section 12.2, nonlinear least squaresprovides consistent, ffiffiffiffiffi
N
p-asymptotically normal estimators under weak regularityconditions We can easily make inference robust to arbitrary heteroskedasticity inVarð y j xÞ An alternative approach is to model the density of y given x—which, ofcourse, implies a particular model for Eð y j xÞ—and use maximum likelihood esti-mation As we will see, the strength of MLE is that, under correct specification of the
Trang 2density, we would have the asymptotically e‰cient estimators, and we would be able
to estimate any feature of the conditional distribution, such as Pð y ¼ 1 j xÞ Thedrawback is that, except in special cases, if we have misspecified the density in anyway, we will not be able to consistently estimate the conditional mean
In most applications, specifying the distribution of the endogenous variables ditional on exogenous variables must have a component of arbitrariness, as economictheory rarely provides guidance Our perspective is that, for robustness reasons, it
con-is desirable to make as few assumptions as possible—at least until relaxing thembecomes practically di‰cult There are cases in which MLE turns out to be robust tofailure of certain assumptions, but these must be examined on a case-by-case basis, aprocess that detracts from the unifying theme provided by the MLE approach (Onesuch example is nonlinear regression under a homoskedastic normal assumption; the
MLE of the parameters bois identical to the NLS estimator, and we know the latter
is consistent and asymptotically normal quite generally We will cover some otherleading cases in Chapter 19.)
Maximum likelihood plays an important role in modern econometric analysis, forgood reason There are many problems for which it is indispensable For example, inChapters 15 and 16 we study various limited dependent variable models, and MLEplays a central role
13.2 Preliminaries and Examples
Traditional maximum likelihood theory for independent, identically distributedobservations fyiARG: i¼ 1; 2; g starts by specifying a family of densities for yi.This is the framework used in introductory statistics courses, where yiis a scalar with
a normal or Poisson distribution But in almost all economic applications, we areinterested in estimating parameters in conditional distributions Therefore, we assumethat each random draw is partitioned asðxi; yiÞ, where xiARKand yiARG, and weare interested in estimating a model for the conditional distribution of yigiven xi Weare not interested in the distribution of xi, so we will not specify a model for it.Consequently, the method of this chapter is properly called conditional maximumlikelihood estimation (CMLE) By taking xito be null we cover unconditional MLE
as a special case
An alternative to viewingðxi; yiÞ as a random draw from the population is to treatthe conditioning variables xias nonrandom vectors that are set ahead of time and thatappear in the unconditional distribution of yi (This is analogous to the fixed regres-sor assumption in classical regression analysis.) Then, the yi cannot be identicallydistributed, and this fact complicates the asymptotic analysis More importantly,
Trang 3treating the xias nonrandom is much too restrictive for all uses of maximum hood In fact, later on we will cover methods where xicontains what are endogenousvariables in a structural model, but where it is convenient to obtain the distribution ofone set of endogenous variables conditional on another set Once we know how toanalyze the general CMLE case, applications follow fairly directly.
likeli-It is important to understand that the subsequent results apply any time we haverandom sampling in the cross section dimension Thus, the general theory applies tosystem estimation, as in Chapters 7 and 9, provided we are willing to assume a dis-tribution for yigiven xi In addition, panel data settings with large cross sections andrelatively small time periods are encompassed, since the appropriate asymptoticanalysis is with the time dimension fixed and the cross section dimension tending toinfinity
In order to perform maximum likelihood analysis we need to specify, or derivefrom an underlying (structural) model, the density of yi given xi We assume thisdensity is known up to a finite number of unknown parameters, with the result that
we have a parametric model of a conditional density The vector yican be continuous
or discrete, or it can have both discrete and continuous characteristics In many ofour applications, yiis a scalar, but this fact does not simplify the general treatment
We will carry along two examples in this chapter to illustrate the general theory ofconditional maximum likelihood The first example is a binary response model, spe-cifically the probit model We postpone the uses and interepretation of binary responsemodels until Chapter 15
Example 13.1 (Probit): Suppose that the latent variable yifollows
where eiis independent of xi(which is a 1 K vector with first element equal to unity
for all i), y is a K 1 vector of parameters, and ei@ Normal(0,1) Instead of
Trang 4We can easily obtain the distribution of yi given xi:
Pð yi¼ 1 j xiÞ ¼ Pð yi>0j xiÞ ¼ Pðxiyþ ei>0j xiÞ
¼ Pðei>xiyj xiÞ ¼ 1 FðxiyÞ ¼ FðxiyÞ ð13:4Þwhere FðÞ denotes the standard normal cumulative distribution function (cdf ) Wehave used Property CD.4 in the chapter appendix along with the symmetry of thenormal distribution Therefore,
We can combine equations (13.4) and (13.5) into the density of yi given xi:
fð y j xiÞ ¼ ½FðxiyÞy½1 FðxiyÞ1y; y¼ 0; 1 ð13:6ÞThe fact that fð y j xiÞ is zero when y B f0; 1g is obvious, so we will not be explicit
about this in the future
Our second example is useful when the variable to be explained takes on negative integer values Such a variable is called a count variable We will discuss theuse and interpretation of count data models in Chapter 19 For now, it su‰ces tonote that a linear model for Eð y j xÞ when y takes on nonnegative integer values isnot ideal because it can lead to negative predicted values Further, since y can take onthe value zero with positive probability, the transformation logð yÞ cannot be used toobtain a model with constant elasticities or constant semielasticities A functionalform well suited for Eð y j xÞ is expðxy Þ We could estimate y by using nonlinear leastsquares, but all of the standard distributions for count variables imply hetero-skedasticity (see Chapter 19) Thus, we can hope to do better A traditional approach
non-to regression models with count data is non-to assume that yi given xi has a Poissondistribution
Example 13.2 (Poisson Regression): Let yibe a nonnegative count variable; that is,
yi can take on integer values 0; 1; 2; : Denote the conditional mean of yigiven thevector xi as Eð yij xiÞ ¼ mðxiÞ A natural distribution for yi given xi is the Poissondistribution:
fð y j xiÞ ¼ exp½mðxiÞfmðxiÞgy= y!; y¼ 0; 1; 2; ð13:7Þ(We use y as the dummy argument in the density, not to be confused with the randomvariable yi.) Once we choose a form for the conditional mean function, we havecompletely determined the distribution of yi given xi For example, from equation(13.7), Pð y ¼ 0 j xÞ ¼ exp½mðxÞ An important feature of the Poisson distribu-
Trang 5tion is that the variance equals the mean: Varð yij xiÞ ¼ Eð yij xiÞ ¼ mðxiÞ The usualchoice for mðÞ is mðxÞ ¼ expðxy Þ, where y is K 1 and x is 1 K with first elementunity.
13.3 General Framework for Conditional MLE
Let poðy j xÞ denote the conditional density of yi given xi¼ x, where y and x aredummy arguments We index this density by ‘‘o’’ to emphasize that it is the truedensity of yi given xi, and not just one of many candidates It will be useful to let
X H RK denote the possible values for xi and Y denote the possible values of yi; Xand Y are called the supports of the random vectors xiand yi, respectively
For a general treatment, we assume that, for all x A X, poð j xÞ is a density withrespect to a s-finite measure, denoted nðdyÞ Defining a s-finite measure would take
us too far afield We will say little more about the measure nðdyÞ because it doesnot play a crucial role in applications It su‰ces to know that nðdyÞ can be chosen toallow yito be discrete, continuous, or some mixture of the two When yi is discrete,the measure nðdyÞ simply turns all integrals into sums; when yi is purely continuous,
we obtain the usual Riemann integrals Even in more complicated cases—where, say,
yi has both discrete and continuous characteristics—we can get by with tools frombasic probability without ever explicitly defining nðdyÞ For more on measures andgeneral integrals, you are referred to Billingsley (1979) and Davidson (1994, Chapters
3 and 4)
In Chapter 12 we saw how nonlinear least squares can be motivated by the factthat moðxÞ 1 Eð y j xÞ minimizes Ef½ y mðxÞ2g for all other functions mðxÞ withEf½mðxÞ2g < y Conditional maximum likelihood has a similar motivation The
result from probability that is crucial for applying the analogy principle is the ditional Kullback-Leibler information inequality Although there are more generalstatements of this inequality, the following su‰ces for our purpose: for any non-negative function fð j xÞ such that
con-ð
Y
Property CD.1 in the chapter appendix implies that
Trang 6We can apply inequality (13.9) to a parametric model for poð j xÞ,
which we assume satisfies condition (13.8) for each x A X and each y A Y; if it does
not, then fð j x; y Þ does not integrate to unity (with respect to the measure n), and as
a result it is a very poor candidate for poðy j xÞ Model (13.10) is a correctly specifiedmodel of the conditional density, poð j Þ, if, for some yoAY,
As we discussed in Chapter 12, it is useful to use yoto distinguish the true value of theparameter from a generic element of Y In particular examples, we will not bothermaking this distinction unless it is needed to make a point
For each x A X, Kð f ; xÞ can be written as Eflog½ poðyij xiÞ j xi¼ xg Eflog½ f ðyij xiÞ j xi¼ xg Therefore, if the parametric model is correctly specified,then Eflog½ f ðyij xi;yoÞ j xig b Eflog½ f ðyij xi;yÞ j xig, or
E½liðyoÞ j xi b E½liðy Þ j xi; y A Y ð13:12Þwhere
liðy Þ 1 lðyi; xi;y Þ 1 log f ðyij xi;yÞ ð13:13Þ
is the conditional log likelihood for observation i Note that liðy Þ is a random function
of y, since it depends on the random vectorðxi; yiÞ By taking the expected value of
expression (13.12) and using iterated expectations, we see that yosolves
A solution to problem (13.15), assuming that one exists, is the conditional maximum
likelihood estimator (CMLE) of yo, which we denote as ^y y We will sometimes drop
‘‘conditional’’ when it is not needed for clarity
The CMLE is clearly an M-estimator, since a maximization problem is easilyturned into a minimization problem: in the notation of Chapter 12, take wi1ðxi; yiÞand qðwi;y Þ 1 log f ðyij xi;yÞ As long as we keep track of the minus sign in front
of the log likelihood, we can apply the results in Chapter 12 directly
Trang 7The motivation for the conditional MLE as a solution to problem (13.15) mayappear backward if you learned about maximum likelihood estimation in an intro-ductory statistics course In a traditional framework, we would treat the xi as con-stants appearing in the distribution of yi, and we would define ^y y as the solution to
mator of yoare necessarily heuristic By contrast, the analogy principle applies directly
to problem (13.15), and we need not assume that the xiare fixed
In our two examples, the conditional log likelihoods are fairly simple
Example 13.1 (continued): In the probit example, the log likelihood for observation
i is liðy Þ ¼ yilog FðxiyÞ þ ð1 yiÞ log½1 FðxiyÞ
Example 13.2 (continued): In the Poisson example, liðy Þ ¼ expðxiyÞ þ yixiylogð yi!Þ Normally, we would drop the last term in defining liðy Þ because it does not
a¤ect the maximization problem
13.4 Consistency of Conditional MLE
In this section we state a formal consistency result for the CMLE, which is a specialcase of the M-estimator consistency result Theorem 12.2
theorem13.1 (Consistency of CMLE): Letfðxi; yiÞ: i ¼ 1; 2; g be a random ple with xiA X H RK, yiA Y H RG Let Y H RPbe the parameter set and denote theparametric model of the conditional density as f f ð j x; y Þ: x A X; y A Yg Assume
sam-that (a) fð j x; y Þ is a true density with respect to the measure nðdyÞ for all x and y, so that condition (13.8) holds; (b) for some yoAY, poð j xÞ ¼ f ð j x; yoÞ, all x A X, and
yois the unique solution to problem (13.14); (c) Y is a compact set; (d) for each y A Y, lð ; y Þ is a Borel measurable function on Y X; (e) for each ðy; xÞ A Y X, lðy; x; Þ
is a continuous function on Y; and (f )jlðw; y Þj a bðwÞ, all y A Y, and E½bðwÞ < y.
Then there exists a solution to problem (13.15), the CMLE ^y y, and plim ^ y ¼ yo
As we discussed in Chapter 12, the measurability assumption in part d is purelytechnical and does not need to be checked in practice Compactness of Y can be
Trang 8relaxed, but doing so usually requires considerable work The continuity assumptionholds in most econometric applications, but there are cases where it fails, such aswhen estimating certain models of auctions—see Donald and Paarsch (1996) Themoment assumption in part f typically restricts the distribution of xiin some way, butsuch restrictions are rarely a serious concern For the most part, the key assumptions
are that the parametric model is correctly specified, that yois identified, and that the
log-likelihood function is continuous in y.
For the probit and Poisson examples, the log likelihoods are clearly continuous in
y We can verify the moment condition (f ) if we bound certain moments of xi andmake the parameter space compact But our primary concern is that densities arecorrectly specified For example, in the probit case, the density for yigiven xi will beincorrect if the latent error eiis not independent of xiand normally distributed, or ifthe latent variable model is not linear to begin with For identification we must ruleout perfect collinearity in xi The Poisson CMLE turns out to have desirable prop-erties even if the Poisson distributional assumption does not hold, but we postpone adiscussion of the robustness of the Poisson CMLE until Chapter 19
13.5 Asymptotic Normality and Asymptotic Variance Estimation
Under the di¤erentiability and moment assumptions that allow us to apply the orems in Chapter 12, we can show that the MLE is generally asymptotically normal.Naturally, the computational methods discussed in Section 12.7, including concen-trating parameters out of the log likelihood, apply directly
the-13.5.1 Asymptotic Normality
We can derive the limiting distribution of the MLE by applying Theorem 12.3 We
will have to assume the regularity conditions there; in particular, we assume that yois
in the interior of Y, and liðy Þ is twice continuously di¤erentiable on the interior of Y.
The score of the log likelihood for observation i is simply
Trang 9Example 13.2 (continued): The score for the Poisson case, where y is again K 1, is
siðy Þ ¼ expðxiyÞx0iþ yixi0¼ x0i½ yi expðxiyÞ ð13:19Þ
In the vast majority of cases, the score of the log-likelihood function has an portant zero conditional mean property:
In other words, when we evaluate the P 1 score at yo, and take its expectation withrespect to fð j xi;yoÞ, the expectation is zero Under condition (13.20), E½siðyoÞ ¼ 0,which was a key condition in deriving the asymptotic normality of the M-estimator
sðy; xi;yÞf ðy j xi;yÞnðdyÞ
If integration and di¤erentation can be interchanged on intðYÞ—that is, if
Y
‘yfðy j xi;yÞnðdyÞ ð13:21Þfor all xiA X, y A intðYÞ—then
Trang 10Example 13.2 (continued): Define ui1yi expðxiyoÞ Then siðyoÞ ¼ x0
iui and soE½siðyoÞ j xi ¼ 0
Assuming that liðy Þ is twice continuously di¤erentiable on the interior of Y, let
the Hessian for observation i be the P P matrix of second partial derivatives of
which is generally a positive definite matrix when yo is identified Under standardregularity conditions, the asymptotic normality of the CMLE follows from Theorem12.3: ffiffiffiffiffi
N
p
ð ^y yoÞ @a Normalð0; A1o BoA1o Þ, where Bo1Var½siðyoÞ 1 E½siðyoÞsiðyoÞ0
It turns out that this general form of the asymptotic variance matrix is too cated We now show that Bo¼ Ao
compli-We must assume enough smoothness such that the following interchange of gral and derivative is valid (see Newey and McFadden, 1994, Section 5.1, for the case
Y
‘y½siðy Þf ðy j xi;yÞnðdyÞ ð13:25Þ
Then, taking the derivative of the identity
ð
Y
siðy Þf ðy j xi;y ÞnðdyÞ 1 Ey½siðy Þ j xi ¼ 0; y A intðYÞ
and using equation (13.25), gives, for all y A intðYÞ,
Ey½Hiðy Þ j xi ¼ Vary½siðy Þ j xi
where the indexing by y denotes expectation and variance when fð j xi;yÞ is thedensity of yigiven xi When evaluated at y ¼ yowe get a very important equality:
E½HiðyoÞ j xi ¼ E½siðyoÞsiðyoÞ0j xi ð13:26Þwhere the expectation and variance are with respect to the true conditional distri-bution of yi given xi Equation (13.26) is called the conditional information matrixequality (CIME) Taking the expectation of equation (13.26) (with respect to the
Trang 11distribution of xi) and using the law of iterated expectations gives
or Ao¼ Bo This relationship is best thought of as the unconditional informationmatrix equality (UIME)
theorem 13.2 (Asymptotic Normality of CMLE): Let the conditions of Theorem
13.1 hold In addition, assume that (a) yoAintðYÞ; (b) for each ðy; xÞ A Y X,lðy; x; Þ is twice continuously di¤erentiable on intðYÞ; (c) the interchanges of de-
rivative and integral in equations (13.21) and (13.25) hold for all y A intðYÞ; (d)
the elements of ‘y2lðy; x; yÞ are bounded in absolute value by a function bðy; xÞwith finite expectation; and (e) Ao defined by expression (13.24) is positive definite.Then
deriva-at the rderiva-ate ffiffiffiffiffi
N
p
Some progress has been made for specific models when the support
of the distribution depends on unknown parameters; see, for example, Donald andPaarsch (1996)
13.5.2 Estimating the Asymptotic Variance
Estimating Avarð ^yÞ requires estimating Ao From the equalities derived previously,there are at least three possible estimators of Aoin the CMLE context In fact, underslight extensions of the regularity conditions in Theorem 13.2, each of the matrices
Aðxi;yoÞ 1 E½Hðyi; xi;yoÞ j xi ð13:31Þ
Trang 12Thus, Avaˆrð ^yÞ can be taken to be any of the three matrices
of some linear combinations of the parameters will not be well defined
The second estimator in equation (13.32), based on the outer product of the score,
is always positive definite (whenever the inverse exists) This simple estimator wasproposed by Berndt, Hall, Hall, and Hausman (1974) Its primary drawback is that itcan be poorly behaved in even moderate sample sizes, as we discussed in Section12.6.2
If the conditional expectation Aðxi;yoÞ is in closed form (as it is in some leadingcases) or can be simulated—as discussed in Porter (1999)—then the estimator based
on Aðxi; ^yÞ has some attractive features First, it often depends only on first tives of a conditional mean or conditional variance function Second, it is positivedefinite when it exists because of the conditional information matrix equality (13.26).Third, this estimator has been found to have significantly better finite sample prop-erties than the outer product of the score estimator in some situations where Aðxi;yoÞcan be obtained in closed form
deriva-Example 13.1 (continued): The Hessian for the probit log-likelihood is a mess.Fortunately, E½HiðyoÞ j xi has a fairly simple form Taking the derivative of equation(13.18) and using the product rule gives
Trang 13ixi In this example, the Hessian does not depend on yi,
so there is no distinction between HiðyoÞ and E½HiðyoÞ j xi The positive definite timate of Avaˆrð ^yÞ is simply
The three tests covered in Chapter 12 are immediately applicable to the MLE case.Since the information matrix equality holds when the density is correctly specified, weneed only consider the simplest forms of the test statistics The Wald statistic is given
in equation (12.63), and the conditions su‰cient for it to have a limiting chi-squaredistribution are discussed in Section 12.6.1
Define the log-likelihood function for the entire sample by Lðy Þ 1PN
i¼1liðy Þ Let
^
y be the unrestricted estimator, and let ~ y y be the estimator with the Q nonredundant
constraints imposed Then, under the regularity conditions discussed in Section12.6.3, the likelihood ratio (LR) statistic,
is distributed asymptotically as wQ2 under H0 As with the Wald statistic, we cannotuse LR as approximately w2
Q when yo is on the boundary of the parameter set The
LR statistic is very easy to compute once the restricted and unrestricted models havebeen estimated, and the LR statistic is invariant to reparameterizing the conditionaldensity
The score or LM test is based on the restricted estimation only Let sið ~yÞ be the
P 1 score of liðy Þ evaluated at the restricted estimates ~ y y That is, we compute the
partial derivatives of lðy Þ with respect to each of the P parameters, but then we
Trang 14evaluate this vector of partials at the restricted estimates Then, from Section 12.6.2and the information matrix equality, the statistics
~
Ai
!1
XN i¼1
containing any conditioning variables; see Problem 13.5 We have already used theexpected Hessian form of the LM statistic for nonlinear regression in Section 12.6.2
We will use it in several applications in Part IV, including binary response modelsand Poisson regression models In these examples, the statistic can be computedconveniently using auxiliary regressions based on weighted residuals
Because the unconditional information matrix equality holds, we know from tion 12.6.4 that the three classical statistics have the same limiting distribution underlocal alternatives Therefore, either small-sample considerations, invariance, or com-putational issues must be used to choose among the statistics
Sec-13.7 Specification Testing
Since MLE generally relies on its distributional assumptions, it is useful to haveavailable a general class of specification tests that are simple to compute One generalapproach is to nest the model of interest within a more general model (which may bemuch harder to estimate) and obtain the score test against the more general alternative.RESET in a linear model and its extension to exponential regression models in Section12.6.2 are examples of this approach, albeit in a non-maximum-likelihood setting
In the context of MLE, it makes sense to test moment conditions implied by theconditional density specification Let wi¼ ðxi; yiÞ and suppose that, when f ð j x; y Þ is
correctly specified,
Trang 15where gðw; y Þ is a Q 1 vector Any application implies innumerable choices for thefunction g Since the MLE ^y y sets the sum of the score to zero, g ðw; y Þ cannot contain
elements of sðw; y Þ Generally, g should be chosen to test features of a model that are
of primary interest, such as first and second conditional moments, or various tional probabilities
condi-A test of hypothesis (13.37) is based on how far the sample average of gðwi; ^yÞ isfrom zero To derive the asymptotic distribution, note that
Po1fE½siðyoÞsiðyoÞ0g1fE½siðyoÞgiðyoÞ0g
is the P Q matrix of population regression coe‰cients from regressing giðyoÞ0 on
siðyoÞ0 Using a mean-value expansion about yoand algebra similar to that in ter 12, we can write
E½‘ygiðyoÞ j xi ¼ E½giðyoÞsiðyoÞ0j xi: ð13:39Þ
To show equation (13.39), write
Ey½giðy Þ j xi ¼
ð
Y
gðy; xi;yÞf ðy j xi;yÞnðdyÞ ¼ 0 ð13:40Þ
for all y Now, if we take the derivative with respect to y and assume that the
inte-grals and derivative can be interchanged, equation (13.40) implies that
ð
‘ygðy; xi;yÞf ðy j xi;yÞnðdyÞ þ
ðgðy; xi;yÞ‘yfðy j xi;yÞnðdyÞ ¼ 0
Trang 16or Ey½‘ygiðy Þ j xi þ Ey½giðy Þsiðy Þ0j xi ¼ 0, where we use the fact that ‘yfðy j x; y Þ ¼
sðy; x; y Þ0fðy j x; y Þ Plugging in y ¼ yoand rearranging gives equation (13.39).What we have shown is that
con-i¼1ð^gi ^ssiPÞð^^ gi ^ssiPÞ^ 0 When we construct the dratic form, we get the Newey-Tauchen-White (NTW ) statistic,
ð^gi ^ssiPÞð^^ gi ^ssiPÞ^ 0
XN i¼1
gið ^yÞ
" #
ð13:41Þ
This statistic was proposed independently by Newey (1985) and Tauchen (1985), and
is an extension of White’s (1982a) information matrix (IM) test statistic
For computational purposes it is useful to note that equation (13.41) is identical to
N SSR0¼ NR2from the regression
1 on ^ss0i; ^gi0; i¼ 1; 2; ; N ð13:42Þwhere SSR0 is the usual sum of squared residuals Under the null that the density iscorrectly specified, NTW is distributed asymptotically as w2
Q, assuming that gðw; y Þcontains Q nonredundant moment conditions Unfortunately, the outer product form
of regression (13.42) means that the statistic can have poor finite sample properties
In particular applications—such as nonlinear least squares, binary response analysis,and Poisson regression, to name a few—it is best to use forms of test statistics based
on the expected Hessian We gave the regression-based test for NLS in equation(12.72), and we will see other examples in later chapters For the information matrixtest statistic, Davidson and MacKinnon (1992) have suggested an alternative form ofthe IM statistic that appears to have better finite sample properties
Example 13.2 (continued): To test the specification of the conditional mean forPoission regression, we might take gðw;y Þ ¼ expðxyÞx0½ y expðxyÞ ¼ expðxyÞsðw; yÞ,
Trang 17where the score is given by equation (13.19) If Eð y j xÞ ¼ expðxyoÞ then E½gðw; yoÞ j x
¼ expðxyoÞE½sðw; yoÞ j x ¼ 0 To test the Poisson variance assumption, Varð y j xÞ ¼
Eð y j xÞ ¼ expðxyoÞ, g can be of the form gðw; y Þ ¼ aðx; y Þf½ y expðxy Þ2 expðxy Þg,
where aðx; y Þ is a Q 1 vector If the Poisson assumption is true, then u ¼ y expðxyoÞ has a zero conditional mean and Eðu2j xÞ ¼ Varð y j xÞ ¼ expðxyoÞ It fol-lows that E½gðw; yoÞ j x ¼ 0
Example 13.2 contains examples of what are known as conditional moment tests
As the name suggests, the idea is to form orthogonality conditions based on somekey conditional moments, usually the conditional mean or conditional variance, butsometimes conditional probabilities or higher order moments The tests for nonlinearregression in Chapter 12 can be viewed as conditional moment tests, and we willsee several other examples in Part IV For reasons discussed earlier, we will avoidcomputing the tests using regression (13.42) whenever possible See Newey (1985),Tauchen (1985), and Pagan and Vella (1989) for general treatments and applications
of conditional moment tests White’s (1982a) information matrix test can often beviewed as a conditional moment test; see Hall (1987) for the linear regression modeland White (1994) for a general treatment
13.8 Partial Likelihood Methods for Panel Data and Cluster Samples
Up to this point we have assumed that the parametric model for the density of ygiven x is correctly specified This assumption is fairly general because x can containany observable variable The leading case occurs when x contains variables we view
as exogenous in a structural model In other cases, x will contain variables that areendogenous in a structural model, but putting them in the conditioning set and find-ing the new conditional density makes estimation of the structural parameters easier.For studying various panel data models, for estimation using cluster samples, andfor various other applications, we need to relax the assumption that the full condi-tional density of y given x is correctly specified In some examples, such a model istoo complicated Or, for robustness reasons, we do not wish to fully specify the den-sity of y given x
13.8.1 Setup for Panel Data
For panel data applications we let y denote a T 1 vector, with generic element yt.Thus, yi is a T 1 random draw vector from the cross section, with tth element yit
As always, we are thinking of T small relative to the cross section sample size With a
Trang 18slight notational change we can replace yit with, say, a G-vector for each t, an tension that allows us to cover general systems of equations with panel data.
ex-For some vector xtcontaining any set of observable variables, let Dð ytj xtÞ denotethe distribution of yt given xt The key assumption is that we have a correctly speci-fied model for the density of ytgiven xt; call it ftð ytj xt;yÞ, t ¼ 1; 2; ; T The vector
xtcan contain anything, including conditioning variables zt, lags of these, and lagged
values of y The vector y consists of all parameters appearing in ft for any t; some orall of these may appear in the density for every t, and some may appear only in thedensity for a single time period
What distinguishes partial likelihood from maximum likelihood is that we do notassume that
We define the partial log likelihood for each observation i as
liðy Þ 1XT
t¼1
which is the sum of the log likelihoods across t What makes partial likelihood
methods work is that yo maximizes the expected value of equation (13.44) provided
we have the densities ftð ytj xt;yÞ correctly specified
By the Kullback-Leibler information inequality, yomaximizes E½log ftð yitj xit;yÞ
over Y for each t, so yoalso maximizes the sum of these over t As usual,
identifica-tion requires that yo be the unique maximizer of the expected value of equation
(13.44) It is su‰cient that youniquely maximizes E½log ftð yitj xit;yÞ for each t, butthis assumption is not necessary
The partial maximum likelihood estimator (PMLE) ^y y solves