As in the case of linear models, we often call y the explained variable, the response variable, the dependent variable, or the endogenous variable; x 1ðx1; x2;.. 15.2 The Linear Probabil
Trang 1We now apply the general methods of Part III to study specific nonlinear models thatoften arise in applications Many nonlinear econometric models are intended to ex-plain limited dependent variables Roughly, a limited dependent variable is a variablewhose range is restricted in some important way Most variables encountered ineconomics are limited in range, but not all require special treatment For example,many variables—wage, population, and food consumption, to name just a few—canonly take on positive values If a strictly positive variable takes on numerous values,special econometric methods are rarely called for Often, taking the log of the vari-able and then using a linear model su‰ces.
When the variable to be explained, y, is discrete and takes on a finite number ofvalues, it makes little sense to treat it as an approximately continuous variable Dis-creteness of y does not in itself mean that a linear model for Eð y j xÞ is inappropriate.However, in Chapter 15 we will see that linear models have certain drawbacks formodeling binary responses, and we will treat nonlinear models such as probit andlogit We also cover basic multinomial response models in Chapter 15, including thecase when the response has a natural ordering
Other kinds of limited dependent variables arise in econometric analysis, especiallywhen modeling choices by individuals, families, or firms Optimizing behavior oftenleads to corner solutions for some nontrivial fraction of the population For example,during any given time, a fairly large fraction of the working age population does notwork outside the home Annual hours worked has a population distribution spreadout over a range of values, but with a pileup at the value zero While it could be that
a linear model is appropriate for modeling expected hours worked, a linear modelwill likely lead to negative predicted hours worked for some people Taking the nat-ural log is not possible because of the corner solution at zero In Chapter 16 we willdiscuss econometric models that are better suited for describing these kinds of limiteddependent variables
We treat the problem of sample selection in Chapter 17 In many sample selectioncontexts the underlying population model is linear, but nonlinear econometric meth-ods are required in order to correct for nonrandom sampling Chapter 17 also coverstesting and correcting for attrition in panel data models, as well as methods fordealing with stratified samples
In Chapter 18 we provide a modern treatment of switching regression models and,more generally, random coe‰cient models with endogenous explanatory variables
We focus on estimating average treatment e¤ects
We treat methods for count-dependent variables, which take on nonnegative ger values, in Chapter 19 An introduction to modern duration analysis is given inChapter 20
Trang 2inte-15.1 Introduction
In qualitative response models, the variable to be explained, y, is a random variabletaking on a finite number of outcomes; in practice, the number of outcomes is usuallysmall The leading case occurs where y is a binary response, taking on the values zeroand one, which indicate whether or not a certain event has occurred For example,
y¼ 1 if a person is employed, y ¼ 0 otherwise; y ¼ 1 if a family contributes tocharity during a particular year, y¼ 0 otherwise; y ¼ 1 if a firm has a particular type
of pension plan, y¼ 0 otherwise Regardless of the definition of y, it is traditional torefer to y¼ 1 as a success and y ¼ 0 as a failure
As in the case of linear models, we often call y the explained variable, the response
variable, the dependent variable, or the endogenous variable; x 1ðx1; x2; ; xKÞ isthe vector of explanatory variables, regressors, independent variables, exogenousvariables, or covariates
In binary response models, interest lies primarily in the response probability,pðxÞ 1 Pð y ¼ 1 j xÞ ¼ Pð y ¼ 1 j x1; x2; ; xKÞ ð15:1Þfor various values of x For example, when y is an employment indicator, x mightcontain various individual characteristics such as education, age, marital status, andother factors that a¤ect employment status, such as a binary indicator variable forparticipation in a recent job training program, or measures of past criminal behavior.For a continuous variable, xj, the partial e¤ect of xjon the response probability is
If xK is a binary variable, interest lies in
pðx1; x2; ; xK1;1Þ pðx1; x2; ; xK1;0Þ ð15:3Þwhich is the di¤erence in response probabilities when xK¼ 1 and xK ¼ 0 For most
of the models we consider, whether a variable xjis continuous or discrete, the partiale¤ect of xjon pðxÞ depends on all of x
In studying binary response models, we need to recall some basic facts aboutBernoulli (zero-one) random variables The only di¤erence between the setup here
Trang 3and that in basic statistics is the conditioning on x If Pð y ¼ 1 j xÞ ¼ pðxÞ then
Pð y ¼ 0 j xÞ ¼ 1 pðxÞ, Eð y j xÞ ¼ pðxÞ, and Varð y j xÞ ¼ pðxÞ½1 pðxÞ
15.2 The Linear Probability Model for Binary Response
The linear probability model (LPM) for binary response y is specified as
Pð y ¼ 1 j xÞ ¼ b0þ b1x1þ b2x2þ þ bKxK ð15:4Þ
As usual, the xj can be functions of underlying explanatory variables, which wouldsimply change the interpretations of the bj Assuming that x1 is not functionally re-lated to the other explanatory variables, b1¼ qPð y ¼ 1 j xÞ=qx1 Therefore, b1is thechange in the probability of success given a one-unit increase in x1 If x1 is a binaryexplanatory variable, b1 is just the di¤erence in the probability of success when
x1¼ 1 and x1¼ 0, holding the other xjfixed
Using functions such as quadratics, logarithms, and so on among the independentvariables causes no new di‰culties The important point is that the bj now measurethe e¤ects of the explanatory variables xj on a particular probability
Unless the range of x is severely restricted, the linear probability model cannot be agood description of the population response probability Pð y ¼ 1 j xÞ For given values
of the population parameters bj, there would usually be feasible values of x1; ; xK
such that b0þ xb is outside the unit interval Therefore, the LPM should be seen as a
convenient approximation to the underlying response probability What we hope isthat the linear probability approximates the response probability for common values
of the covariates Fortunately, this often turns out to be the case
In deciding on an appropriate estimation technique, it is useful to derive the ditional mean and variance of y Since y is a Bernoulli random variable, these aresimply
con-Eð y j xÞ ¼ b0þ b1x1þ b2x2þ þ bKxK ð15:5Þ
where xb is shorthand for the right-hand side of equation (15.5).
Equation (15.5) implies that, given a random sample, the OLS regression of y
on 1; x1; x2; ; xK produces consistent and even unbiased estimators of the bj.Equation (15.6) means that heteroskedasticity is present unless all of the slope co-e‰cients b1; ;bK are zero A nice way to deal with this issue is to use standardheteroskedasticity-robust standard errors and t statistics Further, robust tests ofmultiple restrictions should also be used There is one case where the usual F statistic
Trang 4can be used, and that is to test for joint significance of all variables (leaving the stant unrestricted) This test is asymptotically valid because Varð y j xÞ is constantunder this particular null hypothesis.
con-Since the form of the variance is determined by the model for Pð y ¼ 1 j xÞ, anasymptotically more e‰cient method is weighted least squares (WLS) Let ^b b be the
OLS estimator, and let ^yidenote the OLS fitted values Then, provided 0 < ^yi<1 forall observations i, define the estimated standard deviation as ^si1½ ^yið1 ^yiÞ1=2
Then the WLS estimator, b, is obtained from the OLS regression
yi=^sion 1=^si; xi1=^si; ; xiK=^si; i¼ 1; 2; ; N ð15:7ÞThe usual standard errors from this regression are valid, as follows from the treat-ment of weighted least squares in Chapter 12 In addition, all other testing can bedone using F statistics or LM statistics using weighted regressions
If some of the OLS fitted values are not between zero and one, WLS analysis is notpossible without ad hoc adjustments to bring deviant fitted values into the unit in-terval Further, since the OLS fitted value ^yiis an estimate of the conditional proba-bility Pð yi¼ 1 j xiÞ, it is somewhat awkward if the predicted probability is negative orabove unity
Aside from the issue of fitted values being outside the unit interval, the LPMimplies that a ceteris paribus unit increase in xj always changes Pð y ¼ 1 j xÞ by thesame amount, regardless of the initial value of xj This implication cannot literally betrue because continually increasing one of the xj would eventually drive Pð y ¼ 1 j xÞ
to be less than zero or greater than one
Even with these weaknesses, the LPM often seems to give good estimates of thepartial e¤ects on the response probability near the center of the distribution of x.(How good they are can be determined by comparing the coe‰cients from the LPMwith the partial e¤ects estimated from the nonlinear models we cover in Section 15.3.)
If the main purpose is to estimate the partial e¤ect of xj on the response probability,averaged across the distribution of x, then the fact that some predicted values areoutside the unit interval may not be very important The LPM need not provide verygood estimates of partial e¤ects at extreme values of x
Example 15.1 (Married Women’s Labor Force Participation): We use the data fromMROZ.RAW to estimate a linear probability model for labor force participation(inlf ) of married women Of the 753 women in the sample, 428 report working non-zero hours during the year The variables we use to explain labor force participationare age, education, experience, nonwife income in thousands (nwifeinc), number ofchildren less than six years of age (kidslt6), and number of kids between 6 and 18
Trang 5inclusive (kidsge6); 606 women report having no young children, while 118 reporthaving exactly one young child The usual OLS standard errors are in parentheses,while the heteroskedasticity-robust standard errors are in brackets:
Of the 753 fitted probabilities, 33 are outside the unit interval Rather than usingsome adjustment to those 33 fitted values and applying weighted least squares, wejust use OLS and report heteroskedasticity-robust standard errors Interestingly, thesedi¤er in practically unimportant ways from the usual OLS standard errors
The case for the LPM is even stronger if most of the xj are discrete and take ononly a few values In the previous example, to allow a diminishing e¤ect of youngchildren on the probability of labor force participation, we can break kidslt6 intothree binary indicators: no young children, one young child, and two or more youngchildren The last two indicators can be used in place of kidslt6 to allow the firstyoung child to have a larger e¤ect than subsequent young children (Interestingly,when this method is used, the marginal e¤ects of the first and second young childrenare virtually the same The estimated e¤ect of the first child is about .263, and theadditional reduction in the probability of labor force participation for the next child
is about.274.)
In the extreme case where the model is saturated—that is, x contains dummy ables for mutually exclusive and exhaustive categories—the linear probability model
Trang 6vari-is completely general The fitted probabilities are simply the average yi within eachcell defined by the di¤erent values of x; we need not worry about fitted probabilitiesless than zero or greater than one See Problem 15.1.
15.3 Index Models for Binary Response: Probit and Logit
We now study binary response models of the form
where x is 1 K, b is K 1, and we take the first element of x to be unity Examples
where x does not contain unity are rare in practice For the linear probability model,GðzÞ ¼ z is the identity function, which means that the response probabilities cannot
be between 0 and 1 for all x and b In this section we assume that GðÞ takes on values
in the open unit interval: 0 < GðzÞ < 1 for all z A R
The model in equation (15.8) is generally called an index model because it restrictsthe way in which the response probability depends on x: pðxÞ is a function of x only
through the index xb ¼ b1þ b2x2þ þ bKxK The function G maps the index intothe response probability
In most applications, G is a cumulative distribution function (cdf ), whose specificform can sometimes be derived from an underlying economic model For example, inProblem 15.2 you are asked to derive an index model from a utility-based model ofcharitable giving The binary indicator y equals unity if a family contributes to charityand zero otherwise The vector x contains family characteristics, income, and the price
of a charitable contribution (as determined by marginal tax rates) Under a normalityassumption on a particular unobservable taste variable, G is the standard normal cdf.Index models where G is a cdf can be derived more generally from an underlyinglatent variable model, as in Example 13.1:
where e is a continuously distributed variable independent of x and the distribution
of e is symmetric about zero; recall from Chapter 13 that 1½ is the indicator function
If G is the cdf of e, then, because the pdf of e is symmetric about zero, 1 GðzÞ ¼GðzÞ for all real numbers z Therefore,
Pð y ¼ 1 j xÞ ¼ Pð y>0j xÞ ¼ Pðe > xb j xÞ ¼ 1 Gðxb Þ ¼ Gðxb Þ
which is exactly equation (15.8)
Trang 7There is no particular reason for requiring e to be symmetrically distributed in thelatent variable model, but this happens to be the case for the binary response modelsapplied most often.
In most applications of binary response models, the primary goal is to explain thee¤ects of the xj on the response probability Pð y ¼ 1 j xÞ The latent variable formu-lation tends to give the impression that we are primarily interested in the e¤ects ofeach xj on y As we will see, the direction of the e¤ects of xj on Eð yj xÞ ¼ xb and
on Eð y j xÞ ¼ Pð y ¼ 1 j xÞ ¼ Gðxb Þ are the same But the latent variable yrarely has
a well-defined unit of measurement (for example, y might be measured in utilityunits) Therefore, the magnitude of bj is not especially meaningful except in specialcases
The probit model is the special case of equation (15.8) with
In order to successfully apply probit and logit models, it is important to know how
to interpret the bj on both continuous and discrete explanatory variables First, if xj
Trang 8the sign of the e¤ect is given by the sign of bj Also, the relative e¤ects do not depend
on x: for continuous variables xjand xh, the ratio of the partial e¤ects is constant and
given by the ratio of the corresponding coe‰cients: qpðxÞ=qxj
qpðxÞ=qxh
¼ bj= h In the typicalcase that g is a symmetric density about zero, with unique mode at zero, the largest
e¤ect is when xb¼ 0 For example, in the probit case with gðzÞ ¼ fðzÞ, gð0Þ ¼ fð0Þ
¼ 1= ffiffiffiffiffiffi
2p
p
A :399 In the logit case, gðzÞ ¼ expðzÞ=½1 þ expðzÞ2, and so gð0Þ ¼ :25
If xK is a binary explanatory variable, then the partial e¤ect from changing xK
from zero to one, holding all other variables fixed, is simply
Gðb1þ b2x2þ þ bK1xK1þ bKÞ Gðb1þ b2x2þ þ bK1xK1Þ ð15:14ÞAgain, this expression depends on all other values of the other xj For example, if y is
an employment indicator and xjis a dummy variable indicating participation in a jobtraining program, then expression (15.14) is the change in the probability of em-ployment due to the job training program; this depends on other characteristics thata¤ect employability, such as education and experience Knowing the sign of bK isenough to determine whether the program had a positive or negative e¤ect But tofind the magnitude of the e¤ect, we have to estimate expression (15.14)
We can also use the di¤erence in expression (15.14) for other kinds of discretevariables (such as number of children) If xK denotes this variable, then the e¤ect onthe probability of xKgoing from cKto cKþ 1 is simply
is the approximate change in Pð y ¼ 1 j zÞ given a 1 percent increase in z2 Modelswith interactions among explanatory variables, including interactions between dis-crete and continuous variables, are handled similarly When measuring e¤ects ofdiscrete variables, we should use expression (15.15)
Trang 915.4 Maximum Likelihood Estimation of Binary Response Index Models
Assume we have N independent, identically distributed observations following themodel (15.8) Since we essentially covered the case of probit in Chapter 13, the dis-cussion here will be brief To estimate the model by (conditional) maximum likeli-hood, we need the log-likelihood function for each i The density of yigiven xican bewritten as
fð y j xi;bÞ ¼ ½GðxibÞy½1 GðxibÞ1y; y¼ 0; 1 ð15:16ÞThe log-likelihood for observation i is a function of the K 1 vector of parametersand the dataðxi; yiÞ:
liðbÞ ¼ yi log½GðxibÞ þ ð1 yiÞ log½1 GðxibÞ ð15:17Þ(Recall from Chapter 13 that, technically speaking, we should distinguish the ‘‘true’’
value of beta, bo, from a generic value For conciseness we do not do so here.)Restricting GðÞ to be strictly between zero and one ensures that liðb Þ is well defined for all values of b.
As usual, the log likelihood for a sample size of N is Lðb Þ ¼PN
i¼1liðb Þ, and the MLE of b, denoted ^ b b, maximizes this log likelihood If GðÞ is the standard normalcdf, then ^b b is the probit estimator; if GðÞ is the logistic cdf, then ^b b is the logit esti-
mator From the general maximum likelihood results we know that ^b b is consistent
and asymptotically normal We can also easily estimate the asymptotic variance ^b b.
We assume that GðÞ is twice continuously di¤erentiable, an assumption that isusually satisfied in applications (and, in particular, for probit and logit) As before,the function gðzÞ is the derivative of GðzÞ For the probit model, gðzÞ ¼ fðzÞ, and forthe logit model, gðzÞ ¼ expðzÞ=½1 þ expðzÞ2
Using the same calculations for the probit example as in Chapter 13, the score ofthe conditional log likelihood for observation i can be shown to be
Trang 10In most cases the inverse exists, and when it does, ^VV is positive definite If the matrix
in equation (15.20) is not invertible, then perfect collinearity probably exists amongthe regressors
As usual, we treat ^b b as being normally distributed with mean zero and variance
matrix in equation (15.20) The (asymptotic) standard error of ^bjis the square root ofthe jth diagonal element of ^VV These can be used to construct t statistics, which have
a limiting standard normal distribution, and to construct approximate confidenceintervals for each population parameter These are reported with the estimates forpackages that perform logit and probit We discuss multiple hypothesis testing in thenext section
Some packages also compute Huber-White standard errors as an option for probitand logit analysis, using the general M-estimator formulas; see, in particular, equa-tion (12.49) While the robust variance matrix is consistent, using it in place of theusual estimator means we must think that the binary response model is incorrectlyspecified Unlike with nonlinear regression, in a binary response model it is not pos-sible to correctly specify Eð y j xÞ but to misspecify Varð y j xÞ Once we have specified
Pð y ¼ 1 j xÞ, we have specified all conditional moments of y given x
In Section 15.8 we will see that, when using binary response models with paneldata or cluster samples, it is sometimes important to compute variance matrix esti-mators that are robust to either serial dependence or within-group correlation Butthis need arises as a result of dependence across time or subgroup, and not becausethe response probability is misspecified
15.5 Testing in Binary Response Index Models
Any of the three tests from general MLE analysis—the Wald, LR, or LM test—can beused to test hypotheses in binary response contexts Since the tests are all asymptoticallyequivalent under local alternatives, the choice of statistic usually depends on computa-tional simplicity (since finite sample comparisons must be limited in scope) In the fol-lowing subsections we discuss some testing situations that often arise in binary choiceanalysis, and we recommend particular tests for their computational advantages.15.5.1 Testing Multiple Exclusion Restrictions
Consider the model
Trang 11where x is 1 K and z is 1 Q We wish to test the null hypothesis H0: g¼ 0, so weare testing Q exclusion restrictions The elements of z can be functions of x, such asquadratics and interactions—in which case the test is a pure functional form test Or,the z can be additional explanatory variables For example, z could contain dummyvariables for occupation or region In any case, the form of the test is the same.Some packages, such as Stata, compute the Wald statistic for exclusion restrictionsusing a simple command following estimation of the general model This capabilitymakes it very easy to test multiple exclusion restrictions, provided the dimension ofðx; zÞ is not so large as to make probit estimation di‰cult.
The likelihood ratio statistic is also easy to use Let Lurdenote the value of the likelihood function from probit of y on x and z (the unrestricted model), and let Lr
log-denote the value of the likelihood function from probit of y on x (the restrictedmodel) Then the likelihood ratio test of H0: g¼ 0 is simply 2ðLur LrÞ, which has
an asymptotic w2
Q distribution under H0 This is analogous to the usual F statistic inOLS analysis of a linear model
The score or LM test is attractive if the unrestricted model is di‰cult to estimate
In this section, let ^b b denote the restricted estimator of b, that is, the probit or logit
estimator with z excluded from the model The LM statistic using the estimatedexpected hessian, ^Ai [see equation (15.20) and Section 12.6.2], can be shown to benumerically identical to the following: (1) Define ^ui1yi Gðxi^Þ, ^Gi1Gðxi^Þ, and
^i1gðxi^Þ These are all obtainable after estimating the model without z (2) Use all
N observations to run the auxiliary OLS regression
The LM procedure is rather easy to remember The term ^gixiis the gradient of themean function Gðxibþ zig Þ with respect to b, evaluated at b ¼ ^ b b and g¼ 0 Simi-larly, ^gizi is the gradient of Gðxibþ zig Þ with respect to g, again evaluated at b ¼ ^ b and g¼ 0 Finally, under H0: g¼ 0, the conditional variance of ui given ðxi; ziÞ isGðxibÞ½1 GðxibÞ; therefore, ½ ^Gið1 ^GiÞ1=2 is an estimate of the conditional stan-dard deviation of ui The dependent variable in regression (15.22) is often called astandardized residual because it is an estimate of ui=½Gið1 GiÞ1=2, which has unitconditional (and unconditional) variance The regressors are simply the gradient of theconditional mean function with respect to both sets of parameters, evaluated under
Trang 12H0, and weighted by the estimated inverse conditional standard deviation The firstset of regressors in regression (15.22) is 1 K and the second set is 1 Q.
Under H0, LM @ w2
Q The LM approach can be an attractive alternative to the LRstatistic if z has large dimension, since with many explanatory variables probit can bedi‰cult to estimate
15.5.2 Testing Nonlinear Hypotheses about b
For testing nonlinear restrictions on b in equation (15.8), the Wald statistic is putationally the easiest because the unrestricted estimator of b, which is just probit
com-or logit, is easy to obtain Actually imposing nonlinear restrictions in estimation—which is required to apply the score or likelihood ratio methods—can be di‰cult.However, we must also remember that the Wald statistic for testing nonlinear restric-tions is not invariant to reparameterizations, whereas the LM and LR statistics are.(See Sections 12.6 and 13.6; for the LM statistic, we would always use the expectedHessian.)
Let the restictions on b be given by H0: cðb Þ ¼ 0, where cðb Þ is a Q 1 vector ofpossibly nonlinear functions satisfying the di¤erentiability and rank requirementsfrom Chapter 13 Then, from the general MLE analysis, the Wald statistic is simply
W ¼ cð ^bÞ0½‘bcð ^bÞ ^VV‘bcð ^bÞ01cð ^bÞ ð15:23Þwhere ^VV is given in equation (15.20) and ‘bcð ^b Þ is the Q K Jacobian of cðb Þ evalu-
ated at ^b b.
15.5.3 Tests against More General Alternatives
In addition to testing for omitted variables, sometimes we wish to test the probit orlogit model against a more general functional form When the alternatives are notstandard binary response models, the Wald and LR statistics are cumbersome toapply, whereas the LM approach is convenient because it only requires estimation ofthe null model
As an example of a more complicated binary choice model, consider the latentvariable model (15.9) but assume that ej x @ Normal½0; expð2x1dÞ, where x1 is 1
K1 subset of x that excludes a constant and d is a K1 1 vector of additional eters (In many cases we would take x1 to be all nonconstant elements of x.) There-fore, there is heteroskedasticity in the latent variable model, so that e is no longerindependent of x The standard deviation of e given x is simply expðx1dÞ Define
param-r¼ e=expðx1dÞ, so that r is independent of x with a standard normal distribution.Then
Trang 13Pð y ¼ 1 j xÞ ¼ Pðe > xb j xÞ ¼ P½expðx1dÞe > expðx1d Þxb
¼ P½r > expðx1d Þxb ¼ F½expðx1d Þxb ð15:24ÞThe partial e¤ects of xj on Pð y ¼ 1 j xÞ are much more complicated in equation
(15.24) than in equation (15.8) When d¼ 0, we obtain the standard probit model.Therefore, a test of the probit functional form for the response probability is a test of
In the previous example, GðÞ ¼ FðÞ, d0¼ 0, and mðxb; x; dÞ ¼ F½expðx1d Þxb.
Let ^b b be the probit or logit estimator of b obtained under d ¼ d0 Define
^i1yi Gðxi^Þ, ^Gi1Gðxi^Þ, and ^gi1gðxi^Þ The gradient of the mean functionmðxib; xi;d Þ with respect to b, evaluated at d0, is simply gðxibÞxi The only otherpiece we need is the gradient of mðxib; xi;d Þ with respect to d, evaluated at d0 Denotethis 1 Q vector as ‘dmðxib; xi;d0Þ Further, set ‘dm^i1‘dmðxib; x^ i;d0Þ The LMstatistic can be obtained as the explained sum of squares or NR2
u from the regression
Q, where Q is the dimension of d.
When applying this test to the preceding probit example, we have only ‘dm^ileft tocompute But mðxib; xi;dÞ ¼ F½expðxi1dÞxib, and so
‘dmðxib; xi;dÞ ¼ ðxibÞ expðxi1dÞxi1f½expðxi1dÞxib
When evaluated at b¼ ^b b and d¼ 0 (the null value), we get ‘dm^i¼ ðxi^Þfðxi^Þxi1
1ðxi^Þ ^fixi1, a 1 K1 vector Regression (15.27) becomes
Trang 14(We drop the minus sign because it does not a¤ect the value of the explained sum ofsquares or R2
u.) Under the null hypothesis that the probit model is correctly specified,
LM @ w2
K1 This statistic is easy to compute after estimation by probit
For a one-degree-of-freedom test regardless of the dimension of xi, replace the lastterm in regression (15.28) withðxi^Þ2f^i=pFffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi^ið1 ^FiÞ
, and then the explained sum ofsquares is distributed asymptotically as w12 See Davidson and MacKinnon (1984) forfurther examples
15.6 Reporting the Results for Probit and Logit
Several statistics should be reported routinely in any probit or logit (or other binarychoice) analysis The ^bj, their standard errors, and the value of the likelihood func-tion are reported by all software packages that do binary response analysis The ^bjgive the signs of the partial e¤ects of each xj on the response probability, and thestatistical significance of xj is determined by whether we can reject H0: bj¼ 0.One measure of goodness of fit that is usually reported is the percent correctlypredicted For each i, we compute the predicted probability that yi¼ 1, given theexplanatory variables, xi If Gðxi^Þ > :5, we predict yito be unity; if Gðxi^Þ a :5, yi
is predicted to be zero The percentage of times the predicted yimatches the actual yi
is the percent correctly predicted In many cases it is easy to predict one of the comes and much harder to predict another outcome, in which case the percent cor-rectly predicted can be misleading as a goodness-of-fit statistic More informative is
out-to compute the percent correctly predicted for each outcome, y¼ 0 and y ¼ 1 Theoverall percent correctly predicted is a weighted average of the two, with the weightsbeing the fractions of zero and one outcomes, respectively Problem 15.7 provides anillustration
Various pseudo R-squared measures have been proposed for binary response.McFadden (1974) suggests the measure 1 Lur=Lo, where Luris the log-likelihoodfunction for the estimated model and Lois the log-likelihood function in the modelwith only an intercept Because the log likelihood for a binary response model isalways negative, jLurj a jLoj, and so the pseudo R-squared is always between zeroand one Alternatively, we can use a sum of squared residuals measure: 1 SSRur=SSRo, where SSRur is the sum of squared residuals ^ui¼ yi Gðxi^Þ and SSRois thetotal sum of squares of yi Several other measures have been suggested (see, for ex-ample, Maddala, 1983, Chapter 2), but goodness of fit is not as important as statis-tical and economic significance of the explanatory variables Estrella (1998) contains
a recent comparison of goodness-of-fit measures for binary response
Trang 15Often we want to estimate the e¤ects of the variables xj on the response bilities Pð y ¼ 1 j xÞ If xjis (roughly) continuous then
for small changes in xj (As usual when using calculus, the notion of ‘‘small’’ here issomewhat vague.) Since gðx ^bÞ depends on x, we must compute gðx ^bÞ at interestingvalues of x Often the sample averages of the xj’s are plugged in to get gðx ^bÞ Thisfactor can then be used to adjust each of the ^bj (at least those on continuous vari-ables) to obtain the e¤ect of a one-unit increase in xj If x contains nonlinear functions
of some explanatory variables, such as natural logs or quadratics, there is the issue ofusing the log of the average versus the average of the log (and similarly with qua-dratics) To get the e¤ect for the ‘‘average’’ person, it makes more sense to plug theaverages into the nonlinear functions, rather than average the nonlinear functions.Software packages (such as Stata with the dprobit command) necessarily average thenonlinear functions Sometimes minimum and maximum values of key variables areused in obtaining gðx ^bÞ, so that we can see how the partial e¤ects change as someelements of x get large or small
Equation (15.29) also suggests how to roughly compare magnitudes of the probitand logit estimates If x ^b b is close to zero for logit and probit, the scale factor we use
can be gð0Þ For probit, gð0Þ A :4, and for logit, gð0Þ ¼ :25 Thus the logit estimatescan be expected to be larger by a factor of about :4=:25¼ 1:6 Alternatively, multiplythe logit estimates by 625 to make them comparable to the probit estimates In thelinear probability model, gð0Þ is unity, and so logit estimates should be divided byfour to compare them with LPM estimates, while probit estimates should be divided
by 2.5 to make them roughly comparable to LPM estimates More accurate parisons are obtained by using the scale factors gðx ^bÞ for probit and logit Of course,one of the potential advantages of using probit or logit is that the partial e¤ects varywith x, and it is of some interest to compute gðx ^bÞ at values of x other than thesample averages
com-If, say, x2is a binary variable, it perhaps makes more sense to plug in zero or onefor x2, rather than x2 (which is the fraction of ones in the sample) Putting in theaverages for the binary variables means that the e¤ect does not really correspond to aparticular individual But often the results are similar, and the choice is really based
Trang 16ob-If xK is a discrete variable, then we can estimate the change in the predicted ability in going from cK to cKþ 1 as
prob-^K ¼ G½ ^b1þ ^b2x2þ þ ^bK1xK1þ ^bKðcKþ 1Þ
Gð ^b1þ ^b2x2þ þ ^bK1xK1þ ^bKcKÞ ð15:31Þ
In particular, when xK is a binary variable, set cK¼ 0 Of course, the other xj’s can
be evaluated anywhere, but the use of sample averages is typical The delta methodcan be used to obtain a standard error of equation (15.31) For probit, Stata does thiscalculation when xKis a binary variable Usually the calculations ignore the fact that
xj is an estimate of EðxjÞ in applying the delta method If we are truly interested in
bKgðmxbÞ, the estimation error in x can be accounted for, but it makes the calculationmore complicated, and it is unlikely to have a large e¤ect
An alternative way to summarize the estimated marginal e¤ects is to estimate theaverage value of bKgðxb Þ across the population, or bKE½gðxb Þ A consistent estima-tor is
N1XN
i¼1
½Gð ^b1þ ^b2xi 2þ þ ^bK1xi; K1þ ^bKÞ Gð ^b1þ ^b2xi2þ þ ^bK1xi; K1Þ
ð15:33Þ
if xKis binary The delta method can be used to obtain an asymptotic standard error
of expression (15.32) or (15.33) Costa (1995) is a recent example of average e¤ectsobtained from expression (15.33)
Trang 17Example 15.2 (Married Women’s Labor Force Participation): We now estimatelogit and probit models for women’s labor force participation For comparison wereport the linear probability estimates The results, with standard errors in parenthe-ses, are given in Table 15.1 (for the LPM, these are heteroskedasticity-robust).The estimates from the three models tell a consistent story The signs of the co-e‰cients are the same across models, and the same variables are statistically signifi-cant in each model The pseudo R-squared for the LPM is just the usual R-squaredreported for OLS; for logit and probit the pseudo R-squared is the measure based onthe log likelihoods described previously In terms of overall percent correctly pre-dicted, the models do equally well For the probit model, it correctly predicts ‘‘out ofthe labor force’’ about 63.1 percent of the time, and it correctly predicts ‘‘in the laborforce’’ about 81.3 percent of the time The LPM has the same overall percent cor-rectly predicted, but there are slight di¤erences within each outcome.
As we emphasized earlier, the magnitudes of the coe‰cients are not directly parable across the models Using the rough rule of thumb discussed earlier, we can
com-Table 15.1
LPM, Logit, and Probit Estimates of Labor Force Participation
Dependent Variable: inlf
Independent Variable
LPM (OLS)
Logit (MLE)
Probit (MLE)
(.0015)
.021 (.008)
.012 (.005)
(.007)
.221 (.043)
.131 (.025)
(.006)
.206 (.032)
.123 (.019)
(.00019)
.0032 (.0010)
.0019 (.0006)
(.002)
.088 (.015)
.053 (.008)
(.032)
1.443 (0.204)
.868 (.119)
(.013)
.060 (.075)
.036 (.043)
(.151)
.425 (.860)
.270 (.509)
Percent correctly predicted 73.4 73.6 73.4
Trang 18divide the logit estimates by four and the probit estimates by 2.5 to make all estimatescomparable to the LPM estimates For example, for the coe‰cients on kidslt6, thescaled logit estimate is about .361, and the scaled probit estimate is about .347.These are larger in magnitude than the LPM estimate (for reasons we will soon dis-cuss) The scaled coe‰cient on educ is 055 for logit and 052 for probit.
If we evaluate the standard normal probability density function, fð ^b0þ ^b1x1þ
þ ^bkxkÞ, at the average values of the independent variables in the sample (includingthe average of exper2), we obtain about 391; this value is close enough to 4 to makethe rough rule of thumb for scaling the probit coe‰cients useful in obtaining thee¤ects on the response probability In other words, to estimate the change in the re-sponse probability given a one-unit increase in any independent variable, we multiplythe corresponding probit coe‰cient by 4
The biggest di¤erence between the LPM model on one hand, and the logit andprobit models on the other, is that the LPM assumes constant marginal e¤ects foreduc, kidslt6, and so on, while the logit and probit models imply diminishing mar-ginal magnitudes of the partial e¤ects In the LPM, one more small child is estimated
to reduce the probability of labor force participation by about 262, regardless of howmany young children the woman already has (and regardless of the levels of the otherdependent variables) We can contrast this finding with the estimated marginal e¤ectfrom probit For concreteness, take a woman with nwifeinc¼ 20:13, educ ¼ 12:3,exper¼ 10:6, age ¼ 42:5—which are roughly the sample averages—and kidsge6 ¼ 1.What is the estimated fall in the probability of working in going from zero to onesmall child? We evaluate the standard normal cdf, Fð ^b0þ ^b1x1þ þ ^bkxkÞ withkidslt6¼ 1 and kidslt6 ¼ 0, and the other independent variables set at the valuesgiven We get roughly :373 :707 ¼ :334, which means that the labor force par-ticipation probability is about 334 lower when a woman has one young child This isnot much di¤erent from the scaled probit coe‰cient of .347 If the woman goesfrom one to two young children, the probability falls even more, but the marginale¤ect is not as large: :117 :373 ¼ :256 Interestingly, the estimate from the linearprobability model, which we think can provide a good estimate near the averagevalues of the covariates, is in fact between the probit estimated partial e¤ects startingfrom zero and one children
Binary response models apply with little modification to independently pooledcross sections or to other data sets where the observations are independent but notnecessarily identically distributed Often year or other time-period dummy variablesare included to account for aggregate time e¤ects Just as with linear models, probitcan be used to evaluate the impact of certain policies in the context of a natural ex-periment; see Problem 15.13 An application is given in Gruber and Poterba (1994)
Trang 1915.7 Specification Issues in Binary Response Models
We now turn to several issues that can arise in applying binary response models toeconomic data All of these topics are relevant for general index models, but features
of the normal distribution allow us to obtain concrete results in the context of probitmodels Therefore, our primary focus is on probit models
15.7.1 Neglected Heterogeneity
We begin by studying the consequences of omitting variables when those omittedvariables are independent of the included explanatory variables This is also called theneglected heterogeneity problem The (structural) model of interest is
where x is 1 K with x111 and c is a scalar We are interested in the partial e¤ects
of the xj on the probability of success, holding c (and the other elements of x) fixed
We can write equation (15.34) in latent variable form as y¼ xb þ gc þ e, where
y¼ 1½ y>0 and e j x; c @ Normalð0; 1Þ Because x1¼ 1, EðcÞ ¼ 0 without loss ofgenerality
Now suppose that c is independent of x and c @ Normalð0; t2Þ [Remember, thisassumption is much stronger than Covðx; cÞ ¼ 0 or even Eðc j xÞ ¼ 0: under indepen-dence, the distribution of c given x does not depend on x.] Given these assumptions,the composite term, gcþ e, is independent of x and has a Normalð0; g2t2þ 1Þ dis-tribution Therefore,
Pð y ¼ 1 j xÞ ¼ Pðgc þ e > xb j xÞ ¼ Fðxb=sÞ ð15:35Þwhere s21g2t2þ 1 It follows immediately from equation (15.35) that probit of y
on x consistently estimates b=s In other words, if ^ b b is the estimator from a probit of
y on x, then plim ^bj¼ bj=s Because s¼ ðg2t2þ 1Þ1=2>1 (unless g¼ 0 or t2¼ 0Þ,
jbj=sj < jbjj
The attenuation bias in estimating bjin the presence of neglected heterogeneity hasprompted statements of the following kind: ‘‘In probit analysis, neglected heteroge-neity is a much more serious problem than in linear models because, even if theomitted heterogeneity is independent of x, the probit coe‰cients are inconsistent.’’
We just derived that probit of y on x consistently estimates b=s rather than b, so
the statement is technically correct However, we should remember that, in nonlinearmodels, we usually want to estimate partial e¤ects and not just parameters For thepurposes of obtaining the directions of the e¤ects or the relative e¤ects of the ex-
planatory variables, estimating b=s is just as good as estimating b.
Trang 20For continuous xj, we would like to estimate
for various values of x and c Because c is not observed, we cannot estimate g Even
if we could estimate g, c almost never has meaningful units of measurement—forexample, c might be ‘‘ability,’’ ‘‘health,’’ or ‘‘taste for saving’’—so it is not obviouswhat values of c we should plug into equation (15.36) Nevertheless, c is normalized
so that EðcÞ ¼ 0, so we may be interested in equation (15.36) evaluated at c ¼ 0,which is simply bjfðxb Þ What we consistently estimate from the probit of y on x is
This expression shows that, if we are interested in the partial e¤ects evaluated at
c¼ 0, then probit of y on x does not do the trick An interesting fact about sion (15.37) is that, even though bj=s is closer to zero than bj, fðxb=sÞ is larger thanfðxb Þ because fðzÞ increases as jzj ! 0, and s > 1 Therefore, for estimating thepartial e¤ects in equation (15.36) at c¼ 0, it is not clear for what values of x anattenuation bias exists
expres-With c having a normal distribution in the population, the partial e¤ect evaluated
at c¼ 0 describes only a small fraction of the population [Technically, Pðc ¼ 0Þ ¼ 0.]Instead, we can estimate the average partial e¤ect (APE), which we introduced inSection 2.2.5 The APE is obtained, for given x, by averaging equation (15.36) acrossthe distribution of c in the population For emphasis, let xo be a given value of theexplanatory variables (which could be, but need not be, the mean value) When weplug xo into equation (15.36) and take the expected value with respect to the distri-bution of c, we get
EcðÞ denotes the expectation with respect to the distribution of c The derivative ofFðxb=sÞ with respect to xj isðbj=sÞfðxb=sÞ, which is what we wanted to show.The bottom line is that, except in cases where the magnitudes of the bj in equation(15.34) have some meaning, omitted heterogeneity in probit models is not a problem
Trang 21when it is independent of x: ignoring it consistently estimates the average partiale¤ects Of course, the previous arguments hinge on the normality of c and the probitstructural equation If the structural model (15.34) were, say, logit and if c werenormally distributed, we would not get a probit or logit for the distribution of y givenx; the response probability is more complicated The lesson from Section 2.2.5 is that
we might as well work directly with models for Pð y ¼ 1 j xÞ because partial e¤ects of
Pð y ¼ 1 j xÞ are always the average of the partial e¤ects of Pð y ¼ 1 j x; cÞ over thedistribution of c
If c is correlated with x or is otherwise dependent on x [for example, if Varðc j xÞdepends on x], then omission of c is serious In this case we cannot get consistentestimates of the average partial e¤ects For example, if cj x @ Normalðxd; h2Þ, thenprobit of y on x gives consistent estimates of ðb þ gdÞ=r, where r2¼ g2h2þ 1 Un-less g¼ 0 or d ¼ 0, we do not consistently estimate b=s This result is not surprising
given what we know from the linear case with omitted variables correlated with the
xj We now study what can be done to account for endogenous variables in probitmodels
15.7.2 Continuous Endogenous Explanatory Variables
We now explicitly allow for the case where one of the explanatory variables is related with the error term in the latent variable model One possibility is to estimate
cor-a linecor-ar probcor-ability model by 2SLS This procedure is relcor-atively ecor-asy cor-and might vide a good estimate of the average e¤ect
pro-If we want to estimate a probit model with an endogenous explanatory variables,
we must make some fairly strong assumptions In this section we consider the case of
a continuous endogenous explanatory variable
Write the model as
y2¼ z1d21þ z2d22þ v2¼ zd2þ v2 ð15:40Þ
where ðu1; v2Þ has a zero mean, bivariate normal distribution and is independent
of z Equation (15.39), along with equation (15.41), is the structural equation; tion (15.40) is a reduced form for y2, which is endogenous if u1and v2are correlated
equa-If u1 and v2 are independent, there is no endogeneity problem Because v2 is mally distributed, we are assuming that y2 given z is normal; thus y2 should havefeatures of a normal random variable (For example, y2 should not be a discretevariable.)
Trang 22nor-The model is applicable when y2is correlated with u1 because of omitted variables
or measurement error It can also be applied to the case where y2is determined jointlywith y1, but with a caveat If y1 appears on the right-hand side in a linear structuralequation for y2, then the reduced form for y2 cannot be found with v2 having thestated properties However, if y1 appears in a linear structural equation for y2, then
y2 has the reduced form given by equation (15.40); see Maddala (1983, Chapter 7)for further discussion
The normalization that gives the parameters in equation (15.39) an average partiale¤ect interpretation, at least in the omitted variable and simultaneity contexts, isVarðu1Þ ¼ 1, just as in a probit model with all explanatory variables exogenous Tosee this point, consider the outcome on y1at two di¤erent outcomes of y2, say y2and
y2þ 1 Holding the observed exogenous factors fixed at z1, and holding u1fixed, thedi¤erence in responses is
the following paragraphs, only consistently estimate d1and a1up to scale; we have to
do a little more work to obtain estimates of the APE If y2is a mismeasured variable,
we apparently cannot estimate the APE of interest: we would like to estimate thechange in the response probability due to a change in y2, but, without further as-sumptions, we can only estimate the e¤ect of changing y2
The most useful two-step approach is due to Rivers and Vuong (1988), as it leads
to a simple test for endogeneity of y2 To derive the procedure, first note that, underjoint normality ofðu1; v2Þ, with Varðu1Þ ¼ 1, we can write
where y1¼ h1= 22, h1¼ Covðv2; u1Þ, t2
2¼ Varðv2Þ, and e1 is independent of z and
v2 (and therefore of y2) Because of joint normality of ðu1; v2Þ, e1 is also normally
Trang 23distributed with Eðe1Þ ¼ 0 and Varðe1Þ ¼ Varðu1Þ h2= 2¼ 1 r2, where r1¼Corrðv2; u1Þ We can now write
sistently estimates dr11 d1=ð1 r2Þ1=2, ar11a1=ð1 r2Þ1=2, and yr11y1=ð1 r2Þ1=2.Notice that because r2
1 <1, each scaled coe‰cient is greater than its unscaled terpart unless y2 is exogenousðr1¼ 0Þ
coun-Since we do not know d2, we must first estimate it, as in the following procedure:
Procedure 15.1: (a) Run the OLS regression y2 on z and save the residuals ^vv2.(b) Run the probit y1 on z1, y2, ^vv2 to get consistent estimators of the scaled co-
e‰cients dr1, ar1, and yr1:
A nice feature of Procedure 15.1 is that the usual probit t statistic on ^vv2 is a validtest of the null hypothesis that y2 is exogenous, that is, H0: y1 ¼ 0 If y100, theusual probit standard errors and test statistics are not strictly valid, and we have only
estimated d1 and a1 up to scale The asymptotic variance of the two-step estimatorcan be derived using the M-estimator results in Section 12.5.2; see also Rivers andVuong (1988)
Under H0: y1¼ 0, e1¼ u1, and so the distribution of v2 plays no role under thenull Therefore, the test of exogeneity is valid without assuming normality or homo-skedasticity of v2, and it can be applied very broadly, even if y2 is a binary variable.Unfortunately, if y2and u1are correlated, normality of v2is crucial
Example 15.3 (Testing for Exogeneity of Education in the Women’s LFP Model): Wetest the null hypothesis that educ is exogenous in the married women’s labor forceparticipation equation We first obtain the reduced form residuals, ^vv2, from regressingeduc on all exogenous variables, including motheduc, fatheduc, and huseduc Then, weadd ^vv2to the probit from Example 15.2 The t statistic on ^vv2is only 867, which is weakevidence against the null hypothesis that educ is exogenous As always, this conclusionhinges on the assumption that the instruments for educ are themselves exogenous.Even when y100, it turns out that we can consistently estimate the average partiale¤ects after the two-stage estimation We simply apply the results from Section 2.2.5
Trang 24To see how, write y1¼ 1½z1d1þ a1y2þ u1>0, where, in the notation of Section
2.2.5, q 1 u1, x 1ðz1; y2Þ, and w 1 v2 (a scalar in this case) Because y1 is a ministic function of ðz1; y2; u1Þ, v2 is trivially redundant in Eð y1j z1; y2; u1Þ, and soequation (2.34) holds Further, as we have already used, u1 givenðz1; y2; v2Þ is inde-pendent ofðz1; y2Þ, and so equation (2.33) holds as well It follows from Section 2.2.5that the APEs are obtained by taking derivatives (or di¤erences) of
We simply divide each coe‰cient by the factor ð ^yr12 ^2
2þ 1Þ1=2 before computingderivatives or di¤erences with respect to the elements of ðz1; y2Þ Unfortunately, be-cause the APEs depend on the parameters in a complicated way—and the asymptoticvariance ofð ^dr10 ; ^ar1; ^yr1Þ0is already complicated because of the two-step estimation—standard errors for the APEs would be very di‰cult to come by using the delta method
An alternative method for estimating the APEs does not exploit the normalityassumption for v2 By the usual uniform weak law of large numbers argument—seeLemma 12.1—a consistent estimator of expression (15.45) for anyðz1; y2Þ is obtained
by replacing unknown parameters by consistent estimators:
Rather than use a two-step procedure, we can estimate equations (15.39)–(15.41)
by conditional maximum likelihood To obtain the joint distribution of ð y1; y2Þ,
Trang 25conditional on z, recall that
fð y1; y2j zÞ ¼ f ð y1j y2; zÞ f ð y2j zÞ ð15:48Þ(see Property CD.2 in Appendix 13A) Since y2j z @ Normalðzd2;t2
where we have used the fact that y1¼ r1= 2
Let w denote the term in inside FðÞ in equation (15.49) Then we have derived
wi1½zi1d1þ a1yi2þ ðr1= 2Þð yi 2 zid2Þ=ð1 r2
1Þ1=2Summing expression (15.50) across all i and maximizing with respect to all param-
eters gives the MLEs of d1, a1, r1, d2, t22 The general theory of conditional MLEapplies, and so standard errors can be obtained using the estimated Hessian, theestimated expected Hessian, or the outer product of the score
Maximum likelihood estimation has some decided advantages over two-step cedures First, MLE is more e‰cient than any two-step procedure Second, we get
pro-direct estimates of d1and a1, the parameters of interest for computing partial e¤ects.Evans, Oates, and Schwab (1992) study peer e¤ects on teenage behavior using the fullMLE
Testing that y2 is exogenous is easy once the MLE has been obtained: just test
H0: r1¼ 0 using an asymptotic t test We could also use a likelihood ratio test.The drawback with the MLE is computational Sometimes it can be di‰cult to getthe iterations to converge, as ^r1sometimes tends toward 1 or1
Comparing the Rivers-Vuong approach to the MLE shows that the former is alimited information procedure Essentially, Rivers and Vuong focus on fð y1j y2; zÞ,
where they replace the unknown d2 with the OLS estimator ^d2 (and they ignore therescaling problem by taking e1 in equation (15.43) to have unit variance) MLE esti-
Trang 26mates the parameters using the information in fð y1j y2; zÞ and f ð y2j zÞ neously For the initial test of whether y2 is exogenous, the Rivers-Vuong approachhas significant computational advantages If exogeneity is rejected, it is probablyworth doing MLE.
simulta-Another benefit of the maximum likelihood approach for this and related problems
is that it forces discipline on us in coming up with consistent estimation proceduresand correct standard errors It is easy to abuse two-step procedures if we are notcareful in deriving estimating equations With MLE, although it can be di‰cult toderive joint distributions of the endogenous variables given the exogenous variables,
we know that, if the underlying distributional assumptions hold, consistent and cient estimators are obtained
e‰-15.7.3 A Binary Endogenous Explanatory Variable
We now consider the case where the probit model contains a binary explanatoryvariable that is endogenous The model is
whereðu1; v2Þ is independent of z and distributed as bivariate normal with mean zero,each has unit variance, and r1¼ Corrðu1; v2Þ If r100, then u1 and y2 are corre-
lated, and probit estimation of equation (15.51) is inconsistent for d1 and a1
As discussed in Section 15.7.2, the normalization Varðu1Þ ¼ 1 is the proper one forcomputing average partial e¤ects Often, the e¤ect of y2 is of primary interest, espe-cially when y2 indicates participation in some sort of program, such as job training,and the binary outcome y1 might denote employment status The average treatmente¤ect (for a given value of z1) is Fðz1d1þ a1Þ Fðz1d1Þ
To derive the likelihood function, we again need the joint distribution ofð y1; y2Þgiven z, which we obtain from equation (15.48) To obtain Pð y1¼ 1 j y2; zÞ, first notethat
Pð y1¼ 1 j v2; zÞ ¼ F½ðz1d1þ a1y2þ r1v2Þ=ð1 r2
Since y2¼ 1 if and only if v2>zd2, we need a basic fact about truncated normaldistributions: If v2 has a standard normal distribution and is independent of z, thenthe density of v2 given v2 >zd2is
fðv2Þ=Pðv2>zd2Þ ¼ fðv2Þ=Fðzd2Þ ð15:54ÞTherefore,
Trang 270j y2¼ 1; zÞ is just one minus equation (15.55).
y2, and taking the log gives the log-likelihood function for maximum likelihoodanalysis It is messy but certainly doable Evans and Schwab (1995) use the MLEapproach to study the causal e¤ects of attending a Catholic high school on theprobability of attending college, allowing the Catholic high school indicator to becorrelated with unobserved factors that a¤ect college attendence As an IV they use abinary variable indicating whether a student is Catholic
Because the MLE is nontrivial to compute, it is tempting to use some seemingly
‘‘obvious’’ two-step procedures As an example, we might try to inappropriatelymimic 2SLS Since Eð y2j zÞ ¼ Fðzd2Þ and d2is consistently estimated by probit of y2
on z, it is tempting to estimate d1and a1from the probit of y1on z, ^F2, where ^F21
Fðz ^d2Þ This approach does not produce consistent estimators, for the same reasonsthe forbidden regression discussed in Section 9.5 for nonlinear simultaneous equa-tions models does not For this two-step procedure to work, we would have to have
Pð y1¼ 1 j zÞ ¼ F½z1d1þ a1Fðzd2Þ But Pð y1¼ 1 j zÞ ¼ Eð y1j zÞ ¼ Eð1½z1d1þ a1y2þ
u1>0 j zÞ, and since the indicator function 1½ is nonlinear, we cannot pass theexpected value through If we were to compute the correct (complicated) formula for
Pð y1¼ 1 j zÞ, plug in ^d2, and then maximize the resulting binary response log hood, then the two-step approach would produce consistent estimators But fullmaximum likelihood is easier and more e‰cient
likeli-As mentioned in the previous subsection, we can use the Rivers-Vuong approach
to test for exogeneity of y2 This has the virtue of being simple, and, if the test fails toreject, we may not need to compute the MLE A more e‰cient test is the score test of
H0: r1¼ 0, and this does not require estimation of the full MLE
Trang 2815.7.4 Heteroskedasticity and Nonnormality in the Latent Variable Model
In applying the probit model it is easy to become confused about the problems ofheteroskedasticity and nonnormality The confusion stems from a failure to distin-guish between the underlying latent variable formulation, as in the model (15.9), andthe response probability in equation (15.8) As we have emphasized throughout thischapter, for most purposes we want to estimate Pð y ¼ 1 j xÞ The latent variable for-mulation is convenient for certain manipulations, but we are rarely interested in
Eð yj xÞ [One case in which Eð yj xÞ is of interest is covered in Problem 15.16.]Once we focus on Pð y ¼ 1 j xÞ, we can easily see why we should not attempt tocompare heteroskedasticity in the latent variable model (15.9) with the consequences
of heteroskedasticity in a standard linear regression model Heteroskedasticity inVarðe j xÞ entirely changes the functional form for Pð y ¼ 1 j xÞ ¼ Eð y j xÞ While the
statement ‘‘probit will be inconsistent for b when e is heteroskedastic’’ is correct, it
largely misses the point In most probit applications, it makes little sense to care
about consistent estimation of b when Pð y ¼ 1 j xÞ 0 Fðxb Þ (Section 15.7.5 contains
a di¤erent perspective.)
It is easy to construct examples where the partial e¤ect of a variable on Pð y ¼ 1 j xÞhas the sign opposite to that of its coe‰cient in the latent variable formulation.For example, let x1 be a positive, continuous variable, and write the latent variablemodel as y¼ b0þ b1x1þ e, e j x1@ Normalð0; x2
1Þ The binary response is defined
as y¼ 1½ y >0 A simple calculation shows that Pð y ¼ 1 j x1Þ ¼ Fðb0=x1þ b1Þ,and so qPð y ¼ 1 j x1Þ=qx1 ¼ ðb0=x12Þfðb0=x1þ b1Þ If b0>0 and b1 >0, thenqPð y ¼ 1 j x1Þ=qx1 and b1 have opposite signs The problem is fairly clear: while thelatent variable model has a conditional mean that is linear in x1, the response prob-ability depends on 1=x1 If the latent variable model is correct, we should just doprobit of y on 1 and 1=x1
Nonnormality in the latent error e means that GðzÞ 0 FðzÞ, and therefore
Pð y ¼ 1 j xÞ 0 Fðxb Þ Again, this is a functional form problem in the responseprobability, and it should be treated as such As an example, suppose that the true
model is logit, but we estimate probit We are not going to consistently estimate b in
Pð y ¼ 1 j xÞ ¼ Lðxb Þ—in fact, Table 15.1 shows that the logit estimates are generallymuch larger (roughly 1.6 times as large)—because of the di¤erent scalings inherent in
the probit and logit functions But inconsistent estimation of b is practically
irrele-vant: probit might provide very good estimates of the partial e¤ects, qPð y ¼ 1 j xÞ=qxj,even though logit is the correct model In Example 15.2, the estimated partial e¤ectsare very similar for logit and probit
Trang 29Relaxing distributional assumptions on e in the model (15.9) can be useful forobtaining more flexible functional forms for Pð y ¼ 1 j xÞ, as we saw in equation(15.24) Replacing FðzÞ with some function Gðz; gÞ, where g is a vector of parameters,
is a good idea, especially when it nests the standard normal cdf [Moon (1988) coverssome interesting possibilities in the context of logit models, including asymmetriccumulative distribution functions.] But it is important to remember that these are justways of generalizing functional form, and they may be no better than directly speci-fying a more flexible functional form for the response probability, as in McDonald(1996) When di¤erent functional forms are used, parameter estimates across di¤er-ent models should not be the basis for comparison: in most cases, it makes sense only
to compare the estimated response probabilities at various values of x and goodness
of fit, such as the values of the log-likelihood function (For an exception, see lem 15.16.)
Prob-15.7.5 Estimation under Weaker Assumptions
Probit, logit, and the extensions of these mentioned in the previous subsection are allparametric models: Pð y ¼ 1 j xÞ depends on a finite number of parameters Therehave been many recent advances in estimation of binary response models that relaxparametric assumptions on Pð y ¼ 1 j xÞ We briefly discuss some of those here
If we are interested in estimating the directions and relative sizes of the partiale¤ects, and not the response probabilities, several approaches are possible Ruud(1983) obtains conditions under which we can estimate the slope parameters, call these
b, up to scale—that is, we can consistently estimate tb for some unknown constant t—
even though we misspecify the function GðÞ Ruud (1986) shows how to exploit theseresults to consistently estimate the slope parameters up to scale fairly generally
An alternative approach is to recognize that we do not know the function GðÞ, butthe response probability has the index form in equation (15.8) This arises from thelatent variable formulation (15.9) when e is independent of x but the distribution of e
is not known There are several semiparametric estimators of the slope parameters,
up to scale, that do not require knowledge of G Under certain restrictions on thefunction G and the distribution of x, the semiparametric estimators are consistentand ffiffiffiffiffi
N
p
-asymptotically normal See, for example, Stoker (1986); Powell, Stock,and Stoker (1989); Ichimura (1993); Klein and Spady (1993); and Ai (1997) Powell(1994) contains a recent survey of these methods
Once ^b b is obtained, the function G can be consistently estimated (in a sense we
cannot make precise here, as G is part of an infinite dimensional space) Thus, theresponse probabilities, as well as the partial e¤ects on these probabilities, can beconsistently estimated for unknown G Obtaining ^GG requires nonparametric regression
Trang 30of yi on xib, where ^^ b b are the scaled slope estimators Accessible treatments of themethods used are contained in Stoker (1992), Powell (1994), and Ha¨rdle and Linton(1994).
Remarkably, it is possible to estimate b up to scale without assuming that e and x
are independent in the model (15.9) In the specification y¼ 1½xb þ e > 0, Manski (1975, 1988) shows how to consistently estimate b, subject to a scaling, under the
assumption that the median of e given x is zero Some mild restrictions are needed onthe distribution of x; the most important of these is that at least one element of x withnonzero coe‰cient is essentially continuous This allows e to have any distribution,and e and x can be dependent; for example, Varðe j xÞ is unrestricted Manski’s esti-mator, called the maximum score estimator, is a least absolute deviations estimator.Since the median of y given x is 1½xb > 0, the maximum score estimator solves
over all b with, say, b0b ¼ 1, or with some element of b fixed at unity if the
corre-sponding xjis known to appear in Medð y j xÞ fA normalization is needed because ifMedð y j xÞ ¼ 1½xb > 0 then Medð y j xÞ ¼ 1½xðtb Þ > 0 for any t > 0.g The resultingestimator is consistent—for a recent proof see Newey and McFadden (1994)—but itslimiting distribution is nonnormal In fact, it converges to its limiting distribution atrate N1=3 Horowitz (1992) proposes a smoothed version of the maximum score esti-mator that converges at a rate close to ffiffiffiffiffi
N
p
The maximum score estimator’s strength is that it consistently estimates b up to
scale in cases where the index model (15.8) does not hold In a sense, this is also theestimator’s weakness, because it is not intended to deliver estimates of the responseprobabilities Pð y ¼ 1 j xÞ In some cases we might only want to know the relativee¤ects of each xj on an underlying utility di¤erence or unobserved willingness to pay
ð yÞ, and the maximum score estimator is well suited for that purpose However, formost policy purposes we want to know the magnitude of the change in Pð y ¼ 1 j xÞfor a given change in xj As illustrated by the heteroskedasticity example in the pre-vious subsection, where Varðe j x1Þ ¼ x2
1, it is possible for bjand qPð y ¼ 1 j xÞ=qxj tohave opposite signs More generally, for any variable y, it is possible that xj has apositive e¤ect on Medð y j xÞ but a negative e¤ect on Eð y j xÞ, or vice versa Thispossibility raises the issue of what should be the focus, the median or the mean Forbinary response, the conditional mean is the response probability
It is also possible to estimate the parameters in a binary response model withendogenous explanatory variables without knowledge of GðÞ Lewbel (1998) con-
Trang 31tains some recent results Apparently, methods for estimating average partial e¤ectswith endogenous explanatory variables and unknown GðÞ are not yet available.15.8 Binary Response Models for Panel Data and Cluster Samples
When analyzing binary responses in the context of panel data, it is often useful tobegin with a linear model with an additive, unobserved e¤ect, and then, just as inChapters 10 and 11, use the within transformation or first di¤erencing to remove theunobserved e¤ect A linear probability model for binary outcomes has the sameproblems as in the cross section case In fact, it is probably less appealing forunobserved e¤ects models, as it implies the unnatural restrictions xitb U ciU
1 xitb; t¼ 1; ; T; on the unobserved e¤ects In this section we discuss probit andlogit models that can incorporate unobserved e¤ects
15.8.1 Pooled Probit and Logit
In Section 13.8 we used a probit model to illustrate partial likelihood methods withpanel data Naturally, we can use logit or any other binary response function as well.Suppose the model is
Pð yit¼ 1 j xitÞ ¼ GðxitbÞ; t¼ 1; 2; ; T ð15:57Þwhere GðÞ is a known function taking on values in the open unit interval As wediscussed in Chapter 13, xitcan contain a variety of factors, including time dummies,interactions of time dummies with time-constant or time-varying variables, and laggeddependent variables
In specifying the model (15.57) we have not assumed nearly enough to obtain thedistribution of yi1ð yi1; ; yiTÞ given xi¼ ðxi1; ; xiTÞ Nevertheless, we can ob-tain a ffiffiffiffiffi
and score statistics can be computed as in Chapter 12
In the case that the model (15.57) is dynamically complete, that is,
Pð yit¼ 1 j xit; yi; t1; xi; t1; Þ ¼ Pð yit¼ 1 j xitÞ ð15:58Þ
Trang 32inference is considerably easier: all the usual statistics from a probit or logit thatpools observations and treats the sample as a long independent cross section of size
NT are valid, including likelihood ratio statistics Remember, we are definitely notassuming independence across t (for example, xit can contain lagged dependent vari-ables) Dynamic completeness implies that the scores are serially uncorrelated across
t, which is the key condition for the standard inference procedures to be valid (Seethe general treatment in Section 13.8.)
To test for dynamic completeness, we can always add a lagged dependent variableand possibly lagged explanatory variables As an alternative, we can derive a simpleone-degree-of-freedom test that works regardless of what is in xit For concreteness,
we focus on the probit case; other index models are handled similarly Define uit1
yit FðxitbÞ, so that, under assumption (15.58), Eðuitj xit; yi; t1; xi; t1; Þ ¼ 0, all
t It follows that uit is uncorrelated with any function of the variables ðxit; yi; t1;
xi; t1; Þ, including ui; t1 By studying equation (13.53), we can see that it is serialcorrelation in the uit that makes the usual inference procedures invalid Let ^uit¼
yit Fðxit^Þ Then a simple test is available by using pooled probit to estimate theartificial model
‘‘Pð yit¼ 1 j xit; ^ui; t1Þ ¼ Fðxitbþ g1^i; t1Þ’’ ð15:59Þusing time periods t¼ 2; ; T The null hypothesis is H0: g1¼ 0 If H0 is rejected,then so is assumption (15.58) This is a case where under the null hypothesis, the es-
timation of b required to obtain ^ui; t1does not a¤ect the limiting distribution of any
of the usual test statistics, Wald, LR, or LM, of H0: g1¼ 0 The Wald statistic, that
is, the t statistic on ^gg1, is the easiest to obtain For the LM and LR statistics we must
be sure to drop the first time period in estimating the restricted modelðg1¼ 0Þ.15.8.2 Unobserved E¤ects Probit Models under Strict Exogeneity
A popular model for binary outcomes with panel data is the unobserved e¤ects probitmodel The main assumption of this model is
Pð yit¼ 1 j xi; ciÞ ¼ Pð yit¼ 1 j xit; ciÞ ¼ Fðxitbþ ciÞ; t¼ 1; ; T ð15:60Þwhere ci is the unobserved e¤ect and xicontains xit for all t The first equality saysthat xit is strictly exogenous conditional on ci: once ci is conditioned on, only xit
appears in the response probability at time t This rules out lagged dependent ables in xit, as well as certain kinds of explanatory variables whose future move-ments depend on current and past outcomes on y (Strict exogeneity also requires that
vari-we have enough lags of explanatory variables if there are distributed lag e¤ects.) The