1. Trang chủ
  2. » Tài Chính - Ngân Hàng

Book Econometric Analysis of Cross Section and Panel Data By Wooldridge - Chapter 15 docx

65 346 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Discrete Response Models
Trường học Wooldridge - Econometric Analysis of Cross Section and Panel Data
Chuyên ngành Econometrics
Thể loại Textbook chapter
Năm xuất bản 2010
Thành phố Cambridge
Định dạng
Số trang 65
Dung lượng 398,68 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

As in the case of linear models, we often call y the explained variable, the response variable, the dependent variable, or the endogenous variable; x 1ðx1; x2;.. 15.2 The Linear Probabil

Trang 1

We now apply the general methods of Part III to study specific nonlinear models thatoften arise in applications Many nonlinear econometric models are intended to ex-plain limited dependent variables Roughly, a limited dependent variable is a variablewhose range is restricted in some important way Most variables encountered ineconomics are limited in range, but not all require special treatment For example,many variables—wage, population, and food consumption, to name just a few—canonly take on positive values If a strictly positive variable takes on numerous values,special econometric methods are rarely called for Often, taking the log of the vari-able and then using a linear model su‰ces.

When the variable to be explained, y, is discrete and takes on a finite number ofvalues, it makes little sense to treat it as an approximately continuous variable Dis-creteness of y does not in itself mean that a linear model for Eð y j xÞ is inappropriate.However, in Chapter 15 we will see that linear models have certain drawbacks formodeling binary responses, and we will treat nonlinear models such as probit andlogit We also cover basic multinomial response models in Chapter 15, including thecase when the response has a natural ordering

Other kinds of limited dependent variables arise in econometric analysis, especiallywhen modeling choices by individuals, families, or firms Optimizing behavior oftenleads to corner solutions for some nontrivial fraction of the population For example,during any given time, a fairly large fraction of the working age population does notwork outside the home Annual hours worked has a population distribution spreadout over a range of values, but with a pileup at the value zero While it could be that

a linear model is appropriate for modeling expected hours worked, a linear modelwill likely lead to negative predicted hours worked for some people Taking the nat-ural log is not possible because of the corner solution at zero In Chapter 16 we willdiscuss econometric models that are better suited for describing these kinds of limiteddependent variables

We treat the problem of sample selection in Chapter 17 In many sample selectioncontexts the underlying population model is linear, but nonlinear econometric meth-ods are required in order to correct for nonrandom sampling Chapter 17 also coverstesting and correcting for attrition in panel data models, as well as methods fordealing with stratified samples

In Chapter 18 we provide a modern treatment of switching regression models and,more generally, random coe‰cient models with endogenous explanatory variables

We focus on estimating average treatment e¤ects

We treat methods for count-dependent variables, which take on nonnegative ger values, in Chapter 19 An introduction to modern duration analysis is given inChapter 20

Trang 2

inte-15.1 Introduction

In qualitative response models, the variable to be explained, y, is a random variabletaking on a finite number of outcomes; in practice, the number of outcomes is usuallysmall The leading case occurs where y is a binary response, taking on the values zeroand one, which indicate whether or not a certain event has occurred For example,

y¼ 1 if a person is employed, y ¼ 0 otherwise; y ¼ 1 if a family contributes tocharity during a particular year, y¼ 0 otherwise; y ¼ 1 if a firm has a particular type

of pension plan, y¼ 0 otherwise Regardless of the definition of y, it is traditional torefer to y¼ 1 as a success and y ¼ 0 as a failure

As in the case of linear models, we often call y the explained variable, the response

variable, the dependent variable, or the endogenous variable; x 1ðx1; x2; ; xKÞ isthe vector of explanatory variables, regressors, independent variables, exogenousvariables, or covariates

In binary response models, interest lies primarily in the response probability,pðxÞ 1 Pð y ¼ 1 j xÞ ¼ Pð y ¼ 1 j x1; x2; ; xKÞ ð15:1Þfor various values of x For example, when y is an employment indicator, x mightcontain various individual characteristics such as education, age, marital status, andother factors that a¤ect employment status, such as a binary indicator variable forparticipation in a recent job training program, or measures of past criminal behavior.For a continuous variable, xj, the partial e¤ect of xjon the response probability is

If xK is a binary variable, interest lies in

pðx1; x2; ; xK1;1Þ  pðx1; x2; ; xK1;0Þ ð15:3Þwhich is the di¤erence in response probabilities when xK¼ 1 and xK ¼ 0 For most

of the models we consider, whether a variable xjis continuous or discrete, the partiale¤ect of xjon pðxÞ depends on all of x

In studying binary response models, we need to recall some basic facts aboutBernoulli (zero-one) random variables The only di¤erence between the setup here

Trang 3

and that in basic statistics is the conditioning on x If Pð y ¼ 1 j xÞ ¼ pðxÞ then

Pð y ¼ 0 j xÞ ¼ 1  pðxÞ, Eð y j xÞ ¼ pðxÞ, and Varð y j xÞ ¼ pðxÞ½1  pðxÞ

15.2 The Linear Probability Model for Binary Response

The linear probability model (LPM) for binary response y is specified as

Pð y ¼ 1 j xÞ ¼ b0þ b1x1þ b2x2þ    þ bKxK ð15:4Þ

As usual, the xj can be functions of underlying explanatory variables, which wouldsimply change the interpretations of the bj Assuming that x1 is not functionally re-lated to the other explanatory variables, b1¼ qPð y ¼ 1 j xÞ=qx1 Therefore, b1is thechange in the probability of success given a one-unit increase in x1 If x1 is a binaryexplanatory variable, b1 is just the di¤erence in the probability of success when

x1¼ 1 and x1¼ 0, holding the other xjfixed

Using functions such as quadratics, logarithms, and so on among the independentvariables causes no new di‰culties The important point is that the bj now measurethe e¤ects of the explanatory variables xj on a particular probability

Unless the range of x is severely restricted, the linear probability model cannot be agood description of the population response probability Pð y ¼ 1 j xÞ For given values

of the population parameters bj, there would usually be feasible values of x1; ; xK

such that b0þ xb is outside the unit interval Therefore, the LPM should be seen as a

convenient approximation to the underlying response probability What we hope isthat the linear probability approximates the response probability for common values

of the covariates Fortunately, this often turns out to be the case

In deciding on an appropriate estimation technique, it is useful to derive the ditional mean and variance of y Since y is a Bernoulli random variable, these aresimply

con-Eð y j xÞ ¼ b0þ b1x1þ b2x2þ    þ bKxK ð15:5Þ

where xb is shorthand for the right-hand side of equation (15.5).

Equation (15.5) implies that, given a random sample, the OLS regression of y

on 1; x1; x2; ; xK produces consistent and even unbiased estimators of the bj.Equation (15.6) means that heteroskedasticity is present unless all of the slope co-e‰cients b1; ;bK are zero A nice way to deal with this issue is to use standardheteroskedasticity-robust standard errors and t statistics Further, robust tests ofmultiple restrictions should also be used There is one case where the usual F statistic

Trang 4

can be used, and that is to test for joint significance of all variables (leaving the stant unrestricted) This test is asymptotically valid because Varð y j xÞ is constantunder this particular null hypothesis.

con-Since the form of the variance is determined by the model for Pð y ¼ 1 j xÞ, anasymptotically more e‰cient method is weighted least squares (WLS) Let ^b b be the

OLS estimator, and let ^yidenote the OLS fitted values Then, provided 0 < ^yi<1 forall observations i, define the estimated standard deviation as ^si1½ ^yið1  ^yiÞ1=2

Then the WLS estimator, b, is obtained from the OLS regression

yi=^sion 1=^si; xi1=^si; ; xiK=^si; i¼ 1; 2; ; N ð15:7ÞThe usual standard errors from this regression are valid, as follows from the treat-ment of weighted least squares in Chapter 12 In addition, all other testing can bedone using F statistics or LM statistics using weighted regressions

If some of the OLS fitted values are not between zero and one, WLS analysis is notpossible without ad hoc adjustments to bring deviant fitted values into the unit in-terval Further, since the OLS fitted value ^yiis an estimate of the conditional proba-bility Pð yi¼ 1 j xiÞ, it is somewhat awkward if the predicted probability is negative orabove unity

Aside from the issue of fitted values being outside the unit interval, the LPMimplies that a ceteris paribus unit increase in xj always changes Pð y ¼ 1 j xÞ by thesame amount, regardless of the initial value of xj This implication cannot literally betrue because continually increasing one of the xj would eventually drive Pð y ¼ 1 j xÞ

to be less than zero or greater than one

Even with these weaknesses, the LPM often seems to give good estimates of thepartial e¤ects on the response probability near the center of the distribution of x.(How good they are can be determined by comparing the coe‰cients from the LPMwith the partial e¤ects estimated from the nonlinear models we cover in Section 15.3.)

If the main purpose is to estimate the partial e¤ect of xj on the response probability,averaged across the distribution of x, then the fact that some predicted values areoutside the unit interval may not be very important The LPM need not provide verygood estimates of partial e¤ects at extreme values of x

Example 15.1 (Married Women’s Labor Force Participation): We use the data fromMROZ.RAW to estimate a linear probability model for labor force participation(inlf ) of married women Of the 753 women in the sample, 428 report working non-zero hours during the year The variables we use to explain labor force participationare age, education, experience, nonwife income in thousands (nwifeinc), number ofchildren less than six years of age (kidslt6), and number of kids between 6 and 18

Trang 5

inclusive (kidsge6); 606 women report having no young children, while 118 reporthaving exactly one young child The usual OLS standard errors are in parentheses,while the heteroskedasticity-robust standard errors are in brackets:

Of the 753 fitted probabilities, 33 are outside the unit interval Rather than usingsome adjustment to those 33 fitted values and applying weighted least squares, wejust use OLS and report heteroskedasticity-robust standard errors Interestingly, thesedi¤er in practically unimportant ways from the usual OLS standard errors

The case for the LPM is even stronger if most of the xj are discrete and take ononly a few values In the previous example, to allow a diminishing e¤ect of youngchildren on the probability of labor force participation, we can break kidslt6 intothree binary indicators: no young children, one young child, and two or more youngchildren The last two indicators can be used in place of kidslt6 to allow the firstyoung child to have a larger e¤ect than subsequent young children (Interestingly,when this method is used, the marginal e¤ects of the first and second young childrenare virtually the same The estimated e¤ect of the first child is about .263, and theadditional reduction in the probability of labor force participation for the next child

is about.274.)

In the extreme case where the model is saturated—that is, x contains dummy ables for mutually exclusive and exhaustive categories—the linear probability model

Trang 6

vari-is completely general The fitted probabilities are simply the average yi within eachcell defined by the di¤erent values of x; we need not worry about fitted probabilitiesless than zero or greater than one See Problem 15.1.

15.3 Index Models for Binary Response: Probit and Logit

We now study binary response models of the form

where x is 1 K, b is K  1, and we take the first element of x to be unity Examples

where x does not contain unity are rare in practice For the linear probability model,GðzÞ ¼ z is the identity function, which means that the response probabilities cannot

be between 0 and 1 for all x and b In this section we assume that GðÞ takes on values

in the open unit interval: 0 < GðzÞ < 1 for all z A R

The model in equation (15.8) is generally called an index model because it restrictsthe way in which the response probability depends on x: pðxÞ is a function of x only

through the index xb ¼ b1þ b2x2þ    þ bKxK The function G maps the index intothe response probability

In most applications, G is a cumulative distribution function (cdf ), whose specificform can sometimes be derived from an underlying economic model For example, inProblem 15.2 you are asked to derive an index model from a utility-based model ofcharitable giving The binary indicator y equals unity if a family contributes to charityand zero otherwise The vector x contains family characteristics, income, and the price

of a charitable contribution (as determined by marginal tax rates) Under a normalityassumption on a particular unobservable taste variable, G is the standard normal cdf.Index models where G is a cdf can be derived more generally from an underlyinglatent variable model, as in Example 13.1:

where e is a continuously distributed variable independent of x and the distribution

of e is symmetric about zero; recall from Chapter 13 that 1½ is the indicator function

If G is the cdf of e, then, because the pdf of e is symmetric about zero, 1 GðzÞ ¼GðzÞ for all real numbers z Therefore,

Pð y ¼ 1 j xÞ ¼ Pð y>0j xÞ ¼ Pðe > xb j xÞ ¼ 1  Gðxb Þ ¼ Gðxb Þ

which is exactly equation (15.8)

Trang 7

There is no particular reason for requiring e to be symmetrically distributed in thelatent variable model, but this happens to be the case for the binary response modelsapplied most often.

In most applications of binary response models, the primary goal is to explain thee¤ects of the xj on the response probability Pð y ¼ 1 j xÞ The latent variable formu-lation tends to give the impression that we are primarily interested in the e¤ects ofeach xj on y As we will see, the direction of the e¤ects of xj on Eð yj xÞ ¼ xb and

on Eð y j xÞ ¼ Pð y ¼ 1 j xÞ ¼ Gðxb Þ are the same But the latent variable yrarely has

a well-defined unit of measurement (for example, y might be measured in utilityunits) Therefore, the magnitude of bj is not especially meaningful except in specialcases

The probit model is the special case of equation (15.8) with

In order to successfully apply probit and logit models, it is important to know how

to interpret the bj on both continuous and discrete explanatory variables First, if xj

Trang 8

the sign of the e¤ect is given by the sign of bj Also, the relative e¤ects do not depend

on x: for continuous variables xjand xh, the ratio of the partial e¤ects is constant and

given by the ratio of the corresponding coe‰cients: qpðxÞ=qxj

qpðxÞ=qxh

¼ bj= h In the typicalcase that g is a symmetric density about zero, with unique mode at zero, the largest

e¤ect is when xb¼ 0 For example, in the probit case with gðzÞ ¼ fðzÞ, gð0Þ ¼ fð0Þ

¼ 1= ffiffiffiffiffiffi

2p

p

A :399 In the logit case, gðzÞ ¼ expðzÞ=½1 þ expðzÞ2, and so gð0Þ ¼ :25

If xK is a binary explanatory variable, then the partial e¤ect from changing xK

from zero to one, holding all other variables fixed, is simply

Gðb1þ b2x2þ    þ bK1xK1þ bKÞ  Gðb1þ b2x2þ    þ bK1xK1Þ ð15:14ÞAgain, this expression depends on all other values of the other xj For example, if y is

an employment indicator and xjis a dummy variable indicating participation in a jobtraining program, then expression (15.14) is the change in the probability of em-ployment due to the job training program; this depends on other characteristics thata¤ect employability, such as education and experience Knowing the sign of bK isenough to determine whether the program had a positive or negative e¤ect But tofind the magnitude of the e¤ect, we have to estimate expression (15.14)

We can also use the di¤erence in expression (15.14) for other kinds of discretevariables (such as number of children) If xK denotes this variable, then the e¤ect onthe probability of xKgoing from cKto cKþ 1 is simply

is the approximate change in Pð y ¼ 1 j zÞ given a 1 percent increase in z2 Modelswith interactions among explanatory variables, including interactions between dis-crete and continuous variables, are handled similarly When measuring e¤ects ofdiscrete variables, we should use expression (15.15)

Trang 9

15.4 Maximum Likelihood Estimation of Binary Response Index Models

Assume we have N independent, identically distributed observations following themodel (15.8) Since we essentially covered the case of probit in Chapter 13, the dis-cussion here will be brief To estimate the model by (conditional) maximum likeli-hood, we need the log-likelihood function for each i The density of yigiven xican bewritten as

fð y j xi;bÞ ¼ ½GðxibÞy½1  GðxibÞ1y; y¼ 0; 1 ð15:16ÞThe log-likelihood for observation i is a function of the K 1 vector of parametersand the dataðxi; yiÞ:

liðbÞ ¼ yi log½GðxibÞ þ ð1  yiÞ log½1  GðxibÞ ð15:17Þ(Recall from Chapter 13 that, technically speaking, we should distinguish the ‘‘true’’

value of beta, bo, from a generic value For conciseness we do not do so here.)Restricting GðÞ to be strictly between zero and one ensures that liðb Þ is well defined for all values of b.

As usual, the log likelihood for a sample size of N is Lðb Þ ¼PN

i¼1liðb Þ, and the MLE of b, denoted ^ b b, maximizes this log likelihood If GðÞ is the standard normalcdf, then ^b b is the probit estimator; if GðÞ is the logistic cdf, then ^b b is the logit esti-

mator From the general maximum likelihood results we know that ^b b is consistent

and asymptotically normal We can also easily estimate the asymptotic variance ^b b.

We assume that GðÞ is twice continuously di¤erentiable, an assumption that isusually satisfied in applications (and, in particular, for probit and logit) As before,the function gðzÞ is the derivative of GðzÞ For the probit model, gðzÞ ¼ fðzÞ, and forthe logit model, gðzÞ ¼ expðzÞ=½1 þ expðzÞ2

Using the same calculations for the probit example as in Chapter 13, the score ofthe conditional log likelihood for observation i can be shown to be

Trang 10

In most cases the inverse exists, and when it does, ^VV is positive definite If the matrix

in equation (15.20) is not invertible, then perfect collinearity probably exists amongthe regressors

As usual, we treat ^b b as being normally distributed with mean zero and variance

matrix in equation (15.20) The (asymptotic) standard error of ^bjis the square root ofthe jth diagonal element of ^VV These can be used to construct t statistics, which have

a limiting standard normal distribution, and to construct approximate confidenceintervals for each population parameter These are reported with the estimates forpackages that perform logit and probit We discuss multiple hypothesis testing in thenext section

Some packages also compute Huber-White standard errors as an option for probitand logit analysis, using the general M-estimator formulas; see, in particular, equa-tion (12.49) While the robust variance matrix is consistent, using it in place of theusual estimator means we must think that the binary response model is incorrectlyspecified Unlike with nonlinear regression, in a binary response model it is not pos-sible to correctly specify Eð y j xÞ but to misspecify Varð y j xÞ Once we have specified

Pð y ¼ 1 j xÞ, we have specified all conditional moments of y given x

In Section 15.8 we will see that, when using binary response models with paneldata or cluster samples, it is sometimes important to compute variance matrix esti-mators that are robust to either serial dependence or within-group correlation Butthis need arises as a result of dependence across time or subgroup, and not becausethe response probability is misspecified

15.5 Testing in Binary Response Index Models

Any of the three tests from general MLE analysis—the Wald, LR, or LM test—can beused to test hypotheses in binary response contexts Since the tests are all asymptoticallyequivalent under local alternatives, the choice of statistic usually depends on computa-tional simplicity (since finite sample comparisons must be limited in scope) In the fol-lowing subsections we discuss some testing situations that often arise in binary choiceanalysis, and we recommend particular tests for their computational advantages.15.5.1 Testing Multiple Exclusion Restrictions

Consider the model

Trang 11

where x is 1 K and z is 1  Q We wish to test the null hypothesis H0: g¼ 0, so weare testing Q exclusion restrictions The elements of z can be functions of x, such asquadratics and interactions—in which case the test is a pure functional form test Or,the z can be additional explanatory variables For example, z could contain dummyvariables for occupation or region In any case, the form of the test is the same.Some packages, such as Stata, compute the Wald statistic for exclusion restrictionsusing a simple command following estimation of the general model This capabilitymakes it very easy to test multiple exclusion restrictions, provided the dimension ofðx; zÞ is not so large as to make probit estimation di‰cult.

The likelihood ratio statistic is also easy to use Let Lurdenote the value of the likelihood function from probit of y on x and z (the unrestricted model), and let Lr

log-denote the value of the likelihood function from probit of y on x (the restrictedmodel) Then the likelihood ratio test of H0: g¼ 0 is simply 2ðLur LrÞ, which has

an asymptotic w2

Q distribution under H0 This is analogous to the usual F statistic inOLS analysis of a linear model

The score or LM test is attractive if the unrestricted model is di‰cult to estimate

In this section, let ^b b denote the restricted estimator of b, that is, the probit or logit

estimator with z excluded from the model The LM statistic using the estimatedexpected hessian, ^Ai [see equation (15.20) and Section 12.6.2], can be shown to benumerically identical to the following: (1) Define ^ui1yi Gðxi^Þ, ^Gi1Gðxi^Þ, and

^i1gðxi^Þ These are all obtainable after estimating the model without z (2) Use all

N observations to run the auxiliary OLS regression

The LM procedure is rather easy to remember The term ^gixiis the gradient of themean function Gðxibþ zig Þ with respect to b, evaluated at b ¼ ^ b b and g¼ 0 Simi-larly, ^gizi is the gradient of Gðxibþ zig Þ with respect to g, again evaluated at b ¼ ^ b and g¼ 0 Finally, under H0: g¼ 0, the conditional variance of ui given ðxi; ziÞ isGðxibÞ½1  GðxibÞ; therefore, ½ ^Gið1  ^GiÞ1=2 is an estimate of the conditional stan-dard deviation of ui The dependent variable in regression (15.22) is often called astandardized residual because it is an estimate of ui=½Gið1  GiÞ1=2, which has unitconditional (and unconditional) variance The regressors are simply the gradient of theconditional mean function with respect to both sets of parameters, evaluated under

Trang 12

H0, and weighted by the estimated inverse conditional standard deviation The firstset of regressors in regression (15.22) is 1 K and the second set is 1  Q.

Under H0, LM @ w2

Q The LM approach can be an attractive alternative to the LRstatistic if z has large dimension, since with many explanatory variables probit can bedi‰cult to estimate

15.5.2 Testing Nonlinear Hypotheses about b

For testing nonlinear restrictions on b in equation (15.8), the Wald statistic is putationally the easiest because the unrestricted estimator of b, which is just probit

com-or logit, is easy to obtain Actually imposing nonlinear restrictions in estimation—which is required to apply the score or likelihood ratio methods—can be di‰cult.However, we must also remember that the Wald statistic for testing nonlinear restric-tions is not invariant to reparameterizations, whereas the LM and LR statistics are.(See Sections 12.6 and 13.6; for the LM statistic, we would always use the expectedHessian.)

Let the restictions on b be given by H0: cðb Þ ¼ 0, where cðb Þ is a Q  1 vector ofpossibly nonlinear functions satisfying the di¤erentiability and rank requirementsfrom Chapter 13 Then, from the general MLE analysis, the Wald statistic is simply

W ¼ cð ^bÞ0½‘bcð ^bÞ ^VV‘bcð ^bÞ01cð ^bÞ ð15:23Þwhere ^VV is given in equation (15.20) and ‘bcð ^b Þ is the Q  K Jacobian of cðb Þ evalu-

ated at ^b b.

15.5.3 Tests against More General Alternatives

In addition to testing for omitted variables, sometimes we wish to test the probit orlogit model against a more general functional form When the alternatives are notstandard binary response models, the Wald and LR statistics are cumbersome toapply, whereas the LM approach is convenient because it only requires estimation ofthe null model

As an example of a more complicated binary choice model, consider the latentvariable model (15.9) but assume that ej x @ Normal½0; expð2x1dÞ, where x1 is 1

K1 subset of x that excludes a constant and d is a K1 1 vector of additional eters (In many cases we would take x1 to be all nonconstant elements of x.) There-fore, there is heteroskedasticity in the latent variable model, so that e is no longerindependent of x The standard deviation of e given x is simply expðx1dÞ Define

param-r¼ e=expðx1dÞ, so that r is independent of x with a standard normal distribution.Then

Trang 13

Pð y ¼ 1 j xÞ ¼ Pðe > xb j xÞ ¼ P½expðx1dÞe > expðx1d Þxb

¼ P½r > expðx1d Þxb ¼ F½expðx1d Þxb ð15:24ÞThe partial e¤ects of xj on Pð y ¼ 1 j xÞ are much more complicated in equation

(15.24) than in equation (15.8) When d¼ 0, we obtain the standard probit model.Therefore, a test of the probit functional form for the response probability is a test of

In the previous example, GðÞ ¼ FðÞ, d0¼ 0, and mðxb; x; dÞ ¼ F½expðx1d Þxb.

Let ^b b be the probit or logit estimator of b obtained under d ¼ d0 Define

^i1yi Gðxi^Þ, ^Gi1Gðxi^Þ, and ^gi1gðxi^Þ The gradient of the mean functionmðxib; xi;d Þ with respect to b, evaluated at d0, is simply gðxibÞxi The only otherpiece we need is the gradient of mðxib; xi;d Þ with respect to d, evaluated at d0 Denotethis 1 Q vector as ‘dmðxib; xi;d0Þ Further, set ‘dm^i1‘dmðxib; x^ i;d0Þ The LMstatistic can be obtained as the explained sum of squares or NR2

u from the regression

Q, where Q is the dimension of d.

When applying this test to the preceding probit example, we have only ‘dm^ileft tocompute But mðxib; xi;dÞ ¼ F½expðxi1dÞxib, and so

‘dmðxib; xi;dÞ ¼ ðxibÞ expðxi1dÞxi1f½expðxi1dÞxib

When evaluated at b¼ ^b b and d¼ 0 (the null value), we get ‘dm^i¼ ðxi^Þfðxi^Þxi1

1ðxi^Þ ^fixi1, a 1 K1 vector Regression (15.27) becomes

Trang 14

(We drop the minus sign because it does not a¤ect the value of the explained sum ofsquares or R2

u.) Under the null hypothesis that the probit model is correctly specified,

LM @ w2

K1 This statistic is easy to compute after estimation by probit

For a one-degree-of-freedom test regardless of the dimension of xi, replace the lastterm in regression (15.28) withðxi^Þ2f^i=pFffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi^ið1  ^FiÞ

, and then the explained sum ofsquares is distributed asymptotically as w12 See Davidson and MacKinnon (1984) forfurther examples

15.6 Reporting the Results for Probit and Logit

Several statistics should be reported routinely in any probit or logit (or other binarychoice) analysis The ^bj, their standard errors, and the value of the likelihood func-tion are reported by all software packages that do binary response analysis The ^bjgive the signs of the partial e¤ects of each xj on the response probability, and thestatistical significance of xj is determined by whether we can reject H0: bj¼ 0.One measure of goodness of fit that is usually reported is the percent correctlypredicted For each i, we compute the predicted probability that yi¼ 1, given theexplanatory variables, xi If Gðxi^Þ > :5, we predict yito be unity; if Gðxi^Þ a :5, yi

is predicted to be zero The percentage of times the predicted yimatches the actual yi

is the percent correctly predicted In many cases it is easy to predict one of the comes and much harder to predict another outcome, in which case the percent cor-rectly predicted can be misleading as a goodness-of-fit statistic More informative is

out-to compute the percent correctly predicted for each outcome, y¼ 0 and y ¼ 1 Theoverall percent correctly predicted is a weighted average of the two, with the weightsbeing the fractions of zero and one outcomes, respectively Problem 15.7 provides anillustration

Various pseudo R-squared measures have been proposed for binary response.McFadden (1974) suggests the measure 1 Lur=Lo, where Luris the log-likelihoodfunction for the estimated model and Lois the log-likelihood function in the modelwith only an intercept Because the log likelihood for a binary response model isalways negative, jLurj a jLoj, and so the pseudo R-squared is always between zeroand one Alternatively, we can use a sum of squared residuals measure: 1 SSRur=SSRo, where SSRur is the sum of squared residuals ^ui¼ yi Gðxi^Þ and SSRois thetotal sum of squares of yi Several other measures have been suggested (see, for ex-ample, Maddala, 1983, Chapter 2), but goodness of fit is not as important as statis-tical and economic significance of the explanatory variables Estrella (1998) contains

a recent comparison of goodness-of-fit measures for binary response

Trang 15

Often we want to estimate the e¤ects of the variables xj on the response bilities Pð y ¼ 1 j xÞ If xjis (roughly) continuous then

for small changes in xj (As usual when using calculus, the notion of ‘‘small’’ here issomewhat vague.) Since gðx ^bÞ depends on x, we must compute gðx ^bÞ at interestingvalues of x Often the sample averages of the xj’s are plugged in to get gðx ^bÞ Thisfactor can then be used to adjust each of the ^bj (at least those on continuous vari-ables) to obtain the e¤ect of a one-unit increase in xj If x contains nonlinear functions

of some explanatory variables, such as natural logs or quadratics, there is the issue ofusing the log of the average versus the average of the log (and similarly with qua-dratics) To get the e¤ect for the ‘‘average’’ person, it makes more sense to plug theaverages into the nonlinear functions, rather than average the nonlinear functions.Software packages (such as Stata with the dprobit command) necessarily average thenonlinear functions Sometimes minimum and maximum values of key variables areused in obtaining gðx ^bÞ, so that we can see how the partial e¤ects change as someelements of x get large or small

Equation (15.29) also suggests how to roughly compare magnitudes of the probitand logit estimates If x ^b b is close to zero for logit and probit, the scale factor we use

can be gð0Þ For probit, gð0Þ A :4, and for logit, gð0Þ ¼ :25 Thus the logit estimatescan be expected to be larger by a factor of about :4=:25¼ 1:6 Alternatively, multiplythe logit estimates by 625 to make them comparable to the probit estimates In thelinear probability model, gð0Þ is unity, and so logit estimates should be divided byfour to compare them with LPM estimates, while probit estimates should be divided

by 2.5 to make them roughly comparable to LPM estimates More accurate parisons are obtained by using the scale factors gðx ^bÞ for probit and logit Of course,one of the potential advantages of using probit or logit is that the partial e¤ects varywith x, and it is of some interest to compute gðx ^bÞ at values of x other than thesample averages

com-If, say, x2is a binary variable, it perhaps makes more sense to plug in zero or onefor x2, rather than x2 (which is the fraction of ones in the sample) Putting in theaverages for the binary variables means that the e¤ect does not really correspond to aparticular individual But often the results are similar, and the choice is really based

Trang 16

ob-If xK is a discrete variable, then we can estimate the change in the predicted ability in going from cK to cKþ 1 as

prob-^K ¼ G½ ^b1þ ^b2x2þ    þ ^bK1xK1þ ^bKðcKþ 1Þ

 Gð ^b1þ ^b2x2þ    þ ^bK1xK1þ ^bKcKÞ ð15:31Þ

In particular, when xK is a binary variable, set cK¼ 0 Of course, the other xj’s can

be evaluated anywhere, but the use of sample averages is typical The delta methodcan be used to obtain a standard error of equation (15.31) For probit, Stata does thiscalculation when xKis a binary variable Usually the calculations ignore the fact that

xj is an estimate of EðxjÞ in applying the delta method If we are truly interested in

bKgðmxbÞ, the estimation error in x can be accounted for, but it makes the calculationmore complicated, and it is unlikely to have a large e¤ect

An alternative way to summarize the estimated marginal e¤ects is to estimate theaverage value of bKgðxb Þ across the population, or bKE½gðxb Þ A consistent estima-tor is

N1XN

i¼1

½Gð ^b1þ ^b2xi 2þ    þ ^bK1xi; K1þ ^bKÞ  Gð ^b1þ ^b2xi2þ    þ ^bK1xi; K1Þ

ð15:33Þ

if xKis binary The delta method can be used to obtain an asymptotic standard error

of expression (15.32) or (15.33) Costa (1995) is a recent example of average e¤ectsobtained from expression (15.33)

Trang 17

Example 15.2 (Married Women’s Labor Force Participation): We now estimatelogit and probit models for women’s labor force participation For comparison wereport the linear probability estimates The results, with standard errors in parenthe-ses, are given in Table 15.1 (for the LPM, these are heteroskedasticity-robust).The estimates from the three models tell a consistent story The signs of the co-e‰cients are the same across models, and the same variables are statistically signifi-cant in each model The pseudo R-squared for the LPM is just the usual R-squaredreported for OLS; for logit and probit the pseudo R-squared is the measure based onthe log likelihoods described previously In terms of overall percent correctly pre-dicted, the models do equally well For the probit model, it correctly predicts ‘‘out ofthe labor force’’ about 63.1 percent of the time, and it correctly predicts ‘‘in the laborforce’’ about 81.3 percent of the time The LPM has the same overall percent cor-rectly predicted, but there are slight di¤erences within each outcome.

As we emphasized earlier, the magnitudes of the coe‰cients are not directly parable across the models Using the rough rule of thumb discussed earlier, we can

com-Table 15.1

LPM, Logit, and Probit Estimates of Labor Force Participation

Dependent Variable: inlf

Independent Variable

LPM (OLS)

Logit (MLE)

Probit (MLE)

(.0015)

.021 (.008)

.012 (.005)

(.007)

.221 (.043)

.131 (.025)

(.006)

.206 (.032)

.123 (.019)

(.00019)

.0032 (.0010)

.0019 (.0006)

(.002)

.088 (.015)

.053 (.008)

(.032)

1.443 (0.204)

.868 (.119)

(.013)

.060 (.075)

.036 (.043)

(.151)

.425 (.860)

.270 (.509)

Percent correctly predicted 73.4 73.6 73.4

Trang 18

divide the logit estimates by four and the probit estimates by 2.5 to make all estimatescomparable to the LPM estimates For example, for the coe‰cients on kidslt6, thescaled logit estimate is about .361, and the scaled probit estimate is about .347.These are larger in magnitude than the LPM estimate (for reasons we will soon dis-cuss) The scaled coe‰cient on educ is 055 for logit and 052 for probit.

If we evaluate the standard normal probability density function, fð ^b0þ ^b1x1þ   

þ ^bkxkÞ, at the average values of the independent variables in the sample (includingthe average of exper2), we obtain about 391; this value is close enough to 4 to makethe rough rule of thumb for scaling the probit coe‰cients useful in obtaining thee¤ects on the response probability In other words, to estimate the change in the re-sponse probability given a one-unit increase in any independent variable, we multiplythe corresponding probit coe‰cient by 4

The biggest di¤erence between the LPM model on one hand, and the logit andprobit models on the other, is that the LPM assumes constant marginal e¤ects foreduc, kidslt6, and so on, while the logit and probit models imply diminishing mar-ginal magnitudes of the partial e¤ects In the LPM, one more small child is estimated

to reduce the probability of labor force participation by about 262, regardless of howmany young children the woman already has (and regardless of the levels of the otherdependent variables) We can contrast this finding with the estimated marginal e¤ectfrom probit For concreteness, take a woman with nwifeinc¼ 20:13, educ ¼ 12:3,exper¼ 10:6, age ¼ 42:5—which are roughly the sample averages—and kidsge6 ¼ 1.What is the estimated fall in the probability of working in going from zero to onesmall child? We evaluate the standard normal cdf, Fð ^b0þ ^b1x1þ    þ ^bkxkÞ withkidslt6¼ 1 and kidslt6 ¼ 0, and the other independent variables set at the valuesgiven We get roughly :373 :707 ¼ :334, which means that the labor force par-ticipation probability is about 334 lower when a woman has one young child This isnot much di¤erent from the scaled probit coe‰cient of .347 If the woman goesfrom one to two young children, the probability falls even more, but the marginale¤ect is not as large: :117 :373 ¼ :256 Interestingly, the estimate from the linearprobability model, which we think can provide a good estimate near the averagevalues of the covariates, is in fact between the probit estimated partial e¤ects startingfrom zero and one children

Binary response models apply with little modification to independently pooledcross sections or to other data sets where the observations are independent but notnecessarily identically distributed Often year or other time-period dummy variablesare included to account for aggregate time e¤ects Just as with linear models, probitcan be used to evaluate the impact of certain policies in the context of a natural ex-periment; see Problem 15.13 An application is given in Gruber and Poterba (1994)

Trang 19

15.7 Specification Issues in Binary Response Models

We now turn to several issues that can arise in applying binary response models toeconomic data All of these topics are relevant for general index models, but features

of the normal distribution allow us to obtain concrete results in the context of probitmodels Therefore, our primary focus is on probit models

15.7.1 Neglected Heterogeneity

We begin by studying the consequences of omitting variables when those omittedvariables are independent of the included explanatory variables This is also called theneglected heterogeneity problem The (structural) model of interest is

where x is 1 K with x111 and c is a scalar We are interested in the partial e¤ects

of the xj on the probability of success, holding c (and the other elements of x) fixed

We can write equation (15.34) in latent variable form as y¼ xb þ gc þ e, where

y¼ 1½ y>0 and e j x; c @ Normalð0; 1Þ Because x1¼ 1, EðcÞ ¼ 0 without loss ofgenerality

Now suppose that c is independent of x and c @ Normalð0; t2Þ [Remember, thisassumption is much stronger than Covðx; cÞ ¼ 0 or even Eðc j xÞ ¼ 0: under indepen-dence, the distribution of c given x does not depend on x.] Given these assumptions,the composite term, gcþ e, is independent of x and has a Normalð0; g2t2þ 1Þ dis-tribution Therefore,

Pð y ¼ 1 j xÞ ¼ Pðgc þ e > xb j xÞ ¼ Fðxb=sÞ ð15:35Þwhere s21g2t2þ 1 It follows immediately from equation (15.35) that probit of y

on x consistently estimates b=s In other words, if ^ b b is the estimator from a probit of

y on x, then plim ^bj¼ bj=s Because s¼ ðg2t2þ 1Þ1=2>1 (unless g¼ 0 or t2¼ 0Þ,

jbj=sj < jbjj

The attenuation bias in estimating bjin the presence of neglected heterogeneity hasprompted statements of the following kind: ‘‘In probit analysis, neglected heteroge-neity is a much more serious problem than in linear models because, even if theomitted heterogeneity is independent of x, the probit coe‰cients are inconsistent.’’

We just derived that probit of y on x consistently estimates b=s rather than b, so

the statement is technically correct However, we should remember that, in nonlinearmodels, we usually want to estimate partial e¤ects and not just parameters For thepurposes of obtaining the directions of the e¤ects or the relative e¤ects of the ex-

planatory variables, estimating b=s is just as good as estimating b.

Trang 20

For continuous xj, we would like to estimate

for various values of x and c Because c is not observed, we cannot estimate g Even

if we could estimate g, c almost never has meaningful units of measurement—forexample, c might be ‘‘ability,’’ ‘‘health,’’ or ‘‘taste for saving’’—so it is not obviouswhat values of c we should plug into equation (15.36) Nevertheless, c is normalized

so that EðcÞ ¼ 0, so we may be interested in equation (15.36) evaluated at c ¼ 0,which is simply bjfðxb Þ What we consistently estimate from the probit of y on x is

This expression shows that, if we are interested in the partial e¤ects evaluated at

c¼ 0, then probit of y on x does not do the trick An interesting fact about sion (15.37) is that, even though bj=s is closer to zero than bj, fðxb=sÞ is larger thanfðxb Þ because fðzÞ increases as jzj ! 0, and s > 1 Therefore, for estimating thepartial e¤ects in equation (15.36) at c¼ 0, it is not clear for what values of x anattenuation bias exists

expres-With c having a normal distribution in the population, the partial e¤ect evaluated

at c¼ 0 describes only a small fraction of the population [Technically, Pðc ¼ 0Þ ¼ 0.]Instead, we can estimate the average partial e¤ect (APE), which we introduced inSection 2.2.5 The APE is obtained, for given x, by averaging equation (15.36) acrossthe distribution of c in the population For emphasis, let xo be a given value of theexplanatory variables (which could be, but need not be, the mean value) When weplug xo into equation (15.36) and take the expected value with respect to the distri-bution of c, we get

EcðÞ denotes the expectation with respect to the distribution of c The derivative ofFðxb=sÞ with respect to xj isðbj=sÞfðxb=sÞ, which is what we wanted to show.The bottom line is that, except in cases where the magnitudes of the bj in equation(15.34) have some meaning, omitted heterogeneity in probit models is not a problem

Trang 21

when it is independent of x: ignoring it consistently estimates the average partiale¤ects Of course, the previous arguments hinge on the normality of c and the probitstructural equation If the structural model (15.34) were, say, logit and if c werenormally distributed, we would not get a probit or logit for the distribution of y givenx; the response probability is more complicated The lesson from Section 2.2.5 is that

we might as well work directly with models for Pð y ¼ 1 j xÞ because partial e¤ects of

Pð y ¼ 1 j xÞ are always the average of the partial e¤ects of Pð y ¼ 1 j x; cÞ over thedistribution of c

If c is correlated with x or is otherwise dependent on x [for example, if Varðc j xÞdepends on x], then omission of c is serious In this case we cannot get consistentestimates of the average partial e¤ects For example, if cj x @ Normalðxd; h2Þ, thenprobit of y on x gives consistent estimates of ðb þ gdÞ=r, where r2¼ g2h2þ 1 Un-less g¼ 0 or d ¼ 0, we do not consistently estimate b=s This result is not surprising

given what we know from the linear case with omitted variables correlated with the

xj We now study what can be done to account for endogenous variables in probitmodels

15.7.2 Continuous Endogenous Explanatory Variables

We now explicitly allow for the case where one of the explanatory variables is related with the error term in the latent variable model One possibility is to estimate

cor-a linecor-ar probcor-ability model by 2SLS This procedure is relcor-atively ecor-asy cor-and might vide a good estimate of the average e¤ect

pro-If we want to estimate a probit model with an endogenous explanatory variables,

we must make some fairly strong assumptions In this section we consider the case of

a continuous endogenous explanatory variable

Write the model as

y2¼ z1d21þ z2d22þ v2¼ zd2þ v2 ð15:40Þ

where ðu1; v2Þ has a zero mean, bivariate normal distribution and is independent

of z Equation (15.39), along with equation (15.41), is the structural equation; tion (15.40) is a reduced form for y2, which is endogenous if u1and v2are correlated

equa-If u1 and v2 are independent, there is no endogeneity problem Because v2 is mally distributed, we are assuming that y2 given z is normal; thus y2 should havefeatures of a normal random variable (For example, y2 should not be a discretevariable.)

Trang 22

nor-The model is applicable when y2is correlated with u1 because of omitted variables

or measurement error It can also be applied to the case where y2is determined jointlywith y1, but with a caveat If y1 appears on the right-hand side in a linear structuralequation for y2, then the reduced form for y2 cannot be found with v2 having thestated properties However, if y1 appears in a linear structural equation for y2, then

y2 has the reduced form given by equation (15.40); see Maddala (1983, Chapter 7)for further discussion

The normalization that gives the parameters in equation (15.39) an average partiale¤ect interpretation, at least in the omitted variable and simultaneity contexts, isVarðu1Þ ¼ 1, just as in a probit model with all explanatory variables exogenous Tosee this point, consider the outcome on y1at two di¤erent outcomes of y2, say y2and

y2þ 1 Holding the observed exogenous factors fixed at z1, and holding u1fixed, thedi¤erence in responses is

the following paragraphs, only consistently estimate d1and a1up to scale; we have to

do a little more work to obtain estimates of the APE If y2is a mismeasured variable,

we apparently cannot estimate the APE of interest: we would like to estimate thechange in the response probability due to a change in y2, but, without further as-sumptions, we can only estimate the e¤ect of changing y2

The most useful two-step approach is due to Rivers and Vuong (1988), as it leads

to a simple test for endogeneity of y2 To derive the procedure, first note that, underjoint normality ofðu1; v2Þ, with Varðu1Þ ¼ 1, we can write

where y1¼ h1= 22, h1¼ Covðv2; u1Þ, t2

2¼ Varðv2Þ, and e1 is independent of z and

v2 (and therefore of y2) Because of joint normality of ðu1; v2Þ, e1 is also normally

Trang 23

distributed with Eðe1Þ ¼ 0 and Varðe1Þ ¼ Varðu1Þ  h2= 2¼ 1  r2, where r1¼Corrðv2; u1Þ We can now write

sistently estimates dr11 d1=ð1  r2Þ1=2, ar11a1=ð1  r2Þ1=2, and yr11y1=ð1  r2Þ1=2.Notice that because r2

1 <1, each scaled coe‰cient is greater than its unscaled terpart unless y2 is exogenousðr1¼ 0Þ

coun-Since we do not know d2, we must first estimate it, as in the following procedure:

Procedure 15.1: (a) Run the OLS regression y2 on z and save the residuals ^vv2.(b) Run the probit y1 on z1, y2, ^vv2 to get consistent estimators of the scaled co-

e‰cients dr1, ar1, and yr1:

A nice feature of Procedure 15.1 is that the usual probit t statistic on ^vv2 is a validtest of the null hypothesis that y2 is exogenous, that is, H0: y1 ¼ 0 If y100, theusual probit standard errors and test statistics are not strictly valid, and we have only

estimated d1 and a1 up to scale The asymptotic variance of the two-step estimatorcan be derived using the M-estimator results in Section 12.5.2; see also Rivers andVuong (1988)

Under H0: y1¼ 0, e1¼ u1, and so the distribution of v2 plays no role under thenull Therefore, the test of exogeneity is valid without assuming normality or homo-skedasticity of v2, and it can be applied very broadly, even if y2 is a binary variable.Unfortunately, if y2and u1are correlated, normality of v2is crucial

Example 15.3 (Testing for Exogeneity of Education in the Women’s LFP Model): Wetest the null hypothesis that educ is exogenous in the married women’s labor forceparticipation equation We first obtain the reduced form residuals, ^vv2, from regressingeduc on all exogenous variables, including motheduc, fatheduc, and huseduc Then, weadd ^vv2to the probit from Example 15.2 The t statistic on ^vv2is only 867, which is weakevidence against the null hypothesis that educ is exogenous As always, this conclusionhinges on the assumption that the instruments for educ are themselves exogenous.Even when y100, it turns out that we can consistently estimate the average partiale¤ects after the two-stage estimation We simply apply the results from Section 2.2.5

Trang 24

To see how, write y1¼ 1½z1d1þ a1y2þ u1>0, where, in the notation of Section

2.2.5, q 1 u1, x 1ðz1; y2Þ, and w 1 v2 (a scalar in this case) Because y1 is a ministic function of ðz1; y2; u1Þ, v2 is trivially redundant in Eð y1j z1; y2; u1Þ, and soequation (2.34) holds Further, as we have already used, u1 givenðz1; y2; v2Þ is inde-pendent ofðz1; y2Þ, and so equation (2.33) holds as well It follows from Section 2.2.5that the APEs are obtained by taking derivatives (or di¤erences) of

We simply divide each coe‰cient by the factor ð ^yr12 ^2

2þ 1Þ1=2 before computingderivatives or di¤erences with respect to the elements of ðz1; y2Þ Unfortunately, be-cause the APEs depend on the parameters in a complicated way—and the asymptoticvariance ofð ^dr10 ; ^ar1; ^yr1Þ0is already complicated because of the two-step estimation—standard errors for the APEs would be very di‰cult to come by using the delta method

An alternative method for estimating the APEs does not exploit the normalityassumption for v2 By the usual uniform weak law of large numbers argument—seeLemma 12.1—a consistent estimator of expression (15.45) for anyðz1; y2Þ is obtained

by replacing unknown parameters by consistent estimators:

Rather than use a two-step procedure, we can estimate equations (15.39)–(15.41)

by conditional maximum likelihood To obtain the joint distribution of ð y1; y2Þ,

Trang 25

conditional on z, recall that

fð y1; y2j zÞ ¼ f ð y1j y2; zÞ f ð y2j zÞ ð15:48Þ(see Property CD.2 in Appendix 13A) Since y2j z @ Normalðzd2;t2

where we have used the fact that y1¼ r1= 2

Let w denote the term in inside FðÞ in equation (15.49) Then we have derived

wi1½zi1d1þ a1yi2þ ðr1= 2Þð yi 2 zid2Þ=ð1  r2

1Þ1=2Summing expression (15.50) across all i and maximizing with respect to all param-

eters gives the MLEs of d1, a1, r1, d2, t22 The general theory of conditional MLEapplies, and so standard errors can be obtained using the estimated Hessian, theestimated expected Hessian, or the outer product of the score

Maximum likelihood estimation has some decided advantages over two-step cedures First, MLE is more e‰cient than any two-step procedure Second, we get

pro-direct estimates of d1and a1, the parameters of interest for computing partial e¤ects.Evans, Oates, and Schwab (1992) study peer e¤ects on teenage behavior using the fullMLE

Testing that y2 is exogenous is easy once the MLE has been obtained: just test

H0: r1¼ 0 using an asymptotic t test We could also use a likelihood ratio test.The drawback with the MLE is computational Sometimes it can be di‰cult to getthe iterations to converge, as ^r1sometimes tends toward 1 or1

Comparing the Rivers-Vuong approach to the MLE shows that the former is alimited information procedure Essentially, Rivers and Vuong focus on fð y1j y2; zÞ,

where they replace the unknown d2 with the OLS estimator ^d2 (and they ignore therescaling problem by taking e1 in equation (15.43) to have unit variance) MLE esti-

Trang 26

mates the parameters using the information in fð y1j y2; zÞ and f ð y2j zÞ neously For the initial test of whether y2 is exogenous, the Rivers-Vuong approachhas significant computational advantages If exogeneity is rejected, it is probablyworth doing MLE.

simulta-Another benefit of the maximum likelihood approach for this and related problems

is that it forces discipline on us in coming up with consistent estimation proceduresand correct standard errors It is easy to abuse two-step procedures if we are notcareful in deriving estimating equations With MLE, although it can be di‰cult toderive joint distributions of the endogenous variables given the exogenous variables,

we know that, if the underlying distributional assumptions hold, consistent and cient estimators are obtained

e‰-15.7.3 A Binary Endogenous Explanatory Variable

We now consider the case where the probit model contains a binary explanatoryvariable that is endogenous The model is

whereðu1; v2Þ is independent of z and distributed as bivariate normal with mean zero,each has unit variance, and r1¼ Corrðu1; v2Þ If r100, then u1 and y2 are corre-

lated, and probit estimation of equation (15.51) is inconsistent for d1 and a1

As discussed in Section 15.7.2, the normalization Varðu1Þ ¼ 1 is the proper one forcomputing average partial e¤ects Often, the e¤ect of y2 is of primary interest, espe-cially when y2 indicates participation in some sort of program, such as job training,and the binary outcome y1 might denote employment status The average treatmente¤ect (for a given value of z1) is Fðz1d1þ a1Þ  Fðz1d

To derive the likelihood function, we again need the joint distribution ofð y1; y2Þgiven z, which we obtain from equation (15.48) To obtain Pð y1¼ 1 j y2; zÞ, first notethat

Pð y1¼ 1 j v2; zÞ ¼ F½ðz1d1þ a1y2þ r1v2Þ=ð1  r2

Since y2¼ 1 if and only if v2>zd2, we need a basic fact about truncated normaldistributions: If v2 has a standard normal distribution and is independent of z, thenthe density of v2 given v2 >zd2is

fðv2Þ=Pðv2>zd2Þ ¼ fðv2Þ=Fðzd2Þ ð15:54ÞTherefore,

Trang 27

0j y2¼ 1; zÞ is just one minus equation (15.55).

y2, and taking the log gives the log-likelihood function for maximum likelihoodanalysis It is messy but certainly doable Evans and Schwab (1995) use the MLEapproach to study the causal e¤ects of attending a Catholic high school on theprobability of attending college, allowing the Catholic high school indicator to becorrelated with unobserved factors that a¤ect college attendence As an IV they use abinary variable indicating whether a student is Catholic

Because the MLE is nontrivial to compute, it is tempting to use some seemingly

‘‘obvious’’ two-step procedures As an example, we might try to inappropriatelymimic 2SLS Since Eð y2j zÞ ¼ Fðzd2Þ and d2is consistently estimated by probit of y2

on z, it is tempting to estimate d1and a1from the probit of y1on z, ^F2, where ^F21

Fðz ^d2Þ This approach does not produce consistent estimators, for the same reasonsthe forbidden regression discussed in Section 9.5 for nonlinear simultaneous equa-tions models does not For this two-step procedure to work, we would have to have

Pð y1¼ 1 j zÞ ¼ F½z1d1þ a1Fðzd2Þ But Pð y1¼ 1 j zÞ ¼ Eð y1j zÞ ¼ Eð1½z1d1þ a1y2þ

u1>0 j zÞ, and since the indicator function 1½ is nonlinear, we cannot pass theexpected value through If we were to compute the correct (complicated) formula for

Pð y1¼ 1 j zÞ, plug in ^d2, and then maximize the resulting binary response log hood, then the two-step approach would produce consistent estimators But fullmaximum likelihood is easier and more e‰cient

likeli-As mentioned in the previous subsection, we can use the Rivers-Vuong approach

to test for exogeneity of y2 This has the virtue of being simple, and, if the test fails toreject, we may not need to compute the MLE A more e‰cient test is the score test of

H0: r1¼ 0, and this does not require estimation of the full MLE

Trang 28

15.7.4 Heteroskedasticity and Nonnormality in the Latent Variable Model

In applying the probit model it is easy to become confused about the problems ofheteroskedasticity and nonnormality The confusion stems from a failure to distin-guish between the underlying latent variable formulation, as in the model (15.9), andthe response probability in equation (15.8) As we have emphasized throughout thischapter, for most purposes we want to estimate Pð y ¼ 1 j xÞ The latent variable for-mulation is convenient for certain manipulations, but we are rarely interested in

Eð yj xÞ [One case in which Eð yj xÞ is of interest is covered in Problem 15.16.]Once we focus on Pð y ¼ 1 j xÞ, we can easily see why we should not attempt tocompare heteroskedasticity in the latent variable model (15.9) with the consequences

of heteroskedasticity in a standard linear regression model Heteroskedasticity inVarðe j xÞ entirely changes the functional form for Pð y ¼ 1 j xÞ ¼ Eð y j xÞ While the

statement ‘‘probit will be inconsistent for b when e is heteroskedastic’’ is correct, it

largely misses the point In most probit applications, it makes little sense to care

about consistent estimation of b when Pð y ¼ 1 j xÞ 0 Fðxb Þ (Section 15.7.5 contains

a di¤erent perspective.)

It is easy to construct examples where the partial e¤ect of a variable on Pð y ¼ 1 j xÞhas the sign opposite to that of its coe‰cient in the latent variable formulation.For example, let x1 be a positive, continuous variable, and write the latent variablemodel as y¼ b0þ b1x1þ e, e j x1@ Normalð0; x2

1Þ The binary response is defined

as y¼ 1½ y >0 A simple calculation shows that Pð y ¼ 1 j x1Þ ¼ Fðb0=x1þ b1Þ,and so qPð y ¼ 1 j x1Þ=qx1 ¼ ðb0=x12Þfðb0=x1þ b1Þ If b0>0 and b1 >0, thenqPð y ¼ 1 j x1Þ=qx1 and b1 have opposite signs The problem is fairly clear: while thelatent variable model has a conditional mean that is linear in x1, the response prob-ability depends on 1=x1 If the latent variable model is correct, we should just doprobit of y on 1 and 1=x1

Nonnormality in the latent error e means that GðzÞ 0 FðzÞ, and therefore

Pð y ¼ 1 j xÞ 0 Fðxb Þ Again, this is a functional form problem in the responseprobability, and it should be treated as such As an example, suppose that the true

model is logit, but we estimate probit We are not going to consistently estimate b in

Pð y ¼ 1 j xÞ ¼ Lðxb Þ—in fact, Table 15.1 shows that the logit estimates are generallymuch larger (roughly 1.6 times as large)—because of the di¤erent scalings inherent in

the probit and logit functions But inconsistent estimation of b is practically

irrele-vant: probit might provide very good estimates of the partial e¤ects, qPð y ¼ 1 j xÞ=qxj,even though logit is the correct model In Example 15.2, the estimated partial e¤ectsare very similar for logit and probit

Trang 29

Relaxing distributional assumptions on e in the model (15.9) can be useful forobtaining more flexible functional forms for Pð y ¼ 1 j xÞ, as we saw in equation(15.24) Replacing FðzÞ with some function Gðz; gÞ, where g is a vector of parameters,

is a good idea, especially when it nests the standard normal cdf [Moon (1988) coverssome interesting possibilities in the context of logit models, including asymmetriccumulative distribution functions.] But it is important to remember that these are justways of generalizing functional form, and they may be no better than directly speci-fying a more flexible functional form for the response probability, as in McDonald(1996) When di¤erent functional forms are used, parameter estimates across di¤er-ent models should not be the basis for comparison: in most cases, it makes sense only

to compare the estimated response probabilities at various values of x and goodness

of fit, such as the values of the log-likelihood function (For an exception, see lem 15.16.)

Prob-15.7.5 Estimation under Weaker Assumptions

Probit, logit, and the extensions of these mentioned in the previous subsection are allparametric models: Pð y ¼ 1 j xÞ depends on a finite number of parameters Therehave been many recent advances in estimation of binary response models that relaxparametric assumptions on Pð y ¼ 1 j xÞ We briefly discuss some of those here

If we are interested in estimating the directions and relative sizes of the partiale¤ects, and not the response probabilities, several approaches are possible Ruud(1983) obtains conditions under which we can estimate the slope parameters, call these

b, up to scale—that is, we can consistently estimate tb for some unknown constant t—

even though we misspecify the function GðÞ Ruud (1986) shows how to exploit theseresults to consistently estimate the slope parameters up to scale fairly generally

An alternative approach is to recognize that we do not know the function GðÞ, butthe response probability has the index form in equation (15.8) This arises from thelatent variable formulation (15.9) when e is independent of x but the distribution of e

is not known There are several semiparametric estimators of the slope parameters,

up to scale, that do not require knowledge of G Under certain restrictions on thefunction G and the distribution of x, the semiparametric estimators are consistentand ffiffiffiffiffi

N

p

-asymptotically normal See, for example, Stoker (1986); Powell, Stock,and Stoker (1989); Ichimura (1993); Klein and Spady (1993); and Ai (1997) Powell(1994) contains a recent survey of these methods

Once ^b b is obtained, the function G can be consistently estimated (in a sense we

cannot make precise here, as G is part of an infinite dimensional space) Thus, theresponse probabilities, as well as the partial e¤ects on these probabilities, can beconsistently estimated for unknown G Obtaining ^GG requires nonparametric regression

Trang 30

of yi on xib, where ^^ b b are the scaled slope estimators Accessible treatments of themethods used are contained in Stoker (1992), Powell (1994), and Ha¨rdle and Linton(1994).

Remarkably, it is possible to estimate b up to scale without assuming that e and x

are independent in the model (15.9) In the specification y¼ 1½xb þ e > 0, Manski (1975, 1988) shows how to consistently estimate b, subject to a scaling, under the

assumption that the median of e given x is zero Some mild restrictions are needed onthe distribution of x; the most important of these is that at least one element of x withnonzero coe‰cient is essentially continuous This allows e to have any distribution,and e and x can be dependent; for example, Varðe j xÞ is unrestricted Manski’s esti-mator, called the maximum score estimator, is a least absolute deviations estimator.Since the median of y given x is 1½xb > 0, the maximum score estimator solves

over all b with, say, b0b ¼ 1, or with some element of b fixed at unity if the

corre-sponding xjis known to appear in Medð y j xÞ fA normalization is needed because ifMedð y j xÞ ¼ 1½xb > 0 then Medð y j xÞ ¼ 1½xðtb Þ > 0 for any t > 0.g The resultingestimator is consistent—for a recent proof see Newey and McFadden (1994)—but itslimiting distribution is nonnormal In fact, it converges to its limiting distribution atrate N1=3 Horowitz (1992) proposes a smoothed version of the maximum score esti-mator that converges at a rate close to ffiffiffiffiffi

N

p

The maximum score estimator’s strength is that it consistently estimates b up to

scale in cases where the index model (15.8) does not hold In a sense, this is also theestimator’s weakness, because it is not intended to deliver estimates of the responseprobabilities Pð y ¼ 1 j xÞ In some cases we might only want to know the relativee¤ects of each xj on an underlying utility di¤erence or unobserved willingness to pay

ð yÞ, and the maximum score estimator is well suited for that purpose However, formost policy purposes we want to know the magnitude of the change in Pð y ¼ 1 j xÞfor a given change in xj As illustrated by the heteroskedasticity example in the pre-vious subsection, where Varðe j x1Þ ¼ x2

1, it is possible for bjand qPð y ¼ 1 j xÞ=qxj tohave opposite signs More generally, for any variable y, it is possible that xj has apositive e¤ect on Medð y j xÞ but a negative e¤ect on Eð y j xÞ, or vice versa Thispossibility raises the issue of what should be the focus, the median or the mean Forbinary response, the conditional mean is the response probability

It is also possible to estimate the parameters in a binary response model withendogenous explanatory variables without knowledge of GðÞ Lewbel (1998) con-

Trang 31

tains some recent results Apparently, methods for estimating average partial e¤ectswith endogenous explanatory variables and unknown GðÞ are not yet available.15.8 Binary Response Models for Panel Data and Cluster Samples

When analyzing binary responses in the context of panel data, it is often useful tobegin with a linear model with an additive, unobserved e¤ect, and then, just as inChapters 10 and 11, use the within transformation or first di¤erencing to remove theunobserved e¤ect A linear probability model for binary outcomes has the sameproblems as in the cross section case In fact, it is probably less appealing forunobserved e¤ects models, as it implies the unnatural restrictions xitb U ciU

1 xitb; t¼ 1; ; T; on the unobserved e¤ects In this section we discuss probit andlogit models that can incorporate unobserved e¤ects

15.8.1 Pooled Probit and Logit

In Section 13.8 we used a probit model to illustrate partial likelihood methods withpanel data Naturally, we can use logit or any other binary response function as well.Suppose the model is

Pð yit¼ 1 j xitÞ ¼ GðxitbÞ; t¼ 1; 2; ; T ð15:57Þwhere GðÞ is a known function taking on values in the open unit interval As wediscussed in Chapter 13, xitcan contain a variety of factors, including time dummies,interactions of time dummies with time-constant or time-varying variables, and laggeddependent variables

In specifying the model (15.57) we have not assumed nearly enough to obtain thedistribution of yi1ð yi1; ; yiTÞ given xi¼ ðxi1; ; xiTÞ Nevertheless, we can ob-tain a ffiffiffiffiffi

and score statistics can be computed as in Chapter 12

In the case that the model (15.57) is dynamically complete, that is,

Pð yit¼ 1 j xit; yi; t1; xi; t1; Þ ¼ Pð yit¼ 1 j xitÞ ð15:58Þ

Trang 32

inference is considerably easier: all the usual statistics from a probit or logit thatpools observations and treats the sample as a long independent cross section of size

NT are valid, including likelihood ratio statistics Remember, we are definitely notassuming independence across t (for example, xit can contain lagged dependent vari-ables) Dynamic completeness implies that the scores are serially uncorrelated across

t, which is the key condition for the standard inference procedures to be valid (Seethe general treatment in Section 13.8.)

To test for dynamic completeness, we can always add a lagged dependent variableand possibly lagged explanatory variables As an alternative, we can derive a simpleone-degree-of-freedom test that works regardless of what is in xit For concreteness,

we focus on the probit case; other index models are handled similarly Define uit1

yit FðxitbÞ, so that, under assumption (15.58), Eðuitj xit; yi; t1; xi; t1; Þ ¼ 0, all

t It follows that uit is uncorrelated with any function of the variables ðxit; yi; t1;

xi; t1; Þ, including ui; t1 By studying equation (13.53), we can see that it is serialcorrelation in the uit that makes the usual inference procedures invalid Let ^uit¼

yit Fðxit^Þ Then a simple test is available by using pooled probit to estimate theartificial model

‘‘Pð yit¼ 1 j xit; ^ui; t1Þ ¼ Fðxitbþ g1^i; t1Þ’’ ð15:59Þusing time periods t¼ 2; ; T The null hypothesis is H0: g1¼ 0 If H0 is rejected,then so is assumption (15.58) This is a case where under the null hypothesis, the es-

timation of b required to obtain ^ui; t1does not a¤ect the limiting distribution of any

of the usual test statistics, Wald, LR, or LM, of H0: g1¼ 0 The Wald statistic, that

is, the t statistic on ^gg1, is the easiest to obtain For the LM and LR statistics we must

be sure to drop the first time period in estimating the restricted modelðg1¼ 0Þ.15.8.2 Unobserved E¤ects Probit Models under Strict Exogeneity

A popular model for binary outcomes with panel data is the unobserved e¤ects probitmodel The main assumption of this model is

Pð yit¼ 1 j xi; ciÞ ¼ Pð yit¼ 1 j xit; ciÞ ¼ Fðxitbþ ciÞ; t¼ 1; ; T ð15:60Þwhere ci is the unobserved e¤ect and xicontains xit for all t The first equality saysthat xit is strictly exogenous conditional on ci: once ci is conditioned on, only xit

appears in the response probability at time t This rules out lagged dependent ables in xit, as well as certain kinds of explanatory variables whose future move-ments depend on current and past outcomes on y (Strict exogeneity also requires that

vari-we have enough lags of explanatory variables if there are distributed lag e¤ects.) The

Ngày đăng: 06/07/2014, 08:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN