The parameters of the linear model with fixed individual effects can be estimated by the 'least squares dummy variable' LSDV or 'within groups' estimator, which we denote bLSDV.. However
Trang 1Estimating Econometric Models with Fixed Effects
William Greene*
Department of Economics, Stern School of Business,
New York University, April, 2001
Abstract
The application of nonlinear fixed effects models in econometrics has often been avoided for two reasons, one methodological, one practical The methodological question centers on an incidental parameters problem that raises questions about the statistical properties of the estimator The practical one relates to the difficulty of estimating nonlinear models with possibly thousands of coefficients This note will demonstrate that the second is, in fact, a nonissue, and that in a very large number of models of interest
to practitioners, estimation of the fixed effects model is quite feasible even in panels with huge numbers of groups The models are fully parametric, and all parameters of interest are estimable
Keywords: Panel data, fixed effects, computation.
JEL classification: C1, C4
* 44 West 4th St., New York, NY 10012, USA, Telephone: 001-212-998-0876; fax: 01-212-995-4218; e-mail: wgreene@stern.nyu.edu, URL www.stern.nyu.edu/~wgreene This paper has benefited from discussions with George Jakubson (who suggested one of the main results in this paper), Martin Spiess, and Scott Thompson and from seminar groups at The University of Texas, University of Illinois, and New York University Any remaining errors are my own
Trang 21 Introduction
The fixed effects model is a useful specification for accommodating individual heterogeneity in panel data But, it has been problematic for two reasons In most cases, the estimator is inconsistent owing to the incidental parameters problem How serious this problem
is in practical terms remains to be established - there is only a very small amount of received evidence - but the theoretical result is unambiguous A second problem is purely practical With current technology, the computation of the model parameters and appropriate standard errors, with all its nuisance parameters, appears to be impractical This note focuses on the second of these, and shows that in a large number of interesting cases, the difficulty is only apparent We will focus on a single result, computation of the estimator, and rely on some well known algebraic results to establish it No formal statistical results are derived here or suggested with Monte Carlo results, as in general, the results are already known The one statistical question noted above is left for further research
The paper proceeds as follows In Section 2, the general modeling framework is presented, departing from the linear model to more complicated specifications The formal results for the estimator and computational procedures for obtaining appropriate standard errors are presented in Section 3 Section 4 suggests two possible new applications Conclusions are drawn in Section 5
2 Models with Fixed Effects
The linear regression model with fixed effects is
y it = xit + i + t + it , t = 1, ,T(i), i = 1, ,N,
E[it|xi1,xi2, ,xiT(i)] = 0,
Var[it|xi1,xi2, ,xiT(i)] = 2
We have assumed the strictly exogenous regressors case in the conditional moments, [see Woolridge (1995)] We have not assumed equal sized groups in the panel The vector is a set
of parameters of primary interest, i is the group specific heterogeneity We have included time
specific effects but, they are only tangential in what follows Since the number of periods is usually fairly small, these can usually be accommodated simply by adding a set of time specific
dummy variables to the model Our interest here is in the case in which N is too large to do likewise for the group effects For example in analyzing census based data sets, N might number
in the tens of thousands The analysis of two way models, both fixed and random effects, has been well worked out in the linear case [See, e.g., Baltagi (1995) and Baltagi, et al (2001).] A full extension to the nonlinear models considered in this paper remains for further research
The parameters of the linear model with fixed individual effects can be estimated by the
'least squares dummy variable' (LSDV) or 'within groups' estimator, which we denote bLSDV This
is computed by least squares regression of y it * = (y it - y ) on the same transformation of x. it where the averages are group specific means The individual specific dummy variable coefficients can
be estimated using group specific averages of residuals [See, e.g., Greene (2000, Chapter 14).] The slope parameters can also be estimated using simple first differences Under the
assumptions, bLSDV is a consistent estimator of However, the individual effects, i, are each
estimated with the T(i) group specific observations Since T(i) might be small, and is, moreover, fixed, the estimator, a i,LSDV , is inconsistent But, the inconsistency of a i,LSDV, is not transmitted to
Trang 3bLSDV because y is a sufficient statistic The LSDV estimator b .i LSDV is not a function of a i,LSDV There are a few nonlinear models in which a like result appears
We will define a nonlinear model by the density for an observed random variable, y it,
f(y it | xi1,xi2, ,xiT(i) ) = g(y it, xit + i, )
where is a vector of ancillary parameters such as a scale parameter, an overdispersion parameter
in the Poisson model or the threshold parameters in an ordered probit model We have narrowed
our focus to linear index function models For the present, we also rule out dynamic effects; y i,t-1
does not appear on the right hand side of the equation [See, e.g., Arellano and Bond (1991), Arellano and Bover (1995), Ahn and Schmidt (1995), Orme (1999), Heckman and MaCurdy (1980)] However, it does appear that extension of the fixed effects model to dynamic models may well be practical This, and multiple equation models, such as VAR's are left for later extensions [See Holtz-Eakin (1988) and Holtz-Eakin, Newey and Rosen (1988, 1989).] Lastly,
note that only the current data appear directly in the density for the current y it We will also be limiting attention to parametric approaches to modeling The density is assumed to be fully defined This makes maximum likelihood the estimator of choice
The likelihood function for a sample of N observations is
L = i N1 T t1i) g(y it,'xit i,)
The likelihood equations,
0
L
log
, logL 0,i 1, ,N
i
log
0
L
do not have explicit solutions for the parameter estimates in terms of the data and must, therefore,
be solved iteratively In principle, maximization can proceed simply by creating and including a complete set of dummy variables in the model But, at some point, this approach becomes unusable with current technology We are interested in a method that would accommodate a
panel with, say, 50,000 groups, which would mandate estimating a total of 50,000 + K + K
parameters What makes this impractical is a second derivatives matrix (or some approximation
to it) with 50,000 rows and columns But, that consideration is misleading, a proposition we will return to presently
The proliferation of parameters is a practical shortcoming of the fixed effects model The 'incidental parameters problem' is a methodoligical issue If and were known, then, the solution for i would be based on only the T(i) observations for group i (see below for an application) This implies that the asymptotic variance for a i is O[1/T(i)] and, since T(i) is fixed,
a i is inconsistent In fact, is not known; in general in nonlinear settings, the estimator will be a function of the estimator of i , a i,ML Therefore bML, MLE of is a function of a random variable
which does not converge to a constant as N , so neither does b ML There is a small sample bias as well The example is unrealistic, but Hsiao (1993, 1996) shows that in a binary logit
model with a single regressor that is a dummy variable and a panel in which T(i) = 2 for all
groups, the small sample bias is +100% No general results exist for the small sample bias in more realistic settings Heckman (1981) found in a Monte Carlo study of a probit model that the bias of the slope estimator in a fixed effects model was toward zero and on the order of 10%
when T(i) = 8 and N = 100 On this basis, it is often noted that in samples at least this large, the
Trang 4small sample bias is probably not too severe In many microeconometric applications, T(i) is
considerably larger than this, so for practical purposes, there is good cause for optimism
3 Computation of the Fixed Effects Estimator
In the linear case, regression using group mean deviations sweeps out the fixed effects The slope estimator is not a function of the fixed effects which implies that it (unlike the
estimator of the fixed effect) is consistent There are a few analogous cases of nonlinear models
that have been identified in the literature Among them are the binomial logit model,
g(y it , xit + i) = [(2yit - 1)(xit + i)]
where (.) is the cdf for the logistic distribution [see Chamberlain (1980)] In this case, t y it is a sufficient statistic, and estimation in terms of the conditional density provides a consistent estimator of [See Greene (2000) for discussion.] Three other models which have this property are the Poisson and negative binomial regressions [See Hausman, Hall, and Griliches (1984)] and the exponential regression model
g(y it , xit + i) = (1/it )exp(-y it/it), it = exp(xit + i ), y it 0
[See Munkin and Trivedi (2000) and Greene (2001).] In these models, there is a solution to the likelihood equation for that is not a function of i Consider the Poisson regression model with fixed effects - the result for the exponential model is essentially the same - for which
log g(y it , , x it + i) = -it + y it log it - log y it!
where it = exp(xit + i) = exp(i)exp(xit) Then,
log L = ) exp( )exp( ' ) ( ' ) log !
1
i T t
N
The likelihood equation for i, logL/i = 0, implies a solution
exp(i) =
( ) 1 ( )
1 exp( 'x )
T i it t
T i
it t
y
Thus, the maximum likelihood estimator of is not a function of i There are other models with loglinear conditional mean functions, however these are too few and specialized to serve as the benchmark case for a modeling framework In the vast majority of cases of interest to practitioners, including those based on transformations of normally distributed variables such as the probit and tobit models, this method will be unusable
Heckman and MaCurdy (1980) suggested a 'zig-zag' sort of approach to maximization of the log likelihood function, dummy variable coefficients and all Consider the probit model For known set of fixed effect coefficients, = (1, ,N), estimation of is straightforward The log
likelihood conditioned on these values (denoted a i), would be
log L|a1, ,a N = i N1 T i t( )1 log [(2 y it 1 'xit a i)
Trang 5This can be treated as a cross section estimation problem since with known , there is no connection between observations even within a group With given estimate of (denoted b) the conditional log likelihood function for each i,
log L i|b = T t1i) log(2y it 1)(z it i)
where z it = bxit is now a known function Maximizing this function is straightforward (if
tedious, since it must be done for each i) Heckman and MaCurdy suggested iterating back and
forth between these two estimators until convergence is achieved There is no guarantee that this back and forth procedure will converge to the true maximum of the log likelihood function because the Hessian is not block diagonal Whether either estimator is even consistent in the
dimension of N (that is, of ) depends on the initial estimator being consistent, and it is unclear
how one should obtain that consistent initial estimator There is no maximum likelihood estimator for i for any group in which the dependent variable is all 1s or all 0s, - the likelihood
equation for log L i has no solution if there is no within group variation in y it This feature of the model carries over to the tobit and binomial logit models, as the authors noted In the Poisson
and negative binomial models, any group which has y it = 0 for all t contributes a 0 to the log
likelihood function so its group specific effect is not identified either Finally, irrespective of its probability limit, the estimated covariance matrix for the estimator of will be too small, again because the Hessian is not block diagonal The estimator at the step does not obtain the correct submatrix of the information matrix
Many of the models we have studied involve an ancillary parameter vector, No generality is gained by treating separately from , so at this point, we will simply group them
in the single parameter vector = [,] Denote the gradient of the log likelihood by
g =
logL
=
) , , , ( log
) 1 1
i it it i
T t
N i
y
(a K1 vector)
gi =
i
L
log =
i
i it it i
T t
y g
) , , , ( log
) 1
x
(a scalar)
g = [g1, , g N] (an N1 vector)
g = [g, g] (a (K+N)1 vector).
The full (K+N) (K+N) Hessian is
NN N
N
h
h h
0 0 0 '
0
0 0
'
0 0
'
22 2
11 1
2 1
h
h h
h h
h H
where
Trang 6H =
'
) , , , ( log
2 ) 1
i it it i
T t
N i
y
(a K K matrix)
hi =
i
i it it i
T t
y g
, , ) ,
( log
2 ) 1
x
(N K 1 vectors)
2 ) 1
) , , , ( log
i
i it it i
T t
y g
x
(N scalars).
Newton's method of maximizing the log likelihood produces the iteration
k
=
1
k
- H-1k1gk-1 =
1
k
+
where subscript 'k' indicates the updated value and 'k-1' indicates a computation at the current
value Let H denote the upper left KK submatrix of H-1 and define the NN matrix H and
KN H likewise Isolating , then, we have the iteration
k = k-1 - [H g + H g]k-1 = k-1 +
Using the partitioned inverse formula [e.g., Greene (2000, equation 2-74)], we have
H = [H - HH-1H]-1
The fact that H is diagonal makes this computation simple Collecting the terms,
H =
1
1 1 '
ii
N
i h h h
H
Thus, the upper left part of the inverse of the Hessian can be computed by summation of vectors
and matrices of order K We also require H Once again using the partitioned inverse formula, this would be
H = -H H H-1
As before, the diagonality of H makes this straightforward Combining terms, we find
= - H ( g - H -1
H g)
= -
1
1
1 1 '
k i i ii
N
i h h h
H
1
1
k
i ii
i N
i h
g
h g
Trang 7Turning now to the update for , we use the same results for the partitioned matrices Thus,
= - [H g + H g]k-1
Using Greene's (2-74) once again, we have
H = H-1 (I + HHHH-1)
H = -H HH = --1 H-1HH
Therefore,
= - H-1(I + HHHH-1)g + H-1(I + HHHH-1)HH g-1
= -H-1(g + H)
Since H is diagonal,
i = - 1 i i'
ii
g
h h .
Neither update vector requires storage or inversion of a (K+N)(K+N) matrix; each is a function of sums of scalars and K1 vectors of first derivatives and mixed second derivatives.1
The practical implication is that calculation of fixed effects models is a computation only of order
K Storage requirements for and are linear in N, not quadratic Even for huge panels of
tens of thousands of units, this is well within the capacity of even modest desktop computers In experiments, we have found this method effective for probit models with 10,000 effects (An analyst using this procedure for a tobit model reported success with nearly 15,000 coefficients.) (The amount of computation is not particularly large either, though with the current vintage of 2+ GFLOP processors, computation time for econometric estimation problems is usually not an issue.)
The estimator of the asymptotic covariance matrix for the MLE of is -H, the upper left
submatrix of -H-1 This is a sum of K K matrices, and will be of the form of a moment matrix which is easily computed (see the application below) Thus, the asymptotic covariance matrix for the estimated coefficient vector is easily obtained in spite of the size of the problem The
asymptotic covariance matrix of a is
-(H - HH H-1 )-1 = -H-1 - H-1H {H - H-1 H-1H} -1 HH-1
It is (presumably) not possible to store the asymptotic covariance matrix for the fixed effects estimators (unless there are relatively few of them) But, by expanding the summations where
needed and exploiting the diagonality of H, we find that the individual terms are
1 The iteration for the slope estimator is suggested in the context of the binary logit model by Chamberlain (1980, page 227) A formal derivation of and was presented by George Jakubson of Cornell University in an undated memo, "Fixed Effects (Maximum Likelihood) in Nonlinear Models."
Trang 81 1
1( ) h i 'H h j
i j
Once again, the only matrix to be inverted is K K, not NN (and, it is already in hand) so this can
be computed by summation It involves only K1 vectors and repeated use of the same KK
inverse matrix Likewise, the asymptotic covariance matrix of the slopes and the constant terms can
be arranged in a computationally feasible format;
Asy.Cov[c,a ] = Asy.Var[c] H -1
This involves NN and KN matrices, but it is simplifies to
Asy.Cov[c,a i ] = Asy.Var[c] i
ii
h
h
4 Applications
To illustrate the preceding, we examine two applications, the binomial probit (and logit) model(s) and a sample selection model (With trivial modification, the first of these will extend
to many other models, as shown below.)2
4.1 Binary Choice and Simple Index Function Models
For a binomial probit model with dependent variable z it,
g(z it , xit + i) = [(2zit - 1)(xit + i)] = (qit r it) = (ait)
and
log L = i N1 T i t( )1 log [q it'xit i)]
Define the first and second derivatives of log g(z it , xit + i) with respect to (xit + i) as
( )
it it it
a q a
it = -a it it - it, -1 < it < 0
The derivatives of the log likelihood for the probit model are
2 We assume in the following that none of the groups have y it always equal to 1 or 0 In practice, one would have to determine this as part of the estimation process It should be noted for the practitioner that this condition is not trivially obvious during estimation The usual criteria for convergence, such as small will appear to be met while the associated i is still finite even in the presence of degenerate groups
Trang 9g i = T i t( )1 qit it
g = N i1 T i t( )1 qitit itx ,
h ii = T i t ( )1 it,
hi = T i t ( )1 it itx ,
H = N i1 T i t( )1 it it it'
For convenience, let
i = T i t ( )1 it
and
i
~
x = hi / h ii = T i t( )1 it it / T i t( )1 it
Note that x~i is a weighted within group mean of the regressor vectors.
The update vectors and computation of the slope and group effect estimates follows the template given earlier After a bit of manipulation, we find the asymptotic covariance matrix for the slope parameters is
Asy.Var[bMLE] = [-H]-1- =
-1 ( )
The resemblance to the 'within' moment matrix from the analysis of variance context is notable and convenient Inserting the parts and collecting terms produces
=
1 ( )
x x~ x x~ N1 T i( )1 it it xit x~i
and
i = T i( )1 it it/ i + '~i
Denote the matrix in the preceding as
V = -[H]-1 = Asy.Var[bMLE]
Then,
Asy.Cov[a i ,a j] = ( ) ( )
+ 'i j = + ij
s
x V x
Finally,
Trang 10Asy.Cov[bMLE ,a i] = -Vx~i
Each of these involves a moderate amount of computation, but can easily be obtained with
existing software and, most important for our purposes, involves computations that are linear in N and K We note as well that the preceding extends directly to any other simple index function
model, such as the binomial logit model [change derivatives it to (1- it) and it to -it(1 - it) where it is the logit CDF] and the Poisson regression model [replace it with (y it - m it) and it
with -m it where m it = exp(xit + i)] Extension to models that involve ancillary parameters, such
as the tobit model, are a bit more complicated, but not excessively so
The preceding provides the estimator and asymptotic variances for all estimated parameters in the model For inference purposes, note that the unconditional log likelihood function is computed Thus, a test for homogeneity is straightforward using the likelihood ratio test Finally, one would normally want to compute marginal effects for the estimated probit model The conditional mean in the model is
E[z it | xit] = (xit + i)
so the slopes in the model are
it it
it
E z
x
x
In many applications, marginal effects are computed at the means of the data The heterogeneity
in the fixed effects presents a complication Using the sample mean of the fixed effects estimators, the estimator would be
_
it
E z
_ x
b b x
In order to compute the appropriate asymptotic standard errors for these estimates, we need the asymptotic covariance matrix for the estimated parameters The asymptotic covariance matrix for
the slope estimator is already in hand, so what remains is Asy.Cov[b,a] and Asy.Var[a] For the former,
AsyCov[b,a] = 1 N1 ~
i i
while, by summation, we obtain
Asy.Var[a] = 12 1 1 1 1
+
ij
i
s
These would be assembled in a (K+1)(K+1) matrix, say V* The asymptotic covariance matrix
for the estimated marginal effects would be
Asy.Var[] = GV*G