The methods to be discussed are the least-squares method, the method of moments and the maximum likelihood method.. The method developed in relation to the contemporary statistical model
Trang 1CHAPTER 13
Estimation IIT — methods
The purpose of this chapter is to consider various methods for constructing
‘good’ estimators for the unknown parameters @ The methods to be discussed are the least-squares method, the method of moments and the maximum likelihood method These three methods played an important role
in the development of statistical inference from the early nineteenth century
to the present day The historical background is central to the discussion of these methods because they were developed in response to the particular demands of the day and in the context of different statistical frameworks If
we consider these methods in the context of the present-day framework ofa statistical model as developed above we lose most of the early pioneers’ insight and the resulting anachronism can lead to misunderstanding The method developed in relation to the contemporary statistical model framework is the maximum likelihood method attributed to Fisher (1922) The other two methods will be considered briefly in relation to their historical context in an attempt to delineate their role in contemporary statistical inference and in particular their relation to the method of maximum likelihood
The method of maximum likelihood will play a very important role in the discussion and analysis of the statistical models considered in Part IV; a sound understanding of this method will be of paramount importance After the discussion of the concepts of the likelihood function, maximum likelihood estimator (MLE) and score function we go on to discuss the properties of MLE’s The properties of MLE’s are divided into finite- sample and asymptotic properties and discussed in the case of a random as well as a non-random sample The latter case will be used extensively in Part IV The actual derivation of MLE’s and their asymptotic distributions
252
Trang 213.1 The method of least-squares 253
is emphasised throughout as a prelude to the discussion of estimation in Part IV
13.1 The method of least-squares
The method of least-squares was first introduced by Legendre in 1805 and Gauss in 1809 in the context of astronomical measurements The problem as posed at the time was one of approximating a set of noisy observations y,, i=1,2, ,n, with some known functions g(@,, 02, , On),i=1, -,7, which depended on the unknown parameters 0=(6,, ., 0,), m<n Legendre argued that in the case of g{0@)=0,, i= 1, 2, , n, minimising
yi w;—Ø,)?, with respect to 0, (13.1)
gives rise to 0; =(1/n) ) y,=¥,, the sample mean, which was generally considered to be the most representative value of (11, ¥3, , }',) On the basis of this result he went on to suggest minimising the squared errors
I(0)= Š_ (y—g(0))?; the least-squares, (13.2)
Gauss, on the other hand, proposed a probabilistic set-up by reversing the Legendre argument about the mean Crudely, his argument was that if X=(X,, X5, - X,,)' 18 a random sample from some density function f(x) and the mean x, is the most representative value for all such X,s, then the density function must be normal, i.e
Trang 3254 Methods
(Note: NI({-) ‘reads’ normal independent, justifying the normality assumption on the grounds of being made up of a large number of independent factors cancelling each other out.) In this form the problem can
be viewed as one of estimation in the context of the statistical model:
(1) 0= | flys 0= 555 ex] 32g20 949) Lee] (13.6)
by transferring the probabilistic assumption from ¢; to y,, the observable r.v., and
is linear, i.e
k= 1
and the normality assumption replaced by the assumptions
E(e,)=0, Var(e,)=07, E(ee)=0, iAj, ij=l,2, ,0
(13.10)
In some of the present-day literature this model is considered as an extension of the Gauss formulation by weakening the normality assumption For further discussion of the Gauss linear model see Chapter 18.
Trang 413.1 The method of least-squares 255
For simplicity of exposition let us consider the case where m= | and the model becomes
is the least-squares estimator of @,
Note that in the case where x,;= 1,i=1,2 ,n,9,=(1/n) 37s yụ, Le the sample mean
Given that the x, ,s are in general assumed to be known constants 6, isa linear function of the r.v.’s y,, , ¥, of the form
It can be shown that, under the above assumptions relating to ¢; if
(S"_ , x7) #0, the least-squares estimator 0, of 0, has the smallest variance
(13.17)
Trang 5256 Methods
within the class of linear and unbiased estimators (Gauss—Markov theorem, see Section 21.2)
13.2 The method of moments
From the discussion in the previous section it is clear that the least-squares method is not a general method of estimation because it presupposes the existence of approximating functions g,(0), i= 1,2, ,n, which play the role
of the mean in the context of a probability model In the context of a probability model ®, however, unknown parameters of interest are not only associated with the mean but also with the higher moments This prompted Pearson in 1894 to suggest the method of moments as a general estimation method The idea underlying the method can be summarised as follows: Let usassume that X =(X,,X>, X,,) isa random sample from f(x; 9), 6¢R* The raw moments of f(x; 6), w= E(x’), r>1, are by definition
functions of the unknown parameters, since
Trang 613.3 The maximum likelihood method 257
(1/n) 7 Xj, m2=(1/n) Y?_ , X7 The method suggests
(Mì Hy)
det——————l#0 SA 0 0), 7Ô for 0eO, Úc (13.23) 13.23
If the equations y(0)=m,, i=1,2, ,k, have a unique solution 6,=(6,,
P , 6) with probability approaching one, asn > x, then 0 —> 6 (ie 6, isa consistent estimator of 6)
Although the method of moments usually yields (strongly) consistent estimators they are in general inefficient This was taken up by Fisher in several papers in the 1920s and 30s arguing in favour of the maximum likelihood method for producing efficient estimators (at least asymptotically) The controversy between Pearson and Fisher about the relative merits of their respective methods of estimation ended in the mid- 1930s with Fisher the winner and the absolute dominance since then of the maximum likelihood method
The basic reason for the inefficiency of the estimators based on the method of moments is not hard to find It is due to the fact that the method does not use any information relating to the probability model ® apart from the assumption that raw moments of order k exist It is important, however, to remember that this method was proposed by Pearson in the late nineteenth century when no such probability model was postulated a priori The problem of statistical inference at the time was seen as one starting from a sample X=(X,, , X,,)' and estimating f(x; 0) without assuming a priori some form for {(-) This point is commonly missed when comparisons between the various methods are made; it was unfortunately missed even by Pearson himself in his exchanges with Fisher It is no surprise then to discover that a method developed in the context of an alternative framework when applied to present-day set-up is found wanting
13.3 The maximum likelihood method
The maximum likelihood method of estimation was formulated by Fisher
Trang 7258 Methods
in a series of papers in the 1920s and 30s and extended by various authors such as Cramer, Rao and Wald In the current statistical literature the method of maximum likelihood is by far the most widely used method of estimation and plays a very important role in hypothesis testing
(1) The likelihood function
Consider the statistical model:
(i) D={ f(x; 0), 0€ O};
(H) X=(X,,X, ,X,)' a sample from f(x; 6),
where X takes values in 2 = IR", the observation space The distribution of the sample D(x,,X , ,X,} Ø) describes how the density changes as X takes different values in % fora given § €@ In deriving the likelihood function we reason as follows:
since D(x; 9) incorporates all the information in the statistical model it makes a lot of intuitive sense to reverse the argument in deriving D(x; 8) and consider the question which value of 0¢© is mostly supported by a given sample realisation X =x?
‘likelihood’ of arising must be intuitively our best choice of 6 Using this intuitive argument the likelihood function is defined by
where k(x)>0 is a function of x only (not 8) In particular
Trang 813.3 The maximum likelihood method 259
Even though the probability remains attached to X and not @ in defining L(0; x) it is interpreted as if it is reflected inferentially on @; reflecting the
‘likelihood’ of a given X =x arising for different values of @ in © In order to see this, consider the following example
Trang 9260 Methods
implies that the likelihood function is non-unique; any monotonic transformation of it represents the same information In particular: (1) log L(@; x), the log likelihood function; and (13.27)
(2) The maximum likelihood estimator (MLE)
Given that the likelihood function represents the support given to the
various 0€@ given X =x, it is natural to define the maximum likelihood estimator of @ to be a Borel function 6: #— © such that
660
and there may be one, none or many such MLE’s
Trang 1013.3 The maximum likelihood method 261
Trang 11262 Methods
Note that
log L(@; x) > log L(6*; x), for all 0*e@ (13.30)
In the case where L(@; x) is differentiable the MLE can be derived as a solution of the equations
Trang 1213.3 The maximum likelihood method 263
Before the reader jumps to the erroneous conclusion that deriving the MLE
is a matter of a simple differentiation let us consider some examples where the derivation is not as straightforward
Example 4
Let Z=(Z,, Z>, , Z,) where Z;=(X;, Y;) be a random sample from
O\/1 p
Ni()() 0l
/6y:p= TP? ap| T— ——; (x2~2øxy+32) “ 2m 2(1—ø?) ,
log L(p; x, y)=c —n log 2n—5 log (1—p?)
Example 5
LetX=(X;,X;, , X„Y be a random sample from f(x; 0)= 1/0 where 0 <
x <6 The likelihood function is
L(0;x)=0"" if O<x; <0, i=1,2, ,n
Using [dL(0; x)]/d@=0 to derive the MLE is out of the question since L(6; x) is not continuous at the maximum (see Fig 13.5) A moment's reflection suggests that the MLE of @ is 0=max(X,, X2, , X,)-
Trang 13that L(@; x)>0 Since 0 is bounded below by the X;8, ô= min(X;, X;,
X,,) represents the MLE of 8
Looking at examples 5 and 6 we can see that the problem of the derivation of the MLE arose because the range of the X,s depended on the unknown parameter @ It turns out that in such cases there are not only problems with deriving the MLE but also the estimators derived do not in general satisfy all the properties MLE’s enjoy (see below) For example, 6= max(X,, , X,,) 18 not asymptotically normal Such cases are excluded by the assumption CR1 of Chapter 12
So far the examples considered refer to the case where @ is a scalar In
econometrics, however, @ is commonly a k x 1 vector, a case which presents
certain additional difficulties For differentiable likelihood functions the MLE of €=(0,, 02, 6,)' is derived by solving the system of equations
Trang 1413.3 The maximum likelihood method 265
Trang 15(3) Finite sample properties
Let us discuss the finite sample properties of MLE’s in the context of the simple statistical model:
(i) probability model, ®={ ƒ(x; Ø), EO};
(ii) sampling model, X=(X,, ., X,,)’, is a random sample from
ñ=n( Š lo X,)
i=1
Trang 1613.3 The maximum likelihood method 267
The invariance property of MLE’s enables us to deduce that the MLE of
1 lẻ
o=5 is $=; 2, log Xi
In relation to invariance it is important to note that in general
For example, if g(@) = 07 it is well known that E(6?)# (E(Ô))? in general This
contributes to the fact that the MLE’s are not in general unbiased estimators For instance, in example 7 above the MLE of o?, d?= (1/n) ¥'8_, (X;-X) is a biased estimator since (no?)/o? ~ ¥2(n—1) (see Section 11.5) and hence E(é?)=[(n—1)/n]o*#o7 Thus, in general, unbiased and MLE’s do not coincide In one particular case, when unbiasedness is accompanied by full efficiency, however, the two coincide
Unbiasedness, full-efficiency
In the case where ® satisfies the regularity conditions CR1-CR3 and @is an unbiased estimator of 6 whose variance achieves the Cramer—Rao lower bound, then the likelihood equation has a unique solution equal to 6 This suggests that any unbiased fully efficient estimator 6 can be derived as a solution of the likelihood equation (a comforting thought!) In example 7 above the MLE of p was ji, = X,, which implies that ji, ~~ N(u, o7/n) since j2,
is a linear function of independent r.v.’s Hence, E(i,)= and ;i, is an unbiased estimator Moreover, given that
0 —
n
we can see that Var(zi,,) achieves the Cramer—Rao lower bound On the
other hand, the MLE of 0’, ¢?=(I/n) )'"_, (X; —X,,)’ as discussed above, is
not an unbiased estimator
The property mostly emphasised by Fisher in support of the method of maximum likelihood was the property of sufficiency.
Trang 17268 Methods
Sufficiency
If c(X) is a sufficient statistic for 6 and a unique MLE 0 of @ exists then Oisa function of 1(X) In the case of a non-unique MLE, a MLE 6 can be found which is a function of 7(X) It isimportant to note that this does not say that any MLE isa function of 2(X); in the case of non-uniqueness some MLE’s are not functions of t(X) It was shown in Chapter 12 that 7(X)=(9" , X,, vey X?) are jointly minimal sufficient statistics for 0=(, 07) in the case where X =(X,, , X,,)' is a random sample from N(u, o*) In example 7 above the MLE’s of y and o? were
by =~ » Xi, Ơn =~ > (X;—X,)
which are clearly functions of 1(X)
An important implication for ML estimation when a sufficient statistic exists is that the asymptotic covariance of @,, (see below) can be consistently estimated by the Hessian evaluated at @=6, That is,
(4) Asymptotic properties (IID case)
Although MLE’s enjoy several optimum finite sample properties, as seen above, their asymptotic properties provide the main justification for the almost universal appeal of the method of maximum likelihood As argued below, under certain regularity conditions, MLE’s can be shown to be consistent, asymptotically normal and asymptotically efficient
Let us begin the discussion of asymptotic properties enjoyed by MLE’s
by considering the simplest possible case where the statistical model is as follows:
(i) probability model, ®= { f(x; &), d€@}, 6 being a scalar;
(ti) sampling model, X =(X,, ,X,,)' isa random sample from f(x; 6) Although this case is of little interest in Part IV, a brief discussion of it will help us understand the non-random sample case considered in the sequel The regularity conditions needed to prove the above-mentioned asymptotic properties for MLE’s can take various forms (see Cramer (1946), Wald (1949), Norden (1972-73), Weiss and Wolfowitz (1974), Serfling (1980), inter alia) For our purposes it suffices to supplement the regularity conditions of Chapter 12, CR1I-CR3, with the following condition: