n n n 7.1 Parametric, Semiparametric, and NonparametricEstimation Problems 7.2 Additional Considerations for the Specificationand Estimation of Probability Models 7.3 Estimators and Esti
Trang 1n n n
7.1 Parametric, Semiparametric, and NonparametricEstimation Problems
7.2 Additional Considerations for the Specificationand Estimation of Probability Models
7.3 Estimators and Estimator Properties7.4 Sufficient Statistics
7.5 Minimum Variance Unbiased Estimation
The problem of point estimation examined in this chapter
is concerned with the estimation of the values of unknown parameters, orfunctions of parameters, that represent characteristics of interest relating to aprobability model of some collection of economic, sociological, biological, orphysical experiments The outcomes generated by the collection of experimentsare assumed to be outcomes of a random sample with some joint probabilitydensity function fðx1; ; xn; QÞ The random sample need not be from a popula-tion distribution, so that it is not necessary that X1, .,Xnbe iid The estimationconcepts we will examine in this chapter can be applied to the case of generalrandom sampling, as well as simple random sampling and random samplingwith replacement, i.e., all of the random sampling types discussed in Chapter6.The objective of point estimation will be to utilize functions of the randomsample outcome to generate good (in some sense) estimates of the unknowncharacteristics of interest
7.1 Parametric, Semiparametric, and Nonparametric Estimation Problems
The types of estimation problems that will be examined in this (and the next)chapter are problems ofparametric estimation and semiparametric estimation,
as opposed to nonparametric estimation problems Both parametric andsemiparametric estimation problems are concerned with the estimates of thevalues of unknown parameters that characterizeparametric probability models
or semiparametric probability models of the population, process, or general
Trang 2examining, and the need to distinguish their appearance and effect inspecifying parametric, semiparametric, and nonparametric probability models,
we extend the scope of the termprobability model to explicitly encompass thedefinition of parameters and their admissible values Note, because it is possi-ble that the range of the random variable can change with changing values ofthe parameter vector for certain specification of the joint probability densityfunction of a random variable (e.g, a uniform distribution), we emphasize this
in the definition below by including the parameter vector in the definition ofthe range ofX
We note, given our convention that the range and the support of the randomvariableX are equivalent (recall Definition 2.13), that explicitly listing the range
of the random variable as part of the specification of the probability model doesnot provide new information, per se That is, knowing the density functionand its admissible parameter values implies the range of the random variable
as R Xð ; QÞ x : f x; Qf ð Þ>0gfor Q 2 O We will see ahead that in point estimationproblems an explicit specification of the range of a random sampleX is impor-tant for a number of reasons, including determining the types of estimationprocedures that can be used in a given estimation problem, and for defining
Trang 3in which case the adequacy of the model may itself be an issue in need of furtherstatistical analysis and testing However, use of parametric estimation method-ology begins with, and indeed requires such a full specification of a parametricmodel forX.
7.1.2 Semiparametric Models
A semiparametric model is one in which the functional form of the jointprobability density function component of the probability model for theobserved sample data, x, is not fully specified and is not known when thevalue of the parameter vector of the model, Q, is given a specific numericalvalue Instead of defining a collection of explicit parametric functional forms forthe joint density of the random sampleX, when defining the model, as in theparametric case, the analyst defines a number of properties that the underlyingtrue sampling density fðx1; ; xn; Q0Þ is thought to possess Such informationcould include parametric specifications for some of the moments that the ran-dom variables are thought to adhere to, or whether the random variablescontained in the random sample exhibit independence or not Given a numericalvalue for the parameter vectorQ, any parametric structural components of themodel are given an explicit fully specified functional form, but othercomponents of the model, most notably the underlying joint density function
Trang 4bility model, with the analyst simply acknowledging the existence of somegeneral characteristics and relationships relating to the random variables inthe random sample, such as the existence of a general regression relationship,
or the existence of a population probability distribution if the sample weregenerated through simple random sampling
For example, the analyst may wish to estimate the CDF F(z), where (X1, .,
Xn) is an iid random sample from the population distribution F(z), and nomention is made, nor required, regarding parameters of the CDF We havealready examined a method for estimating the CDF in the case where therandom sample is from a population distribution, namely, the empirical distri-bution function, Fn, provides an estimate of F We will leave the general study ofnonparametric estimation to a more advanced course of study; interested readerscan refer to M Puri and P Sen (1985) Nonparametric Methods in General LinearModels New York: John Wiley, F Hampel, E Ronchetti, P Rousseeuw, and
W Stahel (1986), Robust Statistics New York: John Wiley; and J Pratt and
J Gibbons (1981), Concepts of Nonparametric Theory New York: Verlag and A Pagan and A Ullah, (1999), Nonparametric Econometrics,Cambridge: Cambridge University Press.1
Springer-We illustrate the definition of the above three types of models in the ing example
Trang 5random sampling ofX implied by the above fully specified probability model.For asemiparametric model specification of the relationship, the probabilitymodel Rf ðX; QÞ; f x; Qð Þ; Q 2 Og will not be fully functionally specified For thistype of model, the analyst might specify thatX ¼ b1þ zb2þ « with E «ð Þ ¼ 0 andCov «ð Þ ¼ s2I, so that the first and second moments of the relationship have beendefined as Eð Þ ¼ bX 11nþ zb2; i ¼ 1; ; n and Cov Xð Þ ¼ s2I In this case, know-ing the numerical values ofb and s2will fully identify the means of the randomvariables as well as the variances, but the joint density of the random samplewill remain unknown and not fully specified It would not be possible for theanalyst to simulate random sampling ofX given this incomplete specification ofthe probability model.
Finally, consider anonparametric model of the relationship In this case, theanalyst might specify thatX ¼ g zð Þ þ « with E Xð Þ ¼ g zð Þ, and perhaps that X is acollection of independent random variables, but nothing more Thus, the meanfunction, as well as all other aspects of the relationship betweenX and z, are leftcompletely general, and nothing is explicitly determined given numerical values
of parameters There is clearly insufficient information for the analyst to late random sample outcomes from the probability model, not knowing the jointdensity of the random sample or even any moment aspects of the model, given
7.1.4 Scope of Parameter Estimation Problems
The objective in problems of parameter estimation is to utilize a sampleoutcome x½ 1; ; xn0 ofX ¼ X½ 1; ; Xn0 to estimate the unknown valueQ0 or
q Qð Þ, where Q0 0 denotes the value of the parameter vector associated with thejoint PDF that actually determines the probabilities of events for the randomsample outcome That is, Q0 is the value ofQ such that X ~ f(x;Q0) is a truestatement, and for this reasonQ0is oftentimes referred to as thetrue value of Q,and we can then also speak ofq Qð Þ as being the true value of q Q0 ð Þ and f(x; Q0)
as being the true PDF of X Some examples of the many functions of Q0
that might be of interest when sampling from a distribution f(z;Q0) include
1 q1(Q0)¼ E(Z) ¼R11z f(z;Q0)dz (mean),
2 q2(Q0)¼ E(Z – E(Z))2
¼R11(z – E(Z))2f(z;Q0)dz (variance),
Trang 6the properties of the procedures, is theparametric model case In this case, thedensity function candidates, f(x1, .,xn;Q), are assumed at the outset to belong tospecific parametric families of PDFs (e.g., normal, Gamma, binomial), andapplication of the celebrated maximum likelihood estimation procedure(presented in Chapter8) relies on the candidates for the distribution ofX beingmembers of a specific collection of density functions that are indexed, and fullyalgebraically specified, by the values ofQ.
In the semiparametric model and nonparametric model cases, a specificfunctional definition of the potential PDFs for X is not assumed, althoughsome assumptions about the lower-order moments of f(x;Q) are often made Inany case, it is often still possible to generate useful point estimates of variouscharacteristics of the probability model ofX that are conceptually functions ofparameters, such as moments, quantiles, and probabilities, even if the specificparametric family of PDFs for X is not specified For example, useful pointestimates (in a number of respects) of the parameters in the so-called generallinear model representation of a random sample based on general random sam-pling designs can be made, and with only a few general assumptions regardingthe lower-order moments of f(x1, .,xn;Q), and without any assumptions that thedensity is of a specific parametric form (see Section8.2)
Semiparametric and nonparametric methods of estimation have an tage of being applicable to a wide range of sampling distributions since they aredefined in a distribution-nonspecific context that inherently subsumes manydifferent functional forms for f(x;Q) However, it is usually the case that superiormethods of estimatingQ or q(Q) exist if a parametric family of PDFs for X can bespecified, and if the actual sampling distribution of the random sample issubsumed by the probability model Put another way, the more (correct) infor-mation one has about the form of f(x;Q) at the outset, the more precisely one canestimateQ0orq(Q0)
advan-7.2 Additional Considerations for the Specification and Estimation of Probability Models
A problem of point estimation begins with either a fully or partially specifiedprobability modelfor the random sampleX ¼ (X1, .,Xn)0whose outcomex ¼[x1, .,xn]0constitutes the observed data being analyzed in a real-world problem
Trang 7represented by {f(x;Q), Q∈O} For example, a fully specified probability model for
a random sample of miles per gallon achieved by 25 randomly chosen trucksfrom the assembly line of a Detroit manufacturer might be defined as
),(m,s2)∈O}, where O ¼ (0,1)(0,1) and f(x;m,s2
) is some continuous PDF In thislatter case, the statistical model allows for the possibility that f(x;m,s2) is anycontinuous PDF having a mean of m and variance of s2
, with both m and s2tive, e.g., normal, Gamma, or uniform PDFs would be potential candidates.7.2.1 Specifying a Parametric Functional Form for the Sampling Distribution
posi-In specifying a probability model, the researcher presumably attempts to tify an appropriate parametric family based on a combination of experience,consideration of the real-world characteristics of the experiments involved,theoretical considerations, past analyses of similar problems, an attempt at areasonably robust approximation to the probability distribution, and/or pragma-tism The degree of detail with which the parametric family of densities isspecified can vary from problem to problem
iden-In some situations there will be great confidence in a detailed choice ofparametric family For example, suppose we are interested in estimating theproportion, p, of defective manufactured items in a shipment of N items If arandom sample with replacement of size n is taken from the shipment (popula-tion) of manufactured items, then
ðX1; ; XnÞ fðx1; ; xn; pÞ ¼ pPni¼1 x ið1 pÞnPni¼1 x iYn
i¼1
If 0 ;1 gðxiÞrepresents the parametric family of densities characterizing the joint density ofthe random sample, and interest centers on estimating the unknown value of theparameter p
On the other hand, there will be situations in which the specification of theparametric family is quite tentative For example, suppose one were interested
in estimating the average operating life of a certain brand of hard-disk based onoutcomes of a random sample of hard-disk lifetimes In order to add some
Trang 8family of distributions associated withX is of the form Qn
i¼1m xð i; QÞ, wherethe density m(z;Q ) has mean m and variance s2
(what is the correspondingspecification for V?)
Now, what parametric functional specification of m xð i; QÞcan be assumed tocontain the specific density that represents the actual probability distribution of
Xior Vi? (Note, of course, that specifying a parametric family for Viwould imply
a corresponding parametric family for Xiand vice versa) One general tion would be the collection of all continuous joint density functions f(x1, .,xn;Q) for which f(x1, .,xn;Q) ¼ Qn
specifica-i¼1m xð i; QÞ with E(Xi)¼ m and var(Xi)¼ s2
,8i.The advantage of such a general specification of density family is that we havegreat confidence that the actual density function ofX is contained within theimplied set of potential PDFs, which we will come to see as an importantcomponent of the specification of any point estimation problem In this particu-lar case, the general specification of the statistical model actually providessufficient structure to the point estimation problem for a useful estimate ofmean lifetime to be generated (for example, the least squares estimator can beused to estimate m – see Chapter8) We will see that one disadvantage of verygeneral specifications of the probability model is that the interpretation of theproperties of point estimates generated in such a general context is also usuallynot as specific or detailed as when the density family can be defined with greaterspecificity
Consider a more detailed specification of the probability model of hard-diskoperating lives If we feel that lifetimes are symmetrically distributed aroundsome point, m, with the likelihoods of lifetimes declining the more distant themeasurement is from m, we might consider the normal parametric family for thedistribution of Vi It would, of course, follow that Xi is then also normallydistributed, and thus the normal distribution could serve only as an approxima-tionsince negative lifetimes are impossible Alternatively, if we felt that thedistribution of lifetimes was skewed to the right, the gamma parametric familyprovides a rich source of density shapes, and we might specify that the Xi’s havesome Gamma density, and thus the Vi’s would have the density of aGamma–type random variable that has been shifted to the left by m units.Hopefully, the engineering staff could provide some guidance regarding themost defensible parametric family specification to adopt In cases where there
is considerable doubt concerning the appropriate parametric family of densities,
Trang 9O, must also be identified tocomplete the probability model There are often natural choices for the para-meter space For example, if the Bernoulli family were specified, then O ¼{p: p∈[0,1]}, or if the normal family were specified, then O ¼ {(m,s): m∈(1,1),
s > 0} However, if only a general definition of the parametric family of densities
is specified at the outset of the point estimation problem, the specification of theparameter space for the parametric family will then also be general and oftenincomplete For example, a parameter space specification for the aforemen-tioned point estimation problem involving hard-disk lifetimes could be
O ¼ {Yo: m 0, s2 0} In this case, since the specific algebraic form of f(x1, .,
xn;Q) is also not specified, we can only state that the mean and variance of disk lifetimes are nonnegative, possibly leaving other unknown parameters in
hard-Younrestricted depending on the relationship of the mean and variance to theparameters of the distribution, and in any case not fully specifying the func-tional form of the density function Regardless of the level of detail with which
O is specified, there are two important assumptions, presented ahead, regardingthe specification ofO that are made in the context of a point estimation problem.The Issue of Truth in the Parameter Space First, it is assumed that O contains thetrue value ofQ0, so that the probability model given by {f(x; Q), Q∈O} can beassumed to contain the true sampling distribution for the random sample understudy Put another way, in the context of a point estimation problem, the setO isassumed to represent the entire collection of possible values for Q0 The rele-vance of this assumption in the context of point estimation is perhaps obvious –
if the objective in point estimation is to estimate the value ofQ0orq(Q0) we donot want to precludeQ0orq(Q0) from the set of potential estimates Note that, inpractice, this may be a tentative assumption that is subjected to statistical testfor verification or refutation (see Chapter10)
Identifiability of Parameters The second assumption on O concerns the concept
of the identifiability of the parameter vector Q As we alluded to in ourdiscussion of parametric families of densities in Chapter4, parameterization ofdensity families is not unique Any invertible transformation ofQ, say l ¼ h(Q),defines an alternative parameter space L ¼ {l: l ¼ h( Q ), Q ∈O} that can beused to specify an alternative probability model forX that contains the samePDF candidates as the statistical model based onO, i.e.,ffðx; h1ðlÞÞ; l 2 Lg ¼ffðx; QÞ; Q 2 Og Defining m(x;l) f(x;h1(l)), the alternative probability model
Trang 10The importance of parameter identifiability is related to the ability of dom sample outcomes to provide discriminatory information regarding thechoice ofQ∈O to be used in estimating Q0 If the parameter vector in a statisticalmodel is not identified, then two or more different values of the parametervector Q , say Q1 and Q2, are associated with precisely the same samplingdistribution X In this event, random sample outcomes cannot possibly beused to discriminate between the values of Q1and Q2since the probabilisticbehavior ofX under either possibility is indistinguishable We thus insist onparameter identifiability in a point estimation problem so that different values
ran-ofQ are associated with different probabilistic behavior of the outcomes of therandom sample
i¼1ð0; 1Þ Note that any choice of positive values for b0, b1, mT, s2
T,and s2
Vthat result in the same given positive values for m and s2result in preciselythe same sampling distribution forY (there are an infinite set of such choices foreach value of the vector [m,s2]0) Thus the original parameter vector is notidentified Note that the parameter vector [m,s2]0in the latter statistical modelforY is identified since the sampling distributions associated with two different
Trang 11ofQ (or of q(Q)) There is widespread use of such phrases in the statistics andeconometrics literature, and we will make frequent use of such phrases in thisbook as well In general, one must rely on the context of the discussion to be surewhether Q or q( Q ) refers to the quantity being estimated or merely to theindexing parameter of a family of joint density functions.
7.3 Estimators and Estimator Properties
Point estimation is concerned with estimatingQ or q(Q) from knowledge of theoutcome x ¼ x½ 1; ; xn0of a random sample X It follows from this basicdescription of point estimation that functions are critical to the estimationproblem, where inputs or domain elements are sample outcomes,x, and outputs
or range elements are estimates ofQ or q(Q) More formally, estimates will begenerated via some function of the formt: R(X) ! R(t), where R(t) is the range of
t defined as R(t) ¼ {t: t ¼ t(x), x∈R(X)} Note that R(t) represents the set of allpossible estimates ofQ or q(Q) that can be generated as outcomes of t(X) We willalways tacitly assume thatt(X) is an observable random variable, and hence astatistic, so that estimates are observable and empirically informative
Henceforth, when the functiont: R(X) ! R(t) represented by t ¼ t(x) is beingutilized to generate estimates ofq(Q), we will refer to the random variable T ¼t(X) as an estimator for q( Q ), and q( Q ) will be referred to as the estimand
An outcome,t ¼ t(x), of the estimator will be referred to as an estimate of q(Q)
We formalize these three terms in the following definition:
Definition 7.4
Point Estimator,
Estimate, and Estimand
A statistic or vector of statistics, T ¼ t(X), whose outcomes are used toestimate the value of a scalar or vector function,q(Q), of the parameter vector,
Q , is called a point estimator, with q( Q ) being called the estimand.2
An observed outcome of an estimator is called apoint estimate
2 Note, as always, that the function q can be the identity function q(Q) Q, in which case we could be referring to estimating the vector
Q itself Henceforth, it will be understood that since q(Q) Q is a possible choice of q(Q), all discussion of estimating q(Q) could be referring to estimating the vector Q itself.
Trang 12Figure7.1contains a schematic overview of the general context of the pointestimation problem to this point.
7.3.1 Evaluating Performance of EstimatorsSince there is literally an uncountably infinite set of possible functions ofX thatare potential estimators ofq(Q), a fundamental problem in point estimation isthe choice of a “good” estimator In order to rank the efficacy of estimators and/
or to choose the optimal estimator ofq(Q), an objective function that establishes
an appropriate measure of “goodness” must be defined
A natural measure to use in ranking estimators would seem to be thedistance between outcomes of t(X) and q(Q), which is a direct measure ofhow close estimates are to what is being estimated In the current context,this distance measure is d(t(x), q(Q)) ¼ ([t(x) q(Q)]0[t(x) qðQ)])1/2
, whichspecializes to |t(x) - q(Q)| when k ¼ 1 However, this closeness measure has anobvious practical flaw for comparing alternative functions for estimatingq(Q) –the estimate that would be preferred depends on the true value ofq(Q), which isunknown (or else there would be no point estimation problem in the firstplace) This problem is clearly not the fault of the particular closeness measurechosen since any reasonable measure of closeness between the two valuest(x)andq(Q) would depend on where q(Q) actually is in ℝkvis-a-vis wheret(x) islocated Thus, comparing alternative functions for estimatingq(Q) on the basis
of the closeness toq(Q) to an actual estimate t(x) is not tractable — we clearlyneed additional criteria with which to judge whether t(X) generates “good”estimates ofq(Q)
Observe Outcome of Random Sample:
Trang 13estimatingq(Q) have been presented in the literature.3
The measures evaluateand rank estimators in terms of closeness of estimates toq(Q) in an expected orprobabilisticsense Note that sincet(X) is a function of X, and thus a randomvariable, a sampling distribution (i.e., the probability distribution oft(X)) exists
on R(T) that is induced by the probability distribution of the random sample,
X ¼ (X1, .,Xn) Roughly speaking, the fact that the distribution ofX depends on
Q will generally result in the sampling distribution of t(X) depending on Q aswell, and this latter dependence can lead to changes in location, spread, and/orshape of the distribution oft(X) as Q changes If the sampling distribution of t(X)changes withQ in a way that keeps the spread of potential estimates generated
byt(X) narrowly focused on q(Q) so that outcomes of t(X) occur near q(Q) withhigh probability under all contingencies for Q ∈ O, (see Figure7.2), then thefunctionT would be useful for generating estimates of q(Q)
We now turn our attention to specific estimator properties that have beenused in practice to measure whether these objectives have been achieved Indiscussing estimator properties, we will sometimes utilize aQ-subscript such as
EQð Þ, P Qð Þ, or var Qð Þ to emphasize that expectations or probabilities are beingcalculated using a particular value ofQ for the parameter vector of the underly-ing probability distribution In cases where the parametric context ofexpectations and probabilities are clear or does not need to be distinguished,the subscriptQ will not be explicitly displayed
7.3.2 Finite Sample PropertiesThe properties examined in this section evaluate the performance of estimatorswhen the random sample is of fixed size, and they are therefore referred to asfinite sample properties This is as opposed to asymptotic properties that we willexamine later in this section, which relate to limiting results that areestablished as the random sample size increases without bound (increases toinfinity) All of the finite sample properties examined here are based on the firsttwo moments of estimators, and thus relate to the central tendency of estimates
as well as the spread of estimates around their central tendency Of course, ifthese moments do not exist for a given estimator, then these finite sampleproperties cannot be used to evaluate the performance of the estimator
3 A concise review and comparison of a number of alternative criteria is given by T Amemiya (1994), Introduction to Statistics and Econometrics, Cambridge, MA, Harvard University Press, pp 118–121.
Trang 14The MSE criterion accounts for both the degree of spread in the samplingdistribution of T as well as the degree to which the central tendency of T ’sdistribution deviates from q(Q) We will make this notion precise upon definingthe concept of bias, as follows.
Definition 7.6
Estimator Bias Thebias of a scalar estimator T of q(Q) is defined as biasQðTÞ ¼ EQðT qðQÞÞ;
8Q 2 O: The bias vector of a vector estimator T of q(Q) is defined as BiasQð ÞT
In the multivariate case, the MSE criterion is generalized through the use ofthemean square error matrix
To appreciate the information content ofMSEQ(T), first note that the diagonal
of the MSE matrix contains the MSE of the estimator Tifor qi(Q), i ¼ 1, .,k sincethe ith diagonal entry inMSEQ(T) is EQ(Tiqi(Q))2
More generally, letc be any(k 1) vector of constants, and examine the MSE of the linear combination
c0T ¼Pk cT as an estimator ofc0qðQÞ ¼Pk cqð Þ, defined byQ
Trang 15tr MSEð Yð ÞT Þ ¼ tr E Y½T qðYÞ½T qðYÞ0
¼ EY½T qðYÞ0½T qðYÞ¼ EYd2ðT; qðYÞÞ:This is the direct vector analogue to the measure of closeness of T to q(Q) that isprovided by the MSE criterion in the scalar case
The MSE matrix can be decomposed into variance and bias components,analogous to the scalar case Specifically,MSE(T) is equal to the sum of thecovariance matrix ofT and the outer product of the bias vector of T, as
MSEQð Þ ¼ ET Q½T EQð Þ þ ET Qð Þ q QT ð Þ T E½ Qð Þ þ ET Qð Þ q QT ð Þ0
¼ CovQð Þ þ BiasT Qð ÞBiasT Qð ÞT 0:The outer product of the bias vector forms a (k k) matrix that is called the biasmatrix
In the case of a scalar q(Q), estimators with smaller MSEs are preferred Note,however, that since the trueQ is unknown (or else there is no point estimationproblem to begin with), one must consider the performance of an estimator for allpossible contingencies for the true value ofQ, which is to say, for all Q∈O It isquite possible, and often the case, that an estimator will have lower MSEs thananother estimator for some values ofQ∈O but not for others These considerationslead to the concepts ofrelative efficiency and relatively more efficient
T is relatively more efficient than T*
if REQ(T,T*) 1 8Q∈O and > 1 forsomeQ∈O
In comparing two estimators of q(Q), if T is relatively more efficient than T*,then there is no value ofQ for which T* is preferred to T on the basis of MSE, andfor one or more values ofQ, T is preferred to T* In this case, it is evident that T*can be discarded as an estimator of q(Q), and in this case, T* is said to be aninadmissible estimator of q(Q), as defined below
Trang 16Estimators for the
Mean of a Bernoulli
Distribution
i i¼ 1) or not (Xi¼ 0) the ith customer contacted
by telephone solicitation purchases a product Consider two estimators for theunknown proportion, p, of the consumer population who will purchase theproduct:
T¼ X ¼ n1Xn
i¼1
Xi and T ¼ n þ 1ð Þ1X
n i¼1
Xi¼ nþ 1n
X:Which estimator, if either, is the preferred estimator of p on the basis of MSEgiven that n¼ 25? Does either estimator render the other inadmissible?Answer: Note that bias(T) ¼ E(X ) p ¼ 0, bias(T*
)¼ E n= n þ 1ð ð ÞÞX p ¼
p/(n þ 1) ¼ p/26, var(T)¼ p(1 p)/n ¼ p(1 p)/25, and var(T*)¼np(1 p)/(n þ 1)2¼ p(1 p)/27.04 Then the MSEs of the two estimators aregiven by
MSEðTÞ ¼pð125 pÞand
MSE Tð Þ ¼ pð1 pÞ
27:04 þ
p2
676:Examine the MSE of T*relative to the MSE of T, as
REpðT; T Þ ¼MSE TMSEð ÞðTÞ ¼ :9246 þ :0370p=ð1 pÞ:
Since the ratio depends on the value of p, which is unknown, we must considerall of the possible contingencies for p∈[0,1] Note that the ratio is monotonicallyincreasing in p, taking its smallest value of 9246 when p¼ 0, and diverging toinfinity as p! 1 The ratio of MSEs equals 1 when p ¼ 6708 Thus, withoutconstraints on the potential values of p, neither estimator is preferred to theother on the basis of MSE, and thus neither estimator is rendered inadmissible
In contrast to the scalar case, a myriad of different MSE comparisons arepossible whenq(Q) is a (k 1) vector First of all, there are k individual MSE
Trang 17If T*is SMSE superior to T, it follows directly from Definition 7.10 thatMSEQðT
i Q(Ti)8i and 8Q∈O because if MSEQ(T*)MSEQ(T) is negativesemidefinite, the matrix difference necessarily has nonpositive diagonalentries.4It follows that
MSEQðℓ0T Þ ¼ ℓ0MSEQðT 0MSEQðTÞℓ ¼ MSEQðℓ0TÞ 8Q 2 O and 8ℓ:
Thus in the sense of all of the MSE comparisons defined previously,T* is at least
as good asT
The fact thatMSEQ(T*)MSEQ(T) is negative semidefinite and unequal tothe zero matrix for some
the aforementioned MSE comparisons become strong inequalities (<) for some Q
To see this, note that a nonzero negative semidefinite symmetric matrix sarily has one or more negative diagonal entries.5 Therefore, MSEQ Ti
neces-<MSEQ(Ti) for someQ and i, so that EQ(d2(T*, q(Q)) < EQ(d2(T, q(Q)) for some Qand MSEQ(ℓ0T*) < MSEQ(ℓ0T) for someQ and ℓ Thus, T* is superior to T for at leastsome MSE comparisons in addition to being no worse for any of the MSEcomparisons We can now define multivariate analogues to the notions ofrelative efficiency and admissibility
4 By definition, A is negative semidefinite iff ℓ 0 th diagonal entry of
by ℓ 0 Aℓ with ℓ being a zero vector except for a 1 in the i th position.
5 A nonzero matrix has at least unit rank The rank of a negative semidefinite symmetric matrix is equal to the number of negatively valued eigenvalues, and all eigenvalues of a negative semidefinite matrix are
matrix is equal to the sum of its eigenvalues Since all diagonal entries in a negative semidefinite matrix must be
nonzero negative semidefinite symmetric matrix must have one or more negative diagonal entries.
Trang 18an optimal estimator exists that has the smallest MSE or MSE matrix among allestimators ofq(Q) We might call such an estimator most efficient, or simplyefficient Unfortunately, no such estimator exists in general To clarify theissues involved, consider the scalar case and note that the degenerate estimatorT*¼ t*(X) ¼ Yo would certainly have minimum mean-square error forestimating Y if mean-square error were evaluated at the point Y ¼ Yo, i.e.,MSEY(T*)¼ 0 for Y ¼ Yo Since a similar degenerate estimator could be definedfor eachY∈O, then for a given estimator to have minimum mean-square errorfor every potential value ofY, (i.e., uniformly in Y) it would be necessary thatMSEY(T)¼0 8Y∈O, which would imply that varQ(T)¼0 8Y∈O, and thus, that
PY(t(x) ¼ Y) ¼ 1 8Y∈O In order to construct an estimator T that satisfies thecondition P(t(x)¼Y) ¼ 1 8Y∈O, it would be necessary to be able to identify thetrue value ofY directly upon observing the sample outcome, x This essentiallyrequires that the range of the random sample be dependent on the value ofY,denoted as RY(X), in such a way that the sets RY(X), Y∈O, are all mutuallyexclusive, i.e., RY0ðXÞ \ RY00ðXÞ ¼ ; for Y06¼ Y00 Then, upon observingx, onewould only need to identify the set RY(X) to which x belonged, and Y would beimmediately known This is rarely, if ever, possible in practice, and so adopting aminimum mean-square error criterion for choosing an estimator of q(Q) is notfeasible A similar argument leads to the conclusion that there is in general noestimator of a (k 1) vector q(Q) whose MSE matrix is smallest among the MSEmatrices of all estimators ofq(Q).7
While there generally does not exist an estimator that has a uniformly (i.e.,for allQ∈O) minimum MSE or MSE matrix relative to all other estimators ofq(Q), it is often possible to find an optimal estimator if one restricts the type ofestimators under consideration Two such restrictions that have been widelyused in practice areunbiasedness and linearity, which we will examine in thenext two subsections ahead
6 Some analysts use a weak mean square error (WMSE) criterion that relates to only expected squared distance considerations T * is WMSE superior to T iff E Q (d 2 ( T * , Y(d 2 ( T,q(Q)) 8Q∈O, and < for some Q∈O Relative efficiency and admissibility can be defined in the context of WMSE superiority and are left to the reader.
7 By “smallest MSE matrix,” we mean that MSE Q ( T * )- MSE Q ( T) is a negative semidefinite matrix for all estimators T of q(Q) and for all Q.
Trang 19cance of the condition8Q∈O in the above definition In the context of the pointestimation problem, we have assumed that the true value ofQ, say Q*, is someelement of the specified parameter space, O, but we do not know which one.Thus, the property of unbiasedness is stated for all possible contingenciesregarding the potential values for the true value of Q Due to the condition8Q∈O, the requirement for unbiasedness essentially means that EQð Þ¼ q(Q)Tregardlessof which value ofQ∈O is the true value Thus, for T to be unbiased,its density function must be balanced on the pointq(Q), whatever the true value
ofQ Whether or not T has the unbiasedness property depends on the functionaldefinition of T, and in particular, on how the function translates the densityfunction ofX ~ f(x1, .,xn;Q) into the density function of T ~ f(t;Q)
An unbiased estimator has the intuitively appealing property of being equal
toq(Q) on average, the phrase having two useful interpretations First, since theexpectation operation is inherently a weighted average of the outcomes ofT,then the outcomes ofT have a weighted average equal to q(Q) Alternatively, ifone were to repeatedly and independently observe outcomes of the randomsampleX, and thus repeatedly generate estimates of q(Q) using correspondingoutcomes of the vector T, then the simple average of all of the observedestimates would converge in probability (and, in fact, converge almost surely)elementwise toq(Q) by Khinchin’s WLLN (or by Kolmogorov’s SLLN in the case
of almost-sure convergence), provided only thatq(Q) is finite
We provide the following example of an unbiased estimator of a parameter.Example 7.4
i ¼1Xi)¼ n1Pn
i ¼1E Xð Þ ¼ yi
regardlessof the value of y > 0 Thus, for example, if the true value of y were
2, then E(T)¼ 2, or if the true value of y were100, then E(T) ¼ 100 □
MVUE, MVLUE or BLUE, and Efficiency The unbiasedness criterion ensures onlythat an estimator will have a density that has a central tendency or balancingpoint ofq(Q) However, it is clear that we would also desire that the density not
be too spread out around this balancing point for fear that an estimate could be
Trang 20generated that was a significant distance from q(Q) with high probability.Graphically, we would prefer the estimator T to the estimator T* in Figure7.3,where both of these estimators are unbiased estimators ofq(Q).
The foregoing considerations motivate that, if one wishes to use an unbiasedestimator ofq(Q), one should use the unbiased estimator that also has minimumvariance, or minimum covariance matrix ifT is a vector, among all unbiasedestimators ofq(Q) Since BiasQ(T) ¼ 0 for all estimators in the unbiased class ofestimators, MSEQ(T)¼ varQ(T) or MSEQ(T) ¼ CovQ(T), and we can thus viewthe objective of minimizing var(T) orCov(T) equivalently as searching for theestimator with the smallest MSE or smallest MSE matrix within the class ofunbiased estimators In the definition below, we introduce the notationAffiB toindicate that matrix A is smaller than matrix B by a negative semidefinitematrix, i.e.,A B ¼ C is a negative semidefinite matrix
Trang 21Definition 7.13 implies that an estimator is a MVUE if the estimator is unbiasedand if there is no other unbiased estimator that has a smaller variance or covariancematrix for anyQ∈O Drawing direct analogies to the discussion of the MSE criteria,
a MVUE, T, is such that MSEQ(Ti)¼ varQ(Ti) Q( Ti ) ¼ MSEQ( Ti ) 8Q∈Oand8i, where T* is any estimator in the unbiased class of estimators Further-more, EQ(d2( Q (d2(T*, q(Q)) 8Q∈O and MSEQ(ℓ0T) ¼ varQ(ℓ0
varQ(ℓ0T*) ¼ MSEQ(ℓ0T*) 8Q∈O and 8ℓ Thus, within the class of unbiasedestimators, a MVUE ofq(Q) is at least as good as any other estimator of q(Q)
in terms of all of the types of MSE comparisons that we have discussedpreviously IfT is a MVUE for q(Q), then T is said to be efficient within theclass of unbiased estimators
Unfortunately, without the aid of theorems that facilitate the discovery ofMVUES, finding a MVUE ofq(Q) can be quite challenging even when the pointestimation problem appears to be quite simple The following exampleillustrates the general issues involved
Example 7.5MVUE of the Mean of a
Bernoulli Distribution
Consider defining an MVUE for the parameter p using a random sample ofsize 2 from the Bernoulli population distribution f(z;p)¼ pz(1p)1-z
I{0,1}(z).First of all, the range of the random sample is {(0,0), (0,1), (1,0), (1,1)}, whichrepresents the domain of the estimator function T¼ t(X) For t(X) to be in theunbiased class, the following general condition must be met:
Trang 22where we have used the facts that E(t(X))¼ p, t(0,0) ¼ 0, and t(1,1) ¼ 1 since t(X)must be unbiased Also because of the unbiasedness condition, we can substi-tute t(0,1)¼ 1t(1,0) into the variance expression to obtain
varðtðXÞ ¼ 2p2ð1 pÞ2þ ð1 p tð1; 0Þ Þ2ð1 pÞp þ ðtð1; 0Þ p Þ2pð1 pÞ:The first-order condition for a minimum of the variance is given by
dvar tðXÞ
dtð1; 0Þ ¼ 2ð1 p Þ
2
pþ 2tð1; 0Þpð1 pÞ þ 2tð1; 0Þpð1 pÞ 2 p2ð1 pÞ ¼ 0;which implies that 4p(1p)t(1,0) ¼ 2(1p)2
p+ 2p2(1p), so that t(1,0) ¼ (1/2),which then implies t(0,1)¼ (1/2)
We have thus defined the function T¼ t(X) that represents a MVUE of p byassociating an appropriate outcome of T with each random sample outcome.The preceding results can be represented collectively as t(x1, x2)¼ (1/2) (x1+ x2)
A number of general theorems that can often be used to simplify the searchfor a MVUE will be presented in Section7.5
For purposes of simplicity and tractability, as well as for cases where littlecan be assumed about the probability model other than conditions on low-ordermoments, attention is sometimes restricted to estimators that are unbiased andthat have minimum variance or covariance matrix among all unbiasedestimators that are linear functions of the sample outcome Such an estimator
is called a BLUE or MVLUE, as indicated in the following definition
Definition 7.14Best Linear Unbiased
Estimator (BLUE) orMinimum VarianceLinear UnbiasedEstimator (MVLUE)
An estimatorT is said to be a BLUE or MVLUE of q(Q) iff
1 T is a linear function, T ¼ t(X) ¼ AX + b, of the random sample X,
Example 7.6BLUE of PopulationMean for AnyPopulation Distribution
Let (X1, .,Xn) be a random sample from some population distribution f(z;Y)having a finite mean m ¼ q1(Y) and variance s2¼ q2(Y) What is the BLUE ofthe mean of the population distribution?
Answer: We are examining linear estimators, and thus t(X) ¼ Pn
i¼1aiXiþ b For T
to be unbiased, we require thatPn
i ¼1ai¼ 1 and b ¼ 0, since E(T)¼E Pn
i ¼1aiXiþ b
¼ Pn i¼1aiE Xð Þi
Trang 23and b¼ 0 The variance of T is simply s2Pn
i¼1a2
i because (X1, .,Xn) is a randomsample from f(z;Y) Thus, to find the BLUE, we must solve the followingminimization problem:
@L
@ ai¼ 2 s2 ail ¼ 0; i¼ 1; nand
@L
@l¼ 1 S
n i¼1ai¼ 0:
The first n conditions imply a1¼ a2¼ ¼ an, since ai¼ l/2s2 8i, andthen Pn
i¼1ai¼ 1 requires that ai¼ 1/n, i ¼ 1, .,n Thus, t(X) ¼ Pn
i¼1aiXi¼
n1Pn i¼1Xi¼ X, so that the sample mean is the BLUE (or MVLUE) of the mean
of any population distribution having a finite mean and variance The readershould check that the second-order conditions for a minimum are in fact met.□
In addition to estimating the means of population distributions, a prominentBLUE arises in the context of least-squares estimation of the parameters of ageneral linear model, which we will examine in Chapter8
7.3.3 Asymptotic PropertiesWhen finite sample properties are intractable or else inapplicable due to thenonexistence of the appropriate expectations that define means and variances,one generally resorts to asymptotic properties to rank the efficiency ofestimators In addition, asymptotic properties are of fundamental interest ifthe analyst is interested in assessing the effects on estimator properties of anever-increasing number of sample observations
Asymptotic properties of estimators are essentially equivalent in concept tothe finite sample properties presented heretofore, except that asymptoticproperties are based on the asymptotic distributions of estimators rather thanestimators’ exact finite sampling distributions In particular, asymptoticanalogues to MSE, relative efficiency, unbiasedness, and minimum-varianceunbiasedness can be defined with reference to asymptotic distributions ofestimators However, a problem of nonuniqueness of asymptotic propertiesarises because of the inherent nonuniqueness of asymptotic distributions
To clarify the difficulties that can arise when using asymptoticdistributions as a basis for defining estimator properties, let Tn denote anestimator of the scalar q(Q) based on n sample observations, and suppose
Trang 24b1n ðTnqðQÞÞ !d Nð0; 1Þ Then one might consider defining asymptoticproperties of Tnin terms of the asymptotic distribution N(q(Q),b2
n) However,
by Slutsky’s theorem it follows that nð = n kð ÞÞ1 =2b1n ðTn q Qð ÞÞ !d Nð0; 1Þ for afixed value of k since nð = n kð ÞÞ1=2! 1 , so that an alternative asymptoticdistribution could be Tn a
N qð Þ; n kQ ðð Þ=nÞb2
n
; producing a differentasymptotic variance with implications for estimator performance measuresthat are functions of the variance of estimators The difficulty is that thecentering and scaling required to achieve a limiting distribution is not unique,leading to both nonunique asymptotic distributions and nonunique asymptoticproperties derived from them
There are two basic ways of addressing the aforementioned nonuniquenessproblem when dealing with asymptotic properties One approach, which we willmention only briefly, is to rank estimators only on the basis of limits of asymp-totic property comparisons so as to remove the effects of any arbitrary scaling orcentering from the comparison For example, referring to the previous illustra-tion of nonuniqueness, let the asymptotic distribution of Tnbe N(q(Q),b2
n
¼AMSEQðT nÞAMSEQðTnÞ¼
n k
n :Using the ARE in this form, one would be led to the conclusion that Tn isasymptotically relatively more efficient than Tn, which in the context of theprevious illustration of nonuniqueness would be absurd since Tnand Tn arethe same estimator However, limn!1 AREQ Tn; T
An alternative approach for avoiding nonuniqueness of asymptoticproperties is to restrict the use of asymptotic properties to classes of estimatorsfor which the problem will not occur For our purposes, it will suffice to examinethe consistent asymptotically normal (CAN) class of estimators (for otherpossibilities, see E Lehmann, Point Estimation, pp 347–348)
Prior to identifying the CAN class of estimators, we examine the property ofconsistency
Consistency A consistent estimator is an estimator that converges in ity (element-wise ifTnis a vector) to what is being estimated
Trang 25probabil-Definition 7.15Consistent Estimator Tnis said to be aconsistent estimator of q(Q) iff plimQ(Tn)¼ q(Q) 8 Q ∈ O.
Thus, for large enough n (i.e., for large enough sample size), there is a highprobability that the outcome of a scalar estimator Tn will be in the interval(q(Q)e, q(Q)+e) for arbitrarily small e > 0 regardless of the value of Q Relatedly,the sampling density of Tnconcentrates on the true value of q(Q) as the samplesize ! 1 if Tn is a consistent estimator of q(Q) Consistency is clearly adesirable property of an estimator, since it ensures that increasing sample infor-mation will ultimately lead to an estimate that is essentially certain to bearbitrarily close to what is being estimated, q(Q)
SinceTn!m q(Q) implies Tn !p q(Q), we can state sufficient conditions forconsistency of Tnin terms of unbiasedness and in terms of variance convergence
to zero Specifically, ifTnis unbiased, or if the bias vector converges to zero as n
! 1, and if var(Tn) ! 0 as n ! 1, or Cov(Tn) ! 0 as n ! 1 if Tnis a vector,thenTnis a consistent estimator ofq(Q) by mean-square convergence
Example 7.7Sample Mean as aConsistent Estimator of
as n! 1 Note the following counterexample
Example 7.8Consistency without
E(Tn)! q(Y)
Let the sampling density of Tn be defined as f(tn;Y) ¼ (1n1/2)I{Q}(tn)þ
n1/2I{n}(tn) Note that as n ! 1, limn!1 P[|tnY|< e]¼1, for any e > 0,and Tn is consistent for Y However, since E(Tn)¼ Y(1n1/2) + n(n1/2)¼
Y (1n1/2) + n1/2, then as n! 1, E(Tn) ! 1 □The divergence of the expectation in Example 7.8 is due to the fact that thedensity function of Tn, although collapsing to the point Y as n!1, was notcollapsing at a fast enough rate for the expectation to converge toY In particu-lar, the density weighting assigned to the outcome n in defining the expectationwent to zero at a rate slower than n went to infinity as n ! 1, causingthe divergence A sufficient condition for Tn!pqðQÞ ) limn!1E Tð Þ ¼ qðQÞ isnprovided in the following theorem:
Theorem 7.1Sufficient ConditionforTn!p qðYÞ )limn!1E Tð Þ ¼ qðQÞn
IfE T2 n
exists and is bounded8n, so that E T2
n
probability implies convergence in mean
Trang 26Proof Rao, Statistical Inference, pp 121 n
Note the sufficient condition given in Theorem 7.1 does not hold inExample 7.7
Consistent Asymptotically Normal (CAN) Estimators The class of consistentasymptotically normal (CAN) estimators of q(Q) is defined in the statisticalliterature to be the collection of all estimators ofq(Q) for which n1/2
(Tnq(Q))
!d N([0], ST), whereSTis a positive definite covariance matrix that may depend
on the value ofQ We will allow this dependence to be implicit rather thanutilize notation such aST(Q) Note the consistency of Tnfollows immediately,since by Slutsky’s theorem n1/2[n1/2(Tnq(Q))] ¼ Tnq(Q) !d 0 ·Z ¼ 0, where
Z ~ N(0, ST), which impliesTnq(Q) !d 0 or equivalently Tn!d q(Q) The CANclass contains a large number of the estimators used in empirical work
Because all of the estimators in the CAN class utilize precisely the samesequence of centering (i.e.,q(Q) is subtracted from Tn) and scaling (i.e.,Tnq(Q)
is multiplied by n1/2), the problem of nonuniqueness of asymptotic distributionsand properties does not arise Asymptotic versions of MSEs, MSE matrices, biasvectors, variances, and covariance matrices can be defined via expectationstaken with respect to the unique asymptotic distribution of estimators, where
Tn a
N(q(Q), n1ST) In particular, letting the prefix A denote an asymptoticproperty, and letting EAdenote an expectation taken with respect to an asymp-totic distribution, we have within the CAN class8Q∈O,
The zero value of the asymptotic bias indicates that a CAN estimator ofq(Q)
is necessarilyasymptotically unbiased We pause to note that there is a lack
of consensus in the literature regarding the definition ofasymptotic ness, and Example 7.8 is useful for illustrating the issues involved Somestatisticians define asymptotic unbiasedness of an estimator sequence interms of the limit of the expected values of the estimators in the sequence,where limn!1 E(Tn)¼ q(Q) 8Q∈O characterizes an asymptotically unbiasedestimator Under this definition, the estimator in Example 7.8 would not beasymptotically unbiased, but rather would beasymptotically biased It is clearthat this definition of asymptotic unbiasedness requires that the expectations
unbiased-in the sequence exist, as they do unbiased-in Example 7.8 Withunbiased-in the CAN class, the twodefinitions of asymptotic unbiasedness will coincide if the second ordermoments of the estimators in the sequence {Tn} are bounded (recall Theorem7.1), since then limn!1E(Tn)¼ q(Q) ¼ EA(Tn) Otherwise, the definitions mayrefer to different concepts of unbiasedness, as Example 7.8 demonstrates
Trang 27Thus, one must discern the definition of asymptotic unbiasedness being used byany analyst by the context of the discussion.
Given the preceding definition of asymptotic properties, we can now definethe meaning of asymptotic relative efficiency and asymptotic admissibilityuniquely for CAN estimators
Definition 7.16Asymptotic RelativeEfficiency andAsymptoticAdmissibility
LetTnandT nbe CAN estimators ofq(Q) such that n1/2
(Tnq(Q)) !d N(0, ST)and n1/2(T
b Tn is asymptotically relatively more efficient than T
n iff ST ST isnegative semidefinite8Y 2 O and ST ST 6¼ 0 for some Q 2 O
c If there exists an estimator that is asymptotically relatively more efficientthanTn, thenTnisasymptotically inadmissible Otherwise Tnisasymp-totically admissible
A discussion of the meaning of ARE and asymptotic admissibility, as well asall of the other asymptotic properties presented to this point, would becompletely analogous to the discussion presented in the finite sample case,except now all interpretations would be couched in terms of approximationsbased on asymptotic distributions We leave it to the reader to draw the analogies.Example 7.9
Relative AsymptoticEfficiency of TwoEstimators ofExponential Mean
Recall Example 7.4 regarding the estimation of the expected operating lives ofhard disks, y, using a random sample from an exponential population distribu-tion As an alternative estimator of y, consider the following:
T n¼ t
nðXÞ ¼ 12 n1Xn
i¼1
X2 i
Trang 28where m04¼hd4ð1 ytÞ1=dt4i
t¼0¼ 24y4.Now note that Tn is a continuous function of M02 so that plim ðT
nÞ ¼plimðM0
N(2y2, 20y4/n) Since
N(0,G[20y4/n]G0)¼ N(0,1.25 y2)
In comparing Tn with Xnas estimators of y, it is now clear that although bothare consistent and asymptotically normal estimators of y, Xnisasymptoticallymore efficient than T
n, since in comparing the asymptotic variances ofthe limiting distributions of n1/2( Xn y ) and n1/2( Tn y ), we have that
y2< 1.25 y2
Asymptotic Efficiency At this point it would seem logical to proceed to a tion of asymptotic efficiency in terms of a choice of estimator in the CAN classthat has the smallest asymptotic variance or covariance matrix8Q 2 O (compare
defini-to Definition 7.13) Unfortunately, LeCam (1953)9 has shown that such anestimator does not exist without further restrictions on the class of estimators
In particular, LeCam (1953) effectively showed that for any CAN estimator onecan always define an alternative estimator that has a smaller variance or covari-ance matrix for at least oneQ 2 O The implication of this result is that onecannot define an achievable lower bound to the asymptotic variances or covari-ance matrices of CAN estimators, so that no asymptotically optimal estimatorexists
On the other hand, LeCam (1953) also showed that under mild regularityconditions, there does exist a lower bound to the asymptotic variance or covari-ance matrix of a CAN estimator that holds for allQ 2 O except on a set ofQ-values having Lebesque measure zero, which is the Cramer-Rao Lower Boundthat will be discussed in Section7.5 Note that the Lebesque measure of a set ofQ-values can be thought of as the volume of the set within the k-dimensionalparameter space A set having Lebesque measure zero is a set with zero volume
in k-space, e.g., a collection of isolated points, or a set of points having sion less than k (such as a square and its interior in a three-dimensional space, or
dimen-a line in two-dimensiondimen-al spdimen-ace) A set of Lebesque medimen-asure zero is dimen-anonstochastic analogue to a set having probability zero, and such a set is thuspractically irrelevant relative to its complement It is thus meaningful to speak
9 LeCam, L., (1953) “On Some Asymptotic Properties of Maximum Likelihood Estimates and Related Bayes Estimates”, University of California Publications in Statistics, 1:277–330, 1953.
Trang 29of a lower bound on the asymptotic variance or covariance matrix of aCAN estimator ofq(Q) that holds almost everywhere in the parameter space(i.e., except for a set of Lebesque measure zero), and then a search for an estima-tor that achieves this bound becomes meaningful as well.
At this point we will state a general definition of asymptotic efficiencyfor CAN estimators In Section 7.5, we will be much more precise aboutthe functional form of the asymptotic covariance matrix of an asymptoticallyefficient estimator
Definition 7.17Asymptotic Efficiency IfTnis a CAN estimator ofq(Q) having the smallest asymptotic covariance
matrix among all CAN estimators 8Q 2 O, except on a set of Lebesquemeasure zero,Tnis said to be asymptotically efficient
As a final remark, it is possible to remove the qualifier “except on a set ofLebesque measure zero” if the CAN class of estimators is further restricted sothat only estimators that converge uniformly to the normal distribution areconsidered Roughly speaking, uniform convergence of a function sequence
Fn(x) to F(x) requires that the rate at which convergence occurs is uniform acrossall x is the domain of F(x), unlike ordinary convergence (recall Definition 5.7)which allows for the possibility that the rate is different for each x Therestricted class of estimators is called the Consistent Uniformly AsymptoticallyNormal (CUAN) class, and within the CUAN class it is meaningful to speak of
an estimator that literally has the smallest asymptotic covariance matrix Theinterested reader can consult C.R Rao, (1963) “Criteria of Estimation in LargeSamples,” Sankhya, Series A, pp 189–206 for further details
7.4 Sufficient Statistics
Sufficient statistics for a given estimation problem are a collection of statistics
or, equivalently, a collection of functions of the random sample, that summarize
or represent all of the information in a random sample that is useful forestimating anyq(Q) Thus, in place of the original random sample outcome, it
is sufficient to have observations on the sufficient statistics to estimate anyq(Q)
Of course, the random sample itself is a collection of n sufficient statistics, but
an objective in defining sufficient statistics is to reduce the number of functions
of the random sample needed to represent all of the sample information relevantfor estimatingq(Q) If a small collection of sufficient statistics can be found for agiven statistical model then for defining estimators ofq(Q) it is sufficient toconsider only functions of the smaller set of sufficient statistic outcomes asopposed to functions of all n outcomes contained in the original random sample
In this way the sufficient statistics allow a data reduction step to occur in a pointestimation problem Relatedly, it will be shown that the search for estimators ofq(Q) having the MVUE property or small MSEs can always be restricted tofunctions of the smallest collection of sufficient statistics Finally, if the
Trang 30sufficient statistics have a special property, referred to as completeness, then anexplicit procedure utilizing the complete sufficient statistics is available that isoften useful in defining MVUEs We begin by presenting a more rigorous defini-tion of sufficient statistics.
Definition 7.18Sufficient Statistics LetX ¼ X½ 1; ; Xn0~ f(x;Q) be a random sample, and let s ¼ s½ 1ðXÞ; ; srðXÞ0
be r statistics The r statistics are said to besufficient statistics for f(x;Q) ifff(x; Qjs) ¼ h(x), i.e., the conditional density of X, given s, does not depend onthe parameter vectorQ.10
An intuitive interpretation of Definition 7.18 is that once the outcomes ofthe r sufficient statistics are observed, there is no additional information onQ inthe sample outcome The definition also implies that given the function valuess(x) ¼ s, no other function of X provides any additional information about Qthan that obtained from the outcomess To motivate these interpretations, firstnote that the conditional density function f(x;Q|s) can be viewed as representingthe probability distribution of all of the various ways in which random sampleoutcomes,x, occur so as to generate the conditional value of s This is becausethe event being conditioned on requires thatx satisfy, s(x) ¼ s Definition 7.18states that ifS is a vector of sufficient statistics, then Q is a ghost in f(x;Q|s) i.e.,the conditional density function really does not depend on the value ofQ sincef(x;Q|s) ¼ h(x) It follows that the probabilistic behavior of the various ways inwhichx results in s(x) ¼ s has nothing to do with Q, i.e., it is independent of Q.Thus, analyzing the various ways in which a given value of s can occur, orexamining additional functions of X, cannot possibly provide any additionalinformation aboutQ since the behavior of the outcomes of X, conditioned onthe fact thats(x) ¼ s, is totally unrelated to Q
Example 7.10Sufficient Statistic for
Bernoulli Population
Distribution
Let (X1, .,Xn) be a random sample from a Bernoulli population distributionrepresenting whether phone call solicitations to potential customers results in asale, so thatfðx; pÞ ¼ pPni¼1 x ið1 pÞn Pn
i¼1 x iPn
i¼1If0;1gðxiÞwhere p 2 O ¼ (0,1), xi¼ 1denotes a sale, and xi¼ 0 denotes no sale on the ith
call In this case, Pn
i¼1Xi,representing the total number of sales in the sample, is a sufficient statistic for f(x;p)
To see that this is true, first note that the appropriate conditioning event in thecontext of Definition 7.18, would be s(x) ¼ Pn
i¼1xi ¼ s, i.e., the total number of
10 Note that the conditional density function referred to in this definition is degenerate in the general sense alluded to in Section 3.10, footnote 20 That is since (x1, ,x n ) satisfies the r restrictions s i (x1, ,x n ) ¼ s i , for i ¼ 1, .,r by virtue of the event being conditioned upon, the arguments x1, ,x n of the conditional density are not all free to vary but rather are functionally related If one wanted to utilize the conditional density for actually calculating conditional probabilities of events for (X1, ,X n ), and if the random variables were continuous, then line integrals would be required as discussed previously in Chapter 3 concerning the use of degenerate densities This technical problem is of no concern in our current discussion of sufficient statistics since we will have no need to actually calculate conditional probabilities from the conditional density.
Trang 31sales equals the value s It follows from the definition of conditional probability thatthe conditional density function can be defined as11
fðx; pjsÞ ¼Pðx1; ; xn; sðxÞ ¼ sÞ
PðsðxÞ ¼ sÞ :The denominator probability is given directly by
PðsðxÞ ¼ sÞ ¼ n
s
!
psð1 pÞn sIf0;1; ;ngðsÞbecause s(X) ¼ Pn
i¼1Xiis the sum of iid Bernoulli random variables, which weknow to have a binomial distribution The numerator probability is defined by
an appropriate evaluation of the joint density of the random sample, as
If0;1gðxiÞ
!
I Pn i¼1 x i ¼s
which is the probability of x1, .,xnand s(x) ¼ Pn
i¼1xi¼ s Using the precedingfunctional representations of the numerator and denominator probabilities inthe ratio defining the conditional density function, and using the fact that
s¼ Pn i¼1xi, we obtain after appropriate algebraic cancellations that
for any choice of s 2 {0,1, .,n}, which does not depend on the parameter p Thus,
S¼ Pn i¼1Xiis a sufficient statistic for p
Note the conditional density states that, given Pn
i¼1xi¼ s, all outcomes of(x1, .,xn) are equally likely with probability n
s
1, and thus the probability of aparticular pattern of sales and no sales occurring for (xPn 1, .,xn), given that
i ¼1xi ¼ s, has nothing to do with the value of p It follows that only the factthatPn
i¼1xi ¼ s provides any information about p – the particular pattern of 0’sand 1’s in (X1, .,Xn) is irrelevant This is consistent with intuition in that it isthe total number of sales in n phone calls and not the particular pattern of salesthat provides information in a relative frequency sense about the probability, p,
of obtaining a sale on a phone call solicitation Furthermore, if Y¼ g(X) is anyother function of the random sample, then it can provide no additional informa-tion about p other than that already provided by s(X) This follows from the fact
11 The reader may wonder why we define the conditional density “from the definition of conditional probability,” instead of using the rather straightforward methods for defining conditional densities presented in Chapter 2, Section 2.6 The problem is that here we are conditioning on an event that involves all of the random variables X1, ,X n , whereas in Chapter 2 we were dealing with the usual case where the event being conditioned upon involves only a subset of the random variable X 1 , ,X n having fewer than n elements.
Trang 32that h(y|s(x) ¼ s) will not depend on p because the conditional density of Y willhave been derived from a conditional density ofX that is independent of p, i.e.,
In any problem of estimatingq(Q), once the outcome of a set of sufficientstatistics is observed, the random sample outcome (x, .,xn) can effectively beignored for the remainder of the point estimation problem sinces(x) captures all
of the relevant information that the sample has to offer regardingq(Q) tially, it is sufficient that the outcome of s be observed For example, withreference to Example 7.10, if a colleague were to provide the information that
Essen-123 sales were observed in a total of 250 phone calls, i.e., P250
i¼1xi¼ 123, wewould have no need to examine any other characteristic of the random sampleoutcome (x1, .,x250) when estimating p, or q(p)
A significant practical problem in the use of sufficient statistics is knowinghow to identify them A criterion for identifying sufficient statistics which issometimes useful is given by the Neyman Factorization Theorem:
Theorem 7.2Neyman Factorization
Theorem
Let f(x;Q) be the density function of the random sample (X1, .,Xn) The tics S1, .,Srare sufficient statistics for f(x;Q) iff f(x;Q) can be factored as f(x;Q) ¼g(s1(x), .,sr(x);Q) h(x), where g is a function of only s1(x), .,sr(x) and of Q, andh(x) does not depend on Q
statis-Proof The proof of the theorem in the continuous case is quite difficult, and we leave it
to a more advanced course of study (see Lehmann, (1986) Testing StatisticalHypotheses, John Wiley, 1986, pp 54–55) We provide a proof for the discrete case.Sufficiency Suppose the factorization criterion is met Let B(a) ¼ {(X1, .,Xn):si(x) ¼ ai,
i¼ 1, .,r; x 2 R(X)} be such that P(B(a)) > 0, and note that
conditional density that f(x;Q) ¼ f(x|si(x) ¼ ai, i¼ 1, , r) P(si(x) ¼ ai, i¼ 1, ., r) where the conditional density function does not depend on Q by thesufficiency ofs Then we have factored f(x;Q) into the product of a function of
Trang 33s1(x), .,sr(x) and Q (i.e., P(si(x) ¼ ai, i¼ 1, .,r) will depend on Q), and a function
As we have alluded to previously, a practical advantage of sufficient tics is that they can often greatly reduce the number of random variablesrequired to represent the sample information relevant for estimating q(Q), asseen in Example 7.8 and in the following example of the use of the NeymanFactorization Theorem
statis-Example 7.11Sufficient Statistics via
Neyman Factorizationfor Exponential
LetX ¼ (X1, .,Xn) be a random sample from the exponential population butionY1ez/YI(0, 1)(z) representing waiting times between customer arrivals
distri-at a retail store Note thdistri-at the joint density of the random sample is given
i¼1Ið0;1Þ(xi) Then from the theorem, we can conclude that S¼ Pn
i¼1Xi
is a sufficient statistic for f(x;Y) It follows that the value of the sum of therandom sample outcomes contains all of the information in the sample
Successful use of the Neyman Factorization Theorem for identifying cient statistics requires that one be ingenious enough to define the appropriateg(s(x);Q) and h(x) functions that achieve the required joint probability densityfactorization Since the appropriate function definitions will not always bereadily apparent, an approach introduced by Lehmann–Scheffe´12can sometimes
suffi-be quite useful for providing direction to the search for sufficient statistics Wewill discuss this useful result in the context of minimal sufficient statistics.7.4.1 Minimal Sufficient Statistics
At the beginning of our discussion of sufficient statistics we remarked that anobjective of using sufficient statistics is to reduce the number of functions of therandom sample required to represent all of the information in the randomsample relevant for estimatingq(Q) A natural question to consider is what isthe smallest number of functions of the random sample that can represent all ofthe relevant sample information in a given point estimation problem? Thisrelates to the concept of aminimal sufficient statistic, which is essentially thesufficient statistic for a given f(x;Q) that is defined using the fewest number of(functionally independent) coordinate functions of the random sample
The statement of subsequent definitions and theorems will be facilitated
by the concept of the range of X over the parameter space V, defined as
12
Lehmann, E.L and H Scheffe’ (1950) Completeness, Similar Regions, and Unbiased Estimation, Sankhya¯, 10, pp 305.
Trang 34ROð Þ ¼ fx : f x; QX ð Þ>0 for some Q 2 Og The set RO(X) represents all of thevalues ofx that are assigned a nonzero density weighting by f(x;Q) for at leastoneQ 2 O In other words, RO(X) is the union of the supports of the densitiesf(x;Q) for Q 2 O and thus corresponds to the set of relevant x-outcomes forthe statistical model {f(x;Q), Q 2 O} If the support of the density f(x;Q) doesnot change withQ (e.g., normal, Gamma, binomial) then RO(X) ¼ R(X) ¼ {x:f(x;Q) > 0}, where Q 2 O can be chosen arbitrarily and we henceforth treatthe range ofX as being synonymous with the support of its density.
Definition 7.19Minimal Sufficient
Statistics
A sufficient statistic S ¼ s(X) for f(x;Q) is said to be a minimal sufficientstatistic if for every other sufficient statistic T ¼ t(X) ∃ a function hT(·) suchthats(x) ¼ hT(t(x)) 8x 2 RO(X)
In order to motivate what is “minimal” about the sufficient statistic S inDefinition 7.19, first note that S will have the fewest elements in its rangecompared to all sufficient statistics for f(x;Q) This follows from the fact that afunction can never have more elements in its range than in its domain (recall thedefinition of a function, which requires that there is only one range pointassociated with each domain element, although there can be many domainelements associated with each range element), and thus ifS ¼ hT(T) for anyother sufficient statisticT, then the number of elements in R(S) must be no morethan the number of elements in R(T), for any sufficient statistic T So, in thissense,S utilizes the minimal set of points for representing the sample informa-tion relevant for estimatingq(Q)
It can also be shown that a minimal sufficient statistic can be chosen to havethe fewest number of coordinate functions relative to any other sufficient statis-tic, i.e., the number of coordinate functions defining the minimal sufficientstatistic is minimal A rigorous proof of this fact is quite difficult and is deferred
to a more advanced cause of study.13In order to at least motivate the plausibility
of this fact, first note that since a minimal sufficient statistic, sayS, is a function
of all other sufficient statistics, then ifT is any other sufficient statistic, t(x) ¼t(y) ) s(x) ¼ hT(t(x)) ¼ hT(t(y)) ¼ s(y) It follows that
AT¼ x; yfð Þ : t xð Þ ¼ t yð Þg x; yfð Þ : s xð Þ ¼ s yð Þg ¼ B
no matter which sufficient statistic,T, is being referred to If B is to contain theset AT, then the constraints on (x,y) representing the set-defining conditions of Bcannot be more constraining than the constraints defining AT, and in particularthe number of nonredundant constraints14defining B cannot be more than the
13 See E.W Barankin and M Katz, (1959) Sufficient Statistics of Minimal Dimension, Sankhya, 21:217–246; R Shimizu, (1966) Remarks on Sufficient Statistics, Ann Inst Statist Math., 18:49–66; D.A.S Fraser, (1963) On Sufficiency and the Exponential Family, Jour Roy Statist Soc., Series B, 25:115–123.
Trang 35number defining AT Thus the number of nonredundant coordinate functionsdefining S must be no larger than the number of nonredundant coordinatefunctions defining any other sufficient statistic, so that the number of coordi-nate functions defining S is minimal Identification of minimal sufficient statis-tics can often be facilitated by the following approach suggested by Lehmannand Scheffe´.
Theorem 7.3Lehmann-Scheffe´
Minimal Sufficiency
Theorem
LetX ~ f(x;Q) If the statistic S ¼ s(X) is such that 8x and y 2 RO(X), f(x;Q) ¼t(x,y) f(y;Y) iff (x,y) satisfies s(x) ¼ s(y), then S ¼ s(X) is a minimal sufficientstatistic for f(x;Q)
Proof Define A(s) ¼ {x: s(x) ¼ s} and let xs 2 A(s)\RO(X), be chosen as a representative
element of A(s), 8s 2 R(S) Define Z(x) ¼ xs8x 2 A(s) and 8s 2 R(S) Thus A(s) isthe set of x-outcomes whose image s(x) is s, and Z(x) is the representativeelement of the set A(s) to which x belongs
Assume that (x,y) 2 {(x,y): s(x) ¼ s(y)} ) f(x;Q) ¼ t(x,y) f(x;Q) 8 x and y 2
RO(X) Then for x 2 A(s) \ RO(X), sðxÞ ¼ sðxsÞ implies
Ifx 2 RO(X), then f(x;Y) ¼ h(x) g(s(x);Y) by defining h(x) ¼ 0 Since Neymanfactorization holds, s(X) is a sufficient statistic
Now assume f(x;Y) ¼ t(x,y) f(x;Y) ) (x,y) 2 {(x,y): s(x) ¼ s(y)} 8x and y 2
RO(X) Let s*(x) be any other sufficient statistic for f(x;Y) Then by Neymanfactorization, for some g*(·) and h*(·) functions, f(x;Y) ¼ g*(s*(x);Y) h*(x) If s*(x)
¼ s*(y), then since g*(s*(x);Y) ¼ g*(s*(y);Y), and it follows that f(x;Y) ¼[h*(x)/h*(y)] f(x;Y) ¼ t(x,y) f(x;Y) whenever h*(y) 6¼ 0, so that s*(x) ¼ s*(y) ) s(x) ¼s(y) Values of y for which h*(y) ¼ 0 are such that f(x;Y) ¼ 0 8Y 2 O byNeyman Factorization, and are thus irrelevant to the minimal sufficiency of S(recall Definition 7.19) Then s is a function ofs*, ass(x) ¼ g(s*(x)), 8x 2 RO(X),because for a representative ys* 2 {x: s*(x) ¼ s*}, s*(x) ¼ s*(ys*)¼ s* ) s(x) ¼s(ys*)¼ s, and thus s(x) ¼ s ¼ g(s*)¼ g(s*(x)) Therefore s(X) is a minimal sufficient
14 By nonredundant, we mean that none of the constraints are implied by the others Redundant constraints are constraints that are ineffective or unnecessary in defining sets.
Trang 36Before proceeding to applications of the theorem, we present two corollariesthat are informative and useful in practice.
Corollary 7.1
Lehmann-Scheffe´ Sufficiency
LetX ~ f(x;Y) If the statistic S ¼ s(X) is such that 8 x and y 2 RO(X), (x,y) 2 {(x,y):s(x) ¼ s(y)} ) f(x;Y) ¼ t(x,y) f(y;Y), then S ¼ s(X) is a sufficient statistic forf(x;Y)
Proof The validity of this corollary is implied by the first part of the proof of
The corollary indicates that the “only if” part of the condition in Theorem7.3 is not required for the sufficiency ofs(X) but it is the addition of the “only if”part that results in minimality ofs(X)
S ¼ s(X) is a minimal sufficient statistic
Proof This follows from Theorem 7.3 by dividing through by f(y;Y) on the left-hand
side of the iff condition, which is admissible for all x and y in R(X) ¼{x: f(x;Y) > 0} Values of x and y =2 R(X) are irrelevant to sufficiency (recall
Using the preceding results for defining a minimal sufficient statistic ofcourse still requires that one is observant enough to recognize an appropriate(vector) function S However, in many cases the Lehmann-Scheffe´ approachtransforms the problem into one where a choice ofS is readily apparent Thefollowing examples illustrate the use of the procedure for discovering minimalsufficient statistics
Example 7.12Lehman-Scheffe´
Minimal SufficiencyApproach for Bernoulli
Let X¼ (X1, .,Xn) be a random sample from a nondegenerate Bernoulli tion distribution representing whether or not a customer contact results in asale, so that
i ¼1
If0;1gðxiÞ
pPn i¼1 yið1 pÞnPni¼1 y iQn
i¼1
If0;1gðyiÞ
Trang 37for all values ofx and y 2 R(X) ¼ n
i ¼1f0; 1g The ratio will be independent of p,iffthe constraintPn
Minimal SufficiencyApproach for Gamma
Let X¼ (X1, .,Xn) be a random sample from a gamma population distributionrepresenting the operating life until failure of a certain brand and type of per-sonal computer, so that
fðx; a; bÞ ¼ 1
bna GnðaÞ
Yn i¼1
xi
!a1
exp X
n i¼1
i¼1 (0,1) (Note the term (bnaGn
(a)) has beenalgebraically canceled in the density ratio) The ratio will be independent of both
a and b iff the constraints Qn
i¼1xi¼Qn
i¼1yiand Pn
i¼1xi¼Pn
i¼1yi, are imposed
A minimal sufficient statistic for f(Q x;a,b) is then bivariate and given by s1(X) ¼
n i¼1Xiand s2(X) ¼ Pn
Example 7.14Lehman-Scheffe´
Minimal SufficiencyApproach for Uniform
Let X¼ (X1, .,Xn) be a random sample from a uniform population distributionrepresenting the number of minutes that a shipment is delivered before (x< 0)
or after (x> 0) its scheduled arrival time, so that
fðx; a; bÞ ¼ b að Þn Yni¼1I½a;bð Þ:xi
Unlike the previous examples, here the range ofX depends on the parameters aand b Referring to the Lehmann-Scheffe´ procedure for defining a sufficientstatistic for f(x;a,b) as given by Theorem 7.3, examine
andy being any point in n
i ¼1 [a,b]
15 It may be more appropriate to assume finite lower and upper bounds for a and b, respectively Doing so will not change the final result of the example.
Trang 38The ratio will be independent of a and b iff min (x1, .,xn)¼ min (y1, .,yn) andmax(x1, .,xn)¼ max(y1, .,yn), in which case the ratio will be equal to 1.The preceding conditions also ensure that f(x;a,b) ¼ 0 when f(y;a, b) ¼ 0, so thatf(x;a,b) ¼ t(x,y)f(y;a,b) holds 8x and y 2 RO(X) A minimal sufficient statistic forf(x;a,b) is then bivariate and given by the order statistics s1(X) ¼ min(X1, .,Xn)and s2(X) ¼ max(X1, .,Xn) by Theorem 7.3 □
7.4.2 Sufficient Statistics in the Exponential ClassThe exponential class of densities represent a collection of parametric families
of density functions for which sufficient statistics are straightforwardly defined.Furthermore, the sufficient statistics are generally minimal sufficient statistics.Theorem 7.4
Exponential Class and
Sufficient Statistics
Let f(x;Q) be a member of the exponential class of density functions
fðx; QÞ ¼ exp X
k i¼1
cið ÞgQ ið Þ þ d Qx ð Þ þ z xð Þ
IAð Þ:x
Then s(X) ¼ (g1(X), .,gk(X)) is a k-variate sufficient statistic, and if ci(Q),
i¼ 1, .,k are linearly independent, the sufficient statistic is a minimal cient statistic
suffi-Proof Thats(X) is a sufficient statistic follows immediately from the Neyman
Factori-zation theorem by defining g gð 1ðxÞ; ; gkðxÞ; QÞ ¼ exp Pk
i ¼1ciðQÞgiðxÞþ
h
dðQÞiand hðxÞ ¼ exp z xð ð ÞÞIAð Þ in the theorem.x
Thats(X) is a minimal sufficient statistic follows from the fact that s(X) can
be derived using the Lehmann–Scheffe´ approach of Corollary 7.2 To see this,note that
ci(Q),i ¼ 1, .,k are linearly independent.16 nNote that Theorem 7.4 could be used as an alternative approach for discov-ering minimal sufficient statistics in the problems of random sampling
16 If one (or more) c i ( Y) were linearly dependent on the other c j ( Q)’s, then “only if” would not apply To see this, suppose c k ( Y) ¼
Trang 39examined in Examples 7.11 and 7.12 It could not be used in Example 7.13 sincethe uniform distribution is not in the exponential class.
7.4.3 Relationship Between Sufficiency and MSE: Rao-Blackwell
In addition to generating a condensed representation of the information in asample relevant for estimating q(Q), sufficient statistics can also facilitate thediscovery of estimators of q(Q) that are relatively efficient in terms of MSE Inparticular, in the pursuit of estimators with low MSE, only functions ofsufficient statistics need to be examined, which is the implication of theRao-Blackwell theorem
Theorem 7.5Rao-BlackwellTheorem - Scalar Case
LetS ¼ (S1, .,Sr) be an r-variate sufficient statistic for f(x;Q), and let t*(X)
be any estimator of the scalar q(Q) having finite variance Define t(X) ¼E(t*(X)|S1, .,Sr)¼ x(S1, .,Sr) Then t(X) is an estimator of q(Q) for whichMSEY(t( Y (t*(X)) 8 Q 2 O, with the equality being attained only
if PQ(t(x) ¼ t*(x)) ¼ 1
Proof First note that sinceS ¼ (S1, .,Sr) is an r-variate sufficient statistic, fð Þ doesxjs
not depend onQ, and thus neither does the function t(X) (since it is defined
as a conditional expectation using fð Þ), so t(X) is a statistic that can be usedxjs
as an estimator of q(Q) Now by the iterated expectation theorem,
Eðt XÞð Þ ¼ EEðt ð ÞjSX 1; ; SrÞ ¼ E(t*(X)), so that t(X) and t*(X) have preciselythe same expectation Next examine
MSE tð ð ÞX Þ ¼ E tð ð Þ q QX ð ÞÞ2¼ E tð ð Þ t XX ð Þ þ t Xð Þ q Qð ÞÞ2
¼ E tð ð Þ t XX ð ÞÞ2þ 2E tð ð Þ t XX ð ÞÞ t Xð ð Þ q Qð ÞÞ
þ E t Xð ð Þ q Qð ÞÞ2:The cross–product term is zero To see this, first note that E[(t*(X) t(X)) (t(X) q(Q))] ¼ E[t(X) (t*(X) t(X))] since E(t*(X) t(X)) q(Q) ¼ 0 because E(t*(X)) ¼E(t(X)) Now note that by definition t(X) is a function of only sufficient statistics,
so that t(X) is a constant given s1, .,sr Therefore,
E t t½ ð ð Þ tX Þjs1; ; sr ¼ t E t½ ð ð Þ tX Þjs1; ; srÞ ¼ 0since Eðt ð ÞjsX 1; ; srÞ ¼ t by definition, so that E[t(X)(t*(X) t(X))] ¼ 0 bythe iterated expectation theorem
Then dropping the nonnegative term E(t*(X) t(X))2on the right-hand side
of the expression defining MSE(t*(X)) above yields MSEQ(t*(X)) EQ(t(X) q(Q))2¼ MSEQ(t(X)) 8 Q 2 O The equality is attained iff EQ(t*(X) t(X))2¼ 0,which requires that PQ[t*(x) ¼ t(x)] ¼ 1 nThe point of the theorem is that for any estimator t*(X) of q(Q) there alwaysexists an alternative estimator that is at least as good as t*(X) in terms of MSEand that is a function of any set of sufficient statistics Thus, the Rao-Blackwelltheorem suggests that the search for estimators of q(Q) with low MSEs can
Trang 40always be restricted to an examination of functions of sufficient statistics, wherehopefully the number of sufficient statistics required to fully represent theinformation about q(Q) is substantially less than the size of the random sampleitself.17Note that if attention is restricted to the unbiased class of estimators, sothat t*(X) is an unbiased estimator in the statement of the theorem, then the RaoBlackwell theorem implies that the search for a minimum variance estimatorwithin the class of unbiased estimators can also be restricted to functions ofsufficient statistics As an illustration, in Example 7.11, we know thatPn
Rao-BlackwellTheorem-Vector Case
LetS ¼ (S1, .,Sr) be an r-variate sufficient statistic for f(x;Q), and let t*(X) be anestimator of the(k 1) vector function q(Q) having a finite covariance matrix.Definet(X) ¼ E(t*(X)|S1, .,Sr)¼ h(S1, .,Sr) Then t(X) is an estimator of q(Q)for whichMSEQ(t(X))ffiMSEQ(t*(X)) 8Q 2 O, the equality being attained only if
PQ(t(x) ¼ t*(x)) ¼ 1
Proof The proof is analogous to the proof in the scalar case, except that MSE matrices
are used in place of scalar MSEs in establishing that MSE(t(X)) is smaller thanMSE(t*(X)) The details are left to the reader nThe implications of Theorem 7.6 are analogous to those for the scalar case.Namely, one need only examine vector functions of sufficient statistics forestimating the vector q(Q) if the objective is to obtain an estimator with asmall MSE matrix Furthermore, the search for an MVUE ofq(Q) can also berestricted to functions of sufficient statistics As stated previously, this candecrease substantially the dimensionality of the data used in a point estimationproblem if the minimal sufficient statistics for the problem are few in number.Revisiting Example 7.12 we note for future reference that
T ¼ x X
n i¼1
Xi
!
¼
Pn i¼1Xi
35
is the MVUE for (p, p(1p), the mean and variance of the Bernoulli populationdistribution in the example
17 The reader will recall that the random sample, (X1, ,X n ), is by definition a set of sufficient statistics for f( x;Y) However, it is clear that no improvement (decrease) in the MSE of an unbiased estimator will be achieved by conditioning on (X1, ,X n ), i.e., the reader should verify that this is a case where E(t*( X) t(X)) 2 ¼ 0 and MSE equality is achieved in the Rao–Blackwell theorem.