Mathematical statistics for economics and business (second edition) part 2

7.1.2 Semiparametric Models A semiparametric model is one in which the functional form of the jointprobability density function component of the probability model for theobserved sample

Trang 1

n Point Estimation Theory

n n

7.5 Minimum Variance Unbiased Estimation

The problem of point estimation examined in this chapter

is concerned with the estimation of the values of unknown parameters, orfunctions of parameters, that represent characteristics of interest relating to aprobability model of some collection of economic, sociological, biological, orphysical experiments The outcomes generated by the collection of experimentsare assumed to be outcomes of a random sample with some joint probabilitydensity function fðx1; ; xn; QÞ The random sample need not be from a popula-tion distribution, so that it is not necessary that X1, .,Xnbe iid The estimationconcepts we will examine in this chapter can be applied to the case of generalrandom sampling, as well as simple random sampling and random samplingwith replacement, i.e., all of the random sampling types discussed in Chapter6.The objective of point estimation will be to utilize functions of the randomsample outcome to generate good (in some sense) estimates of the unknowncharacteristics of interest

The types of estimation problems that will be examined in this (and the next)chapter are problems ofparametric estimation and semiparametric estimation,

as opposed to nonparametric estimation problems Both parametric andsemiparametric estimation problems are concerned with the estimates of thevalues of unknown parameters that characterizeparametric probability models

or semiparametric probability models of the population, process, or general

Trang 2

experiments under study Both of these models have specific parametric tional structure to them that becomes fixed and known once values ofparameters are numerically specified The difference between the two modelslies in whether a particular parametric family or class of probabilitydistributions underlies the probability model and is fully determined by settingthe values of parameters (the parametric model) or not (the semiparametricmodel) A nonparametric probability model is a model that is devoid of anyspecific parametric functional structure that becomes fixed when parametervalues are specified We discuss these models in more detail below.

func-Given the prominence of parameters in the estimation problems we will beexamining, and the need to distinguish their appearance and effect inspecifying parametric, semiparametric, and nonparametric probability models,

we extend the scope of the termprobability model to explicitly encompass thedefinition of parameters and their admissible values Note, because it is possi-ble that the range of the random variable can change with changing values ofthe parameter vector for certain specification of the joint probability densityfunction of a random variable (e.g, a uniform distribution), we emphasize this

in the definition below by including the parameter vector in the definition ofthe range ofX

We note, given our convention that the range and the support of the randomvariableX are equivalent (recall Definition 2.13), that explicitly listing the range

of the random variable as part of the specification of the probability model doesnot provide new information, per se That is, knowing the density functionand its admissible parameter values implies the range of the random variable

as R Xð ; QÞ x : f x; Qf ð Þ>0gfor Q 2 O We will see ahead that in point estimationproblems an explicit specification of the range of a random sampleX is impor-tant for a number of reasons, including determining the types of estimationprocedures that can be used in a given estimation problem, and for defining

Trang 3

the range of estimates that are possible to generate from a particular pointestimator specification We will therefore continue to explicitly include therange of X in our specification of a probability model, but we will reserve theoption to specify the probability model in the abbreviated form f x; Qf ð Þ; Q 2 Ogwhen emphasizing the range of the random variable is not germane to thediscussion.

in which case the adequacy of the model may itself be an issue in need of furtherstatistical analysis and testing However, use of parametric estimation method-ology begins with, and indeed requires such a full specification of a parametricmodel forX

7.1.2 Semiparametric Models

A semiparametric model is one in which the functional form of the jointprobability density function component of the probability model for theobserved sample data, x, is not fully specified and is not known when thevalue of the parameter vector of the model, Q, is given a specific numericalvalue Instead of defining a collection of explicit parametric functional forms forthe joint density of the random sample X, when defining the model, as in theparametric case, the analyst defines a number of properties that the underlyingtrue sampling density fðx1; ; xn; Q0Þ is thought to possess Such informationcould include parametric specifications for some of the moments that the ran-dom variables are thought to adhere to, or whether the random variablescontained in the random sample exhibit independence or not Given a numericalvalue for the parameter vectorQ, any parametric structural components of themodel are given an explicit fully specified functional form, but othercomponents of the model, most notably the underlying joint density function

Trang 4

for the random sample, fðx1; ; xn; QÞ , remains unknown and not fullyspecified.

7.1.3 Nonparametric Models

A nonparametric model is one in which neither the functional form of the jointprobability density function component of the probability model for theobserved sample data,x, nor any other parametric functional component of theprobability model is defined and known given numerical values of parametersQ.These models proceed with minimal assumptions on the structure of the proba-bility model, with the analyst simply acknowledging the existence of somegeneral characteristics and relationships relating to the random variables inthe random sample, such as the existence of a general regression relationship,

or the existence of a population probability distribution if the sample weregenerated through simple random sampling

For example, the analyst may wish to estimate the CDF F(z), where (X1, .,

Xn) is an iid random sample from the population distribution F(z), and nomention is made, nor required, regarding parameters of the CDF We havealready examined a method for estimating the CDF in the case where therandom sample is from a population distribution, namely, the empirical distri-bution function, Fn, provides an estimate of F We will leave the general study ofnonparametric estimation to a more advanced course of study; interested readerscan refer to M Puri and P Sen (1985) Nonparametric Methods in General LinearModels New York: John Wiley, F Hampel, E Ronchetti, P Rousseeuw, and

W Stahel (1986), Robust Statistics New York: John Wiley; and J Pratt and

J Gibbons (1981), Concepts of Nonparametric Theory New York: Verlag and A Pagan and A Ullah, (1999), Nonparametric Econometrics,Cambridge: Cambridge University Press.1

Springer-We illustrate the definition of the above three types of models in the ing example

Trang 5

« Y

n

i ¼1

1ffiffiffiffiffiffiffiffiffiffiffi2ps2

forb ∈ ℝ2ands2> 0, where then R X; b; s 2

¼ R Xð Þ ¼ ℝnfor all admissibleparameter values In this case, if the parameters b and s2 are given specificnumerical values, the joint density of the random sample is fully defined andknown Specifically, Xi N b10þ zib20; s2

Finally, consider anonparametric model of the relationship In this case, theanalyst might specify thatX ¼ g zð Þ þ « with E Xð Þ ¼ g zð Þ, and perhaps that X is acollection of independent random variables, but nothing more Thus, the meanfunction, as well as all other aspects of the relationship betweenX and z, are leftcompletely general, and nothing is explicitly determined given numerical values

of parameters There is clearly insufficient information for the analyst to late random sample outcomes from the probability model, not knowing the jointdensity of the random sample or even any moment aspects of the model, given

7.1.4 Scope of Parameter Estimation Problems

The objective in problems of parameter estimation is to utilize a sampleoutcome x½ 1; ; xn0 of X ¼ X½ 1; ; Xn0 to estimate the unknown value Q0 or

q Qð Þ, where Q0 0denotes the value of the parameter vector associated with thejoint PDF that actually determines the probabilities of events for the randomsample outcome That is, Q0 is the value of Q such that X ~ f(x;Q0) is a truestatement, and for this reasonQ0is oftentimes referred to as thetrue value of Q,and we can then also speak ofq Qð Þ as being the true value of q Q0 ð Þ and f(x; Q0)

as being the true PDF of X Some examples of the many functions of Q0

that might be of interest when sampling from a distribution f(z;Q0) include

1 q1(Q0)¼ E(Z) ¼R11z f(z;Q0)dz (mean),

2 q2(Q0)¼ E(Z – E(Z))2¼R11(z – E(Z))2f(z;Q0)dz (variance),

Trang 6

3 q3(Q0) defined implicitly byRq3ðy 0 Þ

1 f(z;Q0)dz¼ 5 (median),

4 q4(Q0)¼Rb

af(z;Q0)dz¼ P(z∈[a,b]) (probabilities)

5 q5(Q0)¼R11z f(z |x; Q0)dz¼ (regression function of z on x)The method used to solve a parametric estimation problem will generallydepend on the degree of specificity with which one can define the family ofcandidates for the true PDF of the random sample,X The situation for whichthe most statistical theory has been developed, both in terms of the actualprocedures used to generate point estimates and in terms of the evaluation ofthe properties of the procedures, is theparametric model case In this case, thedensity function candidates, f(x1, .,xn;Q), are assumed at the outset to belong tospecific parametric families of PDFs (e.g., normal, Gamma, binomial), andapplication of the celebrated maximum likelihood estimation procedure(presented in Chapter8) relies on the candidates for the distribution ofX beingmembers of a specific collection of density functions that are indexed, and fullyalgebraically specified, by the values ofQ

In the semiparametric model and nonparametric model cases, a specificfunctional definition of the potential PDFs for X is not assumed, althoughsome assumptions about the lower-order moments of f(x;Q) are often made Inany case, it is often still possible to generate useful point estimates of variouscharacteristics of the probability model ofX that are conceptually functions ofparameters, such as moments, quantiles, and probabilities, even if the specificparametric family of PDFs for X is not specified For example, useful pointestimates (in a number of respects) of the parameters in the so-called generallinear model representation of a random sample based on general random sam-pling designs can be made, and with only a few general assumptions regardingthe lower-order moments of f(x1, .,xn;Q), and without any assumptions that thedensity is of a specific parametric form (see Section8.2)

Semiparametric and nonparametric methods of estimation have an tage of being applicable to a wide range of sampling distributions since they aredefined in a distribution-nonspecific context that inherently subsumes manydifferent functional forms for f(x;Q) However, it is usually the case that superiormethods of estimatingQ or q(Q) exist if a parametric family of PDFs for X can bespecified, and if the actual sampling distribution of the random sample issubsumed by the probability model Put another way, the more (correct) infor-mation one has about the form of f(x;Q) at the outset, the more precisely one canestimateQ0orq(Q0)

A problem of point estimation begins with either a fully or partially specifiedprobability model for the random sampleX ¼ (X1, .,Xn)0whose outcomex ¼[x , .,x ]0constitutes the observed data being analyzed in a real-world problem

Trang 7

of point estimation, or statistical inference The probability model defines theprobabilistic and parametric context in which point estimation proceeds Oncethe probability model has been specified, interest centers on estimating the truevalues of some (or all) of the parameters, or on estimating the true values ofsome functions of the parameters of the problem The specific objectives of anypoint estimation problem depend on the needs of the researcher, who willidentify which quantities are to be estimated.

The case of parametric model estimation ofQ or q(Q) is associated with afully specified probability model in which a specific parametric family of PDFs isrepresented by {f(x;Q), Q∈O} For example, a fully specified probability model for

a random sample of miles per gallon achieved by 25 randomly chosen trucksfrom the assembly line of a Detroit manufacturer might be defined as

In specifying a probability model, the researcher presumably attempts to tify an appropriate parametric family based on a combination of experience,consideration of the real-world characteristics of the experiments involved,theoretical considerations, past analyses of similar problems, an attempt at areasonably robust approximation to the probability distribution, and/or pragma-tism The degree of detail with which the parametric family of densities isspecified can vary from problem to problem

iden-In some situations there will be great confidence in a detailed choice ofparametric family For example, suppose we are interested in estimating theproportion, p, of defective manufactured items in a shipment of N items If arandom sample with replacement of size n is taken from the shipment (popula-tion) of manufactured items, then

On the other hand, there will be situations in which the specification of theparametric family is quite tentative For example, suppose one were interested

in estimating the average operating life of a certain brand of hard-disk based onoutcomes of a random sample of hard-disk lifetimes In order to add some

Trang 8

mathematical structure to the estimation problem, one might represent the ithrandom variable in the random sample of lifetimes (X1, .,Xn) as Xi¼ m + Vi,where m represents the unknown mean of the population distribution oflifetimes, an outcome of Xirepresents the actual lifetime observed for the ithhard disk sampled, and the corresponding outcome of Virepresents the deviation

of Xifrom m Since (X1, .,Xn) is a random sample from the population tion, it follows that E(Xi)¼ m and var(Xi)¼ s28i can be assumed, so that E(Vi)¼ 0and var(Vi) ¼ s2 8i can also be assumed Moreover, it is then legitimate toassume that (X1, .,Xn) and (V1, .,Vn) are each a collection of iid randomvariables Then to this point, we have already specified that the parametricfamily of distributions associated with X is of the form Qn

distribu-i ¼1m xð i; QÞ, wherethe density m(z;Q ) has mean m and variance s2 (what is the correspondingspecification for V?)

Now, what parametric functional specification of m xð i; QÞcan be assumed tocontain the specific density that represents the actual probability distribution of

Xior Vi? (Note, of course, that specifying a parametric family for Viwould imply

a corresponding parametric family for Xiand vice versa) One general tion would be the collection of all continuous joint density functions f(x1, .,xn;Q) for which f(x1, .,xn;Q) ¼ Qn

specifica-i ¼1m xð i; QÞ with E(Xi)¼ m and var(Xi)¼ s2, 8i.The advantage of such a general specification of density family is that we havegreat confidence that the actual density function ofX is contained within theimplied set of potential PDFs, which we will come to see as an importantcomponent of the specification of any point estimation problem In this particu-lar case, the general specification of the statistical model actually providessufficient structure to the point estimation problem for a useful estimate ofmean lifetime to be generated (for example, the least squares estimator can beused to estimate m – see Chapter8) We will see that one disadvantage of verygeneral specifications of the probability model is that the interpretation of theproperties of point estimates generated in such a general context is also usuallynot as specific or detailed as when the density family can be defined with greaterspecificity

Consider a more detailed specification of the probability model of hard-diskoperating lives If we feel that lifetimes are symmetrically distributed aroundsome point, m, with the likelihoods of lifetimes declining the more distant themeasurement is from m, we might consider the normal parametric family for thedistribution of Vi It would, of course, follow that Xi is then also normallydistributed, and thus the normal distribution could serve only as an approxima-tion since negative lifetimes are impossible Alternatively, if we felt that thedistribution of lifetimes was skewed to the right, the gamma parametric familyprovides a rich source of density shapes, and we might specify that the Xi’s havesome Gamma density, and thus the Vi’s would have the density of aGamma–type random variable that has been shifted to the left by m units.Hopefully, the engineering staff could provide some guidance regarding themost defensible parametric family specification to adopt In cases where there

is considerable doubt concerning the appropriate parametric family of densities,

Trang 9

tests of hypotheses concerning the adequacy of a given parametric familyspecification can be performed Some such tests will be discussed inChapter10 In some problem situations, it may not be possible to provide anymore than a general specification of the density family, in which case the use ofsemiparametric methods of parameter estimation will be necessary.

7.2.2 The Parameter Space for the Probability Model

Given that a parametric functional form is specified to characterize the jointdensity of the random sample, a parameter space,O, must also be identified tocomplete the probability model There are often natural choices for the para-meter space For example, if the Bernoulli family were specified, then O ¼{p: p∈[0,1]}, or if the normal family were specified, then O ¼ {(m,s): m∈(1,1),

s > 0} However, if only a general definition of the parametric family of densities

is specified at the outset of the point estimation problem, the specification of theparameter space for the parametric family will then also be general and oftenincomplete For example, a parameter space specification for the aforemen-tioned point estimation problem involving hard-disk lifetimes could be

O ¼ {Yo: m 0, s2 0} In this case, since the specific algebraic form of f(x1, .,

xn;Q) is also not specified, we can only state that the mean and variance of disk lifetimes are nonnegative, possibly leaving other unknown parameters in

hard-Younrestricted depending on the relationship of the mean and variance to theparameters of the distribution, and in any case not fully specifying the func-tional form of the density function Regardless of the level of detail with which

O is specified, there are two important assumptions, presented ahead, regardingthe specification ofO that are made in the context of a point estimation problem.The Issue of Truth in the Parameter Space First, it is assumed that O contains thetrue value of Q0, so that the probability model given by {f(x; Q), Q∈O} can beassumed to contain the true sampling distribution for the random sample understudy Put another way, in the context of a point estimation problem, the setO isassumed to represent the entire collection of possible values for Q0 The rele-vance of this assumption in the context of point estimation is perhaps obvious –

if the objective in point estimation is to estimate the value ofQ0orq(Q0) we donot want to precludeQ0orq(Q0) from the set of potential estimates Note that, inpractice, this may be a tentative assumption that is subjected to statistical testfor verification or refutation (see Chapter10)

Identifiability of Parameters The second assumption on O concerns the concept

of the identifiability of the parameter vector Q As we alluded to in ourdiscussion of parametric families of densities in Chapter4, parameterization ofdensity families is not unique Any invertible transformation ofQ, say l ¼ h(Q),defines an alternative parameter space L ¼ {l: l ¼ h( Q ), Q ∈O} that can beused to specify an alternative probability model forX that contains the samePDF candidates as the statistical model based onO, i.e.,ffðx; h1ðlÞÞ; l 2 Lg ¼ffðx; QÞ; Q 2 Og Defining m(x;l) f(x;h1(l)), the alternative probability model

Trang 10

could be written as {m(x;l), l∈L} The analyst is free to choose whateverparameterization appears to be most natural or useful in the specification of aprobability model, so long as the parameters in the chosen parameterization areidentified In stating the definition of parameter identifiability we use theterminology distinct PDFs to refer to PDFs that assign different probabilities

to at least one event forX

The importance of parameter identifiability is related to the ability of dom sample outcomes to provide discriminatory information regarding thechoice ofQ∈O to be used in estimating Q0 If the parameter vector in a statisticalmodel is not identified, then two or more different values of the parametervector Q , say Q1 and Q2, are associated with precisely the same samplingdistribution X In this event, random sample outcomes cannot possibly beused to discriminate between the values of Q1and Q2 since the probabilisticbehavior of X under either possibility is indistinguishable We thus insist onparameter identifiability in a point estimation problem so that different values

ran-ofQ are associated with different probabilistic behavior of the outcomes of therandom sample

i ¼1ð0; 1Þ Note that any choice of positive values for b0, b1, mT, s2

T,and s2

Vthat result in the same given positive values for m and s2result in preciselythe same sampling distribution forY (there are an infinite set of such choices foreach value of the vector [m,s2]0) Thus the original parameter vector is notidentified Note that the parameter vector [m,s2]0in the latter statistical modelforY is identified since the sampling distributions associated with two differentpositive values of the vector [m,s2]0are distinct □

Trang 11

7.2.3 A Word on Estimation Phraseology

We pause here to introduce a convention regarding the interpretation of phrasessuch asestimating Q or estimating q(Q), or an estimate of Q (or of q(Q)) Since Q issimply a parameter vector that indexes a family of density functions and thatcan assume a range of alternative values (those specified inO), the reader mightwonder what such phrases could possibly mean? That is, what are we estimating

if we are estimating, say, Q ? The phrases are used as a shorthand or anabbreviated way of stating that one is estimating the true value of Q , orestimating the true value ofq(Y), or that one has an estimate of the true value

of Q (or of q(Q)) There is widespread use of such phrases in the statistics andeconometrics literature, and we will make frequent use of such phrases in thisbook as well In general, one must rely on the context of the discussion to be surewhether Q or q( Q ) refers to the quantity being estimated or merely to theindexing parameter of a family of joint density functions

Point estimation is concerned with estimatingQ or q(Q) from knowledge of theoutcome x ¼ x½ 1; ; xn0of a random sample X It follows from this basicdescription of point estimation that functions are critical to the estimationproblem, where inputs or domain elements are sample outcomes,x, and outputs

or range elements are estimates ofQ or q(Q) More formally, estimates will begenerated via some function of the formt: R(X) ! R(t), where R(t) is the range of

t defined as R(t) ¼ {t: t ¼ t(x), x∈R(X)} Note that R(t) represents the set of allpossible estimates ofQ or q(Q) that can be generated as outcomes of t(X) We willalways tacitly assume thatt(X) is an observable random variable, and hence astatistic, so that estimates are observable and empirically informative

Henceforth, when the functiont: R(X) ! R(t) represented by t ¼ t(x) is beingutilized to generate estimates ofq(Q), we will refer to the random variable T ¼t(X) as an estimator for q( Q ), and q( Q ) will be referred to as the estimand

An outcome,t ¼ t(x), of the estimator will be referred to as an estimate of q(Q)

We formalize these three terms in the following definition:

Definition 7.4

Point Estimator,

Estimate, and Estimand

A statistic or vector of statistics, T ¼ t(X), whose outcomes are used toestimate the value of a scalar or vector function,q(Q), of the parameter vector,

Q , is called a point estimator, with q( Q ) being called the estimand.2

An observed outcome of an estimator is called apoint estimate

2 Note, as always, that the function q can be the identity function q(Q) Q, in which case we could be referring to estimating the vector

Q itself Henceforth, it will be understood that since q(Q) Q is a possible choice of q(Q), all discussion of estimating q(Q) could be referring to estimating the vector Q itself.

Trang 12

Figure7.1contains a schematic overview of the general context of the pointestimation problem to this point.

7.3.1 Evaluating Performance of EstimatorsSince there is literally an uncountably infinite set of possible functions ofX thatare potential estimators ofq(Q), a fundamental problem in point estimation isthe choice of a “good” estimator In order to rank the efficacy of estimators and/

or to choose the optimal estimator ofq(Q), an objective function that establishes

an appropriate measure of “goodness” must be defined

A natural measure to use in ranking estimators would seem to be thedistance between outcomes of t(X) and q(Q), which is a direct measure ofhow close estimates are to what is being estimated In the current context,this distance measure is d(t(x), q(Q)) ¼ ([t(x) q(Q)]0 [t(x) qðQ)])1/2, whichspecializes to |t(x) - q(Q)| when k ¼ 1 However, this closeness measure has anobvious practical flaw for comparing alternative functions for estimatingq(Q) –the estimate that would be preferred depends on the true value ofq(Q), which isunknown (or else there would be no point estimation problem in the firstplace) This problem is clearly not the fault of the particular closeness measurechosen since any reasonable measure of closeness between the two valuest(x)andq(Q) would depend on where q(Q) actually is in ℝkvis-a-vis wheret(x) islocated Thus, comparing alternative functions for estimatingq(Q) on the basis

of the closeness toq(Q) to an actual estimate t(x) is not tractable — we clearlyneed additional criteria with which to judge whether t(X) generates “good”estimates ofq(Q)

Specify Probability Model: { f (x1, , x n; Q),Q ∈Ω }

Observe Outcome of Random Sample:

Trang 13

Various criteria for judging the usefulness of a given estimator t(X) forestimatingq(Q) have been presented in the literature.3The measures evaluateand rank estimators in terms of closeness of estimates toq(Q) in an expected orprobabilisticsense Note that sincet(X) is a function of X, and thus a randomvariable, a sampling distribution (i.e., the probability distribution oft(X)) exists

on R(T) that is induced by the probability distribution of the random sample,

X ¼ (X1, .,Xn) Roughly speaking, the fact that the distribution ofX depends on

Q will generally result in the sampling distribution of t(X) depending on Q aswell, and this latter dependence can lead to changes in location, spread, and/orshape of the distribution oft(X) as Q changes If the sampling distribution of t(X)changes withQ in a way that keeps the spread of potential estimates generated

byt(X) narrowly focused on q(Q) so that outcomes of t(X) occur near q(Q) withhigh probability under all contingencies for Q ∈ O, (see Figure7.2), then thefunctionT would be useful for generating estimates of q(Q)

We now turn our attention to specific estimator properties that have beenused in practice to measure whether these objectives have been achieved Indiscussing estimator properties, we will sometimes utilize aQ-subscript such as

EQð Þ, P Qð Þ, or var Qð Þ to emphasize that expectations or probabilities are beingcalculated using a particular value ofQ for the parameter vector of the underly-ing probability distribution In cases where the parametric context ofexpectations and probabilities are clear or does not need to be distinguished,the subscriptQ will not be explicitly displayed

7.3.2 Finite Sample PropertiesThe properties examined in this section evaluate the performance of estimatorswhen the random sample is of fixed size, and they are therefore referred to asfinite sample properties This is as opposed to asymptotic properties that we willexamine later in this section, which relate to limiting results that areestablished as the random sample size increases without bound (increases toinfinity) All of the finite sample properties examined here are based on the firsttwo moments of estimators, and thus relate to the central tendency of estimates

as well as the spread of estimates around their central tendency Of course, ifthese moments do not exist for a given estimator, then these finite sampleproperties cannot be used to evaluate the performance of the estimator

Trang 14

Mean Square Error and Relative Efficiency The term mean square error (MSE) is analternative term for the expected squared distance between outcomes of anestimator T ¼ t(X), and what it is estimating, the estimand q(Q) When T andq(Q) are scalars, the following definition applies.

Definition 7.6

Estimator Bias Thebias of a scalar estimator T of q(Q) is defined as biasQðTÞ ¼ EQðT qðQÞÞ;

8Q 2 O: The bias vector of a vector estimator T of q(Q) is defined as BiasQð ÞT

In the multivariate case, the MSE criterion is generalized through the use ofthemean square error matrix

To appreciate the information content ofMSEQ(T), first note that the diagonal

of the MSE matrix contains the MSE of the estimator Tifor qi(Q), i ¼ 1, .,k sincethe ith diagonal entry inMSEQ(T) is EQ(Tiqi(Q))2 More generally, let c be any(k 1) vector of constants, and examine the MSE of the linear combination

c0T ¼Pk

i ¼1ciTias an estimator ofc0qðQÞ ¼Pk

i ¼1ciqið Þ, defined byQ

Trang 15

MSEQðc0TÞ ¼ EQðc0T c0q Qð ÞÞ2

¼ EQc0½T qðQÞ½T qðQÞ0c

¼ c0MSEQð Þc:TThus, the MSEs of every possible linear combination of the Ti’s, used asestimators of the corresponding linear combination of the qi(Q)’s, can beobtained from the MSE matrix Note further that the trace of the MSE matrixdefines the expected squared distance of the vector estimatorT from the vectorestimandq(Q), as

tr MSEð Yð ÞT Þ ¼ tr E Y½T qðYÞ½T qðYÞ0

¼ EY½T qðYÞ0½T qðYÞ¼ EYd2ðT; qðYÞÞ:This is the direct vector analogue to the measure of closeness of T to q(Q) that isprovided by the MSE criterion in the scalar case

The MSE matrix can be decomposed into variance and bias components,analogous to the scalar case Specifically, MSE(T) is equal to the sum of thecovariance matrix ofT and the outer product of the bias vector of T, as

MSEQð Þ ¼ ET Q½T EQð Þ þ ET Qð Þ q QT ð Þ T E½ Qð Þ þ ET Qð Þ q QT ð Þ0

¼ CovQð Þ þ BiasT Qð ÞBiasT Qð ÞT 0:The outer product of the bias vector forms a (k k) matrix that is called the biasmatrix

In the case of a scalar q(Q), estimators with smaller MSEs are preferred Note,however, that since the trueQ is unknown (or else there is no point estimationproblem to begin with), one must consider the performance of an estimator for allpossible contingencies for the true value ofQ, which is to say, for all Q∈O It isquite possible, and often the case, that an estimator will have lower MSEs thananother estimator for some values ofQ∈O but not for others These considerationslead to the concepts ofrelative efficiency and relatively more efficient

Trang 16

Definition 7.9

Estimator Admissibility Let T be an estimator of q(Q) If there exists another estimator of q(Q) that is

relatively more efficientthan T, then T is calledinadmissible for estimatingq(Q) Otherwise, T is called admissible

It is evident that if one is judging the performance of estimators on the basis

of MSE, the analyst need not consider any estimators that are inadmissible.Example 7.3

Answer: Note that bias(T) ¼ E(X ) p ¼ 0, bias(T*) ¼ E n= n þ 1ð ð ÞÞX p ¼

p/(n þ 1) ¼ p/26, var(T)¼ p(1 p)/n ¼ p(1 p)/25, and var(T*) ¼np(1 p)/(n þ 1)2 ¼ p(1 p)/27.04 Then the MSEs of the two estimators aregiven by

MSEðTÞ ¼pð1 pÞ

25and

MSE Tð Þ ¼ pð1 pÞ

27:04 þ

p2

676:Examine the MSE of T*relative to the MSE of T, as

REpðT; T Þ ¼MSE TMSEð ÞðTÞ ¼ :9246 þ :0370p=ð1 pÞ:

Since the ratio depends on the value of p, which is unknown, we must considerall of the possible contingencies for p∈[0,1] Note that the ratio is monotonicallyincreasing in p, taking its smallest value of 9246 when p¼ 0, and diverging toinfinity as p ! 1 The ratio of MSEs equals 1 when p ¼ 6708 Thus, withoutconstraints on the potential values of p, neither estimator is preferred to theother on the basis of MSE, and thus neither estimator is rendered inadmissible

In contrast to the scalar case, a myriad of different MSE comparisons arepossible when q(Q) is a (k 1) vector First of all, there are k individual MSE

Trang 17

comparisons that can be made between corresponding entries in the twoestimatorsT*andT One could also compare the expected squared distances of

T* andT from q(Q), which is equivalent to comparing the sums of the meansquare errors of the entries in T* andT Furthermore, one could contemplateestimating linear combinations of the entries in q(Q) via corresponding linearcombinations of the entries inT*andT, so that MSE comparisons between theestimatorsℓ0T*andℓ0T for ℓ0q(Q) are then of interest All of the preceding MSEcomparisons are accounted for simultaneously in the following strong meansquare error (SMSE) criterion

IfT* is SMSE superior toT, it follows directly from Definition 7.10 thatMSEQðT

i Q(Ti)8i and 8Q∈O because if MSEQ(T*)MSEQ(T) is negativesemidefinite, the matrix difference necessarily has nonpositive diagonalentries.4It follows that

MSEQðℓ0T Þ ¼ ℓ0MSEQðT 0MSEQðTÞℓ ¼ MSEQðℓ0TÞ 8Q 2 O and 8ℓ:

Thus in the sense of all of the MSE comparisons defined previously,T* is at least

as good asT

The fact thatMSEQ(T*)MSEQ(T) is negative semidefinite and unequal tothe zero matrix for some

the aforementioned MSE comparisons become strong inequalities (<) for some Q

To see this, note that a nonzero negative semidefinite symmetric matrix sarily has one or more negative diagonal entries.5 Therefore, MSEQ Ti

neces-<MSEQ(Ti) for someQ and i, so that EQ(d2(T*, q(Q)) < EQ(d2(T, q(Q)) for some Qand MSEQ(ℓ0T*) < MSEQ(ℓ0T) for someQ and ℓ Thus, T* is superior to T for at leastsome MSE comparisons in addition to being no worse for any of the MSEcomparisons We can now define multivariate analogues to the notions ofrelative efficiency and admissibility

4 By definition, A is negative semidefinite iff ℓ 0 th diagonal entry of

by ℓ 0 Aℓ with ℓ being a zero vector except for a 1 in the i th position.

5 A nonzero matrix has at least unit rank The rank of a negative semidefinite symmetric matrix is equal to the number of negatively valued eigenvalues, and all eigenvalues of a negative semidefinite matrix are

matrix is equal to the sum of its eigenvalues Since all diagonal entries in a negative semidefinite matrix must be

nonzero negative semidefinite symmetric matrix must have one or more negative diagonal entries.

Trang 18

Definition 7.11

Relative Efficiency

and Admissibility with

Respect to SMSE

LetT* and T be estimators of the (k 1) vector q(Q) If T* is SMSE superior to

T, then T* is said to be relatively more efficient than T If there exists anestimator that is relatively more efficient thanT, then T is said to be inad-missible Otherwise, T is said to be admissible

As in the scalar case, if MSE is being used to measure estimator mance, the analyst need not consider any estimators ofq(Q) that are inadmissi-ble when searching for good estimators ofq(Q).6

perfor-In either the scalar or multivariate case, a natural question to ask is whether

an optimal estimator exists that has the smallest MSE or MSE matrix among allestimators ofq(Q) We might call such an estimator most efficient, or simplyefficient Unfortunately, no such estimator exists in general To clarify theissues involved, consider the scalar case and note that the degenerate estimatorT* ¼ t*(X) ¼ Yo would certainly have minimum mean-square error forestimating Y if mean-square error were evaluated at the point Y ¼ Yo, i.e.,MSEY(T*)¼ 0 for Y ¼ Yo Since a similar degenerate estimator could be definedfor eachY∈O, then for a given estimator to have minimum mean-square errorfor every potential value ofY, (i.e., uniformly in Y) it would be necessary thatMSEY(T)¼0 8Y∈O, which would imply that varQ(T)¼0 8Y∈O, and thus, that

PY(t(x) ¼ Y) ¼ 1 8Y∈O In order to construct an estimator T that satisfies thecondition P(t(x)¼Y) ¼ 1 8Y∈O, it would be necessary to be able to identify thetrue value ofY directly upon observing the sample outcome, x This essentiallyrequires that the range of the random sample be dependent on the value of Y,denoted as RY(X), in such a way that the sets RY(X), Y∈O, are all mutuallyexclusive, i.e., RY0ðXÞ \ RY00ðXÞ ¼ ; for Y06¼ Y00 Then, upon observingx, onewould only need to identify the set RY(X) to which x belonged, and Y would beimmediately known This is rarely, if ever, possible in practice, and so adopting aminimum mean-square error criterion for choosing an estimator of q(Q) is notfeasible A similar argument leads to the conclusion that there is in general noestimator of a (k 1) vector q(Q) whose MSE matrix is smallest among the MSEmatrices of all estimators ofq(Q).7

While there generally does not exist an estimator that has a uniformly (i.e.,for allQ∈O) minimum MSE or MSE matrix relative to all other estimators ofq(Q), it is often possible to find an optimal estimator if one restricts the type ofestimators under consideration Two such restrictions that have been widelyused in practice are unbiasedness and linearity, which we will examine in thenext two subsections ahead

6 Some analysts use a weak mean square error (WMSE) criterion that relates to only expected squared distance considerations T * is WMSE superior to T iff E Q (d 2 ( T * , Y(d 2 ( T,q(Q)) 8Q∈O, and < for some Q∈O Relative efficiency and admissibility can be defined in the context of WMSE superiority and are left to the reader.

7 By “smallest MSE matrix,” we mean that MSE Q ( T * )- MSE Q ( T) is a negative semidefinite matrix for all estimators T of q(Q) and for all Q.

Trang 19

Unbiasedness The property of unbiasedness refers to the balancing point orexpectation of an estimator’s probability distribution being equal to what isbeing estimated.

Definition 7.12

Unbiased Estimator An estimator T is said to be an unbiased estimator of q(Q) iff

EQðTÞ ¼ qðQÞ; 8Q 2 V Otherwise, the estimator is said to be biased

As in the case of the MSE criteria, it is important to appreciate the cance of the condition8Q∈O in the above definition In the context of the pointestimation problem, we have assumed that the true value ofQ, say Q*, is someelement of the specified parameter space, O, but we do not know which one.Thus, the property of unbiasedness is stated for all possible contingenciesregarding the potential values for the true value of Q Due to the condition8Q∈O, the requirement for unbiasedness essentially means that EQð Þ¼ q(Q)Tregardlessof which value ofQ∈O is the true value Thus, for T to be unbiased,its density function must be balanced on the pointq(Q), whatever the true value

signifi-ofQ Whether or not T has the unbiasedness property depends on the functionaldefinition of T, and in particular, on how the function translates the densityfunction ofX ~ f(x1, .,xn;Q) into the density function of T ~ f(t;Q)

An unbiased estimator has the intuitively appealing property of being equal

toq(Q) on average, the phrase having two useful interpretations First, since theexpectation operation is inherently a weighted average of the outcomes of T,then the outcomes ofT have a weighted average equal to q(Q) Alternatively, ifone were to repeatedly and independently observe outcomes of the randomsampleX, and thus repeatedly generate estimates of q(Q) using correspondingoutcomes of the vector T, then the simple average of all of the observedestimates would converge in probability (and, in fact, converge almost surely)elementwise toq(Q) by Khinchin’s WLLN (or by Kolmogorov’s SLLN in the case

of almost-sure convergence), provided only thatq(Q) is finite

We provide the following example of an unbiased estimator of a parameter.Example 7.4

i ¼1Xi)¼ n1Pn

i ¼1E Xð Þ ¼ yi

regardlessof the value of y > 0 Thus, for example, if the true value of y were

2, then E(T)¼ 2, or if the true value of y were100, then E(T) ¼ 100 □

MVUE, MVLUE or BLUE, and Efficiency The unbiasedness criterion ensures onlythat an estimator will have a density that has a central tendency or balancingpoint ofq(Q) However, it is clear that we would also desire that the density not

be too spread out around this balancing point for fear that an estimate could be

Trang 20

generated that was a significant distance from q(Q) with high probability.Graphically, we would prefer the estimator T to the estimator T* in Figure7.3,where both of these estimators are unbiased estimators ofq(Q).

The foregoing considerations motivate that, if one wishes to use an unbiasedestimator ofq(Q), one should use the unbiased estimator that also has minimumvariance, or minimum covariance matrix if T is a vector, among all unbiasedestimators ofq(Q) Since BiasQ(T) ¼ 0 for all estimators in the unbiased class ofestimators, MSEQ(T) ¼ varQ(T) or MSEQ(T) ¼ CovQ(T), and we can thus viewthe objective of minimizing var(T) orCov(T) equivalently as searching for theestimator with the smallest MSE or smallest MSE matrix within the class ofunbiased estimators In the definition below, we introduce the notationAﬃB toindicate that matrix A is smaller than matrix B by a negative semidefinitematrix, i.e.,A B ¼ C is a negative semidefinite matrix

Trang 21

Definition 7.13 implies that an estimator is a MVUE if the estimator is unbiasedand if there is no other unbiased estimator that has a smaller variance or covariancematrix for anyQ∈O Drawing direct analogies to the discussion of the MSE criteria,

a MVUE, T, is such that MSEQ(Ti)¼ varQ(Ti) Q( Ti ) ¼ MSEQ( Ti ) 8Q∈Oand8i, where T* is any estimator in the unbiased class of estimators Further-more, EQ (d2( Q (d2(T*, q(Q)) 8Q∈O and MSEQ(ℓ0T) ¼ varQ(ℓ0

varQ(ℓ0T*) ¼ MSEQ(ℓ0T*) 8Q∈O and 8ℓ Thus, within the class of unbiasedestimators, a MVUE ofq(Q) is at least as good as any other estimator of q(Q)

in terms of all of the types of MSE comparisons that we have discussedpreviously IfT is a MVUE for q(Q), then T is said to be efficient within theclass of unbiased estimators

Unfortunately, without the aid of theorems that facilitate the discovery ofMVUES, finding a MVUE ofq(Q) can be quite challenging even when the pointestimation problem appears to be quite simple The following exampleillustrates the general issues involved

E tð ð ÞX Þ ¼ t 0; 0ð Þ 1 pð Þ2þ t 0; 1ð Þ 1 pð Þp þ t 1; 0ð Þp 1 pð Þ þ t 1; 1ð Þp2¼ p 8p 2 ½0; 1This unbiasedness conditions implies the following set of restrictions on thedefinition of tð Þ:X

If p2

1

f g0

Trang 22

where we have used the facts that E(t(X))¼ p, t(0,0) ¼ 0, and t(1,1) ¼ 1 since t(X)must be unbiased Also because of the unbiasedness condition, we can substi-tute t(0,1) ¼ 1t(1,0) into the variance expression to obtain

varðtðXÞ ¼ 2p2ð1 pÞ2þ ð1 p tð1; 0Þ Þ2ð1 pÞp þ ðtð1; 0Þ p Þ2pð1 pÞ:The first-order condition for a minimum of the variance is given by

dvar tðXÞdtð1; 0Þ ¼ 2ð1 p Þ

A number of general theorems that can often be used to simplify the searchfor a MVUE will be presented in Section7.5

For purposes of simplicity and tractability, as well as for cases where littlecan be assumed about the probability model other than conditions on low-ordermoments, attention is sometimes restricted to estimators that are unbiased andthat have minimum variance or covariance matrix among all unbiasedestimators that are linear functions of the sample outcome Such an estimator

is called a BLUE or MVLUE, as indicated in the following definition

An estimatorT is said to be a BLUE or MVLUE of q(Q) iff

1 T is a linear function, T ¼ t(X) ¼ AX + b, of the random sample X,

Answer: We are examining linear estimators, and thus t(X) ¼ Pn

i ¼1aiXiþ b For T

to be unbiased, we require thatPn

i ¼1ai¼ 1 and b ¼ 0, since E(T)¼E Pn

Trang 23

and b¼ 0 The variance of T is simply s2Pn

i ¼1a2

i because (X1, .,Xn) is a randomsample from f(z;Y) Thus, to find the BLUE, we must solve the followingminimization problem:

i ¼1Xi¼ X, so that the sample mean is the BLUE (or MVLUE) of the mean

of any population distribution having a finite mean and variance The readershould check that the second-order conditions for a minimum are in fact met.□

In addition to estimating the means of population distributions, a prominentBLUE arises in the context of least-squares estimation of the parameters of ageneral linear model, which we will examine in Chapter8

7.3.3 Asymptotic Properties

When finite sample properties are intractable or else inapplicable due to thenonexistence of the appropriate expectations that define means and variances,one generally resorts to asymptotic properties to rank the efficiency ofestimators In addition, asymptotic properties are of fundamental interest ifthe analyst is interested in assessing the effects on estimator properties of anever-increasing number of sample observations

Asymptotic properties of estimators are essentially equivalent in concept tothe finite sample properties presented heretofore, except that asymptoticproperties are based on the asymptotic distributions of estimators rather thanestimators’ exact finite sampling distributions In particular, asymptoticanalogues to MSE, relative efficiency, unbiasedness, and minimum-varianceunbiasedness can be defined with reference to asymptotic distributions ofestimators However, a problem of nonuniqueness of asymptotic propertiesarises because of the inherent nonuniqueness of asymptotic distributions

To clarify the difficulties that can arise when using asymptoticdistributions as a basis for defining estimator properties, let Tn denote anestimator of the scalar q(Q) based on n sample observations, and suppose

Trang 24

b1n ðTnqðQÞÞ !d Nð0; 1Þ Then one might consider defining asymptoticproperties of Tnin terms of the asymptotic distribution N(q(Q),b2

n) However,

by Slutsky’s theorem it follows that nð = n kð ÞÞ1 =2b1n ðTn q Qð ÞÞ !d Nð0; 1Þ for afixed value of k since nð = n kð ÞÞ1 =2! 1 , so that an alternative asymptoticdistribution could be Tn a N qð Þ; n kQ ðð Þ=nÞb2

n

; producing a differentasymptotic variance with implications for estimator performance measuresthat are functions of the variance of estimators The difficulty is that thecentering and scaling required to achieve a limiting distribution is not unique,leading to both nonunique asymptotic distributions and nonunique asymptoticproperties derived from them

There are two basic ways of addressing the aforementioned nonuniquenessproblem when dealing with asymptotic properties One approach, which we willmention only briefly, is to rank estimators only on the basis of limits of asymp-totic property comparisons so as to remove the effects of any arbitrary scaling orcentering from the comparison For example, referring to the previous illustra-tion of nonuniqueness, let the asymptotic distribution of Tnbe N(q(Q),b2

n

¼AMSEQðT nÞAMSEQðTnÞ¼

n k

n :Using the ARE in this form, one would be led to the conclusion that Tn isasymptotically relatively more efficient than Tn, which in the context of theprevious illustration of nonuniqueness would be absurd since Tn and Tn arethe same estimator However, limn !1 AREQ Tn; T

An alternative approach for avoiding nonuniqueness of asymptoticproperties is to restrict the use of asymptotic properties to classes of estimatorsfor which the problem will not occur For our purposes, it will suffice to examinethe consistent asymptotically normal (CAN) class of estimators (for otherpossibilities, see E Lehmann, Point Estimation, pp 347–348)

Prior to identifying the CAN class of estimators, we examine the property ofconsistency

Consistency A consistent estimator is an estimator that converges in ity (element-wise ifTnis a vector) to what is being estimated

Trang 25

probabil-Definition 7.15

Consistent Estimator Tnis said to be aconsistent estimator of q(Q) iff plimQ(Tn)¼ q(Q) 8 Q ∈ O

Thus, for large enough n (i.e., for large enough sample size), there is a highprobability that the outcome of a scalar estimator Tn will be in the interval(q(Q)e, q(Q)+e) for arbitrarily small e > 0 regardless of the value of Q Relatedly,the sampling density of Tnconcentrates on the true value of q(Q) as the samplesize ! 1 if Tn is a consistent estimator of q(Q) Consistency is clearly adesirable property of an estimator, since it ensures that increasing sample infor-mation will ultimately lead to an estimate that is essentially certain to bearbitrarily close to what is being estimated, q(Q)

Since Tn!m q(Q) implies Tn !p q(Q), we can state sufficient conditions forconsistency of Tnin terms of unbiasedness and in terms of variance convergence

to zero Specifically, ifTnis unbiased, or if the bias vector converges to zero as n

! 1, and if var(Tn) ! 0 as n ! 1, or Cov(Tn) ! 0 as n ! 1 if Tnis a vector,thenTnis a consistent estimator ofq(Q) by mean-square convergence

as n! 1 Note the following counterexample

Example 7.8

Consistency without

E(Tn)! q(Y)

Let the sampling density of Tn be defined as f(tn;Y) ¼ (1n1/2)I{ Q}(tn)þ

n1/2I{n}(tn) Note that as n ! 1, limn !1 P[|tnY|< e]¼1, for any e > 0,and Tn is consistent for Y However, since E(Tn)¼ Y(1n1/2) + n(n1/2)¼

Y (1n1/2) + n1/2, then as n! 1, E(Tn) ! 1 □The divergence of the expectation in Example 7.8 is due to the fact that thedensity function of Tn, although collapsing to the point Y as n!1, was notcollapsing at a fast enough rate for the expectation to converge toY In particu-lar, the density weighting assigned to the outcome n in defining the expectationwent to zero at a rate slower than n went to infinity as n ! 1, causingthe divergence A sufficient condition for Tn!pqðQÞ ) limn !1E Tð Þ ¼ qðQÞ isn

provided in the following theorem:

exists and is bounded8n, so that E T2

n

probability implies convergence in mean

Trang 26

Proof Rao, Statistical Inference, pp 121 n

Note the sufficient condition given in Theorem 7.1 does not hold inExample 7.7

Consistent Asymptotically Normal (CAN) Estimators The class of consistentasymptotically normal (CAN) estimators of q(Q) is defined in the statisticalliterature to be the collection of all estimators ofq(Q) for which n1/2(Tnq(Q))

!d N([0], ST), whereSTis a positive definite covariance matrix that may depend

on the value of Q We will allow this dependence to be implicit rather thanutilize notation such aST(Q) Note the consistency of Tnfollows immediately,since by Slutsky’s theorem n1/2[n1/2(Tnq(Q))] ¼ Tnq(Q) !d 0 ·Z ¼ 0, where

Z ~ N(0, ST), which impliesTnq(Q) !d 0 or equivalently Tn !d q(Q) The CANclass contains a large number of the estimators used in empirical work

Because all of the estimators in the CAN class utilize precisely the samesequence of centering (i.e.,q(Q) is subtracted from Tn) and scaling (i.e.,Tnq(Q)

is multiplied by n1/2), the problem of nonuniqueness of asymptotic distributionsand properties does not arise Asymptotic versions of MSEs, MSE matrices, biasvectors, variances, and covariance matrices can be defined via expectationstaken with respect to the unique asymptotic distribution of estimators, where

Tn a

N(q(Q), n1ST) In particular, letting the prefix A denote an asymptoticproperty, and letting EAdenote an expectation taken with respect to an asymp-totic distribution, we have within the CAN class8Q∈O,

AMSE Tð Þ ¼ ACov Tn ð Þ ¼ En Að T n qðQÞÞðTn qðQÞÞ0 ðmultivariateÞ

¼ Avar Tð Þ ¼ En AðTn qðQÞÞ2

scalar

andABIAS Tð Þ ¼ En AðTn qðQÞÞ ¼ 0:

The zero value of the asymptotic bias indicates that a CAN estimator of q(Q)

is necessarily asymptotically unbiased We pause to note that there is a lack

of consensus in the literature regarding the definition ofasymptotic ness, and Example 7.8 is useful for illustrating the issues involved Somestatisticians define asymptotic unbiasedness of an estimator sequence interms of the limit of the expected values of the estimators in the sequence,where limn !1 E(Tn)¼ q(Q) 8Q∈O characterizes an asymptotically unbiasedestimator Under this definition, the estimator in Example 7.8 would not beasymptotically unbiased, but rather would beasymptotically biased It is clearthat this definition of asymptotic unbiasedness requires that the expectations

unbiased-in the sequence exist, as they do unbiased-in Example 7.8 Withunbiased-in the CAN class, the twodefinitions of asymptotic unbiasedness will coincide if the second ordermoments of the estimators in the sequence {Tn} are bounded (recall Theorem7.1), since then limn !1 E(Tn)¼ q(Q) ¼ EA(Tn) Otherwise, the definitions mayrefer to different concepts of unbiasedness, as Example 7.8 demonstrates

Trang 27

Thus, one must discern the definition of asymptotic unbiasedness being used byany analyst by the context of the discussion.

Given the preceding definition of asymptotic properties, we can now definethe meaning of asymptotic relative efficiency and asymptotic admissibilityuniquely for CAN estimators

ST

ST8Y 2 O:

Tnisasymptotically relatively more efficient than T

nif AREQ(Tn, Tn ) 18Q 2 O and > 1 for some Q 2 O

b Tn is asymptotically relatively more efficient than T

n iff ST ST isnegative semidefinite8Y 2 O and ST ST 6¼ 0 for some Q 2 O

c If there exists an estimator that is asymptotically relatively more efficientthanTn, thenTnisasymptotically inadmissible Otherwise Tnisasymp-totically admissible

A discussion of the meaning of ARE and asymptotic admissibility, as well asall of the other asymptotic properties presented to this point, would becompletely analogous to the discussion presented in the finite sample case,except now all interpretations would be couched in terms of approximationsbased on asymptotic distributions We leave it to the reader to draw the analogies.Example 7.9

T n¼ t

nðXÞ ¼ 12 n1Xn

i ¼1

X2 i

Trang 28

where m04¼hd4ð1 ytÞ1=dt4i

t ¼0¼ 24y4.Now note that Tn is a continuous function of M02 so that plim ðT

nÞ ¼plimðM02=2 )1/2 ¼ (plim( M02)/2)1/2 ¼ y by Theorem 5.5 Therefore, T

n is aconsistent estimator of y Furthermore, n1/2 (M02 y) has a normal limitingdistribution and is thus a CAN estimator To see this, recall Theorem 5.39 onthe asymptotic distribution of functions of asymptotically normal randomvariables, where in this application, Tn is a function of the asymptoticallynormal random variable M02 a

N(2y2, 20y4/n) Since

N(0,G[20y4/n]G0)¼ N(0,1.25 y2)

In comparing Tn with Xnas estimators of y, it is now clear that although bothare consistent and asymptotically normal estimators of y, Xn isasymptoticallymore efficient than T

n , since in comparing the asymptotic variances ofthe limiting distributions of n1/2( Xn y ) and n1/2( T n y ), we have that

Asymptotic Efficiency At this point it would seem logical to proceed to a tion of asymptotic efficiency in terms of a choice of estimator in the CAN classthat has the smallest asymptotic variance or covariance matrix8Q 2 O (compare

defini-to Definition 7.13) Unfortunately, LeCam (1953)9 has shown that such anestimator does not exist without further restrictions on the class of estimators

In particular, LeCam (1953) effectively showed that for any CAN estimator onecan always define an alternative estimator that has a smaller variance or covari-ance matrix for at least one Q 2 O The implication of this result is that onecannot define an achievable lower bound to the asymptotic variances or covari-ance matrices of CAN estimators, so that no asymptotically optimal estimatorexists

On the other hand, LeCam (1953) also showed that under mild regularityconditions, there does exist a lower bound to the asymptotic variance or covari-ance matrix of a CAN estimator that holds for all Q 2 O except on a set ofQ-values having Lebesque measure zero, which is the Cramer-Rao Lower Boundthat will be discussed in Section7.5 Note that the Lebesque measure of a set ofQ-values can be thought of as the volume of the set within the k-dimensionalparameter space A set having Lebesque measure zero is a set with zero volume

in k-space, e.g., a collection of isolated points, or a set of points having sion less than k (such as a square and its interior in a three-dimensional space, or

dimen-a line in two-dimensiondimen-al spdimen-ace) A set of Lebesque medimen-asure zero is dimen-anonstochastic analogue to a set having probability zero, and such a set is thuspractically irrelevant relative to its complement It is thus meaningful to speak

9 LeCam, L., (1953) “On Some Asymptotic Properties of Maximum Likelihood Estimates and Related Bayes Estimates”, University of California Publications in Statistics, 1:277–330, 1953.

Trang 29

of a lower bound on the asymptotic variance or covariance matrix of aCAN estimator of q(Q) that holds almost everywhere in the parameter space(i.e., except for a set of Lebesque measure zero), and then a search for an estima-tor that achieves this bound becomes meaningful as well.

At this point we will state a general definition of asymptotic efficiencyfor CAN estimators In Section 7.5, we will be much more precise aboutthe functional form of the asymptotic covariance matrix of an asymptoticallyefficient estimator

Definition 7.17

Asymptotic Efficiency IfTnis a CAN estimator ofq(Q) having the smallest asymptotic covariance

matrix among all CAN estimators 8Q 2 O, except on a set of Lebesquemeasure zero,Tnis said to be asymptotically efficient

As a final remark, it is possible to remove the qualifier “except on a set ofLebesque measure zero” if the CAN class of estimators is further restricted sothat only estimators that converge uniformly to the normal distribution areconsidered Roughly speaking, uniform convergence of a function sequence

Fn(x) to F(x) requires that the rate at which convergence occurs is uniform acrossall x is the domain of F(x), unlike ordinary convergence (recall Definition 5.7)which allows for the possibility that the rate is different for each x Therestricted class of estimators is called the Consistent Uniformly AsymptoticallyNormal (CUAN) class, and within the CUAN class it is meaningful to speak of

an estimator that literally has the smallest asymptotic covariance matrix Theinterested reader can consult C.R Rao, (1963) “Criteria of Estimation in LargeSamples,” Sankhya, Series A, pp 189–206 for further details

Sufficient statistics for a given estimation problem are a collection of statistics

or, equivalently, a collection of functions of the random sample, that summarize

or represent all of the information in a random sample that is useful forestimating anyq(Q) Thus, in place of the original random sample outcome, it

is sufficient to have observations on the sufficient statistics to estimate anyq(Q)

Of course, the random sample itself is a collection of n sufficient statistics, but

an objective in defining sufficient statistics is to reduce the number of functions

of the random sample needed to represent all of the sample information relevantfor estimatingq(Q) If a small collection of sufficient statistics can be found for agiven statistical model then for defining estimators of q(Q) it is sufficient toconsider only functions of the smaller set of sufficient statistic outcomes asopposed to functions of all n outcomes contained in the original random sample

In this way the sufficient statistics allow a data reduction step to occur in a pointestimation problem Relatedly, it will be shown that the search for estimators ofq(Q) having the MVUE property or small MSEs can always be restricted tofunctions of the smallest collection of sufficient statistics Finally, if the

Trang 30

sufficient statistics have a special property, referred to as completeness, then anexplicit procedure utilizing the complete sufficient statistics is available that isoften useful in defining MVUEs We begin by presenting a more rigorous defini-tion of sufficient statistics.

Definition 7.18

Sufficient Statistics LetX ¼ X½ 1; ; Xn0~ f(x;Q) be a random sample, and let s ¼ s½ 1ðXÞ; ; srðXÞ0

be r statistics The r statistics are said to besufficient statistics for f(x;Q) ifff(x; Qjs) ¼ h(x), i.e., the conditional density of X, given s, does not depend onthe parameter vectorQ.10

An intuitive interpretation of Definition 7.18 is that once the outcomes ofthe r sufficient statistics are observed, there is no additional information onQ inthe sample outcome The definition also implies that given the function valuess(x) ¼ s, no other function of X provides any additional information about Qthan that obtained from the outcomess To motivate these interpretations, firstnote that the conditional density function f(x;Q|s) can be viewed as representingthe probability distribution of all of the various ways in which random sampleoutcomes,x, occur so as to generate the conditional value of s This is becausethe event being conditioned on requires thatx satisfy, s(x) ¼ s Definition 7.18states that ifS is a vector of sufficient statistics, then Q is a ghost in f(x;Q|s) i.e.,the conditional density function really does not depend on the value ofQ sincef(x;Q|s) ¼ h(x) It follows that the probabilistic behavior of the various ways inwhichx results in s(x) ¼ s has nothing to do with Q, i.e., it is independent of Q.Thus, analyzing the various ways in which a given value of s can occur, orexamining additional functions of X, cannot possibly provide any additionalinformation aboutQ since the behavior of the outcomes of X, conditioned onthe fact thats(x) ¼ s, is totally unrelated to Q

i ¼1xiPn

i ¼1If0;1gðxiÞwhere p 2 O ¼ (0,1), xi¼ 1denotes a sale, and xi¼ 0 denotes no sale on the ithcall In this case, Pn

i ¼1Xi,representing the total number of sales in the sample, is a sufficient statistic for f(x;p)

To see that this is true, first note that the appropriate conditioning event in thecontext of Definition 7.18, would be s(x) ¼ Pn

i ¼1xi ¼ s, i.e., the total number of

10

Note that the conditional density function referred to in this definition is degenerate in the general sense alluded to in Section 3.10, footnote 20 That is since (x 1 , ,x n ) satisfies the r restrictions s i (x 1 , ,x n ) ¼ s i , for i ¼ 1, .,r by virtue of the event being conditioned upon, the arguments x1, ,x n of the conditional density are not all free to vary but rather are functionally related If one wanted to utilize the conditional density for actually calculating conditional probabilities of events for (X1, ,X n ), and if the random variables were continuous, then line integrals would be required as discussed previously in Chapter 3 concerning the use of degenerate densities This technical problem is of no concern in our current discussion of sufficient statistics since we will have no need to actually calculate conditional probabilities from the conditional density.

Trang 31

sales equals the value s It follows from the definition of conditional probability thatthe conditional density function can be defined as11

fðx; pjsÞ ¼Pðx1; ; xn; sðxÞ ¼ sÞ

PðsðxÞ ¼ sÞ :The denominator probability is given directly by

PðsðxÞ ¼ sÞ ¼ n

s

!

psð1 pÞn sIf0;1; ;ngðsÞbecause s(X) ¼ Pn

i ¼1Xiis the sum of iid Bernoulli random variables, which weknow to have a binomial distribution The numerator probability is defined by

an appropriate evaluation of the joint density of the random sample, asPðx1; ; xn; sðxÞ ¼ sÞ ¼ fðx; pÞI Pn

which is the probability of x1, .,xnand s(x) ¼ Pn

i ¼1xi ¼ s Using the precedingfunctional representations of the numerator and denominator probabilities inthe ratio defining the conditional density function, and using the fact that

i ¼1Xiis a sufficient statistic for p

Note the conditional density states that, given Pn

i ¼1xi ¼ s, all outcomes of(x1, .,xn) are equally likely with probability n

s

1, and thus the probability of aparticular pattern of sales and no sales occurring for (xPn 1, .,xn), given that

i ¼1xi ¼ s, has nothing to do with the value of p It follows that only the factthat Pn

i ¼1xi ¼ s provides any information about p – the particular pattern of 0’sand 1’s in (X1, .,Xn) is irrelevant This is consistent with intuition in that it isthe total number of sales in n phone calls and not the particular pattern of salesthat provides information in a relative frequency sense about the probability, p,

of obtaining a sale on a phone call solicitation Furthermore, if Y¼ g(X) is anyother function of the random sample, then it can provide no additional informa-tion about p other than that already provided by s(X) This follows from the fact

11 The reader may wonder why we define the conditional density “from the definition of conditional probability,” instead of using the rather straightforward methods for defining conditional densities presented in Chapter 2, Section 2.6 The problem is that here we are conditioning on an event that involves all of the random variables X 1 , ,X n , whereas in Chapter 2 we were dealing with the usual case where the event being conditioned upon involves only a subset of the random variable X , ,X having fewer than n elements.

Trang 32

that h(y|s(x) ¼ s) will not depend on p because the conditional density of Y willhave been derived from a conditional density ofX that is independent of p, i.e.,hðyjsðxÞ ¼ sÞ ¼ PðyjsðxÞ ¼ sÞ ¼ X

fx:gðxÞ¼yg

fðx; pjsÞ ¼ X

fx:gðxÞ¼yg

hðxÞsince f(x;p|s) ¼ h(x) if s is a sufficient statistic □

In any problem of estimating q(Q), once the outcome of a set of sufficientstatistics is observed, the random sample outcome (x, .,xn) can effectively beignored for the remainder of the point estimation problem sinces(x) captures all

of the relevant information that the sample has to offer regarding q(Q) tially, it is sufficient that the outcome of s be observed For example, withreference to Example 7.10, if a colleague were to provide the information that

Essen-123 sales were observed in a total of 250 phone calls, i.e., P250

i ¼1xi ¼ 123, wewould have no need to examine any other characteristic of the random sampleoutcome (x1, .,x250) when estimating p, or q(p)

A significant practical problem in the use of sufficient statistics is knowinghow to identify them A criterion for identifying sufficient statistics which issometimes useful is given by the Neyman Factorization Theorem:

statis-Proof The proof of the theorem in the continuous case is quite difficult, and we leave it

to a more advanced course of study (see Lehmann, (1986) Testing StatisticalHypotheses, John Wiley, 1986, pp 54–55) We provide a proof for the discrete case

Sufficiency Suppose the factorization criterion is met Let B(a) ¼ {(X1, .,Xn):si(x) ¼ ai,

i¼ 1, .,r; x 2 R(X)} be such that P(B(a)) > 0, and note thatPðBðaÞÞ ¼ X

ðx 1 ; ;x n Þ2BðaÞ

fðx; YÞ ¼ gða1; ; ar; QÞ X

ðx 1 ; ;x n Þ2BðaÞ

hðx1; ; xnÞTherefore,

conditional density that f(x;Q) ¼ f(x|si(x) ¼ ai, i¼ 1, , r) P(si(x) ¼ ai, i¼ 1, ., r) where the conditional density function does not depend on Q by thesufficiency ofs Then we have factored f(x;Q) into the product of a function of

Trang 33

s1(x), .,sr(x) and Q (i.e., P(si(x) ¼ ai, i¼ 1, .,r) will depend on Q), and a function

As we have alluded to previously, a practical advantage of sufficient tics is that they can often greatly reduce the number of random variablesrequired to represent the sample information relevant for estimating q(Q), asseen in Example 7.8 and in the following example of the use of the NeymanFactorization Theorem

i ¼1Ið 0 ;1 Þ(xi) Then from the theorem, we can conclude that S¼ Pn

i ¼1Xi

is a sufficient statistic for f(x;Y) It follows that the value of the sum of therandom sample outcomes contains all of the information in the sample

Successful use of the Neyman Factorization Theorem for identifying cient statistics requires that one be ingenious enough to define the appropriateg(s(x);Q) and h(x) functions that achieve the required joint probability densityfactorization Since the appropriate function definitions will not always bereadily apparent, an approach introduced by Lehmann–Scheffe´12can sometimes

suffi-be quite useful for providing direction to the search for sufficient statistics Wewill discuss this useful result in the context of minimal sufficient statistics.7.4.1 Minimal Sufficient Statistics

At the beginning of our discussion of sufficient statistics we remarked that anobjective of using sufficient statistics is to reduce the number of functions of therandom sample required to represent all of the information in the randomsample relevant for estimating q(Q) A natural question to consider is what isthe smallest number of functions of the random sample that can represent all ofthe relevant sample information in a given point estimation problem? Thisrelates to the concept of aminimal sufficient statistic, which is essentially thesufficient statistic for a given f(x;Q) that is defined using the fewest number of(functionally independent) coordinate functions of the random sample

The statement of subsequent definitions and theorems will be facilitated

by the concept of the range of X over the parameter space V, defined as

12 Lehmann, E.L and H Scheffe’ (1950) Completeness, Similar Regions, and Unbiased Estimation, Sankhya¯, 10, pp 305.

Trang 34

ROð Þ ¼ fx : f x; QX ð Þ>0 for some Q 2 Og The set RO(X) represents all of thevalues ofx that are assigned a nonzero density weighting by f(x;Q) for at leastoneQ 2 O In other words, RO(X) is the union of the supports of the densitiesf(x;Q) for Q 2 O and thus corresponds to the set of relevant x-outcomes forthe statistical model {f(x;Q), Q 2 O} If the support of the density f(x;Q) doesnot change with Q (e.g., normal, Gamma, binomial) then RO(X) ¼ R(X) ¼ {x:f(x;Q) > 0}, where Q 2 O can be chosen arbitrarily and we henceforth treatthe range ofX as being synonymous with the support of its density.

In order to motivate what is “minimal” about the sufficient statisticS inDefinition 7.19, first note that S will have the fewest elements in its rangecompared to all sufficient statistics for f(x;Q) This follows from the fact that afunction can never have more elements in its range than in its domain (recall thedefinition of a function, which requires that there is only one range pointassociated with each domain element, although there can be many domainelements associated with each range element), and thus if S ¼ hT(T) for anyother sufficient statisticT, then the number of elements in R(S) must be no morethan the number of elements in R(T), for any sufficient statistic T So, in thissense,S utilizes the minimal set of points for representing the sample informa-tion relevant for estimatingq(Q)

It can also be shown that a minimal sufficient statistic can be chosen to havethe fewest number of coordinate functions relative to any other sufficient statis-tic, i.e., the number of coordinate functions defining the minimal sufficientstatistic is minimal A rigorous proof of this fact is quite difficult and is deferred

to a more advanced cause of study.13In order to at least motivate the plausibility

of this fact, first note that since a minimal sufficient statistic, sayS, is a function

of all other sufficient statistics, then ifT is any other sufficient statistic, t(x) ¼t(y) ) s(x) ¼ hT(t(x)) ¼ hT(t(y)) ¼ s(y) It follows that

AT¼ x; yfð Þ : t xð Þ ¼ t yð Þg x; yfð Þ : s xð Þ ¼ s yð Þg ¼ B

no matter which sufficient statistic,T, is being referred to If B is to contain theset AT, then the constraints on (x,y) representing the set-defining conditions of Bcannot be more constraining than the constraints defining AT, and in particularthe number of nonredundant constraints14defining B cannot be more than the

13 See E.W Barankin and M Katz, (1959) Sufficient Statistics of Minimal Dimension, Sankhya, 21:217–246; R Shimizu, (1966) Remarks on Sufficient Statistics, Ann Inst Statist Math., 18:49–66; D.A.S Fraser, (1963) On Sufficiency and the Exponential Family, Jour Roy Statist Soc., Series B, 25:115–123.

Trang 35

number defining AT Thus the number of nonredundant coordinate functionsdefining S must be no larger than the number of nonredundant coordinatefunctions defining any other sufficient statistic, so that the number of coordi-nate functions defining S is minimal Identification of minimal sufficient statis-tics can often be facilitated by the following approach suggested by Lehmannand Scheffe´.

Proof Define A(s) ¼ {x: s(x) ¼ s} and let xs 2 A(s)\RO(X), be chosen as a representative

element of A(s), 8s 2 R(S) Define Z(x) ¼ xs8x 2 A(s) and 8s 2 R(S) Thus A(s) isthe set of x-outcomes whose image s(x) is s, and Z(x) is the representativeelement of the set A(s) to which x belongs

Assume that (x,y) 2 {(x,y): s(x) ¼ s(y)} ) f(x;Q) ¼ t(x,y) f(x;Q) 8 x and y 2

RO(X) Then for x 2 A(s) \ RO(X), sðxÞ ¼ sðxsÞ impliesfðx; YÞ ¼ tðx; xsÞfðxs; YÞ

¼ tðx; hðxÞÞfðhðxÞ; YÞ substitute

hðxÞ ¼ xs

!

¼ hðxÞgðsðxÞ; YÞwhere h(x) t(x,h(x)), g(s(x);Y) f(h(x);Y), and the g-function in the latter identitycan be defined from the fact that h(x) ¼ xs iff s(x) ¼ s, so that h(x) , s(x)

If x 2 RO(X), then f(x;Y) ¼ h(x) g(s(x);Y) by defining h(x) ¼ 0 Since Neymanfactorization holds, s(X) is a sufficient statistic

Now assume f(x;Y) ¼ t(x,y) f(x;Y) ) (x,y) 2 {(x,y): s(x) ¼ s(y)} 8x and y 2

RO(X) Let s*(x) be any other sufficient statistic for f(x;Y) Then by Neymanfactorization, for some g*(·) and h*(·) functions, f(x;Y) ¼ g*(s*(x);Y) h*(x) If s*(x)

¼ s*(y), then since g*(s*(x);Y) ¼ g*(s*(y);Y), and it follows that f(x;Y) ¼[h*(x)/h*(y)] f(x;Y) ¼ t(x,y) f(x;Y) whenever h*(y) 6¼ 0, so that s*(x) ¼ s*(y) ) s(x) ¼s(y) Values of y for which h*(y) ¼ 0 are such that f(x;Y) ¼ 0 8Y 2 O byNeyman Factorization, and are thus irrelevant to the minimal sufficiency of S(recall Definition 7.19) Then s is a function of s*, as s(x) ¼ g(s*(x)), 8x 2 RO(X),because for a representative ys* 2 {x: s*(x) ¼ s*}, s*(x) ¼ s*(ys*)¼ s* ) s(x) ¼s(ys*)¼ s, and thus s(x) ¼ s ¼ g(s*)¼ g(s*(x)) Therefore s(X) is a minimal sufficient

14 By nonredundant, we mean that none of the constraints are implied by the others Redundant constraints are constraints that are ineffective or unnecessary in defining sets.

Trang 36

Before proceeding to applications of the theorem, we present two corollariesthat are informative and useful in practice.

Corollary 7.1

Lehmann-Scheffe´ Sufficiency

LetX ~ f(x;Y) If the statistic S ¼ s(X) is such that 8 x and y 2 RO(X), (x,y) 2 {(x,y):s(x) ¼ s(y)} ) f(x;Y) ¼ t(x,y) f(y;Y), then S ¼ s(X) is a sufficient statistic forf(x;Y)

Proof The validity of this corollary is implied by the first part of the proof of

The corollary indicates that the “only if” part of the condition in Theorem7.3 is not required for the sufficiency ofs(X) but it is the addition of the “only if”part that results in minimality ofs(X)

S ¼ s(X) is a minimal sufficient statistic

Proof This follows from Theorem 7.3 by dividing through by f(y;Y) on the left-hand

side of the iff condition, which is admissible for all x and y in R(X) ¼{x: f(x;Y) > 0} Values of x and y =2 R(X) are irrelevant to sufficiency (recall

Using the preceding results for defining a minimal sufficient statistic ofcourse still requires that one is observant enough to recognize an appropriate(vector) function S However, in many cases the Lehmann-Scheffe´ approachtransforms the problem into one where a choice ofS is readily apparent Thefollowing examples illustrate the use of the procedure for discovering minimalsufficient statistics

Example 7.12

Lehman-Scheffe´

Minimal Sufficiency

Approach for Bernoulli

Let X ¼ (X1, .,Xn) be a random sample from a nondegenerate Bernoulli tion distribution representing whether or not a customer contact results in asale, so that

Trang 37

for all values ofx and y 2 R(X) ¼ n

i ¼1f0; 1g The ratio will be independent of p,iffthe constraint Pn

Approach for Gamma

Let X¼ (X1, .,Xn) be a random sample from a gamma population distributionrepresenting the operating life until failure of a certain brand and type of per-sonal computer, so that

i ¼1 (0,1) (Note the term (bnaGn

(a)) has beenalgebraically canceled in the density ratio) The ratio will be independent of both

a and b iff the constraints Qn

Lehman-Scheffe´

Minimal Sufficiency

Approach for Uniform

Let X¼ (X1, .,Xn) be a random sample from a uniform population distributionrepresenting the number of minutes that a shipment is delivered before (x< 0)

or after (x> 0) its scheduled arrival time, so that

fðx; a; bÞ ¼ b að Þn Yni¼1I½ a ;b ð Þ:xi

Unlike the previous examples, here the range ofX depends on the parameters aand b Referring to the Lehmann-Scheffe´ procedure for defining a sufficientstatistic for f(x;a,b) as given by Theorem 7.3, examine

i ¼1 [a,b]

15 It may be more appropriate to assume finite lower and upper bounds for a and b, respectively Doing so will not change the final result of the example.

Trang 38

The ratio will be independent of a and b iff min (x1, .,xn)¼ min (y1, .,yn) andmax(x1, .,xn)¼ max(y1, .,yn), in which case the ratio will be equal to 1.The preceding conditions also ensure that f(x;a,b) ¼ 0 when f(y;a, b) ¼ 0, so thatf(x;a,b) ¼ t(x,y)f(y;a,b) holds 8x and y 2 RO(X) A minimal sufficient statistic forf(x;a,b) is then bivariate and given by the order statistics s1(X) ¼ min(X1, .,Xn)and s2(X) ¼ max(X1, .,Xn) by Theorem 7.3 □

7.4.2 Sufficient Statistics in the Exponential ClassThe exponential class of densities represent a collection of parametric families

of density functions for which sufficient statistics are straightforwardly defined.Furthermore, the sufficient statistics are generally minimal sufficient statistics.Theorem 7.4

Exponential Class and

Then s(X) ¼ (g1(X), .,gk(X)) is a k-variate sufficient statistic, and if ci(Q),

i¼ 1, .,k are linearly independent, the sufficient statistic is a minimal cient statistic

suffi-Proof Thats(X) is a sufficient statistic follows immediately from the Neyman

Factori-zation theorem by defining g gð 1ðxÞ; ; gkðxÞ; QÞ ¼ exp Pk

i ¼1ciðQÞgiðxÞþ

h

dðQÞiand hðxÞ ¼ exp z xð ð ÞÞIAð Þ in the theorem.x

Thats(X) is a minimal sufficient statistic follows from the fact that s(X) can

be derived using the Lehmann–Scheffe´ approach of Corollary 7.2 To see this,note that

ci(Q),i ¼ 1, .,k are linearly independent.16 nNote that Theorem 7.4 could be used as an alternative approach for discov-ering minimal sufficient statistics in the problems of random sampling

16 If one (or more) c i ( Y) were linearly dependent on the other c j ( Q)’s, then “only if” would not apply To see this, suppose c k ( Y) ¼

Trang 39

examined in Examples 7.11 and 7.12 It could not be used in Example 7.13 sincethe uniform distribution is not in the exponential class.

7.4.3 Relationship Between Sufficiency and MSE: Rao-Blackwell

In addition to generating a condensed representation of the information in asample relevant for estimating q(Q), sufficient statistics can also facilitate thediscovery of estimators of q(Q) that are relatively efficient in terms of MSE Inparticular, in the pursuit of estimators with low MSE, only functions ofsufficient statistics need to be examined, which is the implication of theRao-Blackwell theorem

Theorem 7.5

Rao-Blackwell

Theorem - Scalar Case

Let S ¼ (S1, .,Sr) be an r-variate sufficient statistic for f(x;Q), and let t*(X)

be any estimator of the scalar q(Q) having finite variance Define t(X) ¼E(t*(X)|S1, .,Sr)¼ x(S1, .,Sr) Then t(X) is an estimator of q(Q) for whichMSEY(t( Y (t*(X)) 8 Q 2 O, with the equality being attained only

if PQ(t(x) ¼ t*(x)) ¼ 1

Proof First note that sinceS ¼ (S1, .,Sr) is an r-variate sufficient statistic, fð Þ doesxjs

not depend on Q, and thus neither does the function t(X) (since it is defined

as a conditional expectation using fð Þ), so t(X) is a statistic that can be usedxjs

as an estimator of q(Q) Now by the iterated expectation theorem,

Eðt XÞð Þ ¼ EEðt ð ÞjSX 1; ; SrÞ ¼ E(t*(X)), so that t(X) and t*(X) have preciselythe same expectation Next examine

MSE tð ð ÞX Þ ¼ E tð ð Þ q QX ð ÞÞ2¼ E tð ð Þ t XX ð Þ þ t Xð Þ q Qð ÞÞ2

¼ E tð ð Þ t XX ð ÞÞ2þ 2E tð ð Þ t XX ð ÞÞ t Xð ð Þ q Qð ÞÞ

þ E t Xðð Þ q Qð ÞÞ2:The cross–product term is zero To see this, first note that E[(t*(X) t(X)) (t(X) q(Q))] ¼ E[t(X) (t*(X) t(X))] since E(t*(X) t(X)) q(Q) ¼ 0 because E(t*(X)) ¼E(t(X)) Now note that by definition t(X) is a function of only sufficient statistics,

so that t(X) is a constant given s1, .,sr Therefore,

E t t½ ð ð Þ tX Þjs1; ; sr ¼ t E t½ ð ð Þ tX Þjs1; ; srÞ ¼ 0since Eðt ð ÞjsX 1; ; srÞ ¼ t by definition, so that E[t(X)(t*(X) t(X))] ¼ 0 bythe iterated expectation theorem

Then dropping the nonnegative term E(t*(X) t(X))2on the right-hand side

of the expression defining MSE(t*(X)) above yields MSEQ(t*(X)) EQ(t(X) q(Q))2¼ MSEQ(t(X)) 8 Q 2 O The equality is attained iff EQ(t*(X) t(X))2¼ 0,which requires that PQ[t*(x) ¼ t(x)] ¼ 1 nThe point of the theorem is that for any estimator t*(X) of q(Q) there alwaysexists an alternative estimator that is at least as good as t*(X) in terms of MSEand that is a function of any set of sufficient statistics Thus, the Rao-Blackwelltheorem suggests that the search for estimators of q(Q) with low MSEs can

Trang 40

always be restricted to an examination of functions of sufficient statistics, wherehopefully the number of sufficient statistics required to fully represent theinformation about q(Q) is substantially less than the size of the random sampleitself.17Note that if attention is restricted to the unbiased class of estimators, sothat t*(X) is an unbiased estimator in the statement of the theorem, then the RaoBlackwell theorem implies that the search for a minimum variance estimatorwithin the class of unbiased estimators can also be restricted to functions ofsufficient statistics As an illustration, in Example 7.11, we know thatPn

Rao-Blackwell

Theorem-Vector Case

LetS ¼ (S1, .,Sr) be an r-variate sufficient statistic for f(x;Q), and let t*(X) be anestimator of the(k 1) vector function q(Q) having a finite covariance matrix.Define t(X) ¼ E(t*(X)|S1, .,Sr)¼ h(S1, .,Sr) Then t(X) is an estimator of q(Q)for whichMSEQ(t(X))ﬃMSEQ(t*(X)) 8Q 2 O, the equality being attained only if

PQ(t(x) ¼ t*(x)) ¼ 1

Proof The proof is analogous to the proof in the scalar case, except that MSE matrices

are used in place of scalar MSEs in establishing that MSE(t(X)) is smaller thanMSE(t*(X)) The details are left to the reader nThe implications of Theorem 7.6 are analogous to those for the scalar case.Namely, one need only examine vector functions of sufficient statistics forestimating the vector q(Q) if the objective is to obtain an estimator with asmall MSE matrix Furthermore, the search for an MVUE of q(Q) can also berestricted to functions of sufficient statistics As stated previously, this candecrease substantially the dimensionality of the data used in a point estimationproblem if the minimal sufficient statistics for the problem are few in number.Revisiting Example 7.12 we note for future reference that

35

is the MVUE for (p, p(1p), the mean and variance of the Bernoulli populationdistribution in the example

17 The reader will recall that the random sample, (X 1 , ,X n ), is by definition a set of sufficient statistics for f( x;Y) However, it is clear that no improvement (decrease) in the MSE of an unbiased estimator will be achieved by conditioning on (X 1 , ,X n ), i.e., the reader should verify that this is a case where E(t*( X) t(X)) 2 ¼ 0 and MSE equality is achieved in the Rao–Blackwell theorem.

Tiêu đề	Point Estimation Theory
Trường học	University of Example (https://www.universityofexample.edu)
Chuyên ngành	Mathematical Statistics for Economics and Business
Thể loại	Textbook
Năm xuất bản	2023
Thành phố	Sample City

Định dạng
Số trang	375
Dung lượng	2,6 MB