1. Trang chủ
  2. » Kinh Doanh - Tiếp Thị

Analysis of Survey Data phần 6 pps

38 270 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Inference Under Informative Probability Sampling
Trường học Unknown
Chuyên ngành Statistics
Thể loại Thesis
Năm xuất bản Unknown
Thành phố Unknown
Định dạng
Số trang 38
Dung lượng 342,64 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

For the complex sampling schemes incommon use, the sample selection probabilities are often determined by thevalues of several design variables, in which case the informativeness of thes

Trang 1

The sample distribution is different also from the familiar px distribution,defined as the combined distribution over all possible realizations of the finitepopulation measurements (the population x distribution) and all possiblesample values for a given population (the randomization p distribution) The

px distribution is often used for comparing the performance of design-basedestimators in situations where direct comparisons of randomization variances

or mean square errors are not feasible The obvious difference between thesample distribution and the px distribution is that the former conditions onthe selected sample (and values of auxiliary variables measured for units in thesample), whereas the latter accounts for all possible sample selections

Finally, rather than conditioning on the selected sample when constructingthe sample distribution (and hence the sample likelihood), one could computeinstead the joint distribution of the selected sample and the correspondingsample measurements Denote by ysˆ {yt, t 2 s} the outcome variable valuesmeasured for the sample units and by xsˆ {xt, t 2 s} and x~sˆ {xt, t 62 s} thevalues of the auxiliary variables corresponding to the sampled and nonsampledunits Assuming independence of the population measurements and independ-ent sampling of the population units (Poisson sampling), the joint pdf of(s, ys)j(xs, x~s) can be written as

where p( yt, xt) ˆ EU(ptjyt, xt) and p(xt) ˆ EU(ptjxt) Note that the product ofthe terms in the first set of square brackets on the right hand side of (12.6) is thejoint sample pdf, fs( ysjxs, s), for units in the sample as obtained from (12.3) Theuse of (12.6) for likelihood-based inference has the theoretical advantage ofemploying the information on the sample selection probabilities for units outsidethe sample, but it requires knowledge of the expectations p(xt) ˆ EU(ptjxt) forall t 2 U and hence the values x~s This information is not needed when inference

is based on the sample pdf, fs( ysjxs, s) When the values x~sare unknown, it ispossible in theory to regard the values {xt, t 62 s} as random realizations fromsome pdf g~s(xt) and replace the expectations p(xt) for units t 62 s by the uncondi-tional expectations p(t) ˆRp(xt)g~s(xt)dxt See, for example, Rotnitzky andRobins (1997) for a similar analysis in a different context However, modelingthe distribution of the auxiliary variables might be formidable and the resultinglikelihood f (s, ysjxs, x~s) could be very cumbersome

12.3 INFERENCE UNDER INFORMATIVE PROBABILITY

SAMPLING inference under informative probability sampling

12.3.1 Estimating equations with application to the GLM

In this section we consider four different approaches for defining estimatingequations under informative probability sampling We compare the variousapproaches empirically in Section 12.5

INFERENCE UNDER INFORMATIVE PROBABILITY SAMPLING 179

Trang 2

Suppose that the population measurements ( yU, xU) ˆ {( yt, xt), t ˆ 1 N}can be regarded as N independent realizations from some pdf fy,x Denote

by fU( yjx; b) the conditional pdf of yt given xt The true value of the vectorparameter b ˆ ( b0, b1, , bk)0 is defined as the unique solution of the equa-tions

WU( b) ˆXN

tˆ1

EU[dUtjxt] ˆ 0 (12:7)where dUtˆ (dUt,0dUt,1 dUt,k)0ˆ ] log fU( ytjxt; b)=]b is the tth score func-tion We refer to (12.7) as the `parameter equations' since they define the vectorparameter b For the GLM defined in the introduction with the distribution of

yt belonging to the exponential family, fU( y; y, f) ˆ exp {[ yy ÿ b(y)]=a(f) ‡ c( y, f)} where a(:) > 0, b(:) and c(.) are known functions, and f isknown It follows that m(y) ˆ E( y) ˆ ]b(y)=]y so that if y ˆ h(xb) for somefunction h(.) with derivative g(.),

dUt,jˆ {ytÿ m(h(x0tb))}[g(x0tb)]xt,j: (12:8)The `census parameter' (Binder, 1983) corresponding to (12.7) is defined as thesolution BU of the equations

probabil-b, the parameter equations corresponding to the sample distribution are

W1s,e( b) ˆX

s [dUt‡ ] log Es(wtjxt)=]b] ˆ 0: (12:11)Note that (12.11) defines the sample likelihood equations

Trang 3

The second approach uses the relationship (12.4) in order to convert thepopulation expectations in (12.7) into sample expectations Assuming arandom sample of size n from the sample distribution, the parameter equationsthen have the form

W2s( b) ˆX

s

Es(qtdUtjxt) ˆ 0 (12:12)where qt ˆ wt=Es(wtjxt) The vector b is estimated under this approach bysolving the equations

Applica-W3s( b) ˆX

s

Es(wtdUt)=Es(wt) ˆ 0: (12:14)The corresponding estimating equations are

of the standard weights wt used in (12.15) As discussed in PS, the weights qtaccount for the net sampling effects on the target conditional distribution of

ytjxt, whereas the weights wt account also for the sampling effects on themarginal distribution of xt In particular, when w is a deterministic function

of x so that the sampling process is noninformative, qt 1 and Equations(12.13) reduce to the ordinary likelihood equations (see (12.16) below) Theuse of (12.15) on the other hand may yield highly variable estimators in suchcases, depending on the variability of the xt

The three separate sets of estimating equations defined by (12.11), (12.13),and (12.15) all account for the sampling effects On the other hand, ignoring the

INFERENCE UNDER INFORMATIVE PROBABILITY SAMPLING 181

Trang 4

sampling process results in the use of the ordinary (face value) likelihoodequations

Comment The estimating equations proposed in this section employ the scores

dUtˆ ] log fU( ytjxt; b)=]b However, similar equations can be obtained for otherfunctions dUt; see Bickel et al (1993) for examples of alternative definitions

12.3.2 Estimation of Es(wtjxt)

The estimating equations defined by (12.11) and (12.13) contain the ations Es(wtjxt) that depend on the unknown parameters b When the wtare continuous as in probability proportional to size (PPS) sampling with

expect-a continuous size vexpect-ariexpect-able, the form of these expectexpect-ations cexpect-an be identifiedfrom the sample data by the following three-step procedure that utilizes(12.5):

1 Regress wt against (yt, xt) to obtain an estimate of Es(wtjyt, xt)

2 Integrate „yEU(ptjy, xt)fU( yjxt; b)dy ˆ„y[1=Es(wtjy, xt)]fU( yjxt; b)dy toobtain an estimate of EU(ptjxt) as a function of b

on unknown parameters that are estimated in step 1.)

Comment 1 Rather than estimating the coefficients indexing the expectation

Es(wtjyt, xt) from the sample (step 1), these coefficients can be considered asadditional unknown parameters, with the estimating equations extended ac-cordingly This, however, may complicate the solution of the estimating equa-tions and also result in identifiability problems under certain models See PKRfor examples and discussion

Comment 2 For the estimating equations (12.13) that use the weights

qt ˆ wt=Es(wtjxt), the estimation of the expectation Es(wtjxt) can be carriedout by simply regressing wt against xt, thus avoiding steps 2 and 3 This is sobecause in this case there is no need to express the expectation as a function ofthe parameters b indexing the population distribution

Trang 5

The discussion so far focuses on the case where the sample selection abilities are continuous The evaluation of the expectation Es(wtjxt) in the case

prob-of discrete selection probabilities is simpler For example, in the empirical study

of this chapter we consider the case of logistic regression with a discreteindependent variable x and three possible values for the dependent variable y.For this case the expectation Es(wtjxtˆ k) is estimated as

PrU( yt ˆ ajxt ˆ k)=Es(wtjyt ˆ a, xtˆ k)

^

ˆX3aˆ1

PrU( ytˆ ajxt ˆ k)=wak

(12:17)

where wakˆ [PswtI( yt ˆ a, xtˆ k)]=[PsI( ytˆ a, xt ˆ k)] Here I(A) is theindicator function for the event A Substituting the logistic function forPr( ytˆ ajxtˆ k) in the last expression of (12.17) yields the required specifica-tion The estimators wak are considered as fixed numbers when solving theestimating equations

For the estimating equations (12.13), the expectations Es(wtjxtˆ k) in(12.13) in this example are estimated by

12.3.3 Testing the informativeness of the sampling process

The estimating equations developed in Section 12.3.1 for the case of tive sampling involve the use of the sampling weights in various degrees ofcomplexity It is clear therefore that when the sampling process is in factnoninformative, the use of these equations yields more variable estimatorsthan the use of the ordinary score function defined by (12.16) See Tables12.2 and 12.4 below for illustrations For the complex sampling schemes incommon use, the sample selection probabilities are often determined by thevalues of several design variables, in which case the informativeness of theselection process is not always apparent This raises the need for test procedures

informa-as a further indication of whether the sampling process is ignorable or not.Several tests have been proposed in the past for this problem The common

INFERENCE UNDER INFORMATIVE PROBABILITY SAMPLING 183

Trang 6

feature of these tests is that they compare the probability-weighted estimators

of the target parameters to the ordinary (unweighted) estimators that ignore thesampling process, see Pfeffermann (1993) for review and discussion For theclassical linear regression model, PS propose a set of test statistics that comparethe moments of the sample distribution of the regression residuals to thecorresponding moments of the population distribution The use of these tests

is equivalent to testing that the correlations under the sample distributionbetween powers of the regression residuals and the sampling weights are allzero In Chapter 11 the tests developed by PS are extended to situations wherethe moments of the model residuals are functions of the regressor variables, asunder many of the GLMs in common use

A drawback of these test procedures is that they involve the use of a series oftests with dependent test statistics, such that the interpretation of the results ofthese tests is not always clear-cut For this reason, we propose below a singlealternative test that compares the estimating equations that ignore the samplingprocess to estimating equations that account for it As mentioned before, thequestion arising in practice is whether to use the estimating equations (12.16)that ignore the sample selection or one of the estimating equations (12.11),(12.13), or (12.15) that account for it, so that basing the test on these equations

^

Rnˆ nÿ1X

s

^R(xt); R(x^ t) ˆ (dUtÿ qtdUt) and

Sn ˆ nÿ1X

s( ^R(xt) ÿ ^Rn)( ^R(xt) ÿ ^Rn)0:

In practice, b is unknown and the score dUt in ^R(xt) has to be evaluated at asample estimate of b In principle, any of the estimates defined in Section 12.3.1could be used for this purpose since under H0all the estimators are consistentfor b, but we find that the use of the solution of (12.16) that ignores thesampling process is the simplest and yields the best results

Trang 7

Let ^dUtdefine the value of dUtevaluated at ^b ± the solution of (12.16) ± andlet ~R(xt), ~Rn, and ~Sn be the corresponding values of ^R(xt), ^Rn, and Snobtainedafter substituting ^b for b in (12.20) The test statistic is therefore

~H(R) ˆn ÿ (k ‡ 1)

k ‡ 1 R~0nS~ÿ1n R~nH 0Fk‡1,nÿ(k‡1): (12:21)Note that Ps ^Utˆ 0 by virtue of (12.16), so ~Rnˆ ÿnÿ1P

sqt^Ut Therandom variables qt^Utare no longer independent sincePs ^Utˆ 0, but utiliz-ing the property that Es(qtjxt) ˆ 1 implies that under the null hypothesisvars[Psqt^Ut] ˆ vars[Ps(qt^Utÿ ^dUt)] ˆPsvars( ^dUtÿ qt^Ut), thus justifyingthe use of ~Sn=(n ÿ 1) as an estimator of var( ~Rn) in the construction of the teststatistic in (12.21)

Comment The Hotelling test statistic uses the estimating equations (12.13) forthe comparison with (12.16) and here again, one could use instead the equa-tions defined by (12.11) or (12.15): that is, replace qtdUt in the definition of

^

R(xt) by dUt‡ ] log [Es(wtjxt)]=]b, or by wtdUtrespectively The use of (12.11) ismore complicated since it requires evaluation of the expectation Es(wtjxt) as afunction of b (see Section 12.3.2) The use of (12.15) is the simplest but it yieldsinferior results to the use of (12.13) in our simulation study

12.4 VARIANCE ESTIMATION variance estimation

Having estimated the model parameters by any of the solutions of the ing equations in Section 12.3.1, the question arising is how to estimatethe variances of these estimators Unless stated otherwise, the true (estimated)variances are with respect to the sample distribution for a given sample of units,that is, the variance under the pdf obtained by the product of the samplepdfs (12.3) Note also that since the estimating equations are only for theb-parameters, with the coefficients indexing the expectations Es(wtjyt, xt) heldfixed at their estimators of these values, the first four variance estimators below

estimat-do not account for the variability of the estimated coefficients

For the estimator ^b1s defined by the solution to the estimating equations(12.11), that is, the maximum likelihood estimator under the sample distribu-tion, a variance estimator can be obtained from the inverse of the informationmatrix evaluated at this estimator Thus,

^V( ^b1s) ˆ { ÿ Es[]W1s,e( b)=]b0]bˆ ^b

1s}ÿ1: (12:22)For the estimators ^b2ssolving (12.13), we use a result from Bickel et al (1993)

By this result, if for the true vector parameter b0, the left hand side of anestimating equation Wn( b) ˆ 0 can be approximated as Wn( b0) ˆ nÿ1P

sj( yt, xt; b0) ‡ Op(nÿ1=2) for some function j satisfying E(j) ˆ 0 andE(j2) < 1, then under some additional regularity conditions on the order ofconvergence of certain functions,

Trang 8

n1=2( ^bnÿ b0) ˆ nÿ1=2X

s[ _W( b0)]ÿ1j( yt, xt; b0) ‡ op(1) (12:23)

where ^bnis the solution of Wn( b) ˆ 0, _W( b0) ˆ []W( b)=]b0]bˆb0and W( b) ˆ 0

is the parameter equation with _W( b0) assumed to be nonsingular

For the estimating equations (12.13), j( yt, xt; b) ˆ qtdUt, implying that thevariance of ^b2s solving (12.13) can be estimated as

^

Vs( ^b2s) ˆ [ _W2s,e( ^b2s)]ÿ1 X

s[qtdUt( ^b2s)]2

[ _W2s,e( ^b2s)]ÿ1 (12:24)

where _W2s,e( ^b2s) ˆ []W2s,e( b)=b0]bˆ ^b

2sand dUt( ^b2s) is the value of dUtevaluated

at ^b2s Note that since Es(qtdUtjxt) ˆ 0 (and also PsqtdUt( ^b2s) ˆ 0), theestimator (12.24) estimates the conditional variance Vs( ^b2sj{xt, t 2 s}); that is,the variance with respect to the conditional sample distribution of the outcome y.The estimating equations (12.15) have been derived in Section 12.3.1 by twodifferent approaches, implying therefore two separate variance estimators.Under the first approach, these equations estimate the parameter equations(12.14), which are defined in terms of the unconditional sample expectation (theexpectation under the sample distribution of {( yt, xt), t 2 s}) Application ofthe result from Bickel et al (1993) mentioned before yields the followingvariance estimator (compare with (12.24)):

^

Vs( ^b3s) ˆ [ _W3s,e( ^b3s)]ÿ1 X

s[wtdUt( ^b3s)]2

Trang 9

with probabilities of success ptˆ Pr(t 2 s) Simple calculations imply that forthis case the randomization variance estimator (12.26) has the form

^

VR( ^b3s) ˆ [ ^_WU( ^b3s)]ÿ1 X

s(1 ÿ pt)[wtdUt( ^b3s)]2

[ ^_WU( ^b3s)]ÿ1 (12:27)where ^_WU( ^b3s) ˆ _W3s,e( ^b3s) Thus, the difference between the estimator de-fined by (12.25) and the randomization variance estimator (12.26) is in this case

in the weighting of the products wtdUt( ^b3s) by the weights (1 ÿ pt) in the latterestimator Since 0 < (1 ÿ pt) < 1, the randomization variance estimators aresmaller than the variance estimators obtained under the sample distribution.This is expected since the randomization variances measure the variationaround the (fixed) population values and if some of the selection probabilitiesare large, a correspondingly large portion of the population is included in thesample (in high probability), thus reducing the variance

Another plausible variance estimation procedure is the use of bootstrapsamples As mentioned before, under general conditions on the sample selec-tion scheme listed in PKR, the sample measurements are asymptotically inde-pendent with respect to the sample distribution, implying that the use of the(classical) bootstrap method for variance estimation is well founded In con-trast, the use of the bootstrap method for variance estimation under therandomization distribution is limited, and often requires extra modifications;see Sitter (1992) for an overview of bootstrap methods for sample surveys Let

^bs stand for any of the preceding estimators and denote by ^bb

s the estimatorcomputed from bootstrap sample b (b ˆ 1 B), drawn by simple randomsampling with replacement from the original sample (with the same samplesize) The bootstrap variance estimator of ^bs is defined as

^

Vboot( ^bs) ˆ Bÿ1XB

bˆ1( ^bbsÿ bboot)( ^bbsÿ bboot)0 (12:28)where

in principle for all the sources of variation This includes the identification ofthe form of the expectations Es(wtjyt, xt) when unknown, and the estimation ofthe vector coefficient l indexing that expectation, which is carried out for each

of the bootstrap samples but not accounted for by the other variance estimationmethods unless the coefficients l are considered as part of the unknown modelparameters (see Section 12.3.2)

Trang 10

12.5 SIMULATION RESULTS simulation results

12.5.1 Generation of population and sample selection

In order to assess and compare the performance of the parameter estimators,variance estimators, and the test statistic proposed in Sections 12.3 and 12.4, wedesigned a Monte Carlo study that consists of the following stages:

A Generate a univariate population of x-values of size N ˆ 3000, drawnindependently from the discrete U[1, 5] probability function,Pr(X ˆ j) ˆ 0:2, j ˆ 1 5

B Generate corresponding y-values from the logistic probability function,

Pr( ytˆ 1jxt) ˆ [ exp ( b10‡ b11xt)]=C

Pr( ytˆ 2jxt) ˆ [ exp ( b20‡ b21xt)]=C

Pr( ytˆ 3jxt) ˆ 1 ÿ Pr( ytˆ 1jxt) ÿ Pr( ytˆ 2jxt)

(12:29)

where C ˆ 1 ‡ exp ( b10‡ b11xt) ‡ exp ( b20‡ b21xt)

Stages A and B were repeated independently R ˆ 1000 times

C From every population generated in stages A and B, draw a single sampleusing the following sampling schemes (one sample under each scheme):

Ca Poisson sampling: units are selected independently with probabilities

ptˆ nzt=PNuˆ1zu, where n ˆ 300 is the expected sample size and thevalues zt are computed in two separate ways:

Ca(1): zt(1) ˆ Int[(5=9)y2

tut‡ 2xt];

Ca(2): zt(2) ˆ Int[5ut‡ 2xt]: (12:30)The notation Int[  ] defines the integer value and ut U(0, 1)

Cb Stratified sampling: the population units are stratified based oneither the values zt(1) (scheme Cb(1)) or the values zt(2) (schemeCb(2)), yielding a total of 13 strata in each case Denote by S(h)( j) thestrata defined by the values zt( j) such that for units

t 2 Sh( j), zt( j)  z(h)( j), j ˆ 1, 2 Let N(h)( j) represent the ponding strata sizes The selection of units within the strata wascarried out by simple random sampling without replacement(SRSWR), with the sample sizes n(h)( j) fixed in advance The samplesizes were determined so that the selection probabilities are similar

corres-to the corresponding selection probabilities under the Poisson pling scheme andPhnh( j) ˆ 300, j ˆ 1, 2

sam-The following points are worth noting:

1 The sampling schemes that use the values zt(1) are informative, as theselection probabilities depend on the y-values The sampling schemes that

Trang 11

use the values zt(2) are noninformative since the selection probabilitiesdepend only on the x-values and the inference is targeted at the populationmodel of the conditional probabilities of ytjxt defined by (12.29).

2 For the Poisson sampling schemes, EU[ptj{( yu, xu), u ˆ 1 N}] dependsonly on the values (yt, xt) when ztˆ zt(1), and only on the value xt when

zt ˆ zt(2) (With large populations, the totalsPNtˆ1zt( j) can be regarded asfixed.) For the stratified sampling schemes, however, the selection prob-abilities depend on the strata sizes N(h)( j) that are random (they varybetween populations), so that they depend in principle on all the popula-tion values {( yt, xt), t ˆ 1 N}, although with large populations the vari-ation of the strata sizes between populations will generally be minor

3 The stratified sampling scheme Cb corresponds to a case±control studywhereby the strata are defined based on the values of the outcome variable(y) and possibly some of the covariate variables See Chapter 8 (Suchsampling schemes are known as choice-based sampling in the econometricliterature.) For the case where the strata are defined based only on y-values generated by a logistic model that contains intercept terms and thesampling fractions within the strata are fixed in advance, it is shown inPKR that the standard MLE of the slope coefficients (that is, the MLE thatignores the sampling scheme) coincides with the MLE under the sampledistribution As illustrated in the empirical results below, this is no longertrue when the stratification depends also on the x-values (As pointed outabove, under the design Cb the sampling fractions within the strata havesome small variation We considered also the case of a stratified samplewith fixed sampling fractions within the strata and obtained almost identi-cal results as for the scheme Cb.) In order to assess the performance of theestimators derived in the previous sections in situations where the stratifi-cation depends only on the y-values, we consider also a third stratifiedsampling scheme:

Cc Stratified sampling: The population units are stratified based on thevalues yt (three strata); select nh units from stratum h by SRSWRwith the sample sizes fixed in advance,Phnh ˆ 300

12.5.2 Computations and results

The estimators of the logistic model coefficients bkˆ {( bk0, bk1), k ˆ 1, 2} in(12.29), obtained by solving the estimating equations (12.11), (12.13), (12.15),and (12.16), have been computed for each of the samples drawn by the sam-pling methods described in Section 12.5.1, yielding four separate sets of estima-tors The expectations Es(wtjxt) have been estimated using the proceduresdescribed in Section 12.3.2 For each point estimator we computed the corres-ponding variance estimator as defined by (12.22), (12.24), and (12.25) for thefirst three estimators and by use of the inverse information matrix (ignoring thesampling process) for the ordinary MLE In addition, we computed for each ofthe point estimators the bootstrap variance estimator defined by (12.27) Due

Trang 12

to computation time constraints, the bootstrap variance estimators are based

on only 100 samples for each of 100 parent samples Finally, we computed foreach sample the Hotelling test statistic (12.21) for sample informativenessdeveloped in Section 12.3.3

The results of the simulation study are summarized in Tables 12.1±12.5.These tables show for each of the five sampling schemes and each pointestimator the mean estimate, the empirical standard deviation (Std) and themean of the Std estimates (denoted Mean Std est in the tables) over the 1000samples; and the mean of the bootstrap Std estimates (denoted Mean Std est.Boot in the tables) over the 100 samples Table 12.6 compares the theoreticaland empirical distribution of the Hotelling test statistic under H0 for the twononinformative sampling schemes defined by Ca(2) and Cb(2)

The main conclusions from the results set out in Tables 12.1±12.5 are asfollows:

Table 12.1 Means, standard deviations (Std) and mean Std estimates of logisticregression coefficients Poisson sampling, informative scheme Ca (1)

Ordinary MLE (Equation (12.16)):

Trang 13

Table 12.2 Means, standard deviations (Std) and mean Std estimates of logisticregression coefficients Poisson sampling, noninformative scheme Ca (2).

Ordinary MLE (Equation (12.16)):

1 The three sets of estimating equations defined by (12.11), (12.13), and(12.15) perform well in eliminating the sampling effects under informativesampling On the other hand, ignoring the sampling process and using theordinary MLE (Equation (12.16)) yields highly biased estimators

2 The use of Equations (12.11) and (12.13) produces very similar results Thisoutcome, found also in PS for ordinary regression analysis, is importantsince the use of (12.13) is much simpler and requires fewer assumptionsthan the use of the full equations defined by (12.11); see also Section 12.3.2.The use of simple weighting (Equation (12.15)) that corresponds to theapplication of the pseudo-likelihood approach again performs well ineliminating the bias, but except for Table 12.5 the variances under thisapproach are consistently larger than under the first two approaches,illustrating the discussion in Section 12.3.1 On the other hand, the ordin-ary MLE that does not involve any weighting has in most cases the smalleststandard deviation The last outcome is known also from other studies

Trang 14

Table 12.3 Means, standard deviations (Std) and mean Std estimates of logisticregression coefficients Stratified sampling, informative scheme Cb (1).

Ordinary MLE (Equation (12.16)):

3 The MLE variance estimator (12.22) and the semi-parametric estimators(12.24) and (12.25) underestimate in most cases the true (empirical) vari-ance with an underestimation of less than 8 % As anticipated in Section12.4, the use of the bootstrap variance estimators corrects for this under-estimation by better accounting for all the sources of variation, but thisonly occurs with the estimating equations (12.13) and (12.15) We have noclear explanation for why the bootstrap variance estimators perform lesssatisfactory for the MLE equations defined by (12.11) and (12.16) but weemphasize again that we have only used 100 bootstrap samples for thevariance estimation, which in view of the complexity of the estimatingequations is clearly not sufficient It should be mentioned also in thisrespect that the standard deviation estimators (12.22), (12.24), and(12.25) are more stable (in terms of their standard deviation) than thebootstrap standard deviation estimators, which again can possibly beattributed to the relatively small number of bootstrap samples (The stand-ard deviations of the standard deviation estimators are not shown.)

Trang 15

Table 12.4 Means, standard deviations (Std) and mean Std estimates of logisticregression coefficients Stratified sampling, noninformative scheme Cb(2).

Ordinary MLE (Equation (12.16)):

4 Our last comment refers to Table 12.5 that relates to the sampling scheme

Cc by which the stratification is based only on the y-values For this casethe first three sets of parameter estimators and the two semi-parametricvariance estimators perform equally well Perhaps the most notable out-come from this table is that ignoring the sampling process in this case andusing Equations (12.16) yields similar mean estimates (with smaller stand-ard deviations) for the two slope coefficients as the use of the otherequations Note, however, the very large biases of the intercept estimators

As mentioned before, PKR show that the use of (12.16) yields the correctMLE for the slope coefficients under Poisson sampling, which is close tothe stratified sampling scheme underlying this table

Table 12.6 compares the empirical distribution of the Hotelling test statistic(12.21) over the 1000 samples with the corresponding nominal levels of thetheoretical distribution for the two noninformative sampling schemes Ca(2)and Cb(2) As can be seen, the empirical distribution matches almost perfectlythe theoretical distribution We computed the test statistic also under the

Trang 16

Table 12.5 Means, standard deviations (Std) and mean Std estimates of logisticregression coefficients Stratified sampling, informative scheme Cc.

Ordinary MLE (Equation (12.16)):

Table 12.6 Nominal levels and empirical distribution of Hotelling test statistic underH0for noninformative sampling schemes Ca(2) and Cb(2)

null-12.6 SUMMARY AND EXTENSIONS summary and extensions

This chapter considers three alternative approaches for the fitting of GLMunder informative sampling All three approaches utilize the relationships

Trang 17

between the population distribution and the distribution of the sample vations as defined by (12.1), (12.3), and (12.4) and they are shown to performwell in eliminating the bias of point estimators that ignore the sampling process.The use of the pseudo-likelihood approach, derived here under the framework

obser-of the sample distribution, is the simplest, but it is shown to be somewhatinferior to the other two approaches These two approaches require the model-ing and estimation of the expectation of the sampling weights, either as afunction of the outcome and the explanatory variables, or as a function ofonly the explanatory variables This additional modeling is not always trivialbut general guidelines are given in Section 12.3.2 It is important to emphasize

in this respect that the use of the sample distribution as the basis for inferencepermits the application of standard model diagnostic tools so that the goodness

of fit of the model to the sample data can be tested

The estimating equations developed under the three approaches allow theconstruction of variance estimators based on these equations These estimatorshave a small negative bias since they fail to account for the estimation of theexpectations of the sampling weights The use of the bootstrap method that iswell founded under the sample distribution overcomes this problem for two ofthe three approaches, but the bootstrap estimators seem to be less stable.Finally, a new test statistic for the informativeness of the sampling processthat compares the estimating equations that account for the sampling processwith estimating equations that ignore it is developed and shown to performextremely well in the simulation study

An important use of the sample distribution not considered in this chapter isfor prediction problems Notice first that if the sampling process is informative,the model holding for the outcome variable for units outside the sample is againdifferent from the population model This implies that even if the populationmodel is known with all its parameters, it cannot be used directly for theprediction of outcome values corresponding to nonsampled units We mention

in this respect that the familiar `model-dependent estimators' of finite tion totals assume noninformative sampling On the other hand, it is possible toderive the distribution of the outcome variable for units outside the sample,similarly to the derivation of the sample distribution, and then obtain theoptimal predictors under this distribution See Sverchkov and Pfeffermann(2000) for application of this approach

popula-Another important extension is to two-level models with application tosmall-area estimation Here again, if the second-level units (schools, geographicareas) are selected with probabilities that are related to the outcome values, themodel holding for the second-level random effects might be different from themodel holding in the population Failure to account for the informativeness ofthe sampling process may yield biased estimates for the model parameters andbiased predictions for the small-area means Appropriate weighting may elim-inate the first bias but not the second Work on the use of the sample distribu-tion for small-area estimation is in progress

Trang 18

PART D

Longitudinal Data

ISBN: 0-471-89987-9

Trang 19

In this part of the book, we shall use t to denote time, since this use is sonatural and widespread in the literature on longitudinal data, and use i todenote a population unit, usually an individual person As a result, the basicvalues of the response variable of interest will now be denoted yit with twosubscripts, i for unit and t for time This represents a departure from thenotation used so far, where t has denoted unit.

Models will be considered in which time may be either discrete or continuous

In the former case, t takes a sequence of possible values t1, t2, t3, usuallyequally spaced In the latter case, t may take any value in a given interval Theresponse variable Y may also be either discrete or continuous, corresponding tothe distinction between Parts B and C of this book

The case of continuous Y will be discussed first, in Section 13.2 and inChapter 14 (Skinner and Holmes) Only discrete time will be considered inthis case with the basic values to be modelled consisting of{yit; i ˆ 1, , N; t ˆ 1, , T} for the N units in the population and T equallyspaced time points The case of discrete Y will then be discussed in Section 13.3and in Chapters 15 (Lawless) and 16 (Mealli and Pudney) The emphasis inthese chapters will be on continuous time

A basic longitudinal survey design involves a series of waves of data tion, often equally spaced, for all units in a fixed sample, s At a given wave,either the current value of a variable or the value for a recent reference periodmay be measured In this case, the sample data are recorded in discrete time andwill take the form {yit; i 2 s, t ˆ 1, , T} for variables that are measured at allwaves, in the absence of nonresponse In the simplest case, for discrete time

collec-Analysis of Survey Data Edited by R L Chambers and C J Skinner

Copyright ¶ 2003 John Wiley & Sons, Ltd.

ISBN: 0-471-89987-9

Ngày đăng: 14/08/2014, 09:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN