Increasinglevels of nonresponse in NHANES, and inconsistencies in analyses ofNHANES data attributable to differing treatments of the missing values, led to the desire to develop imputati
Trang 1and prior are reasonable Another useful feature of MI is that the data analysis does not necessarily need to be based on the model used to imputethe missing values In particular, the complete-data analysis might consist incomputing estimates and standard errors using more conventional design-based methods, in which case the potential effects of model misspecificationare confined to the imputation of the missing values For more discussion of theproperties of MI under model misspecification, see for example Rubin (1996),Fay (1996), Rao (1996), and the associated discussions.
complete-An important feature of MI is that draws of the missing values are imputedrather than means Means would be preferable if the objective was to obtain thebest estimates of the missing values, but have drawbacks when the objective is
to make inferences about parameters Imputation of draws entails some loss ofefficiency for point estimation, but the averaging over the K multiply-imputeddatasets in (18.25) considerably reduces this loss The gain from imputingdraws is that it yields valid inferences for a wide range of estimands, includingnonlinear functions such as percentiles and variances (Little, 1988)
The difficulty in implementing MI is in obtaining draws from the posteriordistribution of ymisgiven zU, yobsand rs, which typically has an intractable form.Since draws from the posterior distribution of ymis given zU, yobs, rs, and y areoften easy to implement, a simpler scheme is to draw from the posteriordistribution of ymis given zU, yobs, rs and ~y, where ~y is an easily computedestimate of y such as that obtained from the complete cases This approachignores uncertainty in estimating y, and is termed improper in Rubin (1987) Ityields acceptable approximations when the fraction of missing data is modest,but leads to overstatement of precision with large amounts of missing data Inthe latter situations one option is to draw y(q)from its asymptotic distributionand then impute ymisfrom its posterior distribution given zU, yobs, rs, and y(q) Abetter but more computationally intensive approach is to cycle between draws
y(i)mis P( ymisjzU, yobs, rs, y(iÿ1)) and y(i) p(yjzU, yobs, y(i)mis, rs), an application ofthe Gibbs sampler (Tanner and Wong, 1987; Tanner, 1996)
The formulation provided above requires specification of the joint tion of ysand rs given zU As discussed in Section 18.1, if the missing data aremissing at random (MAR) in that the distribution of rsgiven ysand zU does notdepend on ymis, then inference can be based on a model for ysalone rather than
distribu-on a model for the joint distributidistribu-on of ysand rs(Rubin 1976: Little and Rubin2002) Specifically, (18.23)±(18.26) can replaced by the following:
^y(q), where ^y(q) E(yjzU, y(q)
s ),
Trang 2Var(yjzU, yobs) 1
Q
XQ q1
infer-Example 3 MI for the Third National Health and Nutrition ExaminationSurvey
The Third National Health and Nutrition Examination Survey 3) was the third in a series of periodic surveys conducted by the NationalCenter for Health Statistics to assess the health and nutritional status ofthe US population The NHANES-3 Survey began in 1988 and was conducted
(NHANES-in two phases, the first (NHANES-in 1988±91 and the second (NHANES-in 1991±4 It (NHANES-involved datacollection on a national probability sample of 39 695 individuals in the
US population The medical examination component of the survey dictatedthat it was carried out in a relatively small number (89) of localities of thecountry known as stands; stands thus form the primary sampling units Thesurvey was also stratified with oversampling of particular populationsubgroups
This survey was subject to nonnegligible levels of unit and item nonresponse,
in both its interview and its examination components In previous surveys,nonresponse was handled primarily using weighting adjustments Increasinglevels of nonresponse in NHANES, and inconsistencies in analyses ofNHANES data attributable to differing treatments of the missing values, led
to the desire to develop imputation methods for NHANES-3 and subsequentNHANES surveys that yield valid inferences
Variables in NHANES-3 can be usefully classified into three groups:
1 Sample frame/household screening variables
2 Interview variables (family and health history variables)
3 Mobile Examination Center (MEC) variables
The sample frame/household screening variables can be treated essentially asfully observed Of all sampled individuals, 14.6 % were unit nonrespondentswho had only the sampling frame and household screening variables measured.The interview data consist of family questionnaire variables and health vari-ables obtained for sampled individuals These variables were subject to unitnonresponse and modest rates of item nonresponse For example, self-rating of
Trang 3health status (for individuals aged 17 or over) was subject to an overall response rate (including unit nonresponse) of 18.8 %, and family income had anoverall nonresponse rate of 21.1 %.
non-Missing data in the MEC variables are referred to here as examinationnonresponse Since about 8 % of the sample individuals answered the interviewquestions but failed to attend the examination, rates of examination nonre-sponse were generally higher than rates of interview nonresponse For example,body weight at examination had an overall nonresponse rate of 21.6 %, systolicblood pressure an overall nonresponse rate of 28.1 %, and serum cholesterol anoverall nonresponse rate of 29.4 %
The three blocks of variables ± screening, interview, examination ± had anapproximately monotone structure, with screening variables basically fullyobserved, questionnaire variables missing when the interview is not conducted,and examination variables missing when either (i) the interview is not con-ducted or (ii) the interview is conducted but the MEC examination does nottake place However, item nonresponse for interview data, and component anditem-within component nonresponse for MEC data, spoil this monotonestructure
A combined weighting and multiple imputation strategy was adopted tocreate a public-use dataset consisting of over 70 of the main NHANES-3variables (Ezzati and Khare, 1992; Ezzati-Rice et al., 1993, 1995; Khare et al.,1993) The dataset included the following information:
Basic demographics and geography: age, race/ethnicity, sex, household size,design stratum, stand, interview weight
Other interview variables: alcohol consumption, education, poverty index, reported health, activity level, arthritis, cataracts, chest pain, heart attack, backpain, height, weight, optical health measures, dental health measures, first-handand second-hand smoking variables
self-Medical examination variables: blood pressure measures, serum cholesterolmeasures, serum triglycerides, hemoglobin, hematocrit, bone density mea-sures, size measures, skinfold measures, weight, iron, drusen score, macul-pathy, diabetic retinopathy, ferritin, mc measures, blood lead, red cellmeasures
Many of the NHANES variables not included in the above set are recodes ofincluded variables and hence easily derived
As in previous NHANES surveys, unit nonrespondents were dropped fromthe sample and a nonresponse weight created for respondents to adjust for thefact they are no longer a random sample of the population The nonresponseweights were created as inverses of estimated propensity-to-respond scores(Rubin, 1985), as described in Ezzati and Khare (1992) All other missingvalues were handled by multiple imputation, specifically creating five randomdraws from the predictive distribution of the missing values, based on amultivariate linear mixed model (Schafer, 1996) The database consists of thefollowing six components:
Trang 41 A core dataset containing variables that are not subject to imputation (id,demographics, sampling weights, imputation flags) in fixed-width, space-delimited ASCII.
2 Five versions of a data file containing the observed data and the imputedvalues created as one draw from the joint predictive distribution of themissing values
3 SAS code that will merge the core data with each of the imputed datasets,assign variables, names, etc., to produce five SAS datasets of identical size,with identical variable names
4 Sample analyses using SUDAAN and Wesvar-PC to estimate means, portions, quantiles, linear and logistic regression coefficients Each analysiswill have to be run five times
pro-5 SAS code for combining five sets of estimates and standard errors usingRubin's (1987) methods for multiple imputation inference, as outlinedabove
6 Documentation written for a general audience that details (a) the history ofthe imputation project, (b) an overview of multiple imputation, (c)NHANES imputation models and procedures, (d) a summary of the1994±5 evaluation study, (e) instructions on how to use the multiply-imputed database, and (f) caveats and limitations
A separate model was fitted to sample individuals in nine age classes, withsample sizes ranging from 1410 to 8375 individuals One reason for stratifying
in this way is that the set of variables defined for individuals varies somewhat
by age, with a restricted set applying to children under 17 years, and somevariables restricted to adults aged over 60 years Also, stratification on age is asimple modeling strategy for reflecting the fact that relationships betweenNHANES variables are known to vary with age
I now describe the basic form of the model for a particular age stratum Forindividual t in stand c, let ytcbe the (1 J) vector of the set of items subject tomissing data, and let xtcbe a fixed (1 p) vector of design variables and itemsfully observed except for unit nonresponse It is assumed that
ytcjbc NJ(xtcb bc, )
bc NJ(0, c), c 1, , 89; t 1, , nc, (18:27)where b is a ( p J) vector of fixed effects, bc is a (1 J) vector of randomstand effects with mean zero and covariance matrix c diag(c1, , cJ), and
is an unstructured covariance matrix; conditioning on (b, S, c) in (18.27) isimplicit It is further assumed that the missing components of ytcare missing atrandom and the parameters (b, , c) are distinct from parameters defining themechanism, so that the missing-data mechanism does not have to be modeledfor likelihood inference (Rubin, 1976; Little and Rubin, 2002)
In view of the normality assumption in (18.27), most variables were formed to approximate normality using standard power transformations Afew variables not amenable to this approach were forced into approximate
Trang 5trans-normality by calculating the empirical cdf and corresponding quantiles of thestandard normal distribution The model (18.27) is a refinement over earlierimputation models in that stand is treated as a random effect rather than a fixedeffect This reduces the dimensionality of the model and allows for greaterpooling of information across stands.
The Gibbs sampler was used to generate draws from the posterior tion of the parameters and the missing values for the model in Section 18.2.1,with diffuse conjugate prior distributions on (b, , c) S-Plus and Fortran code
distribu-is available at Joseph Schafer's web site at http://www.psu.edu/~jls In mary, given values from the tth iteration, the (t 1)th iteration of Gibbs'sampling involves the following five steps:
sum-Draw (b(i1)
c jyobs,tc, y(i)mis,tc, b(i), c(i), (i)) Normal, c 1, , 89
Draw (c(i1)jyobs,tc, y(i)mis,tc, b(i), (i), {b(i1)
c }) Inverse WishartDraw ((i1)jyobs,tc, y(i)mis,tc, b(i), {b(i1)
c }, c(i1)) Inverse WishartDraw (b(i1)jyobs,tc, y(i)mis,tc, {b(i1)
c }, c(i1), (i1)) NormalDraw ( y(i1)mis,tcjyobs,tc, {b(i1)
c }, c(i1), S(i1), b(i1) Normal, c 1, , 89:Here yobs,tcconsists of the set of observed items in the vector ytcand ymis,tc, theset of missing items More details of the forms of these distributions are given inSchafer (1996)
The Gibbs sampler for each age stratum was run as a single chain andconverged rapidly, reflecting the fact that the model parameters and randomeffects were well estimated After an initial run-in period, draws of the missingvalues were taken at fixed intervals in the chain, and these were transformed back
to their original scales and rounded to produce the five sets of imputations
18.4 NONIGNORABLE MISSING DATA nonignorable missing data
The models discussed in the previous two sections assume the missing data areMAR Nonignorable, non-MAR models are needed when missingness depends
on the missing values For example, suppose a participant in an income surveyrefused to report an income amount because the amount itself is high (or low)
If missingness of the income amount is associated with the amount, aftercontrolling for observed covariates (such as age, education, or occupation)then the mechanism is not MAR, and methods for imputing income based onMAR models are subject to bias A correct analysis must be based on the fulllikelihood from a model for the joint distribution of ys and rs The standardlikelihood asymptotics apply to nonignorable models providing the parametersare identified, and computational tools such as the Gibbs sampler also apply tothis more general class of models
Trang 6Suppose the missing-data mechanism is nonignorable, but the selectionmechanism is ignorable, so that a model is not required for the inclusionindicators iU There are two broad classes of models for the joint distribution
of ys and rs (Little and Rubin, 2002, Ch 11; Little, 1993b) Selection modelsmodel the joint distribution as
p( ys, rsjzU, y, c) p( ysjzU, y)p(rsjzU, yinc, c), (18:28)
as in Section 18.1 Pattern-mixture models specify
p( ys, rsjzU, g, p) p( ysjzU, rs, g)p(rsjzU, p), (18:29)where g and p are unknown parameters, and the distribution of ys is condi-tioned on the missing data pattern rs Equations (18.28) and (18.29) are simplytwo different ways of factoring the joint distribution of ys and rs When rs isindependent of ys the two specifications are equivalent with y g and c p.Otherwise (18.28) and (18.29) generally yield different models
Pattern-mixture models (18.29) seem more natural when missingness defines
a distinct stratum of the population of intrinsic interest, such as individualsreporting `don't know' in an opinion survey However, pattern-mixture modelscan also provide inferences for parameters y of the complete-data distribution,
by expressing the parameters of interest as functions of the pattern-mixturemodel parameters g and p (Little, 1993b) An advantage of the pattern-mixturemodeling approach over selection models is that assumptions about the form ofthe missing-data mechanism are sometimes less specific in their parametricform, since they are incorporated in the model via parameter restrictions.This point is explained for specific normal pattern-mixture models in Little(1994) and Little and Wang (1996)
Most of the literature on nonignorable missing data has concerned selectionmodels of the form (18.28), for univariate nonresponse An early example is theprobit selection model
Example 4 Probit selection model
Suppose Y is scalar and incompletely observed, X1, , Xp represent designvariables and fully observed survey variables, and interest concerns the param-eters b of the regression of Y on X1, , Xp A normal linear model is assumedfor this regression, that is
Trang 7where F denotes the probit function When cp16 0, this probability is amonotonic function of the values of Y, and the missing-data mechanism isnonignorable If, on the other hand, cp1 0 and (c0, , cp) and (b, s2) aredistinct, then the missing-data mechanism is ignorable, and maximum likeli-hood estimates of (b, s2) are obtained by least squares linear regression based
on the complete cases
Amemiya (1984) calls (18.31) a Type II Tobit model, and it was first duced to describe selection of women into the labor force (Heckman, 1976) It isclosely related to the logit selection model of Greenlees, Reece and Zieschang(1982), which is extended to repeated-measures data in Diggle and Kenward(1994) This model is substantively appealing, but problematic in practice, sinceinformation to simultaneously estimate the parameters of the missing-datamechanism and the parameters of the complete-data model is usually verylimited, and estimates are very sensitive to misspecification of the model (Little,1985; Stolzenberg and Relles, 1990) The following example illustrates theproblem
intro-Example 5 Income nonresponse in the current population survey
Lillard, Smith and Welch (1982, 1986) applied the probit selection model ofExample 4 to income nonresponse in four rounds of the Current PopulationSurvey Income Supplement, conducted in 1970, 1975, 1976, and 1980 In 1980their sample consisted of 32 879 employed white civilian males aged 16±65 whoreported receipt (but not necessarily amount) of W wages and salary earningsand who were not self-employed Of these individuals, 27 909 reported the value
of W and 4970 did not In the notation of Example 4, Y is defined to equal(Wdÿ1)=d, where d is a power transformation of the kind proposed in Box andCox (1964) The predictors X were chosen as education (five dummy variables),years of market experience (four linear splines), probability of being in first year
of market experience, region (south or other), child of household head (yes, no),other relative of household head or member of secondary family (yes,no), personal interview (yes, no), and year in survey (1 or 2) The last fourvariables were omitted from the earnings equation (18.30); that is, their coeffi-cients in the vector b were set equal to zero The variables education, years ofmarket experience, and region were omitted from the response equation(18.31); that is, their coefficients in the vector c were set to zero
Lillard, Smith and Welch (1982) fit the probit selection model (18.30) and(18.31) for a variety of other choices of d Their best-fitting model, ^d 0:45,predicted large income amounts for nonrespondents, in fact 73 % larger onaverage than imputations supplied by the Census Bureau, which used a hotdeck method that assumes ignorable nonresponse However, this large adjust-ment is founded on the normal assumption for the population residuals fromthe g 0:45 model, and on the specific choice of covariates in (18.30) and(18.31) It is quite plausible that nonresponse is ignorable and the unrestrictedresiduals follow the same (skewed) distribution as that in the respondentsample Indeed, comparisons of Census Bureau imputations with IRS income
Trang 8amounts from matched CPS/IRS files do not indicate substantial tion (David et al., 1986).
underestima-Rather than attempting to simultaneously estimate the parameters of themodel for Y and the model for the missing-data mechanism, it seems preferable
to conduct a sensitivity analysis to see how much the answers change forvarious assumptions about the missing-data mechanism Examples of thisapproach for pattern-mixture models are given in Rubin (1977), Little (1994),Little and Wang (1996), and Scharfstein, Robins and Rotnitsky (1999) Analternative to simply accepting high rates of potentially nonignorable missingdata for financial variables such as income is to use special questionnaireformats that are designed to collect a bracketed observation whenever a re-spondent is unable or unwilling to provide an exact response to a financialamount question Heeringa, Little and Raghunathan (2002) describe a Baye-sian MI method for multivariate bracketed data on household assets in theHealth and Retirement Survey The theoretical underpinning of these methodsinvolves the extension of the formulation of missing-data problems via the jointdistribution-of yU, iU and rs in Section 18.1 to more general incomplete dataproblems involving coarsened data (Heitjan and Rubin, 1991; Heitjan, 1994).Full and ignorable likelihoods can be defined for this more general setting
This chapter is intended to provide some indication of the generality andflexibility of the Bayesian approach to surveys subject to unit and item non-response The unified conceptual basis of the Bayesian paradigm is veryappealing, and computational tools for implementing the approach are becom-ing increasingly available in the literature What is needed to convince practi-tioners are more applications such as that described in Example 3, moreunderstanding of useful baseline ``reference'' models for complex multistagesurvey designs, and more accessible, polished, and well-documented software:for example, SAS (2001) now has procedures (PROC MI and PROC MIANA-LYZE) for creating and analysing multiply-imputed data I look forward tofurther developments in these directions in the future
ACKNOWLEDGEMENTSThis research is supported by grant DMS-9803720 from the National ScienceFoundation
Trang 9We are interested in constructing an estimator for a large vector of teristics using data from several sources and/or several phases of sampling Wewill concentrate on the use of the information at the estimation stage, omittingdiscussion of use at the design stage.
charac-Two types of two-phase samples can be identified on the basis of sampleselection In one type, a first-phase sample is selected, some characteristics ofthe sample elements are identified, and a second-phase sample is selected usingthe characteristics of the first-phase units as controls in the selection process
A second type, and the type of considerable interest to us, is one in which afirst-phase sample is selected and a rule for selection of second-phase units isspecified as part of the field procedure Very often the selection of second-phaseunits is not a function of first-phase characteristics One example of the secondtype is a survey of soil properties conducted by selecting a large sample ofpoints At the same time the large sample is selected, a subset of the points, thesecond-phase sample, is specified In the field operation, a small set of data iscollected from the first-phase sample and a larger set is collected from the
Analysis of Survey Data Edited by R L Chambers and C J Skinner
Copyright ¶ 2003 John Wiley & Sons, Ltd.
ISBN: 0-471-89987-9
Trang 10second-phase sample A second example of the second type is that of a tion census in which most individuals receive a short form, but a subsamplereceives a long from with more data elements.
popula-The sample probability-of-selection structure of the two-phase sample issometimes used for a survey containing item nonresponse Item nonresponse
is the situation in which respondents provide information for some, but not all,items on the questionnaire The use of the two-phase model for this situationhas been discussed by SaÈrndal and Swensson (1987) and Rao and Sitter (1995)
A very similar situation is a longitudinal survey in which respondents do notrespond at every point of the survey Procedures closely related to multiplephase estimation for these situations have been discussed by Fuller (1990,1999) See also Little and Rubin (1987)
Our objective is to produce an easy-to-use dataset that meets several criteria.Generally speaking, an easy-to-use dataset is a file of complete records withassociated weights such that linear estimators are simple weighted sums Theestimators should incorporate all available information, and should be designconsistent for a wide range of population parameters at aggregate levels, such
as states The dataset will be suitable for analytic uses, such as comparison ofdomains, the computation of regression equations, or the computation of thesolutions to estimating equations We also desire a dataset that producesreasonable estimates for small areas, such as a county A model for some ofthe small-area parameters may be required to meet reliability objectives for thesmall areas
19.2 REGRESSION ESTIMATION regression estimation
19.2.1 Introduction
Our discussion proceeds under the model in which the finite population is asample from an infinite population of (xt, yt) vectors It is assumed that thevectors have finite superpopulation fourth moments The members of the finitepopulation are indexed with integers U {1, 2, , N} We let (mx, my) denotethe mean of the superpopulation vector and let (xU, yU) denote the mean of thefinite population
The set of integers that identify the sample is the set s In a two-phase sample,
we let s1be the set of elements in the first phase and s2be the set of elements inthe second phase Let there be n1 units in the first sample and n2 units in thesecond sample When we discuss consistency, we assume that n1and n2increase
at the same rate
Assume a first-phase sample is selected with selection probabilities p1t
A second-phase sample is selected by a procedure such that the total ity of selection is p2t Thus, the total probability of being selected for thesecond-phase sample can be written
Trang 11where p2tj1is the conditional probability of selecting unit t at the second phasegiven that it is selected in the first phase.
Recall the two types of two-phase samples discussed in the introduction Inthe first type, where the second-phase sample is selected on the basis of thecharacteristics of the first-phase sample, it is possible that only the first-phaseprobabilities and the conditional probabilities of the second-phase units giventhe first-phase sample are known In such a situation, the total probability ofselection for the second phase is
s1, then p2tcannot be calculated On the other hand, in the second type of phase samples, the probabilities p1tand p2tare known
two-19.2.2 Regression estimation for two-phase samples
The two-phase regression estimator of the mean of y is
!ÿ1X
2t If p2tis not known, lt (p1tp2tjs1)ÿ1is a possibility Insome cases, such as simple random sampling at both phases, the weights areequivalent
If the regression variable that is identically equal to one is isolated and thex-vector written as
the regression estimator can be written in the familiar form
yreg y2 ( d1ÿ d2)^g, (19:7)
Trang 12Using conditional expectations, the variance of the two-phase estimator(19.3) is
V{ yreg} V{E [ yregjs1]} E{V[ yregjs1]}: (19:9)
If we approximate the variance by the variance of the Op(nÿ1=2) terms in theTaylor expansion, we have
V{E [ yregjs1]} V{ y1}, (19:10)and
V{ yregjs1} V X
t2s 2
p ÿ1 2tjs 1
!ÿ1X
t2s 2
p ÿ1 2tjs 1e1,t js1
!ÿ1X
t2s 1
x0
tpÿ1 1tyt,
y1 X
t2s 1
pÿ1 1t
!ÿ1X
t2s 1
pÿ1 1tyt,and y1is the (unobservable) mean for all units in the first-phase sample.19.2.3 Three-phase samples
To extend the results to three-phase estimation, assume a third-phase sample ofsize n3is selected from a second-phase sample of size n2, which is itself a sample
of a first-phase sample of size n1selected from the finite population Let s1, s2, s3
be the set of indices in the phase 1, phase 2, and phase 3 samples, respectively
Trang 13A relatively common situation is that in which the first-phase sample is theentire population.
We assume the vector (1, x) is observed in the phase 1 sample, the vector(1, x, y) is observed in the phase 2 sample, and the vector (1, x, y, z) is observed
in the phase 3 sample Let x1be the weighted mean for the phase 1 sample andlet a two-phase estimator of the mean of y be
^my y2 (x1ÿ x2) ^b2, (19:12)where
^b2 X
t2s 2
(xtÿ x1)0l2t(xtÿ x1)
!ÿ1X
t2s 2
(xtÿ x1)0l2t( ytÿ y1);
and l2t are weights The estimator ^my is an estimator of the finite populationmean, as well as of the superpopulation mean Then a three-phase estimator ofthe mean of z is
t2s 3
l3t(xt, yt, zt)
is consistent for (xU, yU, zU) If all three samples are simple random samplesfrom a normal superpopulation, the estimator (19.13) is the maximum likeli-hood estimator See Anderson (1957)
The variance of the three-phase estimator can be obtained by repeatedapplication of conditional expectation arguments:
Trang 1419.2.4 Variance estimation
Several approaches can be considered for the estimation of the variance of atwo-phase sample One approach is to attempt to estimate the two terms of(19.9) using an estimator of the form
^V{ yreg} ^V{ y1} ^V{ yregjs1}: (19:15)The estimated conditional variance, the second term on the right of (19.15), can
be constructed as the standard variance estimator for the regression estimator
of y1 given s1 If the first-phase sample is a simple random nonreplacementsample and the x-vector defines strata, then a consistent estimator of V{ y1}can be constructed as
^V{ y1} Nÿ1(N ÿ n1)nÿ1
1 S^2,where
^S2 X
t2s 2
pÿ1 1tpÿ1 2tjs 1
!ÿ1X
t2s 2
pÿ1 1t pÿ1 2tjs 1( ytÿ yp)2
and
yp X
t2s 2
pÿ1 1t pÿ1 2tjs 1
!ÿ1X
t2s 2
pÿ1 1tpÿ1 2tjs 1yt:
A general estimator of V{ y1} can be constructed by recalling that the Horvitz±Thompson estimator of V{ y1} of (19.10) can be written in the form
1tytpÿ1 1uyu, (19:16)where p1tuis the probability that unit t and unit u appear together in the phase
1 sample Using the observed phase 2 sample, the estimated variance (19.15)can be estimated with
1tytpÿ1 1uyu, (19:17)provided p2tujs1is positive for all (t, u) This variance estimator is discussed bySaÈrndal, Swensson and Wretman (1992, ch 9) Expression (19.17) is not alwayseasy to implement See Kott (1990, 1995), Rao and Shao (1992), Breidt andFuller (1993), Rao and Sitter (1995), Rao (1996), and Binder (1996) for discus-sions of variance estimation
We now consider another approach to variance estimation We can expandthe estimator (19.3) in a first-order Taylor expansion to obtain
x1^b xUbU (x1ÿ xU)bU xU( ^b ÿ bU) Op(nÿ1),
where xU is the finite population mean, and bU is the finite population sion coefficient
Trang 15is reasonable to assume that C{x1, ^b} 0, and variance estimation is muchsimplified See Fuller (1998).
The variances in (19.18) are unconditional variances The variance of x can
be estimated with a variance formula appropriate for the complete first-phasesample The component V{ ^b} is the more difficult component to estimate inthis representation If the second-phase sampling can be approximated byPoisson sampling, then V{ ^b} can be estimated with a two-stage varianceformula In the example of simple random sampling at the first phase andstratified sampling for the second phase, the diagonal elements of V{ ^b} are thevariances of the stratum means In this case, use of the common estimatorsproduces estimators (19.15) and (19.18) that are nearly identical
19.2.5 Alternative representations
The estimator (19.3) can be written in a number of different ways Because it is
a linear estimator, we can write
Trang 16t2s 2
( ytÿ ^yt)ltxt 0,
under the conditions on the column space of X
We have presented estimators for the mean If total estimation is of interestall estimators are multiplied by the population size, if it is known, or byP
t2s 1pÿ1
1t if the population size is unknown
The estimator (19.19), based on a dataset of n2elements, and the estimator(19.22), based on n1elements, give the same estimated value for a mean or total
of y However, the two formulations can be used to create different estimators
of other parameters, such as estimators of subpopulations
To construct estimator (19.3) for a domain D, we would define a new phase analysis variable
second-yDt yt if t 2 D
0 otherwise:
Then, one would regress yDton xtand compute the regression estimator, wherethe regression estimator is of the form (19.19) with the weights w2tapplied tothe variable yDt The design-consistent estimators for some small subpopula-tions may have very large variances because some of the samples will containfew or no final phase sample elements for the subpopulation
The estimator for domain D associated with (19.22) is
Trang 17estima-but the estimator (19.22) is not The fact that estimator (19.22) produces nonzeroestimates for any subpopulation with a first-phase x is very appealing.
However, the domain estimator (19.23) is a model-dependent estimator.Under the model in which
and the et are zero-mean random variables, independent of xu for all t and u,the estimator (19.23) is unbiased for the subpopulation sum Under random-ization theory, the estimator (19.23) is biased and it is possible to constructpopulations for which the estimator is very biased
The method of producing a dataset with (^yt, xt) as data elements will besatisfactory for some subpopulation totals However, the single set of ^y-valuescannot be used to construct estimates for other characteristics of the distribution
of y, such as the quantiles or percentiles of y This is because the estimator ofthe cumulative distribution function evaluated at B is the mean of an analysisvariable that is one if yt is less than B and is zero otherwise The creation of
a dataset that will produce valid quantile estimates has been discussed in theimputation literature and as a special topic by others See Rubin (1987),Little and Rubin (1987), Kalton and Kasprzyk (1986), Chambers and Dunstan(1986), Nascimento Silva and Skinner (1995), Brick and Kalton (1996), andRao (1996) The majority of the imputation procedures require a modelspecification
19.3 REGRESSION ESTIMATION WITH IMPUTATION regression estimation with imputation
Our objective is to develop an estimation scheme that uses model estimation toimprove estimates for small areas, but that retains many of the desirableproperties of design-based two-phase estimation
Our model for the population is
where the et are independent zero-mean, finite variance, random variables,independent of xu for all t and u We subdivide the population into subunitscalled cells Assume that cells exist such that
et II(0, s2
where Cgis the set of indices in the gth cell, and the notation is read `distributedindependently and identically.' A common situation is that in which the x-vector defines a set of imputation cells Then, the model (19.25)±(19.26) reduces
to the assumption that
Trang 18through assumptions on the nature of the parent population and the selectionprocess.
Consider a two-phase sample in which the x-vectors are known for the entirefirst phase Let the weighted sample regression be that defined in (19.3) andassume the lt and the selection probabilities are such that
To construct a design-consistent estimator when units within a cell have beenselected with unequal probability, we select the donors within a cell withprobability proportional to
Bt X
u2s 2
pÿ1 1u(pÿ1 2ujs 1ÿ 1)
pÿ1 1t(pÿ1 2tjs 1ÿ 1): (19:31)Then,
t2s 1 \s c 2
pÿ1 1tertjs2
t2s 1 \s c 2
pÿ1 1t
2ujs 1)
t2s 1 \s c
pÿ1 1tjs1
Trang 19Therefore, the expectation in (19.32) is approximately equal to the weighted sum
t2s 1
pÿ1 1txt^bX
t2s 1
pÿ1 1t ^ert, (19:34)where ^ert ^et if t 2 s2 If the first phase is the entire population,
where eris the mean of the imputed regression deviations
Under random-with-replacement imputation, assuming the original sample
is a single-phase nonreplacement sample with pt nNÿ1, and ignoring thevariance of ^b, the variance of ^YIrin (19.35) is
where the term (N ÿ n)s2
e is introduced by the random imputation Withrandom imputation that restricts the sample sum of the imputed deviations to
be (nearly) equal to zero, the imputation variance term is (nearly) removedfrom (19.36)
Under the described imputation procedure, the estimator of the total for thesmall-area Du is
t2D u \s 1
pÿ1 1t ^ert: (19:38)Under the regression model with simple random sampling, the variance expres-sion (19.36) holds for the small area with (Nu, nu) replacing (N, n), where Nuand
nu are the population and sample number of elements, respectively, in smallarea Du Note that the variance expression remains valid for nu 0
Expression (19.35) for the grand total defines an estimator that is ization valid and does not require the model assumptions associated with(19.25) and (19.26) That is, for simple random sampling we can write