ANALYSIS OF DOMAIN RESPONSE PROPORTIONS analysis of domain response proportionsLogistic regression models are commonly used to analyze the variation ofsubpopulation or domain proportions
Trang 1Table 7.2 Rankings of test performance, based on Type I error and power.
(a)For L > 30, to avoid error rate inflation for small L
and second, Servy, Hachuel and Wojdyla (1998) included the Morell modified
Scott statistic, X2
Roberts (1996), but an important point on which both studies agreed is that thelog±linear Bonferroni procedure, Bf (LL), is the most powerful procedureoverall Thomas, Singh and Roberts (1996) also noted that Bf (LL) providesthe highest power and most consistent control of Type I error over tables ofvarying size (3 3, 3 4 and 4 4)
7.3.3 Discussion and final recommendations
The benefits of Bonferroni procedures in the analysis of categorical data fromcomplex surveys were noted by Thomas (1989), who showed that Bonferronisimultaneous confidence intervals for population proportions, coupled with log
or logit transformations, provide better coverage properties than competingprocedures This is consistent with the preeminence of Bf (LL) for tests ofindependence Nevertheless, it is important to bear in mind the caveat thatthe log±linear Bonferroni procedure is not invariant to the choice of basis forthe interaction terms in the log±linear model
rated in both studies with respect to power and Type I error control However,some practitioners might be reluctant to use this procedure because of thedifficulty of selecting the value of e For example, Thomas, Singh and Roberts(1996) recommend e 0:05, while Servy, Hachuel and Wojdyla (1998) recom-mend e 0:1 Similar comments apply to the adjusted eigenvalue procedure,
Trang 2If the uncertainties of the above procedures are to be avoided, the choice oftest comes down to the Rao±Scott family, Fay's jackknifed tests, or the F-based
on the degree of variation among design effects There is a choice of Rao±Scottprocedures available whatever the variation among design effects If full surveyinformation is not available, then the first-order Rao±Scott tests might be theonly option Fay's jackknife procedures are viable alternatives when full surveyinformation is available, provided the number of clusters is not small Thesejackknifed tests are natural procedures to choose when survey variance estima-tion is based on a replication strategy Finally, FX2
based on the log±linear representation of the hypothesis, provides adequatecontrol and relatively high power provided that the variation in design effects isnot extreme It should be noted that in both studies, all procedures derivedfrom the F-based Wald test exhibited low power for small numbers of clusters,
so some caution is required if these procedures are to be used
7.4 ANALYSIS OF DOMAIN RESPONSE PROPORTIONS analysis of domain response proportionsLogistic regression models are commonly used to analyze the variation ofsubpopulation (or domain) proportions associated with a binary responsevariable Suppose that the population of interest consists of I domains corres-ponding to the levels of one or more factors Let ^Ni and ^N1i (i 1, , I) be
domain response proportions 1i N1i=Ni is denoted by ^1i ^N1i=^Ni Theasymptotic covariance matrix of ^1 (^m11, , ^m1I)0, denoted S1, is consistentlyestimated by ^S1
log [m1i=(1 ÿ m1i)] x0
The pseudo-MLE ^ is obtained by solving the estimating equations specified
by Roberts, Rao and Kumar (1987), namely
Equations (7.33) are obtained from the likelihood equations under independentbinomial sampling by replacing ni=n by ^piand n1i=niby the ratio estimator ^1i,
the goodness of fit of the model (7.32) is given by
ANALYSIS OF DOMAIN RESPONSE PROPORTIONS 101
Trang 3X2 P1 nXI
i1
^pi[^1iÿ 1i(^)]2=[1i(^)(1 ÿ 1i(^))] (7:34)and a statistic corresponding to the likelihood ratio is given by
1variables Wi, where the weights d1i, i 1, , I ÿ m, are eigenvalues of a
D1 ( ~Z01O1Z~1)ÿ1( ~Z011Z~1), (7:36)where
diagonal elements ai, where a (a1, , aI)1 Under independent binomial
LR1(^d1:, ^a1) XLR12 (^d1:)
1 ^a2 1
(7:40)
1),where (1 ^a2
1: [I=(I ÿ m)] ^D1: is an upper bound on
^d1:given by ^D1: Iÿ1 ^D1i The modified first-order corrections X2
P1(^d
1:) and
X2
LR1(^d
Roberts, Rao and Kumar (1987) also developed first-order and second-ordercorrections for nested hypotheses as well as model diagnostics to detect anyoutlying domain response proportions and influential points in the factorspace, taking account of the design features They also obtained a linearization
102 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS
Trang 4Example 2
Roberts, Rao and Kumar (1987) and Rao, Kumar and Roberts (1989) appliedthe previous methods to data from the October 1980 Canadian Labour ForceSurvey The sample consisted of males aged 15±64 who were in the labour forceand not full-time students Two factors, age and education, were chosen toexplain the variation in employment rates via logistic regression models Agegroup levels were formed by dividing the interval [15, 64] into 10 groups withthe jth age group being the interval [10 5j, 14 5j] for j 1, , 10 and then
assigning to each person a value based on the median years of schoolingresulting in the following six levels: 7, 10, 12, 13, 14 and 16 The resulting age
by education cross-classification provided a two-way table of I 60 estimatedcell proportions or employment rates, ^1jk, j 1, , 10; k 1, , 6
A logistic regression model involving linear and quadratic age effects and alinear education effect provided a good fit to the two-way table of estimatedemployment rates, namely:
log [^1jk=(1 ÿ ^1jk)] ÿ3:10 0:211Ajÿ 0:002 18A2
j 0:151Ek:The following values were obtained for testing the goodness of fit of the above
linear education effect and linear and quadratic effects could be rejected On the
P1(^d1:, ^a1) or X2
LR1(^d1:, ^a1) when referred to the upper
P1(^d
1:) or X2
LR1(^d
design effects, is also not significant, and is close to the first-order correction
Example 3
Fisher (1994) also applied the previous methods to investigate whether the use
of personal computers during interviewing by telephone (treatment) versusin-person on-paper interviewing (control) has an effect on labour forceestimates For this purpose, he used split panel data from the US CurrentPopulation Survey (CPS) obtained by randomly splitting the CPS sample intotwo panels and then administering the `treatment' to respondents in one of thepanels and the `control' to those in the other panel Three other binary factors,sex, race and ethnicity, were also included A logistic regression model contain-ing only the main effects of the four factors fitted the four-way table ofestimated unemployment rates well A test of the nested hypothesis that the
ANALYSIS OF DOMAIN RESPONSE PROPORTIONS 103
Trang 5`treatment' main effect was absent given the model containing all four maineffects was rejected, suggesting that the use of a computer during interviewingdoes have an effect on labour force estimates.
Rao, Kumar and Roberts (1989) studied several extensions of logistic sion They extended the previous results to Box±Cox-type models involvingpower transformations of domain odds ratios, and illustrated their use on datafrom the Canadian Labour Force Survey The Box±Cox approach would beuseful in those cases where it could lead to additive models on the transformedscale while the logistic regression model would not provide as good a fit withoutinteraction terms Methods for testing equality of parameters in two logisticregression models, corresponding to consecutive time periods, were alsodeveloped and applied to data from the Canadian Labour Force Survey.Finally, they studied a general class of polytomous response models anddeveloped Rao±Scott adjusted Pearson and likelihood tests which they applied
regres-to data from the Canada Health Survey (1978±9)
In this section we have discussed methods for analyzing domain responseproportions We turn next to logistic regression analysis of unit-specific sampledata
7.5 LOGISTIC REGRESSION WITH A BINARY RESPONSE
VARIABLE logistic regression with a binary response variable
and a binary response variable, y, associated with the tth population unit,
t 1, , N We assume that for a given xt, yt is generated from a model withmean E( yt) t() g(xt, ) and `working' variance var( yt) v0t v0(t).Specifically, we assume a logistic regression model so that
log [t()=(1 ÿ t())] x0
and v0(t) t(1 ÿ t) Our interest is in estimating the parameter vector u and
104 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS
Trang 6Following Binder (1983), a linearization estimator of the covariance matrixV(^) of ^ is given by
^
where J(^) ÿPt2swts@ut()=@0is the observed information matrix and ^V( ^T)
other adjustments It is straightforward to obtain a resampling estimator ofV(^) that takes account of post-stratification adjustment For stratified multi-stage sampling, a jackknife estimator of V(^) is given by
the ( j k)th sample cluster are deleted, but using jackknife survey weights wts( jk)(see Rao, Scott and Skinner, 1998) Bull and Pederson (1987) extended (7.44) tothe case of a polytomous response variable, but without allowing for post-stratification adjustment as in Binder (1983) Again it is straightforward todefine a jackknife variance estimator for this case
^
1, 0
2), where 2 is r2 1 and 2 is r1 1,
H0: 2 20 Then the Wald statistic
transformations of Further, one has to fit the full model (7.1) before
number, r, of parameters This would be the case with a factorial structure of
Scott and Skinner (1998) proposed quasi-score tests to circumvent the problemsassociated with Wald tests These tests are invariant to non-linear transform-
considerable advantage if the dimension of is large, as noted above
Let ~ (~0
1, 020) be the solution of ^T1(~)
~
0, where ^T() [ ^T1()0, ^T2()0]0is
is based on the statistic
Trang 7The linearization estimator, ^VL(~T2), is the estimated covariance matrix of theestimated total
2)0 Again, ^VL( ^T2) should take account
of post-stratification and other adjustments The jackknife estimator ^VJ( ^T2)
is similar to (7.45), with ^ and ^( jk) changed to ^T2 and ^T2( jk), where ^T2( jk)
Under H0, X2
r 2so that X2
QS provides a valid test of H0
freedom for estimating V(^2) or V( ^T2) are not large, the tests become unstable
Skinner (1998) proposed alternative tests including an F-version of the Waldtest (see also Morel, 1989), and Rao±Scott corrections to naive Wald or score
7.2, as well as Bonferroni tests We refer the reader to Rao, Scott and Skinner(1998) for details
nested hypotheses on given the model (7.41), unlike the case of domainproportions which permits testing of model fit as well as nested hypothesesgiven the model
In this section we briefly mention some recent applications and extensions ofRao±Scott and related methodology
106 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS
Trang 8features were obtained A model-free approach using two-phase sampling wasalso developed In two-phase sampling, error-prone measurements are made on
a large first-phase sample selected according to a specified design and free measurements are then made on a smaller subsample selected according toanother specified design (typically SRS or stratified SRS) Rao±Scott correctedPearson statistics were proposed under double sampling for both the goodness-of-fit test and the tests of independence in a two-way table Rao and Thomas(1991) also extended Assakul and Proctor's (1967) method of testing of inde-pendence in a two-way table with known misclassification probabilities
error-to general survey designs They developed Rao±Scott corrected tests using ageneral methodology that can also be used for testing the fit of log±linearmodels on multi-way contingency tables More recently, Skinner (1998)extended the methods of Section 7.4 to the case of longitudinal survey datasubject to classification errors
7.6.2 Biostatistical applications
Cluster-correlated binary response data often occur in biostatistical tions; for example, toxicological experiments designed to assess the teratogeniceffects of chemicals often involve animal litters as experimental units Severalmethods that take account of intracluster correlations have been proposed butmost of these methods assume specific models for the intracluster correlation,e.g., the beta-binomial model Rao and Scott (1992) developed a simplemethod, based on conceptual design effects and effective sample size, that can
applica-be applied to problems involving independent groups of clustered binary datawith group-specific covariates It assumes no specific models for the intraclustercorrelation in the spirit of Zeger and Liang (1986) The method can be readilyimplemented using any standard computer program for the analysis of inde-pendent binary data after a small amount of pre-processing
i 1, , I, and let yij denote the number of `successes' among the nij units,with Pjyij yi and Pjnij ni Treating yi as independent binomial B(ni, pi)
probability in the ith group Denoting the design effect of the ith estimatedproportion, ^pi yi=ni, by Di, and the effective sample size by ~ni ni=Di, the
applied to a variety of biostatistical problems; in particular, testing eity of proportions, estimating dose response models and testing for trends inproportions, computing the Mantel±Haenszel chi-squared test statistic forindependence in a series of 2 2 tables and estimating the common oddsratio and its variance when the independence hypothesis is rejected Obu-chowski (1998) extended the method to comparisons of correlated proportions.Rao and Scott (1999a) proposed a simple method for analyzing groupedcount data exhibiting overdispersion relative to a Poisson model This method
homogen-is similar to the previous method for clustered binary data
SOME EXTENSIONS AND APPLICATIONS 107
Trang 97.6.3 Multiple response tables
In marketing research surveys, individuals are often presented with questionsthat allow them to respond to more than one of the items on a list, i.e., multipleresponses may be observed from a single respondent Standard methods cannot
be applied to tables of aggregated multiple response data because of themultiple response effect, similar to a clustering effect in which each independentrespondent plays the role of a cluster Decady and Thomas (1999, 2000)adapted the first-order Rao±Scott procedure to such data and showed that itleads to simple approximate methods based on easily computed, adjusted chi-squared tests of simple goodness of fit and homogeneity of response probabil-ities These adjusted tests can be calculated from the table of aggregate multipleresponse counts alone, i.e., they do not require knowledge of the correlationsbetween the aggregate multiple response counts This is not true in general; forexample, the test of equality of proportions will require either the full dataset or
an expanded table of counts in which each of the multiple response items istreated as a binary variable Nevertheless, the first-order Rao±Scott approach
is still considerably easier to apply than the bootstrap approach recentlyproposed by Loughin and Scherer (1998)
108 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS
Trang 10of the study and all of these are included In addition, a sample of controls wasdrawn from the remaining children in the study population by a complex multi-stage design At the first stage, a sample of 300 census mesh blocks (eachcontaining roughly 50 households) was drawn with probability proportional
to the number of houses in the block Then a systematic sample of 20 holds was selected from each chosen mesh block and children from thesehouseholds were selected for the study with varying probabilities that depend
house-on age and ethnicity as in Table 8.1 These probabilities were chosen to matchthe expected frequencies among the cases Cluster sample sizes varied from one
to six and a total of approximately 250 controls was achieved This corresponds
to a sampling fraction of about 1 in 400, so that cases are sampled at a rate that
is 400 times that for controls
Analysis of Survey Data Edited by R L Chambers and C J Skinner
Copyright ¶ 2003 John Wiley & Sons, Ltd.
ISBN: 0-471-89987-9
Trang 11Table 8.1 Selection probabilities.
Maori Pacific Islander Other
Complex sampling may also be used in the selection of cases For example,
we have recently helped with the analysis of a study in which cases (patientswho had suffered a mild stroke) were selected through a two-stage design withdoctors' practices as primary sampling units However, this is much lesscommon than with controls
As we said at the outset, these studies are a very special sort of survey butthey share the two key features that make the analysis of survey data distinct-ive The first feature is the lack of independence In our example, we wouldexpect substantial intracluster correlation because of unmeasured socio-economic variables, factors affecting living conditions (e.g mould on walls ofhouse), environmental exposures and so on Ignoring this will lead to standarderrors that are too small and confidence intervals that are too narrow Theother distinctive feature is the use of unequal selection probabilities In case±control studies the selection probabilities can be extremely unbalanced and theyare based directly on the values of the response variable, so that we haveinformative sampling at its most extreme
In recent years, efficient semi-parametric procedures have been developed forhandling the variable selection probabilities (see Scott and Wild (2001) for asurvey of recent work) `Semi-parametric' here means that a parametric model
is specified for the response as a function of potential explanatory variables, butthat the joint distribution of the explanatory variables is left completely free.This is important because there are usually many potential explanatory vari-ables (more than 100 in some of the studies in which we are involved) and itwould be impossible to model their joint behaviour, which is of little interest inits own right However, the problem of clustering is almost completely ignored,
in spite of the fact that a large number of studies use multi-stage sampling Thepaper by Graubard, Fears and Gail (1989) is one of the few that even discussthe problem Most analyses simply ignore the problem and use a programdesigned for simple (or perhaps stratified) random sampling of cases andcontrols This chapter is an attempt to remedy this neglect
110 FITTING REGRESSION MODELS IN CASE±CONTROL STUDIES
Trang 12In the next section we give a summary of standard results for simple case±control studies, where cases and controls are each selected by simple randomsampling or by stratified random sampling In Section 8.3, we extend theseresults to cover cases when the controls (and possibly the cases) are selectedusing an arbitrary probability sampling design We investigate the properties ofthese methods through simulation studies in Section 8.4 The robustness of themethods in situations when the assumed model does not hold exactly is ex-plored in Section 8.5, and some possible alternative approaches are sketched inthe final section.
We shall take it for granted throughout this chapter that our purpose is to makeinferences about the parameters of a superpopulation model since interest iscentred in the underlying process that produces the cases and not on the compos-ition of a particular finite population at a particular point in time Suppose thatthe finite population consists of values {(xt, yt), t 1, , N}, where ytis a binary
underlie all standard methods for the analysis of case±control data from tion-based studies For simplicity, we will work with the logistic regression model,
and often shorten it to p1(x) We also set p0(x) 1 ÿ p1(x)
If we had data on the whole population, that is what we would analyse,treating it as a random sample from the process that produces cases andcontrols The finite population provides score (first derivative of the log-likelihood) equations
X
t:y t 0
xtp1(xt)
controls As N ! 1, Equations (8.1) converge to
distribu-tion of X condidistribu-tional on Y i (i 0, 1) Under the model, Equadistribu-tions (8.2)have solution b
Standard logistic regression applies if our data comes from a simple randomsample of size n from the finite population For rare events, as is typical inbiostatistics and epidemiology, enormous efficiency gains can be obtained by
stratum defined by Y i (i 0, 1) with n1 n0
SIMPLE CASE±CONTROL STUDIES 111
Trang 13Since the estimating equations (8.2) simply involve population means for thecase and control subpopulations, they can be estimated directly from case±control data using the corresponding sample means This leads to a design-
can be estimated consistently (by the corresponding population proportions,for example) More efficient estimates, however, are obtained using the semi-
by ignoring the sampling scheme (see Prentice and Pyke, 1979) and solving theprospective score equations (i.e those that would be appropriate if we had asimple random sample from the whole population)
n1n
(1, 0, , 0)0 We see that only the intercept term is affected by the case±controlsampling The intercept term can be corrected simply by using k as an offset inthe model, but if we are only interested in the coefficients of the risk factors, we
do not even need to know the relative stratum weights More generally, Scott
have the unique solution b b kle1 with kl log [l1W0=(l0W1)], providedthat the model contains a constant term (i.e the first component of x is 1) Thiscan be seen directly by expanding (8.6) Suppose, for simplicity, that X iscontinuous with joint density function f (x) Then, noting that the conditionaldensity of X given that Y i is f (xjY i) pi(x; b) f (x)=Wi, (8.6) is equivalentto
Z xex0 bk lf (x)
(1 ex 0 b)(1 ex 0 b)dx
(1 ex 0 b)(1 ex 0 b)dx:
The result then follows immediately
one explanatory variable and the ratio of the sampling rates between cases andcontrols is not too extreme However, the differences become bigger as casesbecome rarer in the population and the sampling rates become less balanced,and also when we have more than one covariate We have seen 50% efficiencygains in some simple case±control problems In more complex stratified
112 FITTING REGRESSION MODELS IN CASE±CONTROL STUDIES
Trang 14samples, design weighting can be very inefficient indeed (see Lawless,Kalbfleisch and Wild 1999).
Most case±control studies incorporate stratification on other variables, such
as age and ethnicity as in our motivating example This is one aspect of complexsurvey design that is taken into account in standard case±control methodology
A common assumption is that relative risk is constant across these strata butthat absolute levels of risk vary from stratum to stratum, leading to a model ofthe form
for observations in the hth stratum, where x now contains dummy variables for
adapting the design-weighted (pseudo-MLE, Horvitz±Thompson) approach tothis problem; we simply add a dummy variable for each stratum to x andreplace the sample means in (8.3) by their stratified equivalents The maximum
cases and controls within each stratum; we simply fit the model in (8.7) as if wehad a simple random sample from the whole population, ignoring the stratifi-
by an additive constant depending on the relative sampling fractions of cases
even need to know these sampling fractions
In some studies, we want to model the stratum constants as functions of thevariables in x1 For example, if the population is stratified by age, we might stillwant to model the effect of age by some smooth function Again, adapting thedesign-weighted approach is completely straightforward and only requires thespecification of the appropriate sampling weights Extending the maximumlikelihood approach to this case is considerably more difficult, however, and
a fully efficient solution has only been obtained relatively recently (for detailssee Scott and Wild, 1997; Breslow and Holubkov, 1997)
8.3 CASE±CONTROL STUDIES WITH COMPLEX SAMPLINGcase±control studies with complex samplingNow consider situations in which controls (and possibly cases) are obtainedusing a complex sampling plan involving multi-stage sampling Our only as-sumption is that the population is stratified into cases and controls and thatsamples of cases and controls are drawn independently Note that this assump-tion does not hold exactly in our motivating example since a control could bedrawn from the same cluster as a case However, cases are sufficiently rare forthis possibility to be ignorable in the analysis (In fact, it occurred once in thestudy.) One common option is simply to ignore the sample design (see Grau-bard, Fears and Gail, 1989, for examples) and use a standard logistic regressionprogram, just as if we had a simple (or stratified) case±control study Under-estimates of variance are clearly a worry with this strategy; the coveragefrequencies of nominally 95% confidence intervals dropped to about 80% in
CASE±CONTROL STUDIES WITH COMPLEX SAMPLING 113
Trang 15some of our simulations In fact, estimates obtained in this way need not even
be consistent To obtain consistency, we need (1=n0)Pcontrolsxtp1(xt) to
self-weighting designs but not true in general
The estimating equations (8.2) involve separate means for the case andcontrol subpopulations Both these terms can be estimated using standardsurvey sampling techniques for estimating means from complex samples Vari-ances and covariances of the estimated means, which are required for sandwichestimates of Cov(^b), can also be obtained by standard survey sampling tech-niques Such analyses can be carried out routinely using specialist samplingpackages such as SUDAAN or with general packages such as SAS or Stata thatinclude a survey option This approach is a direct generalisation of the design-weighted (Horvitz±Thompson) approach to the analysis of simple case±controldata which, as we have seen, can be quite inefficient
An alternative is to apply standard sampling methods to (8.6), with priate choices for l1and l0, rather than to (8.2) This leads to an estimator, bblsay, that satisfies
argu-ments This leads to an estimated covariance matrix
dCov(bbl) Iÿ1
l (bbl) dCov(bSl)Iÿ1
where Il(b) (]bSl=]b0) and dCov(bbl) is the standard survey estimate of Cov(bSl),just as in Equation (7.44) (Note that bS is just a linear combination of twoestimated means.) All of this can also be carried out straightforwardly inSUDAAN or similar packages simply by specifying the appropriate weights(i.e by scaling the case weights and control weights separately so that the sum
proportional to l0)
random samples of cases and controls, maximum likelihood, in which we set
complex schemes, using the sample proportions will no longer be fully efficientand we might expect weights based on some form of equivalent sample sizes toperform better We have carried out some simulations to investigate this Ourlimited experience so far suggests that setting li ni=n for i 0, 1 leads to only
a moderate loss of efficiency unless design effects are extreme
We have not said anything specific about stratification beyond the basicdivision into cases and controls so far in this discussion Implicitly, we areassuming that stratification on other variables such as age or ethnicity ishandled by specifying the appropriate weights when we form the sample
model the stratum constants in terms of other variables, then this is taken care
114 FITTING REGRESSION MODELS IN CASE±CONTROL STUDIES
Trang 16of automatically There is another approach to fitting model (8.6), which is toignore the different selection probabilities of units in different strata Thismimics the maximum likelihood method of the previous section more closely
further adjustments are necessary if we wish to model the stratum constants
We explore the difference in efficiency between the two approaches in oursimulations in the next section Note that we can again implement the methodsimply in a package like SUDAAN by choosing the weights appropriately
In this section, we describe the results of some of the simulations that wehave done to test and compare the methods discussed previously The simula-
which produces a population proportion of approximately 1 case in every 300individuals, a little less extreme than the 1 in 400 in the meningitis study Thenumber of cases in every simulation study is 200 The number of control clusters
is always fixed The number of controls in each study averages 200, but the actualnumber is random in some studies (when `ethnicities' are subsampled at differentrates ± see later) All simulations used 1000 generated datasets
The first row of Table 8.2 shows results for a population in which 60 % of theclusters are of size 2 and the remaining 40 % of size 4 We see from Table 8.2 that,
increase in efficiency It is possible that the gain in efficiency is smaller for the
Table 8.2 Relative efficiencies (cf design weighting)
Trang 17binary variable because it represents a smaller change in risk We also gated the effect of using a standard logistic regression program ignoring clus-tering Members of the same cluster are extremely closely related here, so wewould expect the coverage frequency of nominally 95 % confidence intervals todrop well below 95 % if the clustering was ignored We found that coverage
cluster-ing was ignored When we correct for clustercluster-ing in both the sample-weightedand population-weighted variance estimates, the coverage frequencies wereclose to 95 %
In Section 8.3, we discussed the possible benefits of using weights thatreflected `effective sample' size in place of sample weights We repeated thepreceding simulation (using the clusters of size 2 and 4) down-weighting thecontrols by a factor of 3 (which is roughly equal to the average design effect)
The rows of Table 8.2 headed `Stratified sampling of ``ethnic groups'' ' showthe results of an attempt to mimic (on a smaller scale) the stratified sampling ofcontrols within clusters in the meningitis study We generated our population
so that 60 % of the individuals belong to ethnic group 1, and 20 % belong toeach of groups 2 and 3 This was done in two ways Where `(random)' has beenappended, the ethnic groups have been generated independently of cluster.Where `(whole)' has been appended, all members of a cluster have the sameethnicity All members of a sampled cluster in groups 2 and 3 were retained,while members of group 1 were retained with probability 0.33 We varied thesizes of the clusters, with either 60 % twos and 40 % fours, or 60 % fours and
40 % eights, before subsampling The subsampling (random removal of group
1 individuals) left cluster sample sizes ranging from one to the maximum clustersize The models now also include dummy variables for ethnic group and wecan estimate these in two ways The columns headed `Ethnic weighted' showresults when cases have sample weights, whereas controls have weights whichare the sample weights times 1.8 for group 1 or 0.6 for groups 2 or 3 Thisreconstitutes relative population weightings so that contrasts in ethnic coeffi-cients are correctly estimated
In this latter set of simulations the increase in efficiency for sample weightingover population weighting is even more striking Reweighting to account fordifferent sampling rates in the different ethnic groups has led to similar effi-ciencies This is important as the type of `ethnic weighting' we have done in thecontrol group would allow us to model group differences rather than being
dropped a little, however, in the `4s and 8s' clusters with design weighting
We can interpret this as meaning that, with more clustering and additionalparameters to estimate, our effective sample sizes have dropped and, withsmaller sample sizes, asymptotic approximations are less accurate
116 FITTING REGRESSION MODELS IN CASE±CONTROL STUDIES
Trang 18equations It is the remaining regression coefficients that are of primary est, however, since they determine the relative risk of becoming a case associ-ated with a change in the values of explanatory variables These remainingcoefficients are unaffected by the choice of weightings.
we would be estimating if we fitted the model to the whole population Since allmodels are almost certainly misspecified to some extent, many survey statisti-cians (see Kish and Frankel, 1974) would suggest that a reasonable aim is toestimate B We adopted this perspective uncritically in Scott and Wild (1986,1989), suggesting as a consequence that, although maximum likelihood estima-tion was more efficient, design weighting was more robust because it alone led
to consistent estimates of B in the presence of model misspecification This hasbeen quoted widely by both biostatisticians and survey samplers However, amore detailed appraisal of the nature of B suggests that using sample weights
robust as well as being more efficient
Table 1 of Scott and Wild (1989) showed results from fitting a linear logisticregression model when the true model was quadratic with logit {Pr(Y 1jx)}
scale and a quadratic approximation should give a reasonable idea of whathappens when the working logistic model is not too badly misspecified Twopopulation curves were investigated, with the extent of the curvature beingchosen so that an analyst would fail to detect the curvature about 50 % of the
standard normal(0, 1) distribution for the covariate X and the population curves
Figure 8.1(b) We will refer to the former curve as the negative quadratic andthe latter as the positive quadratic Both models result in about 1 case in every
correspond to 1 case in every 400 individuals For each population curve,
ROBUSTNESS 117
Trang 19Figure 8.1 Approximations to population curve.
we have plotted two linear approximations The solid line corresponds to B, the
`whole-population approximation' The dashed-line approximation ponds to using sample weights (maximum likelihood)
corres-The slope of the logit curve tells us about the relative risk of becoming a caseassociated with a change in x Using vertical arrows, we have indicated theposition at which the slope of the linear approximation agrees with the slope ofthe curve The results are clearest in the more extreme situation of 1 case in 400(cf the meningitis study) We see that B (design weights) is telling us about
118 FITTING REGRESSION MODELS IN CASE±CONTROL STUDIES