1. Trang chủ
  2. » Kinh Doanh - Tiếp Thị

Analysis of Survey Data phần 4 pot

38 274 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Analysis of Survey Data phần 4 pot
Trường học Standard University
Chuyên ngành Statistics
Thể loại Phân tích
Năm xuất bản 2023
Thành phố Hanoi
Định dạng
Số trang 38
Dung lượng 428,11 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

ANALYSIS OF DOMAIN RESPONSE PROPORTIONS analysis of domain response proportionsLogistic regression models are commonly used to analyze the variation ofsubpopulation or domain proportions

Trang 1

Table 7.2 Rankings of test performance, based on Type I error and power.

(a)For L > 30, to avoid error rate inflation for small L

and second, Servy, Hachuel and Wojdyla (1998) included the Morell modified

Scott statistic, X2

Roberts (1996), but an important point on which both studies agreed is that thelog±linear Bonferroni procedure, Bf (LL), is the most powerful procedureoverall Thomas, Singh and Roberts (1996) also noted that Bf (LL) providesthe highest power and most consistent control of Type I error over tables ofvarying size (3  3, 3  4 and 4  4)

7.3.3 Discussion and final recommendations

The benefits of Bonferroni procedures in the analysis of categorical data fromcomplex surveys were noted by Thomas (1989), who showed that Bonferronisimultaneous confidence intervals for population proportions, coupled with log

or logit transformations, provide better coverage properties than competingprocedures This is consistent with the preeminence of Bf (LL) for tests ofindependence Nevertheless, it is important to bear in mind the caveat thatthe log±linear Bonferroni procedure is not invariant to the choice of basis forthe interaction terms in the log±linear model

rated in both studies with respect to power and Type I error control However,some practitioners might be reluctant to use this procedure because of thedifficulty of selecting the value of e For example, Thomas, Singh and Roberts(1996) recommend e ˆ 0:05, while Servy, Hachuel and Wojdyla (1998) recom-mend e ˆ 0:1 Similar comments apply to the adjusted eigenvalue procedure,

Trang 2

If the uncertainties of the above procedures are to be avoided, the choice oftest comes down to the Rao±Scott family, Fay's jackknifed tests, or the F-based

on the degree of variation among design effects There is a choice of Rao±Scottprocedures available whatever the variation among design effects If full surveyinformation is not available, then the first-order Rao±Scott tests might be theonly option Fay's jackknife procedures are viable alternatives when full surveyinformation is available, provided the number of clusters is not small Thesejackknifed tests are natural procedures to choose when survey variance estima-tion is based on a replication strategy Finally, FX2

based on the log±linear representation of the hypothesis, provides adequatecontrol and relatively high power provided that the variation in design effects isnot extreme It should be noted that in both studies, all procedures derivedfrom the F-based Wald test exhibited low power for small numbers of clusters,

so some caution is required if these procedures are to be used

7.4 ANALYSIS OF DOMAIN RESPONSE PROPORTIONS analysis of domain response proportionsLogistic regression models are commonly used to analyze the variation ofsubpopulation (or domain) proportions associated with a binary responsevariable Suppose that the population of interest consists of I domains corres-ponding to the levels of one or more factors Let ^Ni and ^N1i (i ˆ 1, , I) be

domain response proportions 1iˆ N1i=Ni is denoted by ^1iˆ ^N1i=^Ni Theasymptotic covariance matrix of ^1ˆ (^m11, , ^m1I)0, denoted S1, is consistentlyestimated by ^S1

log [m1i=(1 ÿ m1i)] ˆ x0

The pseudo-MLE ^ is obtained by solving the estimating equations specified

by Roberts, Rao and Kumar (1987), namely

Equations (7.33) are obtained from the likelihood equations under independentbinomial sampling by replacing ni=n by ^piand n1i=niby the ratio estimator ^1i,

the goodness of fit of the model (7.32) is given by

ANALYSIS OF DOMAIN RESPONSE PROPORTIONS 101

Trang 3

X2 P1ˆ nXI

iˆ1

^pi[^1iÿ 1i(^)]2=[1i(^)(1 ÿ 1i(^))] (7:34)and a statistic corresponding to the likelihood ratio is given by

1variables Wi, where the weights d1i, i ˆ 1, , I ÿ m, are eigenvalues of a

D1ˆ ( ~Z01O1Z~1)ÿ1( ~Z011Z~1), (7:36)where

diagonal elements ai, where a ˆ (a1, , aI)1 Under independent binomial

LR1(^d1:, ^a1) ˆXLR12 (^d1:)

1 ‡ ^a2 1

(7:40)

1),where (1 ‡ ^a2

1:ˆ [I=(I ÿ m)] ^D1: is an upper bound on

^d1:given by ^D1:ˆ Iÿ1 ^D1i The modified first-order corrections X2

P1(^d

1:) and

X2

LR1(^d

Roberts, Rao and Kumar (1987) also developed first-order and second-ordercorrections for nested hypotheses as well as model diagnostics to detect anyoutlying domain response proportions and influential points in the factorspace, taking account of the design features They also obtained a linearization

102 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

Trang 4

Example 2

Roberts, Rao and Kumar (1987) and Rao, Kumar and Roberts (1989) appliedthe previous methods to data from the October 1980 Canadian Labour ForceSurvey The sample consisted of males aged 15±64 who were in the labour forceand not full-time students Two factors, age and education, were chosen toexplain the variation in employment rates via logistic regression models Agegroup levels were formed by dividing the interval [15, 64] into 10 groups withthe jth age group being the interval [10 ‡ 5j, 14 ‡ 5j] for j ˆ 1, , 10 and then

assigning to each person a value based on the median years of schoolingresulting in the following six levels: 7, 10, 12, 13, 14 and 16 The resulting age

by education cross-classification provided a two-way table of I ˆ 60 estimatedcell proportions or employment rates, ^1jk, j ˆ 1, , 10; k ˆ 1, , 6

A logistic regression model involving linear and quadratic age effects and alinear education effect provided a good fit to the two-way table of estimatedemployment rates, namely:

log [^1jk=(1 ÿ ^1jk)] ˆ ÿ3:10 ‡ 0:211Ajÿ 0:002 18A2

j ‡ 0:151Ek:The following values were obtained for testing the goodness of fit of the above

linear education effect and linear and quadratic effects could be rejected On the

P1(^d1:, ^a1) or X2

LR1(^d1:, ^a1) when referred to the upper

P1(^d

1:) or X2

LR1(^d

design effects, is also not significant, and is close to the first-order correction

Example 3

Fisher (1994) also applied the previous methods to investigate whether the use

of personal computers during interviewing by telephone (treatment) versusin-person on-paper interviewing (control) has an effect on labour forceestimates For this purpose, he used split panel data from the US CurrentPopulation Survey (CPS) obtained by randomly splitting the CPS sample intotwo panels and then administering the `treatment' to respondents in one of thepanels and the `control' to those in the other panel Three other binary factors,sex, race and ethnicity, were also included A logistic regression model contain-ing only the main effects of the four factors fitted the four-way table ofestimated unemployment rates well A test of the nested hypothesis that the

ANALYSIS OF DOMAIN RESPONSE PROPORTIONS 103

Trang 5

`treatment' main effect was absent given the model containing all four maineffects was rejected, suggesting that the use of a computer during interviewingdoes have an effect on labour force estimates.

Rao, Kumar and Roberts (1989) studied several extensions of logistic sion They extended the previous results to Box±Cox-type models involvingpower transformations of domain odds ratios, and illustrated their use on datafrom the Canadian Labour Force Survey The Box±Cox approach would beuseful in those cases where it could lead to additive models on the transformedscale while the logistic regression model would not provide as good a fit withoutinteraction terms Methods for testing equality of parameters in two logisticregression models, corresponding to consecutive time periods, were alsodeveloped and applied to data from the Canadian Labour Force Survey.Finally, they studied a general class of polytomous response models anddeveloped Rao±Scott adjusted Pearson and likelihood tests which they applied

regres-to data from the Canada Health Survey (1978±9)

In this section we have discussed methods for analyzing domain responseproportions We turn next to logistic regression analysis of unit-specific sampledata

7.5 LOGISTIC REGRESSION WITH A BINARY RESPONSE

VARIABLE logistic regression with a binary response variable

and a binary response variable, y, associated with the tth population unit,

t ˆ 1, , N We assume that for a given xt, yt is generated from a model withmean E( yt) ˆ t() ˆ g(xt, ) and `working' variance var( yt) ˆ v0tˆ v0(t).Specifically, we assume a logistic regression model so that

log [t()=(1 ÿ t())] ˆ x0

and v0(t) ˆ t(1 ÿ t) Our interest is in estimating the parameter vector u and

104 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

Trang 6

Following Binder (1983), a linearization estimator of the covariance matrixV(^) of ^ is given by

^

where J(^) ˆ ÿPt2swts@ut()=@0is the observed information matrix and ^V( ^T)

other adjustments It is straightforward to obtain a resampling estimator ofV(^) that takes account of post-stratification adjustment For stratified multi-stage sampling, a jackknife estimator of V(^) is given by

the ( j k)th sample cluster are deleted, but using jackknife survey weights wts( jk)(see Rao, Scott and Skinner, 1998) Bull and Pederson (1987) extended (7.44) tothe case of a polytomous response variable, but without allowing for post-stratification adjustment as in Binder (1983) Again it is straightforward todefine a jackknife variance estimator for this case

^

1, 0

2), where 2 is r2 1 and 2 is r1 1,

H0: 2ˆ 20 Then the Wald statistic

transformations of  Further, one has to fit the full model (7.1) before

number, r, of parameters This would be the case with a factorial structure of

Scott and Skinner (1998) proposed quasi-score tests to circumvent the problemsassociated with Wald tests These tests are invariant to non-linear transform-

considerable advantage if the dimension of  is large, as noted above

Let ~ ˆ (~0

1, 020) be the solution of ^T1(~) ˆ

~

0, where ^T() ˆ [ ^T1()0, ^T2()0]0is

is based on the statistic

Trang 7

The linearization estimator, ^VL(~T2), is the estimated covariance matrix of theestimated total

2)0 Again, ^VL( ^T2) should take account

of post-stratification and other adjustments The jackknife estimator ^VJ( ^T2)

is similar to (7.45), with ^ and ^( jk) changed to ^T2 and ^T2( jk), where ^T2( jk)

Under H0, X2

r 2so that X2

QS provides a valid test of H0

freedom for estimating V(^2) or V( ^T2) are not large, the tests become unstable

Skinner (1998) proposed alternative tests including an F-version of the Waldtest (see also Morel, 1989), and Rao±Scott corrections to naive Wald or score

7.2, as well as Bonferroni tests We refer the reader to Rao, Scott and Skinner(1998) for details

nested hypotheses on  given the model (7.41), unlike the case of domainproportions which permits testing of model fit as well as nested hypothesesgiven the model

In this section we briefly mention some recent applications and extensions ofRao±Scott and related methodology

106 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

Trang 8

features were obtained A model-free approach using two-phase sampling wasalso developed In two-phase sampling, error-prone measurements are made on

a large first-phase sample selected according to a specified design and free measurements are then made on a smaller subsample selected according toanother specified design (typically SRS or stratified SRS) Rao±Scott correctedPearson statistics were proposed under double sampling for both the goodness-of-fit test and the tests of independence in a two-way table Rao and Thomas(1991) also extended Assakul and Proctor's (1967) method of testing of inde-pendence in a two-way table with known misclassification probabilities

error-to general survey designs They developed Rao±Scott corrected tests using ageneral methodology that can also be used for testing the fit of log±linearmodels on multi-way contingency tables More recently, Skinner (1998)extended the methods of Section 7.4 to the case of longitudinal survey datasubject to classification errors

7.6.2 Biostatistical applications

Cluster-correlated binary response data often occur in biostatistical tions; for example, toxicological experiments designed to assess the teratogeniceffects of chemicals often involve animal litters as experimental units Severalmethods that take account of intracluster correlations have been proposed butmost of these methods assume specific models for the intracluster correlation,e.g., the beta-binomial model Rao and Scott (1992) developed a simplemethod, based on conceptual design effects and effective sample size, that can

applica-be applied to problems involving independent groups of clustered binary datawith group-specific covariates It assumes no specific models for the intraclustercorrelation in the spirit of Zeger and Liang (1986) The method can be readilyimplemented using any standard computer program for the analysis of inde-pendent binary data after a small amount of pre-processing

i ˆ 1, , I, and let yij denote the number of `successes' among the nij units,with Pjyijˆ yi and Pjnijˆ ni Treating yi as independent binomial B(ni, pi)

probability in the ith group Denoting the design effect of the ith estimatedproportion, ^piˆ yi=ni, by Di, and the effective sample size by ~niˆ ni=Di, the

applied to a variety of biostatistical problems; in particular, testing eity of proportions, estimating dose response models and testing for trends inproportions, computing the Mantel±Haenszel chi-squared test statistic forindependence in a series of 2  2 tables and estimating the common oddsratio and its variance when the independence hypothesis is rejected Obu-chowski (1998) extended the method to comparisons of correlated proportions.Rao and Scott (1999a) proposed a simple method for analyzing groupedcount data exhibiting overdispersion relative to a Poisson model This method

homogen-is similar to the previous method for clustered binary data

SOME EXTENSIONS AND APPLICATIONS 107

Trang 9

7.6.3 Multiple response tables

In marketing research surveys, individuals are often presented with questionsthat allow them to respond to more than one of the items on a list, i.e., multipleresponses may be observed from a single respondent Standard methods cannot

be applied to tables of aggregated multiple response data because of themultiple response effect, similar to a clustering effect in which each independentrespondent plays the role of a cluster Decady and Thomas (1999, 2000)adapted the first-order Rao±Scott procedure to such data and showed that itleads to simple approximate methods based on easily computed, adjusted chi-squared tests of simple goodness of fit and homogeneity of response probabil-ities These adjusted tests can be calculated from the table of aggregate multipleresponse counts alone, i.e., they do not require knowledge of the correlationsbetween the aggregate multiple response counts This is not true in general; forexample, the test of equality of proportions will require either the full dataset or

an expanded table of counts in which each of the multiple response items istreated as a binary variable Nevertheless, the first-order Rao±Scott approach

is still considerably easier to apply than the bootstrap approach recentlyproposed by Loughin and Scherer (1998)

108 CATEGORICAL RESPONSE DATA FROM COMPLEX SURVEYS

Trang 10

of the study and all of these are included In addition, a sample of controls wasdrawn from the remaining children in the study population by a complex multi-stage design At the first stage, a sample of 300 census mesh blocks (eachcontaining roughly 50 households) was drawn with probability proportional

to the number of houses in the block Then a systematic sample of 20 holds was selected from each chosen mesh block and children from thesehouseholds were selected for the study with varying probabilities that depend

house-on age and ethnicity as in Table 8.1 These probabilities were chosen to matchthe expected frequencies among the cases Cluster sample sizes varied from one

to six and a total of approximately 250 controls was achieved This corresponds

to a sampling fraction of about 1 in 400, so that cases are sampled at a rate that

is 400 times that for controls

Analysis of Survey Data Edited by R L Chambers and C J Skinner

Copyright ¶ 2003 John Wiley & Sons, Ltd.

ISBN: 0-471-89987-9

Trang 11

Table 8.1 Selection probabilities.

Maori Pacific Islander Other

Complex sampling may also be used in the selection of cases For example,

we have recently helped with the analysis of a study in which cases (patientswho had suffered a mild stroke) were selected through a two-stage design withdoctors' practices as primary sampling units However, this is much lesscommon than with controls

As we said at the outset, these studies are a very special sort of survey butthey share the two key features that make the analysis of survey data distinct-ive The first feature is the lack of independence In our example, we wouldexpect substantial intracluster correlation because of unmeasured socio-economic variables, factors affecting living conditions (e.g mould on walls ofhouse), environmental exposures and so on Ignoring this will lead to standarderrors that are too small and confidence intervals that are too narrow Theother distinctive feature is the use of unequal selection probabilities In case±control studies the selection probabilities can be extremely unbalanced and theyare based directly on the values of the response variable, so that we haveinformative sampling at its most extreme

In recent years, efficient semi-parametric procedures have been developed forhandling the variable selection probabilities (see Scott and Wild (2001) for asurvey of recent work) `Semi-parametric' here means that a parametric model

is specified for the response as a function of potential explanatory variables, butthat the joint distribution of the explanatory variables is left completely free.This is important because there are usually many potential explanatory vari-ables (more than 100 in some of the studies in which we are involved) and itwould be impossible to model their joint behaviour, which is of little interest inits own right However, the problem of clustering is almost completely ignored,

in spite of the fact that a large number of studies use multi-stage sampling Thepaper by Graubard, Fears and Gail (1989) is one of the few that even discussthe problem Most analyses simply ignore the problem and use a programdesigned for simple (or perhaps stratified) random sampling of cases andcontrols This chapter is an attempt to remedy this neglect

110 FITTING REGRESSION MODELS IN CASE±CONTROL STUDIES

Trang 12

In the next section we give a summary of standard results for simple case±control studies, where cases and controls are each selected by simple randomsampling or by stratified random sampling In Section 8.3, we extend theseresults to cover cases when the controls (and possibly the cases) are selectedusing an arbitrary probability sampling design We investigate the properties ofthese methods through simulation studies in Section 8.4 The robustness of themethods in situations when the assumed model does not hold exactly is ex-plored in Section 8.5, and some possible alternative approaches are sketched inthe final section.

We shall take it for granted throughout this chapter that our purpose is to makeinferences about the parameters of a superpopulation model since interest iscentred in the underlying process that produces the cases and not on the compos-ition of a particular finite population at a particular point in time Suppose thatthe finite population consists of values {(xt, yt), t ˆ 1, , N}, where ytis a binary

underlie all standard methods for the analysis of case±control data from tion-based studies For simplicity, we will work with the logistic regression model,

and often shorten it to p1(x) We also set p0(x) ˆ 1 ÿ p1(x)

If we had data on the whole population, that is what we would analyse,treating it as a random sample from the process that produces cases andcontrols The finite population provides score (first derivative of the log-likelihood) equations

X

t:y t ˆ0

xtp1(xt)

controls As N ! 1, Equations (8.1) converge to

distribu-tion of X condidistribu-tional on Y ˆ i (i ˆ 0, 1) Under the model, Equadistribu-tions (8.2)have solution b

Standard logistic regression applies if our data comes from a simple randomsample of size n from the finite population For rare events, as is typical inbiostatistics and epidemiology, enormous efficiency gains can be obtained by

stratum defined by Y ˆ i (i ˆ 0, 1) with n1 n0

SIMPLE CASE±CONTROL STUDIES 111

Trang 13

Since the estimating equations (8.2) simply involve population means for thecase and control subpopulations, they can be estimated directly from case±control data using the corresponding sample means This leads to a design-

can be estimated consistently (by the corresponding population proportions,for example) More efficient estimates, however, are obtained using the semi-

by ignoring the sampling scheme (see Prentice and Pyke, 1979) and solving theprospective score equations (i.e those that would be appropriate if we had asimple random sample from the whole population)

n1n

(1, 0, , 0)0 We see that only the intercept term is affected by the case±controlsampling The intercept term can be corrected simply by using k as an offset inthe model, but if we are only interested in the coefficients of the risk factors, we

do not even need to know the relative stratum weights More generally, Scott

have the unique solution b ˆ b ‡ kle1 with klˆ log [l1W0=(l0W1)], providedthat the model contains a constant term (i.e the first component of x is 1) Thiscan be seen directly by expanding (8.6) Suppose, for simplicity, that X iscontinuous with joint density function f (x) Then, noting that the conditionaldensity of X given that Y ˆ i is f (xjY ˆ i) ˆ pi(x; b) f (x)=Wi, (8.6) is equivalentto

Z xex0 b‡k lf (x)

(1 ‡ ex 0 b)(1 ‡ ex 0 b)dx ˆ

(1 ‡ ex 0 b)(1 ‡ ex 0 b)dx:

The result then follows immediately

one explanatory variable and the ratio of the sampling rates between cases andcontrols is not too extreme However, the differences become bigger as casesbecome rarer in the population and the sampling rates become less balanced,and also when we have more than one covariate We have seen 50% efficiencygains in some simple case±control problems In more complex stratified

112 FITTING REGRESSION MODELS IN CASE±CONTROL STUDIES

Trang 14

samples, design weighting can be very inefficient indeed (see Lawless,Kalbfleisch and Wild 1999).

Most case±control studies incorporate stratification on other variables, such

as age and ethnicity as in our motivating example This is one aspect of complexsurvey design that is taken into account in standard case±control methodology

A common assumption is that relative risk is constant across these strata butthat absolute levels of risk vary from stratum to stratum, leading to a model ofthe form

for observations in the hth stratum, where x now contains dummy variables for

adapting the design-weighted (pseudo-MLE, Horvitz±Thompson) approach tothis problem; we simply add a dummy variable for each stratum to x andreplace the sample means in (8.3) by their stratified equivalents The maximum

cases and controls within each stratum; we simply fit the model in (8.7) as if wehad a simple random sample from the whole population, ignoring the stratifi-

by an additive constant depending on the relative sampling fractions of cases

even need to know these sampling fractions

In some studies, we want to model the stratum constants as functions of thevariables in x1 For example, if the population is stratified by age, we might stillwant to model the effect of age by some smooth function Again, adapting thedesign-weighted approach is completely straightforward and only requires thespecification of the appropriate sampling weights Extending the maximumlikelihood approach to this case is considerably more difficult, however, and

a fully efficient solution has only been obtained relatively recently (for detailssee Scott and Wild, 1997; Breslow and Holubkov, 1997)

8.3 CASE±CONTROL STUDIES WITH COMPLEX SAMPLINGcase±control studies with complex samplingNow consider situations in which controls (and possibly cases) are obtainedusing a complex sampling plan involving multi-stage sampling Our only as-sumption is that the population is stratified into cases and controls and thatsamples of cases and controls are drawn independently Note that this assump-tion does not hold exactly in our motivating example since a control could bedrawn from the same cluster as a case However, cases are sufficiently rare forthis possibility to be ignorable in the analysis (In fact, it occurred once in thestudy.) One common option is simply to ignore the sample design (see Grau-bard, Fears and Gail, 1989, for examples) and use a standard logistic regressionprogram, just as if we had a simple (or stratified) case±control study Under-estimates of variance are clearly a worry with this strategy; the coveragefrequencies of nominally 95% confidence intervals dropped to about 80% in

CASE±CONTROL STUDIES WITH COMPLEX SAMPLING 113

Trang 15

some of our simulations In fact, estimates obtained in this way need not even

be consistent To obtain consistency, we need (1=n0)Pcontrolsxtp1(xt) to

self-weighting designs but not true in general

The estimating equations (8.2) involve separate means for the case andcontrol subpopulations Both these terms can be estimated using standardsurvey sampling techniques for estimating means from complex samples Vari-ances and covariances of the estimated means, which are required for sandwichestimates of Cov(^b), can also be obtained by standard survey sampling tech-niques Such analyses can be carried out routinely using specialist samplingpackages such as SUDAAN or with general packages such as SAS or Stata thatinclude a survey option This approach is a direct generalisation of the design-weighted (Horvitz±Thompson) approach to the analysis of simple case±controldata which, as we have seen, can be quite inefficient

An alternative is to apply standard sampling methods to (8.6), with priate choices for l1and l0, rather than to (8.2) This leads to an estimator, bblsay, that satisfies

argu-ments This leads to an estimated covariance matrix

dCov(bbl)  Iÿ1

l (bbl) dCov(bSl)Iÿ1

where Il(b) ˆ (]bSl=]b0) and dCov(bbl) is the standard survey estimate of Cov(bSl),just as in Equation (7.44) (Note that bS is just a linear combination of twoestimated means.) All of this can also be carried out straightforwardly inSUDAAN or similar packages simply by specifying the appropriate weights(i.e by scaling the case weights and control weights separately so that the sum

proportional to l0)

random samples of cases and controls, maximum likelihood, in which we set

complex schemes, using the sample proportions will no longer be fully efficientand we might expect weights based on some form of equivalent sample sizes toperform better We have carried out some simulations to investigate this Ourlimited experience so far suggests that setting liˆ ni=n for i ˆ 0, 1 leads to only

a moderate loss of efficiency unless design effects are extreme

We have not said anything specific about stratification beyond the basicdivision into cases and controls so far in this discussion Implicitly, we areassuming that stratification on other variables such as age or ethnicity ishandled by specifying the appropriate weights when we form the sample

model the stratum constants in terms of other variables, then this is taken care

114 FITTING REGRESSION MODELS IN CASE±CONTROL STUDIES

Trang 16

of automatically There is another approach to fitting model (8.6), which is toignore the different selection probabilities of units in different strata Thismimics the maximum likelihood method of the previous section more closely

further adjustments are necessary if we wish to model the stratum constants

We explore the difference in efficiency between the two approaches in oursimulations in the next section Note that we can again implement the methodsimply in a package like SUDAAN by choosing the weights appropriately

In this section, we describe the results of some of the simulations that wehave done to test and compare the methods discussed previously The simula-

which produces a population proportion of approximately 1 case in every 300individuals, a little less extreme than the 1 in 400 in the meningitis study Thenumber of cases in every simulation study is 200 The number of control clusters

is always fixed The number of controls in each study averages 200, but the actualnumber is random in some studies (when `ethnicities' are subsampled at differentrates ± see later) All simulations used 1000 generated datasets

The first row of Table 8.2 shows results for a population in which 60 % of theclusters are of size 2 and the remaining 40 % of size 4 We see from Table 8.2 that,

increase in efficiency It is possible that the gain in efficiency is smaller for the

Table 8.2 Relative efficiencies (cf design weighting)

Trang 17

binary variable because it represents a smaller change in risk We also gated the effect of using a standard logistic regression program ignoring clus-tering Members of the same cluster are extremely closely related here, so wewould expect the coverage frequency of nominally 95 % confidence intervals todrop well below 95 % if the clustering was ignored We found that coverage

cluster-ing was ignored When we correct for clustercluster-ing in both the sample-weightedand population-weighted variance estimates, the coverage frequencies wereclose to 95 %

In Section 8.3, we discussed the possible benefits of using weights thatreflected `effective sample' size in place of sample weights We repeated thepreceding simulation (using the clusters of size 2 and 4) down-weighting thecontrols by a factor of 3 (which is roughly equal to the average design effect)

The rows of Table 8.2 headed `Stratified sampling of ``ethnic groups'' ' showthe results of an attempt to mimic (on a smaller scale) the stratified sampling ofcontrols within clusters in the meningitis study We generated our population

so that 60 % of the individuals belong to ethnic group 1, and 20 % belong toeach of groups 2 and 3 This was done in two ways Where `(random)' has beenappended, the ethnic groups have been generated independently of cluster.Where `(whole)' has been appended, all members of a cluster have the sameethnicity All members of a sampled cluster in groups 2 and 3 were retained,while members of group 1 were retained with probability 0.33 We varied thesizes of the clusters, with either 60 % twos and 40 % fours, or 60 % fours and

40 % eights, before subsampling The subsampling (random removal of group

1 individuals) left cluster sample sizes ranging from one to the maximum clustersize The models now also include dummy variables for ethnic group and wecan estimate these in two ways The columns headed `Ethnic weighted' showresults when cases have sample weights, whereas controls have weights whichare the sample weights times 1.8 for group 1 or 0.6 for groups 2 or 3 Thisreconstitutes relative population weightings so that contrasts in ethnic coeffi-cients are correctly estimated

In this latter set of simulations the increase in efficiency for sample weightingover population weighting is even more striking Reweighting to account fordifferent sampling rates in the different ethnic groups has led to similar effi-ciencies This is important as the type of `ethnic weighting' we have done in thecontrol group would allow us to model group differences rather than being

dropped a little, however, in the `4s and 8s' clusters with design weighting

We can interpret this as meaning that, with more clustering and additionalparameters to estimate, our effective sample sizes have dropped and, withsmaller sample sizes, asymptotic approximations are less accurate

116 FITTING REGRESSION MODELS IN CASE±CONTROL STUDIES

Trang 18

equations It is the remaining regression coefficients that are of primary est, however, since they determine the relative risk of becoming a case associ-ated with a change in the values of explanatory variables These remainingcoefficients are unaffected by the choice of weightings.

we would be estimating if we fitted the model to the whole population Since allmodels are almost certainly misspecified to some extent, many survey statisti-cians (see Kish and Frankel, 1974) would suggest that a reasonable aim is toestimate B We adopted this perspective uncritically in Scott and Wild (1986,1989), suggesting as a consequence that, although maximum likelihood estima-tion was more efficient, design weighting was more robust because it alone led

to consistent estimates of B in the presence of model misspecification This hasbeen quoted widely by both biostatisticians and survey samplers However, amore detailed appraisal of the nature of B suggests that using sample weights

robust as well as being more efficient

Table 1 of Scott and Wild (1989) showed results from fitting a linear logisticregression model when the true model was quadratic with logit {Pr(Y ˆ 1jx)} ˆ

scale and a quadratic approximation should give a reasonable idea of whathappens when the working logistic model is not too badly misspecified Twopopulation curves were investigated, with the extent of the curvature beingchosen so that an analyst would fail to detect the curvature about 50 % of the

standard normal(0, 1) distribution for the covariate X and the population curves

Figure 8.1(b) We will refer to the former curve as the negative quadratic andthe latter as the positive quadratic Both models result in about 1 case in every

correspond to 1 case in every 400 individuals For each population curve,

ROBUSTNESS 117

Trang 19

Figure 8.1 Approximations to population curve.

we have plotted two linear approximations The solid line corresponds to B, the

`whole-population approximation' The dashed-line approximation ponds to using sample weights (maximum likelihood)

corres-The slope of the logit curve tells us about the relative risk of becoming a caseassociated with a change in x Using vertical arrows, we have indicated theposition at which the slope of the linear approximation agrees with the slope ofthe curve The results are clearest in the more extreme situation of 1 case in 400(cf the meningitis study) We see that B (design weights) is telling us about

118 FITTING REGRESSION MODELS IN CASE±CONTROL STUDIES

Ngày đăng: 14/08/2014, 09:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN