This chapter considers the definition, construction, and use oflikelihood functions for representing and interpreting the observations as evi-dence about i parameters in the probability
Trang 1B (the strength of the evidence) is quantified by the ratio of the two ities, i.e., the likelihood ratio.
probabil-When uncertainty about the hypotheses, before X x is observed, is ured by probabilities Pr(A) and Pr(B), the law of likelihood can be derivedfrom elementary probability theory In that case the quantity pB is the condi-tional probability that X x, given that A is true, Pr(X xjA), and PB isPr(X xjB) The definition of conditional probability implies that
meas-Pr(AjX x)Pr(BjX x)
pAPr(A)
pBPr(B):This formula shows that the effect of the statistical evidence (the observation
X x) is to change the probability ratio from Pr(A)=Pr(B) to Pr(AjX x)=Pr(BjX x) The likelihood ratio pA=pB is the exact factor by which theprobability ratio is changed If the likelihood ratio equals 5, it means that theobservation X x constitutes evidence just strong enough to cause a five-foldincrease in the probability ratio Note that the strength of the evidence isindependent of the magnitudes of the probabilities, Pr(A) and Pr(B), and oftheir ratio The same argument applies when pAand pB are not probabilities,but probability densities at the point x
The likelihood ratio is a precise and objective numerical measure of thestrength of statistical evidence Practical use of this measure requires that welearn to relate it to intuitive verbal descriptions such as `weak', `fairly strong',
`very strong', etc For this purpose the values of 8 and 32 have been suggested
as benchmarks ± observations with a likelihood ratio of 8 (or 1/8) constitute
`moderately strong' evidence, and observations with a likelihood of ratio of 32(or 1/32) are `strong' evidence (Royall, 1997) These benchmark values aresimilar to others that have been proposed (Jeffreys, 1961; Edwards, 1972;Kass and Raftery, 1995) They are suggested by consideration of a simpleexperiment with two urns, one containing only white balls, and the othercontaining half white balls and half black balls Suppose a ball is drawn fromone of these urns and is seen to be white While this observation surelyrepresents evidence supporting the hypothesis that the urn is the all-white one(vs the alternative that it is the half white one), it is clear that the evidence is
`weak' The likelihood ratio is 2 If the ball is replaced and a second draw ismade, also white, the two observations together represent somewhat strongerevidence supporting the `all-white' urn hypothesis ± the likelihood ratio is 4
A likelihood ratio of 8, which we suggest describing as `moderately strong'evidence, has the same strength as three consecutive white balls Observation offive white balls is `strong' evidence, and gives a likelihood ratio of 32
A key concept of evidential statistics is that of misleading evidence vations with a likelihood ratio of pA=pB 40 (40 times as probable under A asunder B) constitute strong evidence supporting A over B Such observationscan occur when B is true, and when that happens they constitute strongmisleading evidence No error has been made ± the evidence has been properlyinterpreted The evidence itself is misleading Statistical evidence, properlyinterpreted, can be misleading But the nature of statistical evidence is such
Obser-60 INTERPRETING A SAMPLE AS EVIDENCE ABOUT A POPULATION
Trang 2that we cannot observe strong misleading evidence very often There is auniversal bound for the probability of misleading evidence: if A implies that
X has destiny (or mass) function fA, while B implies fB, then for any k > 1,
pB( fA(X)=fB(X) k) 1=k Thus when B is true, the probability of observingdata giving a likelihood ratio of 40 or more in favour of A can be no greaterthan 0.025 Royall (2000) discusses this universal bound as well as much smallerones that apply within many important parametric models
A critical distinction in evidence theory and methods is that between thestrength of the evidence represented by a given body of observations, which ismeasured by likelihood ratios, and the probabilities that a particular procedurefor making observations (sampling plan, stopping rule) will produce observa-tions that constitute weak or misleading evidence The essential flaw in stand-ard frequentist theory and methods such as hypothesis testing and confidenceintervals, when they are used for evidential interpretation of data, is the failure
to make such a distinction (Hacking, 1965, Ch 7) Lacking an explicit concept
of evidence like that embodied in the likelihood, standard statistical theory tries
to use probabilities (Type I and Type II error probabilities, confidence cients, etc.) in both roles: (i) to describe the uncertainty in a statistical proced-ure before observations have been made; and (ii) to interpret the evidencerepresented by a given body of observations It is the use of probabilities inthe second role, which is incompatible with the law of likelihood, that producesthe paradoxes pervading contemporary statistics (Royall, 1997, Ch 5).With respect to a model that consists of a collection of probability distribu-tions indexed by a parameter y, the statistical evidence in observations X xsupporting any value, y1, vis-aÁ-vis any other, y2, is measured by the likelihoodratio, f (x; y1)=f (x; y2) Thus the likelihood function, L(y) / f (x; y), is the math-ematical representation of the statistical evidence under this model ± if twoinstances of statistical evidence generate the same likelihood function, theyrepresent evidence of the same strength with respect to all possible pairs(y1, y2), so they are equivalent as evidence about y This important implication
coeffi-of the law coeffi-of likelihood is known as the likelihood principle It in turn hasimmediate implications for statistical methods in the problem area of infor-mation inference As Birnbaum (1962) expressed it, `One basic consequence isthat reports of experimental results in scientific journals should in principle bedescriptions of likelihood functions.' Note that the likelihood function is, bydefinition, the function whose ratios measure the strength of the evidence in theobservations
When the elements of a finite population are modelled as realisations ofrandom variables, and a sample is drawn from that population, the observa-tions presumably represent evidence about both the probability model and thepopulation This chapter considers the definition, construction, and use oflikelihood functions for representing and interpreting the observations as evi-dence about (i) parameters in the probability model and (ii) characteristics ofthe actual population (such as the population mean or total) in problems wherethe sampling plan is uninformative (selection probabilities depend only onquantities whose values are known at the time of selection, see also the
Trang 3discussion by Little in the previous chapter) Although there is broad ment about the definition of likelihood functions for (i), consensus has not beenreached on (ii) We will use simple transparent examples to examine andcompare likelihood functions for (i) and (ii), and to study the probability ofmisleading evidence in relation to (ii) We suggest that Bjùrnstad's (1996)assertion that `Survey sampling under a population model is a field where thelikelihood function has not been defined properly' is mistaken, and we note thathis proposed redefinition of the likelihood function is incompatible with the law
agree-of likelihood
5.2 THE EVIDENCE IN A SAMPLE FROM A FINITE POPULATIONthe evidence in a sample from a finite population
We begin with the simplest case Consider a condition (disease, genotype,behavioural trait, etc.) that is either present or absent in each member of apopulation of N 393 individuals Let xtbe the zero±one indicator of whetherthe condition is present in the tth individual We choose a sample, s, consisting
of n 30 individuals and observe their x-values: xt, t 2 s
5.2.1 Evidence about a probability
If we model the x as realised values of iid Bernoulli (y) random variables,
X1, X2, , X393, then our 30 observations constitute evidence about the ability y If there are 10 instances of the condition (x 1) in the sample, thenthe evidence about y is represented by the likelihood function L(y) / y10
prob-(1 ÿ y)20 shown by the solid curve in Figure 5.1, which we have standardised
so that the maximum value is one The law of likelihood explains how tointerpret this function: the sample constitutes evidence about y, and for anytwo values y1and y2the ratio L(y1)=L(y2) measures the strength of the evidencesupporting y1 over y2 For example, our sample constitutes very strong evi-dence supporting y 1=4 over y 3=4 (the likelihood ratio is nearly 60 000),weak evidence for y 1=4 over y 1=2 (likelihood ratio 3.2), and even weakerevidence supporting y 1=3 over y 1=4 (likelihood ratio 1.68)
5.2.2 Evidence about a population proportion
The parameter y is the probability in a conceptual model for the process thatgenerated the 393 population values x1, x2, , x393 It is not the same as theactual proportion with the condition in this population, which isP393t1xt=393.The evidence about the proportion is represented by a second likelihood func-tion (derived below), shown in the solid dots in Figure 5.1 This function isdiscrete, because the only possible values for the proportion are k/393, for
k 0, 1, 2, , 393 Since 10 ones and 20 zeros have been observed, populationproportions corresponding to fewer than 10 ones (0/393, 1/393, , 9/393) andfewer than 20 zeros (374/393, , 392/393, 1) are incompatible with the sample,
so the likelihood is zero at all of these points
62 INTERPRETING A SAMPLE AS EVIDENCE ABOUT A POPULATION
Trang 4To facilitate interpretation of this evidence we have supplied some numericalsummaries in Figure 5.1, indicating, for example, where the likelihood ismaximised and the range of values where the standardised likelihood is atleast 1/8 This range, the `1/8 likelihood interval', consists of the values of theproportion that are consistent with the sample in the sense that there is noalternative value that is better supported by a factor of 8 (`moderately strongevidence') or greater.
To clarify further the distinction between the probability and the proportion,let us suppose that the population consists of only N 50 individuals, not 393.Now since the sample consists of 30 of the individuals, there are only 20 whosex-values remain unknown, so there are 21 values for the population proportionthat are compatible with the observed data, 10/50, 11/50, , 30/50 The likeli-hoods for these 21 values are shown by the open dots in Figure 5.1 Thelikelihood function that represents the evidence about the probability does notdepend on the size of the population that is sampled, but Figure 5.1 makes itclear that the function representing the evidence about the actual proportion inthat population depends critically on the population size N
5.2.3 The likelihood function for a population proportion or total
The discrete likelihoods in Figure 5.1 are obtained as follows In a population
of size N, the likelihood ratio that measures the strength of the evidencesupporting one value of the population proportion versus another is the factor
Figure 5.1 Likelihood functions for probability of success (curved line) and for portion of successes in finite populations under a Bernoulli probability model (popula-tion size N 393, black dots; N 50, white dots) The sample of n 30 contains 10successes
pro-THE EVIDENCE IN A SAMPLE FROM A FINITE POPULATION 63
Trang 5by which their probability ratio is changed by the observations Now before thesample is observed, the ratio of the probability that the proportion equals k/N
to the probability that it equals j/N is
N ÿ n
k ÿ ts
Nj
:
This is the function represented by the dots in Figure 5.1
An alternative derivation is based more directly on the law of likelihood: anhypothesis asserting that the actual population proportion equals tU/N is easilyshown to imply (under the Bernoulli trial model) that the probability ofobserving a particular sample vector consisting of ts ones and n ÿ ts zeros isproportional to the hypergeometric probability
:
t2N=N ! y2 the likelihood ratio is
64 INTERPRETING A SAMPLE AS EVIDENCE ABOUT A POPULATION
Trang 6yts
1(1 ÿ y1)nÿts
yts
2(1 ÿ y2)nÿt s:5.2.4 The probability of misleading evidence
According to the law of likelihood, the likelihood ratio measures the strength ofthe evidence in our sample supporting one value of the population proportionversus another Next we examine the probability of observing misleadingevidence Suppose that the population total is actually t1, so the proportion is
p1 t1=N For an alternative value, p2 t2=N, what is the probability ofobserving a sample that constitutes at least moderately strong evidence sup-porting p2 over p1? That is, what is the probability of observing a sample forwhich the likelihood ratio L(p2)=L(p1) 8? As we have just seen, the likelihoodratio is determined entirely by the sample total ts, which has a hypergeometricprobability distribution
:
Thus we can calculate the probability of observing a sample that represents atleast moderately strong evidence in favour of p2 over the true value p1 Thisprobability is shown, as a function of p2, by the heavy lines in Figure 5.2 for apopulation of N 393 in which the true proportion is p1 100=393 0:254,
or 25.4 per hundred
Note how the probability of misleading evidence varies as a function of p2.For alternatives very close to the true value (for example, p2 101=393), theprobability is zero This is because no possible sample of 30 observations
THE EVIDENCE IN A SAMPLE FROM A FINITE POPULATION 65
Trang 7represents evidence supporting p2 101=393 over p1 100=393 by a factor aslarge as 8 This is true for all values of p2from 101/393 through 106/393 For thenext larger value, p2 107=393, the likelihood ratio supporting p2over p1 canexceed 8, but this happens only when all 30 sample units have the trait, ts 30,and the probability of observing such a sample, when 100 out of the population
of 393 actually have the trait, is very small (4 10ÿ20)
As the alternative, p2, continues to increase, the probability of misleadingevidence grows, reaching a maximum at values in an interval that includes170=393 0:42 For any alternative p2within this interval a sample in which 13
or more have the trait (ts 13) gives a likelihood ratio supporting p2 over thetrue mean, 100/393, by a factor of 8 or more The probability of observing such
a sample is 0.0204 Next, as the alternative moves even farther from the truevalue, the probability of misleading evidence decreases For example, at
p2 300=393 0:76 only samples with ts 17 give likelihood ratios as large
as 8 in favour of p2, and the probability of observing such a sample is only0.000 15
Figure 5.2 shows that when the true population proportion is p1 100=393the probability of misleading evidence (likelihood ratio 8) does not exceed0.028 at any alternative, a limit much lower than the universal bound, which is1=8 0:125 For other values of the true proportion the probability of mislead-ing evidence shows the same behaviour ± it equals zero near the true value, riseswith increasing distance from that value, reaches a maximum, then decreases.5.2.5 Evidence about the average count in a finite population
We have used the population size N 393 in the above example for consistencywith what follows, where we will examine the evidence in a sample from anactual population consisting of 393 short-stay hospitals (Royall and Cumber-land, 1981) under a variety of models Here the variate x is not a zero-oneindicator, but a count ± the number of patients discharged from a hospital inone month We have observed the number of patients discharged from each ofthe n 30 hospitals in a sample, and are interested in the total number ofpatients discharged from all 393 hospitals, or, equivalently, the average,
xU Pxt=393
First we examine the evidence under a Poisson model: the counts
x1, x2, , x393are modelled as realised values of iid Poisson (l) random ables, X1, X2, , X393 Under this model the likelihood function for the actualpopulation mean, xU, is proportional to the probability of the sample, giventhat X xU, which for a sample whose mean is xs is easily shown to beL(xU) / NxU
vari-nxs
nN
nxs
1 ÿNn
NxUÿnxs
for xU (nxs j)=N, j 0, 1, 2, :Just as with the proportion and the probability in the first problem that weconsidered, we must be careful to distinguish between the actual populationmean (average), xU, and the mean (expected value) in the underlying probabil-ity model, E(X) l And just as in the first problem (i) the likelihood for the
66 INTERPRETING A SAMPLE AS EVIDENCE ABOUT A POPULATION
Trang 8finite population mean, L(xU), is free of the model parameter, l, and (ii) for anygiven sample, as N ! 1 the likelihood function for the population mean,L(xU), converges to the likelihood function for the expected value l, which isproportional to lnx seÿnl That is, for any positive values x1 and x2,L(x1)=L(x2) ! (x1=x2)nx seÿn(x 1 ÿx 2 ).
In our sample of n 30 from the 393 hospitals the mean number of patientsdischarged per hospital is xs 24 103=30 803:4 and the likelihood functionL(xU) is shown by the dashed line in Figure 5.3
The Poisson model for count data is attractive because it provides a simpleexplicit likelihood function for the population mean But the fact that thismodel has only one parameter makes it too inflexible for most applications,and we will see that it is quite inappropriate for the present one A more widelyuseful model is the negative binominal with parameters r and y, in which thecount has the same distribution as the number of failures before the rth success
in iid Bernoulli (y) trials Within this more general model the Poisson (l)distribution appears as the limiting case when the expected value r(1 ÿ y)=y isfixed at l, while r ! 1
Under the negative binominal model the likelihood for the population mean
xU is free of y, but it does involve the nuisance parameter r:
Expected value Neg binom.
Figure 5.3 Likelihood functions for expected number and population mean number ofpatients discharged in population N 393 hospitals under negative binomial model.Dashed line shows likelihood for population mean under Poisson model Sample of
n 30 hospitals with mean 803.4 patients/hospital
THE EVIDENCE IN A SAMPLE FROM A FINITE POPULATION 67
Trang 9The profile likelihood, Lp(xU) / maxrL(xU, r), is easily calculated numerically,and it too is shown in Figure 5.3 (the solid line) For this sample themore general negative binomial model gives a much more conservativeevaluation of the evidence about the population mean, xU, than the Poissonmodel.
Both of the likelihood functions in Figure 5.3 are correct Each one properlyrepresents the evidence about the population mean in this sample under thecorresponding model Which is the more appropriate? Recall that the negativebinomial distribution approaches the Poisson as the parameter r approachesinfinity The sample itself constitutes extremely strong evidence supportingsmall values of r (close to one) over the very large values which characterise
an approximate Poisson distribution This is shown by the profile likelihoodfunction for r, or equivalently, what is easier to draw and interpret, the profilelikelihood for 1=pr, which places the Poisson distribution (r 1) at the origin.This latter function is shown in Figure 5.4 For comparison, Figure 5.4 alsoshows the profile likelihood function for 1=pr generated by a sample of 30independent observations from a Poisson probability distribution whose meanequals the hospital sample mean The evidence for `extra-Poisson variability' inthe sample of hospital discharge counts is very strong indeed, with values of1=prnear one supported over values near zero (the Poisson model) by enor-mous factors
Under the previous models (Bernoulli and Poisson), for a fixed sample thelikelihood function for the finite population mean converges, as the populationsize grows, to the likelihood for the expected value Whether the same resultsapplied to the profile likelihood function under the negative binomial model is
an interesting outstanding question
Hospital sample
(1) numbers of patients discharged from 30 hospitals, and (2) 30 iid Poisson variables
68 INTERPRETING A SAMPLE AS EVIDENCE ABOUT A POPULATION
Trang 105.2.6 Evidence about a population mean under a regression model
For each of the 393 hospitals in this population we know the value of apotentially useful covariate ± the number of beds, z For our observed sample
of n 30 counts, let us examine the evidence about the population meannumber of patients discharged, xU, under a model that includes this additionalinformation about hospital size The nature of the variables, as well as inspec-tion of a scatterplot of the sample, suggests that as a first approximation weconsider a proportional regression model (expected number of dischargesproportional to the number of beds, z), with variance increasing with z:E(X) bz and var(X) s2z Although a negative binomial model with thismean and variance structure would be more realistic for count data such asthese, we will content ourselves for now with the simple and convenient normalregression model
Under this model the likelihood function for the population mean depends
on the nuisance parameter s, but not the slope b:
L(xU, s2) / Pr(samplej XU
xU; b, s2) / sÿnexp ÿ2s12 X
t2s
x2 t
of discharges in this population is actually tU 320 159, so the true mean is814.65 Figures 5.3 and 5.5 show excellent results, with the exception ofthe clearly inappropriate Poisson model in Figure 5.3, whose overstatement
of the evidence resulted in its 1/8 likelihood interval excluding the true value
THE EVIDENCE IN A SAMPLE FROM A FINITE POPULATION 69
Trang 11Figure 5.5 Likelihood for population mean number of patients discharged in tion of N 393 hospitals: normal regression model Sample of n 30 hospitals withmean 803.4 patients/hospital.
popula-5.3 DEFINING THE LIKELIHOOD FUNCTION FOR A FINITE
POPULATION defining the likelihood function for a finite population
Bjùrnstad (1996) proposed a general definition of the `likelihood function' that,
in the case of a finite population total or mean, differs from the one derivedfrom the law of likelihood, which we have adopted in this chapter In our firstexample (a population of N under the Bernoulli (y) probability model) a sample
of n units in which the number with the trait is ts is evidence about y Thatevidence is represented by the likelihood function
L(y) yts(1 ÿ y)nÿts:The sample is also evidence about the population total tU represented by thelikelihood function
of the sample, namely the model probability that T tU:
of the population have the trait (population total tU 20) versus the
hypoth-70 INTERPRETING A SAMPLE AS EVIDENCE ABOUT A POPULATION
Trang 12esis that 50% have it (tU 50): L(20)=L(50) 1:88 108 This is what the law
of likelihood says, because the hypothesis that tU 20 implies that the ability of our sample is 1:88 108times greater than the probability implied bythe hypothesis that tU 50, regardless of the value of y Compare this likeli-hood ratio with the values of the Bjùrnstad function:
prob-LB(20; y)=LB(50; y) [(1 ÿ y=y]30:This is the ratio of the conditional probabilities, Pr(T 20jts 10; y)=Pr(T 50jts 10; y) It represents, not the evidence in the observations, but
a synthesis of that evidence and the probabilities Pr(T 20; y) andPr(T 50; y) and it is strongly influenced by these `prior' probabilities Forexample, when the parameter y equals 1/2 the Bjùrnstad ratio is
LB(20; 0:5)=LB(50; 0:5) 1 If this were a likelihood ratio, its value, 1, wouldsignify that when y is 1/2 the sample represents evidence of no strength at all insupport of tU 20 over tU 50, which is quite wrong The Bjùrnstad func-tion's ratio of 1 results from the fact that the very strong evidence in favour of
tU 20 in the sample, L(20)=L(50) 1:88 108, is the exact reciprocal of thevery large probability ratio in favour of tU 50 that is given by the modelindependently of the empirical evidence:
Pr(T 50jts 10; y)=Pr (T 20jts 10; y) 10050
10020
se The Bjùrnstad function represents something different, a synthesis of theempirical evidence with model probabilities Whatever its virtues and potentialuses in estimation and prediction, the function LB(tU; y) does not representwhat we seek to measure and communicate, namely, the evidence about tU inthe sample
DEFINING THE LIKELIHOOD FUNCTION FOR A FINITE POPULATION 71
Trang 13PART B
Categorical Response Data
Analysis of Survey Data Edited by R L Chambers and C J Skinner
ISBN: 0-471-89987-9
Trang 14(i) analysis of tables;
(ii) analysis of unit-level data
In the first case, the analysis effectively involves two stages First, a table isconstructed from the unit-level data This will usually involve the cross-classification of two or more categorical variables The elements of the tablewill typically consist of estimated proportions These estimated proportions,together perhaps with associated estimates of the variances and covariances ofthese estimated proportions, may be considered as `sufficient statistics', carry-ing all the relevant information in the data This tabular information will then
be analysed in some way, as the second stage of the analysis
In the second case, there is no intermediate step of constructing a table Thebasic unit-level data are analysed directly A simple example arises in fitting alogistic regression model with a binary response variable and a single continuouscovariate Because the covariate is continuous it is not possible to constructsufficient statistics for the unit-level data in the form of a standard table Insteadthe data are analysed directly This approach may be viewed as a generalisation
of the first We deal with these two cases separately in the following two sections
Analysis of Survey Data Edited by R L Chambers and C J Skinner
ISBN: 0-471-89987-9
Trang 156.2 ANALYSIS OF TABULAR DATA analysis of tabular data
The analysis of tabular data builds most straightforwardly on methods ofdescriptive surveys Tables are most commonly formed by cross-classifyingcategorical variables with the elements of the table consisting of proportions.These may be the (estimated) proportions of the finite population falling intothe various cells of the table Alternatively, the proportions may be domainproportions, either because the table refers only to units falling into a givendomain or, for example in the case of a two-way table, because the proportions
of interest are row proportions or column proportions In all these cases, theestimation of either population proportions or domain proportions is a stand-ard topic in descriptive sample surveys (Cochran, 1977, Ch 3)
6.2.1 One-way classification
Consider first a one-way classification, formed from a single categorical able, Y, taking possible values i 1, , I For a finite population of size N, welet Nidenote the number of units in the population with Y i Assuming thenthat the categories of Y are mutually exclusive and exhaustive, we haveP
vari-Ni N In categorical data analysis it is more common to analyse thepopulation proportions corresponding to the population counts Ni thanthe population counts themselves These (finite) population proportions aregiven by Ni=N, i 1, , I, and may be interpreted as the probabilities offalling into the different categories of Y for a randomly selected member
of the population (Bishop Fienberg and Holland, 1975, p 9) Instead ofconsidering finite population proportions, it may be of more scientific interest
to consider corresponding model probabilities in a superpopulation model.Suppose, for example, that the vector (N1, , NI) is assumed to be the realisa-tion of a multinomial random variable with parameters N and (m1, , mI), sothat the probability mi corresponds to the finite population proportion Ni=N.Suppose also, for example, that it is of interest to test whether two categoriesare equiprobable Then it might be argued that it would be more appropriate totest the equality of two model probabilities than the equality of two finitepopulation proportions because the latter proportions would not be exactlyequal in a finite population except by rare chance (Cochran, 1977, p 39) Raoand Thomas use the same notation mi in Chapter 7 to denote either the finitepopulation proportion Ni=N or the corresponding model probability It isassumed that the analyst will have specified which one of these parameters
is of interest, dependent upon the purpose of the analysis Note that in eithercasePmi 1
One reason why it is convenient to use the same notation mifor both the finitepopulation and model parameters is that it is natural to use the same pointestimator for each A common general approach in descriptive surveys is toestimate the ith proportion (or probability) by
Trang 166.2.2 Multi-way classifications and log±linear models
The one-way classification above, based upon the single categorical variable Y,may be extended to a multi-way classification, formed by cross-classifying two
or more categorical variables Consider, for example, a 2 2 table, formed bycross-classifying two binary variables, A and B, each taking the values 1 or 2.The four cells of the table might then be labelled (a, b), a 1, 2; b 1, 2 Forthe purpose of defining log±linear models, as discussed by Rao and Thomas inChapter 7, these cells may alternatively be labelled just with a single index
i 1, , 4 We shall follow the convention, adopted by Rao and Thomas, ofletting the index i correspond to the lexicographic ordering of the cells, asillustrated in Table 6.1
The same approach may be adopted for tables of three or more dimensions
In general, we suppose the cells are labelled i 1, , I in lexicographic order,where I denotes the number of cells in the table The cell proportions (orprobabilities) mi are then gathered together into the I 1 vector
m (m1, , mI)0 A model for the table may then be specified by representing
m as a function of an r 1 vector y of parameters, that is writing m m(y).Because the vector m is subject to the constraintPmi m0
~
1 1, where
~
1 is the
I 1 vector of ones, the maximum necessary value of r is I ÿ 1 For, in this case
Table 6.1 Alternative labels for cells of 2 2 table
Bivariate labels ordered lexicographically
Trang 17we could define a saturated model, with y as the first I ÿ 1 elements of m and theIth element of m expressed in terms of y using m0
of the parameter vector y
Consider, for example, the model of independence in the 2 2 table cussed above Denoting the probability of falling into cell (a, b) as mAB
dis-(a,b), theindependence of A and B implies that mAB
(a,b) may be expressed as a product
(a,b)
with the correspondence between i and (a, b) in Table 6.1, we may express thelogarithms of the cell probabilities in a matrix equation as
log m1log m2log m3log m4
0
B
@
1C
A u(y)
1111
0B
@
1C
@
1C
discus-X0m(y) X0^m,
Trang 18where ^m is the (unweighted) vector of sample cell proportions (Agresti, 1990,section 6.4) In the case of complex sampling schemes, the multinomial assump-tion will usually be unreasonable and the MLE may be inconsistent for y.Instead, Rao and Thomas (Chapter 7) discuss how y may be estimated consist-ently by the pseudo-MLE of y, obtained by solving the same equations, but with
^m replaced by (^m1, , ^mI), where ^miis a design-consistent estimator of misuch as
in (6.1) above
Let us next consider the estimation of standard errors or, more generally, thevariance±covariance matrix of the pseudo-MLE of y Under multinomial sam-pling, the usual estimated covariance matrix of the MLE ^y is given by (Agresti,
1990, section 6.4.1; Rao and Thomas, 1988)
nÿ2[X0V(^y)X]ÿ1where n is the sample size upon which the table is based and V(y) is thecovariance matrix of ^m under multinomial sampling (with the log±linearmodel holding):
V(y) nÿ1[D(m(y)) ÿ m(y)m(y)0]where D(m) denotes the I I diagonal matrix with diagonal elements mi Re-placing the MLE of y by its pseudo-MLE in this expression will generally fail toaccount adequately for the effect of a complex sampling design on standarderrors Instead it is necessary to use an estimator of the variance±covariancematrix of ^m, which makes appropriate allowance for the actual complex design.Such an estimator is denoted ^ by Rao and Thomas (Chapter 7) There are avariety of standard methods from descriptive surveys, which may be used toobtain such an estimator ± see for example Lehtonen and Pahkinen (1996, Ch 6)
Rao and Thomas (Chapter 7) focus on methods for testing hypothesesabout y Many hypotheses of interest in contingency tables may be expressed
as a hypothesis about y in a log±linear model Rao and Thomas consider theimpact of complex sampling schemes on the distribution of standard Pearsonand likelihood ratio test statistics and discuss the construction of Rao±Scottadjustments to these standard procedures As in the construction of ap-propriate standard errors, these adjustments involve the use of the ^ matrix.Rao and Thomas also consider alternative test procedures, such as the Waldtest, which also make use of ^, and make comparisons between the differentmethods
Trang 196.2.3 Logistic models for domain proportions
A log±linear model for a cross-classification treats the cross-classifying ables symmetrically In many applications, however, one of these variables will
vari-be considered as the response and others as the explanatory variables (alsocalled factors) The simplest case involves the cross-classification of a binaryresponse variable Y by a single categorical variable X Let the categories of Y belabelled as 0 and 1, the categories (also called levels) of X as 1, , I and let Nji
be the number of population units with X i and Y j In order to study how
Y depends upon X, it is natural to compare the proportions N1i=Ni, where
Ni N0i N1i, for i 1, , I For example, suppose Y denotes smokingstatus (1 if smoker, 0 if non-smoker) and X denotes age group Then N1i=Niisthe proportion of people in age group i who smoke If these proportions areequal then there is no dependence of smoking status on age group On the otherhand, if the proportions vary then we may wish to model how the N1i=Nidepend upon i
In contingency table terminology, if Y and X define the rows and columns of
a two-way table then the N1i=Ni are column proportions, as compared with the
Nji=N (N PNi), used to define a log±linear model, which are cell tions In survey sampling terminology, the subsets of the population defined bythe categories of X are referred to as domains and the N1i=Niare referred to asdomain proportions See Section 7.4, where the notation m1i N1i=Niis used
propor-In a more general setting, there may be several categorical explanatoryvariables (factors) For example, we may be interested in how smoking statusdepends upon age group, gender and social class In this case it is still possible
to represent the cells in the cross-classification of these factors by a single index
i 1, , I, by using a lexicographic ordering, as in Table 6.1 For example, ifage group, gender and social class have five, two and five categories (levels)respectively, then I 5 2 5 50 domains i are required In this case, eachdomain i refers to a cell in the cross-classification of the explanatory factors, i.e
a specific combination of the levels of the factors
The dependence of a binary response variable on one or more explanatoryvariables may then be represented by modelling the dependence of the m1i onthe factors defining the domains Just as mi could refer to a finite populationproportion or model probability, depending upon the scientific context, m1imayrefer to either a domain proportion N1i=Nior a corresponding model probabil-ity For example, it might be assumed that N1i is generated by a binomialsuperpopulation model (conditional on Ni): N1ijNi Bin(Ni, m1i) A commonmodel for the dependence of m1ion the explanatory factors is a logistic regres-sion model, which is specified by Rao and Thomas in Section 7.4 as
log [m1i=(1 ÿ m1i)] x0
where xi is an m 1 vector of known constants, which may depend upon thefactor levels defining domain i, and y is an m 1 vector of regression param-eters In general m I and the model is saturated if m I For example,suppose that there are I 4 domains defined by the cross-classification of