3.2 Discrete Statistical Distributions A discrete distribution is one for which the random variable beingconsidered can only take on certain specific values, rather than anyvalue within
Trang 1CHAPTER 3 Models for Data
3.1 Statistical Models
Many statistical analyses are based on a specific model for a set ofdata, where this consists of one or more equations that describe theobservations in terms of parameters of distributions and randomvariables For example, a simple model for the measurement X made
by an instrument might be
X = 2 + ,,
where 2 is the true value of what is being measured, and , is ameasurement error that is equally likely to be anywhere in the rangefrom -0.05 to +0.05
In situations where a model is used, an important task for the dataanalyst is to select a plausible model and to check, as far as possible,that the data are in agreement with this model This includes bothexamining the form of the equation assumed, and the distribution ordistributions that are assumed for the random variables
To aid in this type of modelling process there are many standarddistributions available, the most important of which are considered inthe following two sections of this chapter In addition, there are somestandard types of model that are useful for many sets of data Theseare considered in the later sections of this chapter
3.2 Discrete Statistical Distributions
A discrete distribution is one for which the random variable beingconsidered can only take on certain specific values, rather than anyvalue within some range (Appendix Section A2) By far the mostcommon situation in this respect is where the random variable is acount and the possible values are 0, 1, 2, 3, and so on
It is conventional to denote a random variable by a capital X and aparticular observed value by a lower case x A discrete distribution isthen defined by a list of the possible values x1, x2, x3, , for X, and theprobabilities P(x1), P(x2), P(x3), for these values Of necessity,
Trang 2P(x1) + P(x2) + P(x3) + = 1,
i.e., the probabilities must add to 1 Also of necessity, P(xi) $ 0 for all
i, with P(xi) = 0 meaning that the value xi can never occur Often there
is a specific equation for the probabilities defined by a probabilityfunction
P(x) = Prob(X = x),
where P(x) is some function of x
The mean of a random variable is sometimes called the expectedvalue, and is usually denoted either by µ or E(X) It is the samplemean that would be obtained for a very large sample from thedistribution, and it is possible to show that this is equal to
E(X) = 3 xiP(xi) = x1P(x1) + x2P(x2) + x3P(x3) + (3.1)The variance of a discrete distribution is equal to the sample variancethat would be obtained for a very large sample from the distribution
It is often denoted by F2, and it is possible to show that this is equal to
The Hypergeometric Distribution
The hypergeometric distribution arises when a random sample of size
n is taken from a population of N units If the population contains Runits with a certain characteristic, then the probability that the samplewill contain exactly x units with the characteristic is
P(x) = RCxN-RCn-x / NCn, for x = 0, 1, , Min(n,R), (3.3)
where aCb denotes the number of combinations of a objects taken b
at at time The proof of this result will be found in many elementary
Trang 3statistics texts A random variable with the probabilities of differentvalues given by equation (3.3) is said to have a hypergeometricdistribution The mean and variance are
n sampled locations will have a value exceeding C
Figure 3.1(a) shows examples of probabilities calculated for someparticular hypergeometric distributions
The Binomial Distribution
Suppose that it is possible to carry out a certain type of trial and whenthis is done the probability of observing a positive result is always p foreach trial, irrespective of the outcome of any other trial Then if n trialsare carried out the probability of observing exactly x positive is given
by the binomial distribution
Trang 4(a) Hypergeometric Distributions
(b) Binomial Distributions
(c) Poisson Distributions
Figure 3.1 Examples of hypergeometric, binomial and Poisson discrete
probability distributions
An example of this distribution occurs with the situation described
in Example 1.3, which was concerned with the use of mark-recapturemethods to estimate survival rates of salmon in the Snake andColumbia Rivers in the Pacific Northwest of the United States In thatsetting, if n fish are tagged and released into a river and there is aprobability p of being recorded while passing a detection stationdownstream for each of the fish, then the probability of recording atotal of exactly p fish downstream is given by equation (3.6)
Figure 3.1(b) shows some examples of probabilities calculated forsome particular binomial distributions
Trang 5The Poisson Distribution
One derivation of the Poisson distribution is as the limiting form of thebinomial distribution as n tends to infinity and p tends to zero, with themean µ = np remaining constant More generally, however, it ispossible to derive it as the distribution of the number of events in agiven interval of time or a given area of space when the events occur
at random, independently of each other at a constant mean rate Theprobability function is
P(x) = exp(-µ) µx / x!, for x = 0, 1, 2, (3.9)
The mean and variance are both equal to µ
In terms of events occurring in time, the type of situation where aPoisson distribution might occur is for counts of the number ofoccurrences of minor oil leakages in a region per month, or thenumber of cases per year of a rare disease in the same region Forevents occurring in space a Poisson distribution might occur for thenumber of rare plants found in randomly selected metre squarequadrats taken from a large area In reality, though, counts of thesetypes often display more variation than is expected for the Poissondistribution because of some clustering of the events Indeed, theratio of the variance of sample counts to the mean of the same counts,which should be close to one for a Poisson distribution, is sometimesused as an index of the extent to which events do not occurindependently of each other
Figure 3.1(c) shows some examples of probabilities calculated forsome particular Poisson distributions
3.3 Continuous Statistical Distributions
Continuous distributions are often defined in terms of a probabilitydensity function, f(x), which is a function such that the area under theplotted curve between two limits a and b gives the probability of anobservation within this range, as shown in Figure 3.2 This area isalso the integral between a and b, so that in the usual notation ofcalculus
b Prob( a < X < b) = I f(x) dx (3.10)
a
Trang 6The total area under the curve must be exactly one, and f(x) must begreater than or equal to zero over the range of possible values of x forthe distribution to make sense.
The mean and variance of a continuous distribution are the samplemean and variance that would be obtained for a very large randomsample from the distribution In calculus notation the mean is
µ = I x f(x) dx,
where the range of integration is the possible values for the x This isalso sometimes called the expected value of the random variable X,and denoted E(X) Similarly, the variance is
F2 = I (x - µ)2
where again the integration is over the possible values of x
Figure 3.2 The probability density function f(x) for a continuous distribution.
The probability of a value between a and b is the area under the curvebetween these values, i.e., the area between the two vertical lines at x = aand x = b
The continuous distributions that are described here are ones thatoften occur in environmental and other applications of statistics SeeJohnson and Kotz (1970a, 1970b) for details about many morecontinuous distributions
Trang 7The Exponential Distribution
The probability density function for the exponential distribution withmean µ is
Figure 3.3 Examples of probability density functions for exponential
distributions
The Normal or Gaussian Distribution
The normal or Gaussian distribution with a mean of µ and a standarddeviation of F has the probability density function
f(x) = {1/%(2BF2)} exp{-(x - µ)2/(2F2)}, for -4 < x < +4 (3.13)
This distribution is discussed in Section A2 of Appendix A, and theform of the probability density function is illustrated in Figure A1.The normal distribution is the 'default' that is often assumed for adistribution that is known to have a symmetric bell-shaped form, at
Trang 8least roughly It is often observed for biological measurements such
as the height of humans, and it can be shown theoretically (throughsomething called the central limit theorem) that the normal distributionwill tend to result whenever the variable being considered consists of
a sum of contributions from a number of other distributions Inparticular, mean values, totals, and proportions from simple randomsamples will often be approximately normally distributed, which is thebasis for the approximate confidence intervals for populationparameters that have been described in Chapter 2
The Lognormal Distribution
It is a characteristic of the distribution of many environmental variablesthat they are not symmetric like the normal distribution Instead, thereare many fairly small values and occasional extremely large values.This can be seen, for example, in the measurements of PCBconcentrations that are shown in Table 2.3
With many measurements only positive values can occur, and itturns out that the logarithm of the measurements has a normaldistribution, at least approximately In that case the distribution of theoriginal measurements can be assumed to be a lognormal distribution,with probability density function
f(x) = [1/{x%(2BF2)}]exp[-{loge(x) - µ}2/{2F2}], for x > 0 (3.14)
Here µ and F are the mean and standard deviation of the naturallogarithm of the original measurement The mean and variance of theoriginal measurement itself are
and
Var(X) = exp(2µ + F2){exp(F2) - 1} (3.16)
Figure 3.4 shows some examples of probability density functions forthree lognormal distributions
Trang 9Figure 3.4 Examples of lognormal distributions with a mean of 1.0 The
standard deviations are 0.5, 1.0 and 2.0
3.4 The Linear Regression Model
Linear regression is one of the most frequently used statistical tools.Its purpose is to relate the values of a single variable Y to one or moreother variables X1, X2, , Xp, in an attempt to account for the variation
in Y in terms of variation in the other variables With only one othervariable this is often referred to as simple linear regression
The usual situation is that the data available consist of nobservations y1, y2, , yn for the dependent variable Y, withcorresponding values for the X variables The model is assumed is
y = ß0 + ß1x1 + ß2x2 + + ßpxp + ,, (3.17)
where , is a random error with a mean of zero and a constantstandard deviation F The model is estimated by finding thecoefficients of the X values that make the error sum of squares assmall as possible In other words, if the estimated equation is
í = b0 + b1x1 + b2x2 + + bpxp, (3.18)then the b values are chosen so as to minimise
where the íi is the value given by the fitted equation that corresponds
to the data value yi, and the sum is over the n data values Statisticalpackages or spreadsheets are readily available to do thesecalculations
Trang 10There are various ways that the usefulness of a fitted regressionequation can be assessed One involves partitioning the variationobserved in the Y values into parts that can be accounted for by the
X values, and a part (SSE, above) which cannot be accounted for Tothis end, the total variation in the Y values is measured by the totalsum of squares
Trang 11Table 3.1 Analysis of variance table for a multiple regression analysis
Source of
variation
Sum of squares
Degrees of freedom
Mean square
There is sometimes value in considering the variation in Y that isaccounted for by a variable Xj when this is included in the regressionafter some of the other variables are already in Thus if the variables
X1 to Xp are in the order of their importance, then it is useful tosuccessively fit regressions relating Y to X1, Y to X1 and X2, and so on
up to Y related to all the X variables The variation in Y accounted for
by Xj after allowing for the effects of the variables X1 to Xj-1 is thengiven by the extra sum of squares accounted for by adding Xj to themodel
To be more precise, let SSR(X1,X2, ,Xj) denote the regression sum
of squares with variables X1 to Xj in the equation Then the extra sum
of squares accounted for by Xj on top of X1 to Xj-1 is
SSR(Xj X1,X2, ,Xj-1) = SSR(X1,X2, ,Xj) - SSR(X1,X2, ,Xj-1).(3.23)
Trang 12On this basis, the sequential sums of squares shown in Table 3.2 can
be calculated In this table the mean squares are the sums of squaresdivided by their degrees of freedom, and the F-ratios are the meansquares divided by the error mean square A test for the variable Xjbeing significantly related to Y, after allowing for the effects of thevariables X1 to Xj-1, involves seeing whether the corresponding F-ratio
is significantly large in comparison to the Fdistribution with 1 and n
-p - 1 degrees of freedom
Table 3.2 Analysis of variance table for the extra sums of squares
accounted for by variables as they are added into a multiple regressionmodel one by one
If the X variables are uncorrelated, then the F ratios indicated in
Table 3.2 will be the same irrespective of what order the variables areentered into the regression However, usually the X variables arecorrelated and the order may be of crucial importance This merelyreflects the fact that with correlated X variables it is generally onlypossible to talk about the relationship between Y and Xj in terms ofwhich of the other X variables are controlled for
This has been a very brief introduction to the uses of multipleregression It is a tool that is used for a number of applications later
in this book For a more detailed discussion see Manly (1992,Chapter 4), or one of the many books devoted to this topic (e.g., Neter
et al., 1983 or Younger, 1985) Some further aspects of the use of
this method are also considered in the following example
Trang 13Example 3.1 Chlorophyll-a in Lakes
The data for this example are part of a larger data set originallypublished by Smith and Shapiro (1981), and also discussed by
Dominici et al (1997) The original data set contains 74 cases, where
each case consists of observations on the concentration ofchlorophyll-a, phosphorus, and (in most cases) nitrogen at a lake at
a certain time For the present example, 25 of the cases wererandomly selected from those where measurements on all threevariables are present This resulted in the values shown in Table 3.3.Chlorophyll-a is a widely used indicator of lake water quality It is
a measure of the density of algal cells, and reflects the clarity of thewater in a lake High concentrations of chlorophyll-a are associatedwith high algal densities and poor water quality, a condition known aseutrophication Phosphorus and nitrogen stimulate algal growth andhigh values for these chemicals are therefore expected to beassociated with high chlorophyll-a The purpose of this example is toillustrate the use of multiple regression to obtain an equation relatingchlorophyll-a to the other two variables
The regression equation
CH = $0 + $1PH + $2NT + , (3.24)
was fitted to the data in Table 3.3, where CH denotes chlorophyll-a,
PH denotes phosphorus, and NT denotes nitrogen This gave
with an R2 value from equation (3.21) of 0.774 The equation was fittedusing the regression option in a spreadsheet, which also providedestimated standard errors for the coefficients of SÊ(b1) = 0.046 andSÊ(b2) = 1.172
Trang 14Table 3.3 Values of chlorophyll-a, phosphorus and
nitrogen taken from various lakes at various times
Case Chlorophyll-a Phosphorus Nitrogen
Trang 15the probability of obtaining a value as far from zero as 1.02 is 0.317,which is quite large Therefore there seems to be little evidence thatchlorophyll-a is related to nitrogen.
This analysis seems straightforward but there are in fact someproblems with it These problems are indicated by plots of theregression residuals, which are the differences between the observedconcentrations of chlorophyll-a and the amounts that are predicted bythe fitted equation (3.25) To show this it is convenient to usestandardized residuals, which are the differences between theobserved CH values and the values predicted from the regressionequation, divided by the estimated standard deviation of theregression errors
For a well-fitting model these standardized residuals will appear to
be completely random, and should be mostly within the range from -2
to +2 No patterns should be apparent when they are plotted againstthe values predicted by the regression equation, or the variables beingused to predict the dependent variable This is because thestandardized residuals should approximately equal the error term , inthe regression model but scaled to have a standard deviation of one.The standardized residuals are plotted on the left-hand side of
Figure 3.5 for the regression equation (3.25) There is somesuggestion that (i) the variation in the residuals increases with thefitted value, or, at any rate, is relatively low for the smallest fittedvalues, (ii) all the residuals are less than zero for lakes with very lowphosphorus concentrations, and (iii) the residuals are low, then tend
to be high, and then tend to be low again as the nitrogenconcentration increases
The problem here seems to be the particular form assumed for therelationship between chlorophyll-a and the other two variables It ismore usual to assume a linear relationship in terms of logarithms, i.e.,
log(CH) = $0 + $1log(PH) + $2log(NT) + ,, (3.26)
for the variables being considered (Dominici et al., 1997) Using
logarithms to base ten, fitting this equation by multiple regressiongives
log(CH) = -1.860 + 1.238log(PH) + 0.907log(NT) (3.27)
The R2 value from equation (3.21) is 0.878, which is substantiallyhigher than the value of 0.774 found from fitting equation (3.25) Theestimated standard errors for the estimated coefficients of log(PH) andlog(NT) are 0.124 and 0.326, which means that there is strongevidence that log(CH) is related to both of these variables (t =
Trang 161.238/0.124 = 9.99 for log(CH), giving p = 0.000 for the t-test with 22degrees of freedom; t = 0.970/0.326 = 2.78 for log(NT), giving p =0.011 for the t-test) Finally, the plots of standardized residuals forequation (3.27) that are shown on the right-hand side of Figure 3.5
give little cause for concern
(a) (b)
Figure 3.5 (a) Standardized residuals for chlorophyll-a plotted against the
fitted value predicted from the regression equation (3.25) and against thephosphorus and nitrogen concentrations for lakes, and (b) standardizedresiduals for log(chlorophyll-a) plotted against the fitted value,log(phosphorus), and log(nitrogen) for the regression equation (3.27)
An analysis of variance is provided for equation (3.27) in Table 3.4.This shows that the equation with log(PH) included accounts for a veryhighly significant part of the variation in log(CH) Adding in log(NT) tothe equation then gives a highly significant improvement
Trang 17Table 3.4 Analysis of variance for equation (3.27) showing the
sums of squares accounted for by log(PH), and log(NT) added intothe equation after log(PH)
Source
Sum of Squares
Degrees of Freedom
Mean Square F p-value Phosphorus 5.924 1 5.924 150.98 0.0000
3.5 Factorial Analysis of Variance
The analysis of variance that can be carried out with linear regression
is very often used in other situations as well, particularly with what arecalled factorial experiments An important distinction in thisconnection is between variables and factors A variable is somethinglike the phosphorus concentration or nitrogen concentration in lakes,
as in the example just considered A factor, on the other hand, has anumber of levels and in terms of a regression model it may be thoughtplausible that the response variable being considered has a meanlevel that changes with these levels
Thus if an experiment is carried out to assess the effect on thesurvival time of fish of a toxic chemical, then the survival time might berelated by a regression model to the dose of the chemical, perhaps atfour concentrations, which would then be treated as a variable If theexperiment was carried out on fish from three sources, or on threedifferent species of fish, then the type of fish would be a factor, whichcould not just be entered as a variable The fish types would belabelled 1 to 5 and what would be required in the regression equation
is that the mean survival time varied with the type of fish
Trang 18The type of regression model that could then be considered wouldbe
Equation (3.28) allows for a factor effect, but only on the expectedsurvival time If the effect of the concentration of the toxic chemicalmay also vary with the type of fish, then the model can be extended
to allow for this, by adding products of the 0-1 variables for the fishtype with the concentration variable to give
Y = $1X1 + $2X2 + $3X3 + $4X1X4 + $5X2X4 + $6X3X4 + , (3.29)
For fish of types 1 to 3 the expected survival times are then $1 + $4X4,
$2 + $5X4, and $3+ $6X4, respectively The effect is then a linearrelationship between the survival time and the concentration of thechemical which differs for the three types of fish
When there is only one factor to be considered in a model it can behandled reasonably easily by using dummy indicator variables as justdescribed However, with more than one factor this gets cumbersomeand it is more usual to approach modelling from the point of view of afactorial analysis of variance This is based on a number of standardmodels and the theory can get quite complicated Nevertheless, theuse of analysis of variance in practice can be quite straightforward if
a statistical package is available to do the calculations Anintroduction to experimental designs and their corresponding analyses
of variance is given by Manly (1992, Chapter 7), and a more detailed
account by Mead et al (1993) Here only three simple situations will
be considered
Trang 19One factor Analysis of Variance
With a single factor the analysis of variance model is just a model forcomparing the means of I samples, where I is two or more Thismodel can be written as
xij = µ + ai + ,ij, (3.30)
where xij is the jth observed value of the variable of interest at the ithfactor level (i.e., in the ith sample), µ is an overall mean level, ai is thedeviation from µ for the ith factor level with a1 + a2 + aI = 0, and ,ij
is the random component of xij, which is assumed to be independent
of all other terms in the model, with a mean of zero and a constantvariance
To test for an effect of the factor an analysis of variance table is set
up, which takes the form shown in Table 3.5 Here the sum ofsquares for the factor is just the sum of squares accounted for byallowing the mean level to change with the factor level in a regressionmodel, although it is usually computed somewhat differently The F-test requires the assumption that the random components ,ij in themodel (3.30) have a normal distribution
Table 3.5 Form of the analysis of variance table for a one factor model,
with I levels of the factor and n observations in total
Source of
variation Sum of Squares 1
Degrees of freedom Mean square 2 F 3
Factor SSF I - 1 MSF = SSB/(I - 1) MSF/MSE
Total SST = 33(xij - x) 2 n - 1
1 SSF = sum of squares between factor levels, SSE = sum of squares for error (variation within factor levels), and SST = total sum of squares for which the summation is over all observations at all factor levels.
2 MSF= mean square between factor levels, and MSE = mean square error.
3 The value is tested for significance by comparison with critical values for the distribution with I - 1 and n - I degrees of freedom.