Statistical Tools for Environmental Quality Measurement - Chapter 5 pptx

Once P values have been calculated for all observations, one can obtainexpected normal or Z scores using the relationship: Here ZP is the z-score associated with the cumulative probabili

Trang 1

C H A P T E R 5

Tools for Dealing with Censored Data

“As trace substances are increasingly investigated in soil, air,

and water, observations with concentrations below the

analytical detection limits are more frequently encountered

‘Less-than’ values present a serious interpretation problem for

data analysts.” (Helsel, 1990a)

Calibration and Analytical Chemistry

All measurement methods (e.g., mass spectrometry) for determining chemicalconcentrations have statistically defined errors Typically, these errors are defined as

a part of developing the chemical analysis technique for the compound in question,which is termed “calibration” of the method

In its simplest form calibration consists of mixing a series of solutions that containthe compound of interest in varying concentrations For example, if we were trying tomeasure compound A at concentrations of between zero and 50 ppm, we mightprepare a solution of A at zero, 1, 10, 20, 40, and 80 ppm, and run these solutionsthrough our analytical technique Ideally we would run 3 or 4 replicate analyses ateach concentration to provide us with a good idea of the precision of our measurements

at each concentration At the end of this exercise we would have a set of Nmeasurements (if we ran 5 concentrations and 3 replicates per concentration, N wouldequal 15) consisting of a set of k analytic outputs, Ai,j for each known concentration,

Ci Figure 5.1 shows a hypothetical set of calibration measurements, with a single Aifor each Ci, along with the regression line that best describes these data

Trang 2

Regression (see Chapter 4 for a discussion of regression) is the method that isused to predict the estimated measured concentration from the known standardconcentration (because the standards were prepared to a known concentration) Theresult is a prediction equation of the form:

Mi = β 0 + β 1• Ci + ε i [5.1]Here Mi, is the predicted mean of the measured values (the Ai,j’s) at knownconcentration Ci, β 0 the estimated concentration at Ci = 0, β 1 is the slope coefficientthat predicts Mi from Ci, and ε i is the error associated with the prediction of Mi.Unfortunately, Equation [5.1] is not quite what we want for our chemicalanalysis method because it allows us to predict a measurement from a knownstandard concentration When analyses are actually being performed, we wish to usethe observed measurement to predict the unknown true concentration To do this, wemust rearrange Equation [5.1] to give:

[5.2]

In Equation [5.2] β 0 and β 1 are the same as those in [5.1], but Ci is the unknownconcentration of the compound of interest, Mi is the measurement from sample i, and

ε 'i is the error associated with the “inverse” prediction of Ci from Mi This procedure

is termed inverse prediction because the original regression model was fit to predict

Mi from Ci, but then is rearranged to predict Ci from Mi Note also that the errorterms in [5.1] and [5.2] are different because inverse prediction has larger errors thansimple prediction of y from x in a regular regression model

Detection Limits

The point of this discussion is that the reported concentration of any chemical inenvironmental media is an estimate with some degree of uncertainty In thecalibration process, chemists typically define some Cn value that is not significantlydifferent from zero, and term this quantity the “method detection limit.” That is, if

we used the ε ' distribution from [5.2] to construct a confidence interval for C, Cnwould be the largest concentration whose 95% (or other interval width) confidenceinterval includes zero Values below the limit of detection are said to be censoredbecause we cannot measure the actual concentration and thus all values less than Cnare reported as “less than LOD,” “nondetect,” or simply “ND.” While this seems arather simple concept the statistical process of defining exactly what the LOD is for

a given analytical procedure is not (Gibbons, 1995)

Trang 3

where κ i is the true but unknown concentration and ε i is a random error component.When is small, it can have a confidence interval that does not include zero (thus

it is not an “ND”) but is still quite wide compared to the concentration beingreported For example, one might have a dioxin concentration reported as 500 ppb,but with a 95% confidence interval of 200 to 1,250 ppb This is quite imprecise andwould likely be reported as below the “limit of quantification” or “less than LOQ.”However, the fact remains that a value reported as below the limit of quantificationstill provides evidence that the substance of interest has been identified

Moreover, if the measured concentrations are unbiased, it is true that the averageerror is zero That is:

[5.4]Thus if we have many values below the LOQ it is true that:

[5.5]and for large samples,

[5.6]

That is, even if all values are less than LOQ, the sum is still expected to equalthe sum of the unknown but true measurements and by extension, the mean of agroup of values below the LOQ, but above the DL, would be expected to equal thetrue sample mean

It is worthwhile to consider the LOQ in the context of the calibration process.Sometimes an analytic method is calibrated across a rather narrow range of standardconcentrations If one fits a statistical model to such data, the precision ofpredictions can decline rapidly as one moves away from the range of the data used

to fit the model In this case, one may have artificially high LOQs (and DetectionLimit or DLs as well) as a result of the calibration process itself Moreover, if onemoves to concentrations above the range of calibration one can also haveunacceptably wide confidence intervals This leads to the seeming paradox of valuesthat are too large to be acceptably precise This general problem is an issue ofconsiderable discussion among statisticians engaged in the evaluation of chemical

concentration data (see for example: Gilliom and Helsel, 1986; Helsel and Gilliom,

1986; Helsel, 1990a 1990b).

The important point to take away from this discussion is that values less thanLOQ do contain information and, for most purposes, a good course of action is tosimply take the reported values as the actual values (which is our expectation givenunbiased measurements) The measurements are not as precise as we would like, butare better than values reported as “<LOQ.”

Another point is that sometimes a high LOQ does not reflect any actuallimitation of the analytic method and is in fact due to calibration that was performed

Trang 4

over a limited range of standard concentrations In this case it may be possible toimprove our understanding of the true precision of the method being used by doing

a new calibration study over a wider range of standard concentrations This will notmake our existing <LOQ observations any more precise, but may give us a betteridea of how precise such measurements actually are That is, if we originally had acalibration data set at 200, 400, and 800 ppm and discovered that many fieldmeasurements are less than LOQ at 50 ppm, we could ask the analytical chemist torun a new set of calibration standards at say 10, 20, 40, and 80 ppm and see how wellthe method actually works in the range of concentrations encountered in theenvironment If the new calibration exercise suggests that concentrations above

15 ppm are measured with adequate precision and are thus “quantified,” we shouldhave greater faith in the precision of our existing less than LOQ observations

Censored Data

More often, one encounters data in the form of reports where the original rawanalytical results are not available and no further laboratory work is possible Herethe data consist of the quantified data that are reported as actual concentrations, theless than LOQ observations that are reported as less than LOQ, together with theconcentration defining the LOQ and values below the limit of detection, that arereported as ND, together with concentration defining the limit of detection (LOD) It

is also common to have data reported as “not quantified” together with a

“quantification limit.” Such a limit may reflect the actual LOQ, but may alsorepresent the LOD, or some other cutoff value In any case the general result is that

we have only some of the data quantified, while the rest are defined only by a cutoffvalue(s) This situation is termed “left censoring” in statistics because observationsbelow the censoring point are on the left side of the distribution

The first question that arises is: “How do we want to use the censored data set?”

If our interest is in estimating the mean and standard deviation of the data, and thenumber of nonquantified observations (NDs and <LOQs) is low (say 10% of thesample or less), the easiest approach is to simply assume that nondetects are worth1/2 the detection limit (DL), and that <LOQ values (LVs) are defined as:

This convention makes the tacit assumption that the distribution of nondetects isuniformly distributed between the detection limit and zero, and that <LOQ valuesare uniformly distributed between the DL and the LOQ After assigning values to allnonquantified observations, we can simply calculate the mean and standarddeviation using the usual formulae This approach is consistent with EPA guidanceregarding censored data (e.g., EPA, 1986)

The situation is even easier if we are satisfied with the median and interquartilerange as measures of central tendency and dispersion The median is defined for anydata set where more than half of the observations are quantified, while theinterquartile range is defined for any data set where at least 75% of the observationsare quantified

Trang 5

Estimating the Mean and Standard DevIation Using Linear Regression

As shown in Chapter 2, observations from a normal distribution tend to fall on

a straight line when plotted against their expected normal scores This is true even

if some of the data are below the limit of detection (see Example 5.1) If onecalculates a linear regression of the form:

C = A + B • Z-Score [5.8]where C is the measured concentration, A and B are fitted constants, and Z-Score is

the expected normal score based on the rank order of the data, A is an estimate of

the mean, µ, and B is an estimate of the standard deviation, σ (Gilbert, 1987; Helsel,1990)

Expected Normal Scores

The first problem in obtaining expected normal scores is to convert the ranks ofthe data into cumulative percentiles This is done as follows:

1 The largest value in a sample of N receives rank N, the second largestreceives rank N − 1, the third largest receives rank N − 2 and so on until allmeasured values have received a rank In the event that two or more valuesare tied (in practice this should happen very rarely; if you have many tiedvalues you need to find out why), simply assign one rank K and one rank

K − 1 For example if the five largest values in a sample are unique, andthe next two are tied, assign one rank 6 and one rank 7

2 Convert each assigned rank, r, to a cumulative percentile, P, using theformula:

[5.9]

We note that other authors (e.g., Gilliom and Helsel, 1986) have useddifferent formulae such as P = r/(N + 1) We have found that using P valuescalculated using [5.8] provide better approximations to tabled ExpectedNormal Scores (Rohlf and Sokol, 1969) and thus will yield more accurateregression estimates of µ and σ

3 Once P values have been calculated for all observations, one can obtainexpected normal or Z scores using the relationship:

Here Z(P) is the z-score associated with the cumulative probability P, and

ϕ is the standard normal inverse cumulative distribution function Thisfunction is shown graphically in Figure 5.2

4 Once we have obtained Z values for each P, we are ready to perform aregression analysis to obtain estimates of µ and σ

Trang 6

Example 5.1 contains a sample data set with 20 random numbers, sortedsmallest to largest, generated from a standard normal distribution (µ = 0 and σ = 1),cumulative percentiles calculated from Equation 5.8, and expected normal scorescalculated from these P values When we look at Example 5.1, we see that theestimates for µ and σ look quite close to the usual estimates of µ and σ except for thecase where 75% of the data (15 observations) are censored Note first that evenwhen we have complete data we do not reproduce the parametric values, µ = 0 and

σ = 1 This is because we started with a 20-observation random sample For the case

of 75% censoring the estimated value for µ is quite a bit lower than the sample value

of − 0.3029 and the estimated value for σ is also a good bit higher than the samplevalue of 1.0601 However, it is worthwhile to consider that if we did not use theregression method for censored data, we would have to do something else Let usassume that our detection limit is really 0.32, and assign half of this value, 0.16, toeach of the 15 “nondetects” in this example and use the usual formulae to calculate

µ and σ The resulting estimates are µ = 0.3692 and σ = 0.4582 That is, ourestimate for µ is much too large and our estimate for σ is much too small The moralhere is that regression estimates may not do terribly well if a majority of the data iscensored, but other methods may do even worse

The sample regression table in Example 5.1 shows where the Statistics

presented for the 4 models (20 observations, 15 observations, 10 observations,

5 observations) come from The CONSTANT term is the intercept for the regressionequation and provides our estimate of µ, while the ZSCORE term is the slope of theregression line and provides our estimate of σ The ANOVA table is includedbecause the regression procedure in many statistical software packages provides this

as part of the output Note that the information required to estimate µ and σ is found

Figure 5.2 The Inverse Normal Cumulative Distribution Function

Trang 7

in the regression equation itself, not in the ANOVA table The plot of the data withthe regression curve includes both the “detects” and the “nondetects.” However,only the former were used to fit the curve With real data we would have only thedetect values, but this plot is meant to show why regression on normal scores workswith censored data That is, if the data are really log-normal, regression on thosedata points that we can quantify will really describe all of the data An importantpoint concerning using regression to estimate µ and σ is that all of the toolsdiscussed in our general treatment of regression apply Thus we can see if factorslike influential observations or nonlinearity are affecting our regression model andthus have a better idea of how good our estimates of µ and σ really are.

Maximum Likelihood

There is another way of estimating µ and σ from censored data that also doesrelatively well when there is considerable left-censoring of the data This is themethod of maximum likelihood There are some similarities between this methodand the regression method just discussed When using regression we use the ranks

of the detected observations to calculate cumulative percentiles and use the standardnormal distribution to calculate expected normal scores for the percentiles We thenuse the normal scores together with the observed data in a regression model thatprovides us with estimates of µ and σ In the maximum likelihood approach we start

by assuming a normal distribution for the log-transformed concentration We thenmake a guess as to the correct values for µ and σ Once we have made this guess wecan calculate a likelihood for each observed data point, using the guess about µ and

σ and the known percentage, ψ , of the data that is censored We write this result asL(xi*µ, σ , ψ ) Once we have calculated an L for each uncensored observation, wecan calculate the overall likelihood of the data, L(X*µ, σ , ψ ) as:

[5.11]

That is the overall likelihood of the data given µ, σ , and ψ , L(X*µ, σ , ψ ), is theproduct of the likelihoods of the individual data points Such calculations areusually carried out under logarithmic transformation Thus most discussions are interms of log-likelihood, and the overall log-likelihood is the sum of the log-likelihoods of the individual observations Once L(X*µ, σ , ψ ) is calculated there aremethods for generating another guess at the values for µ and σ , that yields an evenhigher log-likelihood This process continues until we reach values of µ and σ thatresult in a maximum value for L(X*µ, σ , ψ ) Those who want a technical discussion

of a representative approach to the likelihood maximization problem in the context

of censored data should consult Shumway et al (1989)

The first point about this procedure is that it is complex compared to theregression method just discussed, and is not easy to implement without specialsoftware (e.g., Millard, 1997) The second point is that if there is only one censoringvalue (e.g., detection limit) maximum likelihood and regression almost always give

Trang 8

essentially identical estimates for µ and σ , and when the answers differ somewhatthere is no clear basis for preferring one method over the other Thus for reasons ofsimplicity we recommend the regression approach.

Multiply Censored Data

There is one situation where maximum likelihood methods offer a distinctadvantage over regression In some situations we may have multiple “batches” ofdata that all have values at which the data is censored For example, we might have

a very large environmental survey where the samples were split among several labsthat had somewhat different instrumentation and thus different detection andquantification limits Alternatively, we might have samples with differing levels of

“interference” for the compound of interest by other compounds and thus differinglimits for detection and quantification We might even have replicate analyses overtime with declining limits of detection caused by improved analytic techniques Thecause does not really matter, but the result is always a set of measurementsconsisting of several groups, each of which has its own censoring level

One simple approach to this problem is to declare all values below the highestcensoring point (the largest value reported as not quantified across all groups) ascensored and then apply the regression methods discussed earlier If this results inminimal data loss (say, 5% to 10% of quantified observations), it is arguably thecorrect course However, in some cases, especially if one group has a high censoringlevel, the loss of quantified data points may be much higher (we have seen situationswhere this can exceed 50%) In such a case, one can use maximum likelihoodmethods for multiply censored data such as those contained in Millard (1997) toobtain estimates for µ and σ that utilize all of the available data However, wecaution that estimation in the case of multiple censoring is a complex issue Forexample, the pattern of censoring can affect how one decides to deal with the data.When dealing with such complex issues, we strongly recommend that a professionalstatistician, one who is familiar with this problem area, be consulted

Example 5.1

The Data for Regression

Y Data (Random Normal)

Sorted Smallest to Largest

Cumulative Proportion from Equation 5.8

Z-Scores from Cumulative Proportions

Trang 9

0.061932 0.186756 0.314572 0.447768 0.589456 0.744143 0.919135 1.128143 1.403412 1.868242

Unweighted Least-Squares Linear Regression of Y Predictor Variables Coefficient Std Error Student’s t P

The Data for Regression (Cont’d)

Sorted Smallest to Largest

Cumulative Proportion from Equation 5.8

Trang 10

R-SQUARED 0.9641

Estimating the Arithmetic Mean and Upper Bounds on the Arithmetic Mean

In Chapter 2, we discussed how one can estimate the arithmetic meanconcentration of a compound in environmental media, and how one might calculate

an upper bound on this arithmetic mean Our general recommendation was to usethe usual statistical estimator for the arithmetic mean and to use bootstrapmethodology (Chapter 6) to calculate an upper bound on this mean The question athand is how do we develop estimates for the arithmetic mean, and upper bounds forthis mean, when the data are censored?

One approach that is appealing in its simplicity is to use the values of µ and σ ,estimated by regression on expected normal scores, to assign values to the censoredobservations That is, if we have N observations, k of which are censored, we canassume that there are no tied values and that the ranks of the censored observationsare 1 through k We can then use these ranks to calculate P values usingEquation [5.9], and use the estimates P values to calculate expected normal scores

ANOVA Table Source DF SS MS F P

Regression 1 3.34601 3.34601 214.85 0.0000Residual 8 0.12459 0.01557

Total 9 3.47060

Figure 5.3 A Regression Plot of the Data Used in Example 5.1

Trang 11

(Equation [5.10]) We then use the regression estimates of µ and σ to calculate

“values” for the censored observations and use an exponential transformation tocalculate observations in original units (usually ppm or ppb) Finally, we use the

“complete” data, which consists of estimated values for the censored observationsand observed values for the uncensored observations, together with the usualformulae to calculate and s

Consider Example 5.2 The estimates of µ and σ are essentially identical What

is perhaps more surprising is the fact that the upper percentiles of the bootstrapdistribution shown Example 5.2 are also virtually identical for the complete andpartially estimated exponentially transformed data Replacing the censored datawith their exponentially transformed expectations from the regression model andthen calculating and s using the resulting pseudo-complete data is a strategy that hasbeen recommended by other authors (Helsel, 1990b; Gilliom and Helsel, 1986;Helsel and Gilliom, 1986) The use of the same data to estimate an upper bound for

is a relatively new idea, but one that flows logically from previous work That is,the use of the bootstrap technique to estimate an upper bound on is well establishedfor the case of uncensored data As noted earlier (Chapter 2), environmental data isalmost always skewed to the right That is, the distribution has a long “tail” thatpoints to the right Except for cases of extreme censoring, this long tail alwaysconsists of actual observations, and it is this long tail that plays the major role indetermining the bootstrap upper bound on Our work suggests that the bootstrap is

a useful tool for determining an upper bound on whenever at least 50% of the dataare uncensored (Ginevan and Splitstone, 2002)

Example 5.2

Calculating the Arithmetic Mean and its Bootstrap Upper Bound

Sorted Smallest to

Largest

Data Calculated from Estimates of

µ and σ

Exponential Transform of Calculated for Censored and Observed for Uncensored

x

xx

Định dạng
Số trang	23
Dung lượng	623,14 KB