1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Statistical Tools for Environmental Quality Measurement - Chapter 2 pps

30 307 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Statistical Tools for Environmental Quality Measurement - Chapter 2 pps
Trường học CRC Press LLC
Chuyên ngành Environmental Quality Measurement
Thể loại Lecture Notes
Năm xuất bản 2003
Định dạng
Số trang 30
Dung lượng 866,63 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

We can calculate thearithmetic mean, , whose value is given by: [2.1] where the xi’s are the individual sample measurements and N is the sample size... The Sample Range One possibility i

Trang 1

C H A P T E R 2

Basic Tools and Concepts Description of Data

The goal of statistics is to gain information from data The first

step is to display the data in a graph so that our eyes can take in

the overall pattern and spot unusual observations Next, we

often summarize specific aspects of the data, such as the

average of a value, by numerical measures As we study graphs

and numerical summaries, we keep firmly in mind where the

data come from and what we hope to learn from them Graphs

and numbers are not ends in themselves, but aids to

understanding (Moore and McCabe, 1993)

Every study begins with a sample, or a set of measurements, which is

“representative” in some sense, of some population of possible measurements Forexample, if we are concerned with PCB contamination of surfaces in a buildingwhere a transformer fire has occurred, our sample might be a set of 20 surface wipesamples chosen to represent the population of possible surface contaminationmeasurements Similarly, if we are interested in the level of pesticide present inindividual apples, our sample might be a set of 50 apples chosen to be representative

of all apples (or perhaps all apples treated with pesticide) Our focus here is the set

of statistical tools one can use to describe a sample, and the use of these samplestatistics to infer the characteristics of the underlying population of measurements

Central Tendency or Location

The Arithmetic Mean

Perhaps the first question one asks about a sample is what is a typical value forthe sample Usually this is answered by calculating a value that is in the middle ofthe sample measurements Here we have a number of choices We can calculate thearithmetic mean, , whose value is given by:

[2.1]

where the xi’s are the individual sample measurements and N is the sample size

The Geometric Mean

Alternatively, we can calculate the geometric mean, GMx, given by:

x

x Σ xiN -

=

Trang 2

That is, GM(x) is the antilogarithm of the mean of the logarithms of the data value.Note that for the GM to be defined, all x’s must be greater than zero.

If we calculate ln (GM(x)), this is called the logarithmic mean, LM(x), and issimply the arithmetic mean of the log-transformed x’s

The Median

The median, M, is another estimator of central tendency It is given by the 50thpercentile of the data If we have a sample of size N, sorted from smallest to largest(e.g., x1 is the smallest observation and xN is the largest) and N is odd, the median isgiven by xj Here j is given as:

[2.3]That is, if we have 11 observations the median is equal to the 6th largest and if wehave 7 observations, the median is equal to the 4th largest When N is an evennumber, the median is given as:

[2.4]

In Equation [2.4], j and k are equal to (N/2) and ((N/2) + 1), respectively Forexample if we had 12 observations, the median would equal the average of the 6thand 7th largest observations If we had 22 observations, the median would equal theaverage of the 11th and 12th largest values

Discussion

While there are other values, such as the mode of the data (the most frequent value)

or the harmonic mean (the reciprocal of the mean of the 1/x values), the arithmeticmean, the geometric mean and the median are the three measures of central tendencyroutinely used in environmental quality investigations The logarithmic mean is not

of interest as a measure of central tendency because it is in transformed units (ln(concentration)), but does arise in considerations of hypothesis tests

Note also that all of these measures of sample central tendency are expected torepresent the corresponding quantities in the population (often termed the “parent”population) from which the sample was drawn That is, as the sample size becomeslarge, the difference between, for example, and µ (the parametric or “true”arithmetic mean) becomes smaller and smaller, and in the limit is zero In statisticalterms these “sample statistics” are unbiased estimators of the correspondingpopulation parameters

Dispersion

By dispersion we mean how spread out the data are For example, say we havetwo areas, both with a median concentration of 5 ppm for some compound ofinterest However, in the first area the 95th percentile concentration is 25 ppm while

in the second, the 95th percentile concentration is 100 ppm One might argue thatthe central tendency or location of the compound of interest is similar in these areas

j = ( ( N–1) ⁄ 2) +1

M = ( xj+xk) ⁄ 2

x

Trang 3

(or not, depending on the purpose of our investigation; see Chapter 3), but the secondarea clearly has a much greater spread or dispersion of concentrations than the first.The question is, how can this difference be expressed?

The Sample Range

One possibility is the sample range, W, which is given by:

[2.5]that is, W is the difference between the largest and smallest sample values This iscertainly a good measure of the dispersion of the sample, but is less useful indescribing the underlying population The reason that this is not too useful as adescription of the population dispersion is that its magnitude is a function of both theactual dispersion of the population and the size of the sample We can show this asfollows:

1 The median percentile, mpmax, of the population that the largest value in asample of N observations will represent is given by:

that is, if we have a sample of 10 observations, mpmax equals 0.51/10 or0.933 If instead we have a sample of 50 observations, mpmax equals0.51/50 or 0.986 That is, if the sample size is 10, the largest value in thesample will have a 50-50 chance of being above or below the 93.3rdpercentile of the population from which the sample was drawn However,

if the sample size is 50, the largest value in the sample will have a 50-50chance of being above or below the 98.6th percentile of the populationfrom which the sample was drawn

2 The median percentile, mpmin, of the population that the smallest value in

a sample of N observations will represent is given by:

For a sample of 10 observations, mpmin equals or 0.0.067, and for a sample

of 50 observations, mpmin equals 0.0.014

3 Thus for a sample of 10 the range will tend to be the difference between the6.7th and 93.3rd percentiles of the population from which the sample wasdrawn, while for a sample of 50, the range will tend to be the differencebetween the 1.4th and 98.6th percentiles of the population from which thesample was drawn More generally, as the sample becomes larger andlarger, the range represents the difference between more and more extremehigh and low percentiles of the population

W = xmax–xmin

mpmax = 0.51/N

mpmin = 1–0.51/N

Trang 4

This is why the sample range is a function of both the dispersion of the populationand the sample size For equal sample sizes the range will tend to be larger for apopulation with greater dispersion, but for populations with the same dispersion thesample range will larger for larger N.

The Interquartile Range

One way to fix the problem of the range depending on the sample size is tocalculate the difference between fixed percentiles of the data The first problemencountered is the calculation of percentiles We will use the following procedure:

1 Sort the N sample observations from smallest to largest

2 Let the rank of an observation be I, its list index value That is, the smallestobservation has rank 1, the second smallest has rank 2, and so on, up to thelargest value that has rank N

3 The cumulative probability, PI, of rank I is given by:

[2.6]This cumulative probability calculation gives excellent agreement withmedian probability calculated from the theory of order statistics (Looneyand Gulledge, 1995)

To get values for cumulative probabilities not associated with a given rank

1 Pick the cumulative probability, CP, of interest (e.g., 0.75)

2 Pick the PI value of the rank just less than CP The next rank hascumulative probability value PI+1 (note that one cannot calculate a valuefor cumulative probabilities less than P1 or greater than PN)

3 Let the values associated with these ranks be given by VI = VL and VI+1 = VU

4 Now if we assume probability is uniform between PI = PL and PI+1 = PU it

is true that:

[2.7]where VCP is the CP (e.g., 0.75) cumulative probability, VL is the valueassociated with the lower end of the probability interval, PL and VU is thevalue associated with the upper end of the probability interval, PU Onecan rearrange [2.6] to obtain V0.75 as follows:

[2.8]This is general for all cumulative probabilities that we can calculate Note thatone cannot calculate a value for cumulative probabilities less than P1 or greater than

PN because in the first case PL is undefined and in the second PU is undefined That

is, if we wish to calculate the value associated with a cumulative probability of 0.95

in a sample of 10 observations, we find that we cannot because P10 is only about 0.94

PI = ( I–3/8) ⁄ (N+1/4)

CP–PL

( ) ⁄ ( PU–PL) = ( VCP–VL) ⁄ ( VU–VL)

V0.75 = ( (VU–VL) x 0.75( –PL) ⁄ ( PU–PL) ) +VL

Trang 5

As one might expect from the title of this section, the interquartile range, IQ,given by:

[2.9]

is a commonly used measure of dispersion It has the advantage that its expected widthdoes not vary with sample size and is defined (calculable) for samples as small as 3

The Variance and Standard Deviation

The sample variance, S2 is defined as:

[2.10]

where the xi’s are the individual sample measurements and N is the sample size Note that one sometimes also sees the formula:

[2.11]

Here σ 2 is the population variance The difference between [2.10] and [2.11] is

the denominator The (N − 1) term is used in [2.10] because using N as in [2.11] withany finite sample will result in an estimate of S2, which is too small relative to thetrue value of σ 2 Equation [2.11] is offered as an option in some spreadsheetprograms, and is sometimes mistakenly used in the calculation of sample statistics.This is always wrong One should always use [2.10] with sample data because italways gives a more accurate estimate of the true σ 2 value

The sample standard deviation, S is given by:

[2.12]that is, the sample standard deviation is the square root of the sample variance

It is easy to see that S and S2 reflect the dispersion of the measurements Thevariance is, for large samples, approximately equal to the average squared deviation

of the observations from the sample mean, which as the observations get more andmore spread out, will get larger and larger

If we can assume that the observations follow a normal distribution, we can alsouse and s to calculate estimates of extreme percentiles We will consider this atsome length in our discussion of the normal distribution

The Logarithmic and Geometric Variance and Standard Deviation

Just as we can calculate the arithmetic mean of the log transformed observations,LM(x), and its anti-log, GM(x), we can also calculate the variance and standarddeviation of these log-transformed measurements, termed the logarithmic variance,LV(y), and logarithmic standard deviation LSD(x), and their anti-logs, termed thegeometric variance, GV(y), and geometric standard deviation, GSD(x), respectively.These measures of dispersion find application when the log-transformed measure-ments follow a normal distribution, which means that the measurements themselvesfollow what is termed a log-normal distribution

Trang 6

The Coefficient of Variation (CV)

The sample CV is defined as:

[2.13]that is, it is the standard deviation expressed as a percentage of the sample mean.Note that S and x have the same units That is, if our measurements are in units ofppm, then both S and x are in ppm Thus, the CV is always unitless The CV isuseful because it is a measure of relative variability For example, if we have ameasurement method for a compound, and have done ten replicates each at standardconcentrations of 10 and 100 ppm, we might well be interested in relative rather thanabsolute precision because a 5% error at 10 ppm is 0.5 ppm, but the same relativeerror at 100 ppm is 5 ppm Calculation of the CV would show that while theabsolute dispersion at 100 ppm is much larger than that at 5 ppm, the relativedispersion of the two sets of measurements is equivalent

Discussion

The proper measure of the dispersion of one’s data depends on the question onewants to ask The sample range does not estimate any parameter of the parentpopulation, but it does give a very clear idea of the spread of the sample values Theinterquartile range does estimate the population interquartile range and clearlyshows the spread between the 25th and 75th percentiles Moreover, this is the onlydispersion estimate that we will discuss that accurately reflects the same dispersionmeasure of the parent population and that does not depend on any specific assumeddistribution for its interpretation The arithmetic variance and standard deviation areprimarily important when the population follows a normal distribution, becausethese statistics can help us estimate error bounds and conduct hypothesis tests Thesituation with the logarithmic and geometric variance and standard deviation issimilar These dispersion estimators are primarily important when the populationfollows a log-normal distribution

Some Simple Plots

The preceding sections have discussed some basic measures of location(arithmetic mean, geometric mean, median) and dispersion (range, interquartilerange, variance, and standard deviation) However, if one wants to get an idea ofwhat the data “look like,” perhaps the best approach is to plot the data (Tufte, 1983;Cleveland, 1993; Tukey, 1977) There are many options for plotting data to get anidea of its form, but we will discuss only three here

Box and Whisker Plots

The first, called a “box and whisker plot” (Tukey, 1977), is shown in Figure 2.1.This plot is constructed using the median and the interquartile range (IQR) The IQRdefines the height of the box, while the median is shown as a line within the box.The whiskers are drawn from the upper and lower hinges ((UH and LH; top andbottom of the box; 75th and 25th percentiles) to the largest and smallest observedvalues within 1.5 times the IQR of the UH and LH, respectively Values between 1.5and 3 times the IQR above or below the UH or LH are plotted as “*” and are termed

CV = ( S x⁄ ) • 100

Trang 7

“outside points.” Values beyond 3 times the IQR above or below the UH and LHvalues are plotted as “o” and are termed “far outside values.” The value of this plot

is that is conveys a great amount of information about the form of one’s data in avery simple form It shows central tendency and dispersion as well as whether thereare any extremely large or small values In addition one can assess whether the dataare symmetric in the sense that values seem to be similarly dispersed above andbelow the median (see Figure 2.2D) or are “skewed” in the sense that there is a longtail toward high or low values (see Figure 2.4)

Dot Plots and Histograms

A dot plot (Figure 2.2A) is generated by sorting the data into “bins” of specifiedwidth (here about 0.2) and plotting the points in a bin as a stack of dots (hence the

Figure 2.1 A Sample Box Plot

Upper Whisker

Lower Whisker

Trang 8

name dot plot) Such plots can give a general idea of the shape and spread of a set

of data, and are very simple to interpret Note also that the dot plot is similar inconcept to a histogram (Figure 2.2B) A key difference is that when data are sparse,

a dot plot will still provide useful information on the location and spread of the datawhereas a histogram may be rather difficult to interpret (Figure 2.2B)

When there are substantial number of data points, histograms can provide a goodlook at the relative frequency distribution of x In a histogram the range of the data

is divided into a set of intervals of fixed width (e.g., if the data range from 1 to 10,

we might pick an interval width of 1, which would yield 10 intervals) Thehistogram is constructed by counting up the data points whose value lies in a giveninterval and drawing a bar whose height corresponds to the number of observations

in the interval In practice the scale for the heights of the bars may be in eitherabsolute or relative units In the first case the scale is simply numbers ofobservations, k, while in the second, the scale is in relative frequency, which is thefraction of the total sample, N, that is represented by a given bar (relative frequency

= k/N) Both views are useful An absolute scale allows one to see how many points

a given interval contains, which can be useful for small- to medium-sized data sets,while the relative scale provides information on the frequency distribution of thedata, which can be particularly useful for large data sets

Empirical Cumulative Distribution Plots

If we sort the observations in a sample from smallest to largest, we can calculatethe proportion of the sample less than or equal to a given observation by the simpleequation I/N, where N is the sample size and I is the rank of the observation in thesorted sample We could also calculate the expected cumulative proportion of thepopulation associated with the observation using Equation [2.6] In either case, wecan then plot the x’s against their calculated cumulative proportions to produce a plotlike that shown in Figure 2.2C These empirical cumulative distribution plots canshow how rapidly data values increase with increasing rank, and are also useful indetermining what fraction of the observations are above some value of interest

Figure 2.2A Examples of Some Useful Plot Types

A An Example Dot Plot

Trang 9

Figure 2.2B Examples of Some Useful Plot Types

Figure 2.2C Examples of Some Useful Plot Types

Figure 2.2D Examples of Some Useful Plot Types

B An Example Histogram

C An Example Empirical Cumulative Distribution Plot

100 80 60

20 0

Trang 10

Describing the Distribution of Environmental Measurements

Probability distributions are mathematical functions that describe the probabilitythat the value of x will lie in some interval for continuous distributions, or, x willequal some integer value for discrete distributions (e.g., integers only) There aretwo functional forms that are important in describing these distributions, theprobability density function (PDF) and the cumulative distribution function (CDF).The PDF, which is written as f(X) can be thought of, in the case of continuousdistributions, as providing information on the relative frequency or likelihood ofdifferent values of x, while for the case of discrete distributions it gives theprobability, P, that x equals X; that is:

[2.14]The CDF, usually written as F(X), always gives the probability that y is less than

or equal to x; that is:

[2.15]The two functions are related For discrete distributions:

Trang 11

upon these measurements The wise admonition of G E P Box (1979) that “ allmodels are wrong but some are useful” should be kept firmly in mind whenassuming the utility of any particular functional form Techniques useful for judgingthe lack of utility of a functional form are discussed later in this chapter.

Some of the functional forms that traditionally have been found useful forcontinuous measurement data are the Gaussian or “normal” model, the “Student’s t”distribution, and the log-normal model

Another continuous model of great utility is the uniform distribution Theuniform model simply indicates that the occurrence of any measurement outcomewithin a range of possible outcomes is equally likely Its utility derives from the factthat the CDF of any distribution is distributed as the uniform model This fact will

be exploited in discussing Bootstrap techniques in Chapter 6

The Normal Distribution

The normal or Gaussian distribution is one of the historical cornerstones ofstatistical inference in that many broadly used techniques such as regression andanalysis of variance (ANOVA) assume that the variation of measurement errorsfollows a normal distribution The PDF for the normal distribution is given as:

[2.18]

Here π is the numerical constant defined by the ratio of the circumference ofcircle to its diameter ( ≈ 3.14), exp is the exponential operator (exp (Z) = eZ; e is thebase of the natural logarithms (≈ 2.72)), and µ and σ are the parametric values for themean and standard deviation, respectively The CDF of the normal distribution doesnot have an explicit algebraic form and thus must be calculated numerically Agraph of the “standard” normal curve (µ = 0 and σ = 1) is shown in Figure 2.3.The standard form of the normal curve is important because if we subtract µ, thepopulation mean, from each observation, and divide the result by σ , the sample stan-dard deviation, the resulting transformed values have a mean of zero and a standarddeviation of 1 If the parent distribution is normal the resulting standardized valuesshould approximate a standard normal distribution The standardization procedure

is shown explicitly in Equation [2.19] In this equation, Z is the standardized variate

[2.21]

f x( ) 1

σ ( )2π 1 2 / -exp[ –1 2⁄ ( ( x–µ) σ ⁄ ) 2]

=

Z = ( x–µ) σ ⁄

t = ( x–µ) ⁄ S x( )

S x( ) = S N⁄ 1 2/

Trang 12

That is, is the sample standard deviation divided by the square root of thesample size A t distribution for a sample size of N is termed a t distribution on ν

degrees of freedom, where ν = N − 1 and is often written tν Thus, for example, a tvalue based on 16 samples from a normal distribution would have a t15 distribution.The algebraic form of the t distribution is complex, but tables of the cumulativedistribution function of tν are found in many statistics texts and are calculated bymost statistical packages and some pocket calculators Generally tabled values of tν

are presented for ν = 1 to ν = 30 degrees of freedom and for probability valuesranging from 0.90 to 0.9995 Many tables equivalently table 0.10 to 0.0005 for

1 − F(tν ) See Table 2.2 for some example t values Note that Table 2.2 includes t∞ This is the distribution of t for an infinite sample size, which is precisely equivalent

to a normal distribution As Table 2.2 suggests, for ν greater than 30 t tends toward

a standard normal distribution

The Log-Normal Distribution

Often chemical measurements exhibit a distribution with a long tail to the right

A frequently useful model for such data is the log-normal distribution In such adistribution the logarithms of the x’s follow a normal distribution One can dologarithmic transformations in either log base 10 (often referred to as commonlogarithms, and written log(x)), or in log base e (often referred to as naturallogarithms, and written as ln(x)) In our discussions we will always use naturallogarithms because these are most commonly used in statistics However, whenconfronted with “log-transformed data,” the reader should always be careful todetermine which logarithms are being used because log base 10 is also sometimesused When dealing with log-normal statistical calculations all statistical tests aredone with log-transformed observations, and assume a normal distribution

(Note that the likelihood is maximized at Z = 0, the distribution mean.)

S x( )

Trang 13

A log-normal distribution, which corresponds to the exponential transformation

of the standard normal distribution, is shown in Figure 2.4 An important feature ofthis distribution is that it has a long tail that points to the right and is thus termed

“right skewed.” The median and geometric mean for the example distribution areboth 1.0, while the arithmetic mean is 1.65

Table 2.2

Some Values for t Distribution

(The entries in the body of the table are the t values.)

from Exponentially Transforming the Z-Scores for a Standard Normal Curve

Trang 14

Some measurements such as counts of radioactive decay are usually expressed asevents per unit time The Poisson distribution is often useful in describing discretemeasurements of this type If we consider the number of measurements, x, out of agroup of N measurements that have a particular property (e.g., they are above some

“bright line” value such as effluent measurements exceeding a performancelimitation), distributional models such as the binomial distribution models mayprove useful The functional forms of these are given below:

In Equation 2.22, λ is the average number of events per unit time (e.g., counts perminute) In Equation 2.23, p is the probability that a single observation will be

“positive” (e.g., exceed the “bright” line)

We may also be interested in the amount of time that will elapse until some event

of interest will occur These are termed “waiting time” distributions When time iscontinuous, the exponential and Weibull distributions are well known When time isdiscrete (e.g., number of measurement periods) waiting time is commonly described

by the negative binomial distribution An important aid in assigning a degree ofconfidence to the percent compliance is the Incomplete Beta function

The distributions mentioned above are only a small fraction of the theoreticaldistributions that are of potential interest Extensive discussion of statisticaldistributions can be found in Evans et al (1993) and Johnson and Kotz (1969, 1970a,1970b)

Does a Particular Statistical Distribution Provide a Useful Model?

Before discussing techniques for assessing the lack of utility of any particularstatistical distribution to serve as a model for the data at hand, we need to point out

a major short coming of statistics We can never demonstrate that the data at handarise as a sample from any particular distribution model In other words, justbecause we can’t reject a particular model as being useful doesn’t mean that it is the

only model that is useful Other models might be as useful We can however

determine within a specified degree of acceptable decision risk that a particularstatistical distribution does not provide a useful model for the data The followingprocedures test for the “goodness of fit” of a particular model

The Kolmogorov-Smirnov (K-S) Test for Goodness of Fit

The K-S test is a general goodness-of-fit test in the sense that it will apply to anyhypothetical distribution that has a defined CDF, F(X) To apply this test in the case

Trang 15

B Next we calculate the standardized Z scores for each data value usingEquation 2.19, with and s substituted for µ and σ

C We then calculate the F(X) value for each Z-score either by using a table ofthe standard normal distribution or a statistics package or calculator thathas built in normal CDF calculations If we are using a table, it is likelythat F(x) values are presented for Z > 0 That is, we will have Z valuesranging from something like zero to 4, together with the cumulativeprobabilities (F(x)) associated with these Z values For negative Z values,

we use the relationship:

F( − Z) = 1 − F(Z)that is, the P value associated with a negative Z value is equal to one minusthe P value associated with the positive Z value of the same magnitude(e.g., − 1.5;1.5)

D Next we calculate two measures of cumulative relative frequency:

C1 = RANK/N and C2 = (RANK − 1)/N

In both cases, N equals the sample size

E Now we calculate the absolute value of difference between C1 and F(Z)and C2 and F(Z) for each observation That is:

DIFF1i = | C1i − F(Z)i | and DIFF2i = | C2i − F(Z)i |

F Finally we select the largest of the DIFF1i and DIFF2i values This is thevalue, Dmax, used to test for significance (also called the “test statistic”).This calculation is illustrated in Table 2.3 Here our test statistic is 0.1124 Thiscan be compared to either a standard probability table for the K-S statistic(Table 2.4) or in our example, Lilliefors modification of the K-S probability table(Lilliefors, 1967; Dallal and Wilkinson, 1986) The reason that our example usesLilliefors modification of the K-S probabilities is that the K-S test compares asample of measurements to a known CDF In our example, F(X) was estimatedusing the sample mean x and standard deviation S Lilliefors test corrects for the fact

that F(X) is not really known a priori.

Dallal and Wilkinson (1986) give an analytic approximation to find probabilityvalues for Lilliefors test For P< 0.10 and N between 5 and 100, this is given by:

Ngày đăng: 11/08/2014, 10:22

TỪ KHÓA LIÊN QUAN