Địa chất thống kê cho các nhà khoa học môi trường pps

2.1 MEASUREMENT AND SUMMARY The simplest kind of environmental variable is binary, in which there are onlytwo possible states, such as present or absent, wet or dry, calcareous or non-ca

Trang 1

Geostatistics for Environmental Scientists

Second Edition

Richard Webster Rothamsted Research, UK Margaret A Oliver University of Reading, UK

Trang 2

Geostatistics for Environmental Scientists

Second Edition

Richard Webster Rothamsted Research, UK Margaret A Oliver University of Reading, UK

Trang 3

2.4.1 Logarithmic transformation 212.4.2 Square root transformation 21

2.5 Exploratory data analysis and display 22

2.6.1 Target population and units 28

2.6.7 Increasing precision and efficiency 32

Trang 4

3 Prediction and Interpolation 37

4.5 Intrinsic variation and the variogram 544.5.1 Equivalence with covariance 54

4.9 Estimating semivariances and covariances 65

4.9.4 The experimental covariance function 73

5.1 Limitations on variogram functions 79

Trang 5

7.3.2 Smoothing characteristics of windows 148

7.4 Spectral analysis of the Caragabal transect 1507.4.1 Bandwidths and confidence intervals

7.5 Further reading on spectral analysis 152

8 Local Estimation or Prediction: Kriging 1538.1 General characteristics of kriging 154

Trang 6

8.7 Case study 1758.7.1 Kriging with known measurement error 180

9 Kriging in the Presence of Trend and Factorial Kriging 195

Trang 7

11.4 Disjunctive kriging 25111.4.1 Assumptions of Gaussian disjunctive kriging 251

A.7 Spatial analysis: the variogram 288

A.9 Spatial estimation or prediction: kriging 291

Trang 9

2 Basic Statistics

Before focusing on the main topic of this book, geostatistics, we want to ensurethat readers have a sound understanding of the basic quantitative methods forobtaining and summarizing information on the environment There are twoaspects to consider: one is the choice of variables and how they are measured;the other, and more important, is how to sample the environment This chapterdeals with these Chapter 3 will then consider how such records can be used forestimation, prediction and mapping in a classical framework

The environment varies from place to place in almost every aspect There areinfinitely many places at which we might record what it is like, but practically

we can measure it at only a finite number by sampling Equally, there are manyproperties by which we can describe the environment, and we must choosethose that are relevant Our choice might be based on prior knowledge of themost significant descriptors or from a preliminary analysis of data to hand

2.1 MEASUREMENT AND SUMMARY

The simplest kind of environmental variable is binary, in which there are onlytwo possible states, such as present or absent, wet or dry, calcareous or non-calcareous (rock or soil) They may be assigned the values 1 and 0, and theycan be treated as quantitative or numerical data Other features, such as classes

of soil, soil wetness, stratigraphy, and ecological communities, may be recordedqualitatively These qualitative characters can be of two types: unordered andranked The structure of the soil, for example, is an unordered variable and may

be classified into blocky, granular, platy, etc Soil wetness classes—dry, moist,wet—are ranked in that they can be placed in order of increasing wetness Inboth cases the classes may be recorded numerically, but the records should not

be treated as if they were measured in any sense They can be converted to sets

of binary variables, called ‘indicators’ in geostatistics (see Chapter 11), and canoften be analysed by non-parametric statistical methods

Geostatistics for Environmental Scientists/2nd Edition R Webster and M.A Oliver

Trang 10

The most informative records are those for which the variables are measuredfully quantitatively on continuous scales with equal intervals Examples includethe soil’s thickness, its pH, the cadmium content of rock, and the proportion ofland covered by vegetation Some such scales have an absolute zero, whereasfor others the zero is arbitrary Temperature may be recorded in kelvin (absolutezero) or in degrees Celsius (arbitrary zero) Acidity can be measured byhydrogen ion concentration (with an absolute zero) or as its negative logarithm

to base 10, pH, for which the zero is arbitrarily taken as log101 (in moles perlitre) In most instances we need not distinguish between them Some propertiesare recorded as counts, e.g the number of roots in a given volume of soil, thepollen grains of a given species in a sample from a deposit, the number of plants

of a particular type in an area Such records can be analysed by many of themethods used for continuous variables if treated with care

Properties measured on continuous scales are amenable to all kinds ofmathematical operation and to many kinds of statistical analysis They arethe ones that we concentrate on because they are the most informative, andthey provide the most precise estimates and predictions The same statisticaltreatment can often be applied to binary data, though because the scale is socoarse the results may be crude and inference from them uncertain In someinstances a continuous variable is deliberately converted to binary, or to an

‘indicator’ variable, by cutting its scale at some specific value, as described inChapter 11

Sometimes, environmental variables are recorded on coarse stepped scales inthe field because refined measurement is too expensive Examples include thepercentage of stones in the soil, the root density, and the soil’s strength Thesteps in their scales are not necessarily equal in terms of measured values, butthey are chosen as the best compromise between increments of equal practicalsignificance and those with limits that can be detected consistently These scalesneed to be treated with some caution for analysis, but they can often be treated

we shall not consider them in this book

2.1.1 Notation

Another feature of environmental data is that they have spatial and temporalcomponents as well as recorded values, which makes them unique or determi-nistic (we return to this point in Chapter 4) In representing the data we mustdistinguish measurement, location and time For most classical statistical

12 Basic Statistics

Trang 11

analyses location is irrelevant, but for geostatistics the location must bespecified We shall adhere to the following notation as far as possible through-out this text Variables are denoted by italics: an upper-case Z for randomvariables and lower-case z for a realization, i.e the actuality, and also forsample values of the realization Spatial position, which may be in one, two orthree dimensions, is denoted by bold x In most instances the space is two-dimensional, and so x¼ fx1; x2g, signifying the vector of the two spatialcoordinates Thus ZðxÞ means a random variable Z at place x, and zðxÞ isthe actual value of Z at x In general, we shall use bold lower-case letters forvectors and bold capitals for matrices.

We shall use lower-case Greek letters for parameters of populations and eithertheir Latin equivalents or place circumflexes (^), commonly called ‘hats’ bystatisticians, over the Greek for their estimates For example, the standard

deviation of a population will be denoted by s and its estimate by s or ^ s.

2.1.2 Representing variation

The environment varies in almost every aspect, and our first task is to describethat variation

Frequency distribution: the histogram and box-plot

Any set of measurements may be divided into several classes, and we may countthe number of individuals in each class For a variable measured on acontinuous scale we divide the measured range into classes of equal widthand count the number of individuals falling into each The resulting set offrequencies constitutes the frequency distribution, and its graph (with fre-quency on the ordinate and the variate values on the abscissa) is the histogram.Figures 2.1 and 2.4 are examples The number of classes chosen depends on the

Figure 2.1 Histograms: (a) exchangeable potassium (K) in mg l1; (b) log10K, for thetopsoil at Broom’s Barn Farm The curves are of the (lognormal) probability density

Measurement and Summary 13

Trang 12

number of individuals and the spread of values In general, the fewer theindividuals the fewer the classes needed or justified for representing them.Having equal class intervals ensures that the area under each bar is propor-tional to the frequency of the class If the class intervals are not equal then theheights of the bars should be calculated so that the areas of the bars areproportional to the frequencies.

Another popular device for representing a frequency distribution is the plot This is due to Tukey (1977) The plain ‘box and whisker’ diagram, likethose in Figure 2.2, has a box enclosing the interquartile range, a line showingthe median (see below), and ‘whiskers’ (lines) extending from the limits of theinterquartile range to the extremes of the data, or to some other values such asthe 90th percentiles

box-Both the histogram and the box-plot enable us to picture the distribution tosee how it lies about the mean or median and to identify extreme values

Figure 2.2 Box-plots: (a) exchangeable K; (b) log10K showing the ‘box’ and ‘whiskers’,and (c) exchangeable K and (d) log10K showing the fences at the quartiles plus andminus 1.5 times the interquartile range

14 Basic Statistics

Trang 13

Cumulative distribution

The cumulative distribution of a set of N observations is formed by ordering themeasured values, zi, i¼ 1; 2; ; N, from the smallest to the largest, recordingthe order, say k, accumulating them, and then plotting k against z The resultinggraph represents the proportion of values less than zkfor all k¼ 1; 2; ; N Thehistogram can also be converted to a cumulative frequency diagram, thoughsuch a diagram is less informative because the data are grouped

The methods of representing frequency distribution are illustrated inFigures 2.1–2.6

2.1.3 The centre

Three quantities are used to represent the ‘centre’ or ‘average’ of a set ofmeasurements These are the mean, the median and the mode, and we dealwith them in turn

XN i¼1

This, the mean, is the usual measure of central tendency

The mean takes account of all of the observations, it can be treatedalgebraically, and the sample mean is an unbiased estimate of the populationmean For capacity variables, such as the phosphorus content in the topsoil offields or daily rainfall at a weather station, means can be multiplied to obtaingross values for larger areas or longer periods Similarly, the mean concentra-tion of a pollutant metal in the soil can be multiplied by the mass of soil toobtain a total load in a field or catchment Further, addition or physical mixingshould give the same result as averaging

Intensity variables are somewhat different These are quantities such asbarometric pressure and matric suction of the soil Adding them or multiplyingthem does not make sense, but the average is still valuable as a measure of thecentre Physical mixing will in general not produce the arithmetic average Someproperties of the environment are not stable in the sense that bodies of materialreact with one another if they are mixed For example, the average pH of a largevolume of soil or lake water after mixing will not be the same as the average ofthe separate bodies of the soil or water that you measured previously Chemicalequilibration takes place The same can be true for other exchangeable ions

Trang 14

So again, the average of a set of measurements is unlikely to be the same as asingle measurement on a mixture.

Mode

The mode is the most typical value It implies that the frequency distributionhas a single peak It is often difficult to determine the numerical value If in ahistogram the class interval is small then the mid-value of the most frequentclass may be taken as the mode For a symmetric distribution the mode, themean and the median are in principle the same For an asymmetric one

ðmode medianÞ 2 ðmedian meanÞ: ð2:2Þ

In asymmetric distributions, e.g Figures 2.1(a) and 2.4(a), the median andmode lie further from the longer tail of the distribution than the mean, and themedian lies between the mode and the mean

2.1.4 Dispersion

There are several measures for describing the spread of a set of measurements:the range, interquartile range, mean deviation, standard deviation and itssquare, the variance These last two are so much easier to treat mathematically,and so much more useful therefore, that we concentrate on them almost to theexclusion of the others

Variance and standard deviation

The variance of a set of values, which we denote S2, is by definition

S2¼1N

XN

ðzi zÞ2: ð2:3Þ

16 Basic Statistics

Trang 15

The variance is the second moment about the mean Like the mean, it is based

on all of the observations, it can be treated algebraically, and it is little affected

by sampling fluctuations It is both additive and positive Its analysis and useare backed by a huge body of theory Its square root is the standard deviation, S.Below we shall replace the divisor N by N 1 so that we can use the variance

of a sample to estimate s2, the population variance, without bias

It is useful for comparing the variation of different sets of observations of thesame property It has little merit for properties with scales having arbitraryzeros and for comparing different properties except where they can be measured

on the same scale

Skewness

The skewness measures the asymmetry of the observations It is definedformally from the third moment about the mean:

m3¼1N

XN i¼1

ðzi zÞ3: ð2:5Þ

The coefficient of skewness is then

g1¼ m3

m2 ffiffiffiffiffiffim2

p ¼m3

where m2is the variance Symmetric distributions have g1¼ 0 Skewness is themost common departure from normality (see below) in measured environ-mental data If the data are skewed then there is some doubt as to whichmeasure of centre to use Comparisons between the means of different sets ofobservations are especially unreliable because the variances can differ substan-tially from one set to another

Trang 16

The kurtosis expresses the peakedness of a distribution It is obtained from thefourth moment about the mean:

m4¼1N

XN i¼1

g2< 0

2.2 THE NORMAL DISTRIBUTION

The normal distribution is central to statistical theory It has been found todescribe remarkably well the errors of observation in physics Many environ-mental variables, such as of the soil, are distributed in a way that approximatesthe normal distribution The form of the distribution was discovered indepen-dently by De Moivre, Laplace and Gauss, but Gauss seems generally to take thecredit for it, and the distribution is often called ‘Gaussian’ It is defined for acontinuous random variable Z in terms of the probability density function (pdf),

where m is the mean of the distribution and s2 is the variance

The shape of the normal distribution is a vertical cross-section through a bell

It is continuous and symmetrical, with its peak at the mean of the distribution

It has two points of inflexion, one on each side of the mean at a distance s The

ordinate fðzÞ at any given value of z is the probability density at z The total areaunder the curve is 1, the total probability of the distribution The area underany portion of the curve, say between z1and z2, represents the proportion of thedistribution lying in that range For instance, slightly more than two-thirds ofthe distribution lies within one standard deviation of the mean, i.e between

m s and m þ s; about 95% lies in the range m 2s to m þ 2s; and 99.73%

lies within three standard deviations of the mean

Just as the frequency distribution can be represented as a cumulativedistribution, so too can the pdf In this representation the normal distribution

18 Basic Statistics

Định dạng
Số trang	26
Dung lượng	237,46 KB