2.1 MEASUREMENT AND SUMMARY The simplest kind of environmental variable is binary, in which there are onlytwo possible states, such as present or absent, wet or dry, calcareous or non-ca
Trang 1Geostatistics for Environmental Scientists
Second Edition
Richard Webster Rothamsted Research, UK Margaret A Oliver University of Reading, UK
Trang 2Geostatistics for Environmental Scientists
Second Edition
Richard Webster Rothamsted Research, UK Margaret A Oliver University of Reading, UK
Trang 32.4.1 Logarithmic transformation 212.4.2 Square root transformation 21
2.5 Exploratory data analysis and display 22
2.6.1 Target population and units 28
2.6.7 Increasing precision and efficiency 32
Trang 43 Prediction and Interpolation 37
4.5 Intrinsic variation and the variogram 544.5.1 Equivalence with covariance 54
4.9 Estimating semivariances and covariances 65
4.9.4 The experimental covariance function 73
5.1 Limitations on variogram functions 79
Trang 57.3.2 Smoothing characteristics of windows 148
7.4 Spectral analysis of the Caragabal transect 1507.4.1 Bandwidths and confidence intervals
7.5 Further reading on spectral analysis 152
8 Local Estimation or Prediction: Kriging 1538.1 General characteristics of kriging 154
Trang 68.7 Case study 1758.7.1 Kriging with known measurement error 180
9 Kriging in the Presence of Trend and Factorial Kriging 195
Trang 711.4 Disjunctive kriging 25111.4.1 Assumptions of Gaussian disjunctive kriging 251
A.7 Spatial analysis: the variogram 288
A.9 Spatial estimation or prediction: kriging 291
Trang 92 Basic Statistics
Before focusing on the main topic of this book, geostatistics, we want to ensurethat readers have a sound understanding of the basic quantitative methods forobtaining and summarizing information on the environment There are twoaspects to consider: one is the choice of variables and how they are measured;the other, and more important, is how to sample the environment This chapterdeals with these Chapter 3 will then consider how such records can be used forestimation, prediction and mapping in a classical framework
The environment varies from place to place in almost every aspect There areinfinitely many places at which we might record what it is like, but practically
we can measure it at only a finite number by sampling Equally, there are manyproperties by which we can describe the environment, and we must choosethose that are relevant Our choice might be based on prior knowledge of themost significant descriptors or from a preliminary analysis of data to hand
2.1 MEASUREMENT AND SUMMARY
The simplest kind of environmental variable is binary, in which there are onlytwo possible states, such as present or absent, wet or dry, calcareous or non-calcareous (rock or soil) They may be assigned the values 1 and 0, and theycan be treated as quantitative or numerical data Other features, such as classes
of soil, soil wetness, stratigraphy, and ecological communities, may be recordedqualitatively These qualitative characters can be of two types: unordered andranked The structure of the soil, for example, is an unordered variable and may
be classified into blocky, granular, platy, etc Soil wetness classes—dry, moist,wet—are ranked in that they can be placed in order of increasing wetness Inboth cases the classes may be recorded numerically, but the records should not
be treated as if they were measured in any sense They can be converted to sets
of binary variables, called ‘indicators’ in geostatistics (see Chapter 11), and canoften be analysed by non-parametric statistical methods
Geostatistics for Environmental Scientists/2nd Edition R Webster and M.A Oliver
Trang 10The most informative records are those for which the variables are measuredfully quantitatively on continuous scales with equal intervals Examples includethe soil’s thickness, its pH, the cadmium content of rock, and the proportion ofland covered by vegetation Some such scales have an absolute zero, whereasfor others the zero is arbitrary Temperature may be recorded in kelvin (absolutezero) or in degrees Celsius (arbitrary zero) Acidity can be measured byhydrogen ion concentration (with an absolute zero) or as its negative logarithm
to base 10, pH, for which the zero is arbitrarily taken as log101 (in moles perlitre) In most instances we need not distinguish between them Some propertiesare recorded as counts, e.g the number of roots in a given volume of soil, thepollen grains of a given species in a sample from a deposit, the number of plants
of a particular type in an area Such records can be analysed by many of themethods used for continuous variables if treated with care
Properties measured on continuous scales are amenable to all kinds ofmathematical operation and to many kinds of statistical analysis They arethe ones that we concentrate on because they are the most informative, andthey provide the most precise estimates and predictions The same statisticaltreatment can often be applied to binary data, though because the scale is socoarse the results may be crude and inference from them uncertain In someinstances a continuous variable is deliberately converted to binary, or to an
‘indicator’ variable, by cutting its scale at some specific value, as described inChapter 11
Sometimes, environmental variables are recorded on coarse stepped scales inthe field because refined measurement is too expensive Examples include thepercentage of stones in the soil, the root density, and the soil’s strength Thesteps in their scales are not necessarily equal in terms of measured values, butthey are chosen as the best compromise between increments of equal practicalsignificance and those with limits that can be detected consistently These scalesneed to be treated with some caution for analysis, but they can often be treated
we shall not consider them in this book
2.1.1 Notation
Another feature of environmental data is that they have spatial and temporalcomponents as well as recorded values, which makes them unique or determi-nistic (we return to this point in Chapter 4) In representing the data we mustdistinguish measurement, location and time For most classical statistical
12 Basic Statistics
Trang 11analyses location is irrelevant, but for geostatistics the location must bespecified We shall adhere to the following notation as far as possible through-out this text Variables are denoted by italics: an upper-case Z for randomvariables and lower-case z for a realization, i.e the actuality, and also forsample values of the realization Spatial position, which may be in one, two orthree dimensions, is denoted by bold x In most instances the space is two-dimensional, and so x¼ fx1; x2g, signifying the vector of the two spatialcoordinates Thus ZðxÞ means a random variable Z at place x, and zðxÞ isthe actual value of Z at x In general, we shall use bold lower-case letters forvectors and bold capitals for matrices.
We shall use lower-case Greek letters for parameters of populations and eithertheir Latin equivalents or place circumflexes (^), commonly called ‘hats’ bystatisticians, over the Greek for their estimates For example, the standard
deviation of a population will be denoted by s and its estimate by s or ^ s.
2.1.2 Representing variation
The environment varies in almost every aspect, and our first task is to describethat variation
Frequency distribution: the histogram and box-plot
Any set of measurements may be divided into several classes, and we may countthe number of individuals in each class For a variable measured on acontinuous scale we divide the measured range into classes of equal widthand count the number of individuals falling into each The resulting set offrequencies constitutes the frequency distribution, and its graph (with fre-quency on the ordinate and the variate values on the abscissa) is the histogram.Figures 2.1 and 2.4 are examples The number of classes chosen depends on the
Figure 2.1 Histograms: (a) exchangeable potassium (K) in mg l1; (b) log10K, for thetopsoil at Broom’s Barn Farm The curves are of the (lognormal) probability density
Measurement and Summary 13
Trang 12number of individuals and the spread of values In general, the fewer theindividuals the fewer the classes needed or justified for representing them.Having equal class intervals ensures that the area under each bar is propor-tional to the frequency of the class If the class intervals are not equal then theheights of the bars should be calculated so that the areas of the bars areproportional to the frequencies.
Another popular device for representing a frequency distribution is the plot This is due to Tukey (1977) The plain ‘box and whisker’ diagram, likethose in Figure 2.2, has a box enclosing the interquartile range, a line showingthe median (see below), and ‘whiskers’ (lines) extending from the limits of theinterquartile range to the extremes of the data, or to some other values such asthe 90th percentiles
box-Both the histogram and the box-plot enable us to picture the distribution tosee how it lies about the mean or median and to identify extreme values
Figure 2.2 Box-plots: (a) exchangeable K; (b) log10K showing the ‘box’ and ‘whiskers’,and (c) exchangeable K and (d) log10K showing the fences at the quartiles plus andminus 1.5 times the interquartile range
14 Basic Statistics
Trang 13Cumulative distribution
The cumulative distribution of a set of N observations is formed by ordering themeasured values, zi, i¼ 1; 2; ; N, from the smallest to the largest, recordingthe order, say k, accumulating them, and then plotting k against z The resultinggraph represents the proportion of values less than zkfor all k¼ 1; 2; ; N Thehistogram can also be converted to a cumulative frequency diagram, thoughsuch a diagram is less informative because the data are grouped
The methods of representing frequency distribution are illustrated inFigures 2.1–2.6
2.1.3 The centre
Three quantities are used to represent the ‘centre’ or ‘average’ of a set ofmeasurements These are the mean, the median and the mode, and we dealwith them in turn
XN i¼1
This, the mean, is the usual measure of central tendency
The mean takes account of all of the observations, it can be treatedalgebraically, and the sample mean is an unbiased estimate of the populationmean For capacity variables, such as the phosphorus content in the topsoil offields or daily rainfall at a weather station, means can be multiplied to obtaingross values for larger areas or longer periods Similarly, the mean concentra-tion of a pollutant metal in the soil can be multiplied by the mass of soil toobtain a total load in a field or catchment Further, addition or physical mixingshould give the same result as averaging
Intensity variables are somewhat different These are quantities such asbarometric pressure and matric suction of the soil Adding them or multiplyingthem does not make sense, but the average is still valuable as a measure of thecentre Physical mixing will in general not produce the arithmetic average Someproperties of the environment are not stable in the sense that bodies of materialreact with one another if they are mixed For example, the average pH of a largevolume of soil or lake water after mixing will not be the same as the average ofthe separate bodies of the soil or water that you measured previously Chemicalequilibration takes place The same can be true for other exchangeable ions
Measurement and Summary 15
Trang 14So again, the average of a set of measurements is unlikely to be the same as asingle measurement on a mixture.
Mode
The mode is the most typical value It implies that the frequency distributionhas a single peak It is often difficult to determine the numerical value If in ahistogram the class interval is small then the mid-value of the most frequentclass may be taken as the mode For a symmetric distribution the mode, themean and the median are in principle the same For an asymmetric one
ðmode medianÞ 2 ðmedian meanÞ: ð2:2Þ
In asymmetric distributions, e.g Figures 2.1(a) and 2.4(a), the median andmode lie further from the longer tail of the distribution than the mean, and themedian lies between the mode and the mean
2.1.4 Dispersion
There are several measures for describing the spread of a set of measurements:the range, interquartile range, mean deviation, standard deviation and itssquare, the variance These last two are so much easier to treat mathematically,and so much more useful therefore, that we concentrate on them almost to theexclusion of the others
Variance and standard deviation
The variance of a set of values, which we denote S2, is by definition
S2¼1N
XN
ðzi zÞ2: ð2:3Þ
16 Basic Statistics
Trang 15The variance is the second moment about the mean Like the mean, it is based
on all of the observations, it can be treated algebraically, and it is little affected
by sampling fluctuations It is both additive and positive Its analysis and useare backed by a huge body of theory Its square root is the standard deviation, S.Below we shall replace the divisor N by N 1 so that we can use the variance
of a sample to estimate s2, the population variance, without bias
It is useful for comparing the variation of different sets of observations of thesame property It has little merit for properties with scales having arbitraryzeros and for comparing different properties except where they can be measured
on the same scale
Skewness
The skewness measures the asymmetry of the observations It is definedformally from the third moment about the mean:
m3¼1N
XN i¼1
ðzi zÞ3: ð2:5Þ
The coefficient of skewness is then
g1¼ m3
m2 ffiffiffiffiffiffim2
p ¼m3
where m2is the variance Symmetric distributions have g1¼ 0 Skewness is themost common departure from normality (see below) in measured environ-mental data If the data are skewed then there is some doubt as to whichmeasure of centre to use Comparisons between the means of different sets ofobservations are especially unreliable because the variances can differ substan-tially from one set to another
Measurement and Summary 17
Trang 16The kurtosis expresses the peakedness of a distribution It is obtained from thefourth moment about the mean:
m4¼1N
XN i¼1
g2< 0
2.2 THE NORMAL DISTRIBUTION
The normal distribution is central to statistical theory It has been found todescribe remarkably well the errors of observation in physics Many environ-mental variables, such as of the soil, are distributed in a way that approximatesthe normal distribution The form of the distribution was discovered indepen-dently by De Moivre, Laplace and Gauss, but Gauss seems generally to take thecredit for it, and the distribution is often called ‘Gaussian’ It is defined for acontinuous random variable Z in terms of the probability density function (pdf),
where m is the mean of the distribution and s2 is the variance
The shape of the normal distribution is a vertical cross-section through a bell
It is continuous and symmetrical, with its peak at the mean of the distribution
It has two points of inflexion, one on each side of the mean at a distance s The
ordinate fðzÞ at any given value of z is the probability density at z The total areaunder the curve is 1, the total probability of the distribution The area underany portion of the curve, say between z1and z2, represents the proportion of thedistribution lying in that range For instance, slightly more than two-thirds ofthe distribution lies within one standard deviation of the mean, i.e between
m s and m þ s; about 95% lies in the range m 2s to m þ 2s; and 99.73%
lies within three standard deviations of the mean
Just as the frequency distribution can be represented as a cumulativedistribution, so too can the pdf In this representation the normal distribution
18 Basic Statistics