The visual impression from the top plot is that the vertical deviations are greater for large values of time, but the residual plot bottom shows that the curve does not fit the points at
Trang 1© 2002 By CRC Press LLC
at the shorter times and in this region the residuals are large and predominantly positive Tukey (1977) calls this process of plotting residuals flattening the data He emphasizes its power to shift our attention from the fitted line to the discrepancies between prediction and observation It is these discrepancies that contain the information needed to improve the model
Make it a habit to examine the residuals of a fitted model, including deviations from a simple mean Check for normality by making a dot diagram or histogram Plot the residuals against the predicted values, against the predictor variables, and as a function of the time order in which the measurements were made Residuals that appear to be random and to have uniform variance are persuasive evidence that the model has no serious deficiencies If the residuals show a trend, it is evidence that the model is inadequate If the residuals spread out, it suggests that a data transformation is probably needed Figure 3.12 is a calibration curve for measuring chloride using an ion chromatograph There are three repli-cate measures at each concentration level The hidden variation of the replirepli-cates is revealed in Figure 3.13,
FIGURE 3.11 Graphing residuals The visual impression from the top plot is that the vertical deviations are greater for large values of time, but the residual plot (bottom) shows that the curve does not fit the points at low times.
FIGURE 3.12 Calibration curve for measuring chloride with an ion chromatograph There are three replicate measure-ments at each of the 13 levels of chloride.
0 10 20 30
30
20
10
0 6 4 2 0 -2 -4 -6
Time (hours)
0 200 400 600 800
120 100 80 60 40 20 0 Standard conc (mg/L)
L1592_frame_C03 Page 32 Tuesday, December 18, 2001 1:41 PM
Trang 2© 2002 By CRC Press LLC
which has flattened the data by looking at deviations from the average of the three values at each level
An important fact is revealed: the measurement error (variation) tends to increase as the concentration increases This must be taken into account when fitting the calibration curve to the data
A Note on Clarity and Style
Here are the words of some people who have devoted their talent and energy to improving the quality
of graphical presentations of statistical data
“Excellence in statistical graphics consists of complex ideas communicated with clarity, precision, and efficiency.” Edward Tufte (1983)
“The greatest possibilities of visual display lie in vividness and inescapability of the intended message.” John Tukey (1990)
“Graphing data should be an iterative experiment process.” Cleveland (1994)
Tufte (1983) emphasizes clarity and simplicity in graphics Wainer (1997) uses elegance, grace, and impact to describe good graphics Cleveland (1994) emphasizes clarity, precision, and efficiency William Playfair (1786), a pioneer and innovator in the use of statistical graphics, desires to tell a story graphically
as well as dramatically
Vividness, drama, elegance, grace, clarity, and impact are not technical terms and the ideas they convey are not easy to capture in technical rules, but Cleveland (1994) and Tufte (1983) have suggested basic principles that will produce better graphics Tufte (1983) says that graphical excellence:
• is the well-designed presentation of interesting data: a matter of substance, of statistics, and
of design
• consists of complex ideas communicated with clarity, precision, and efficiency
• is that which gives the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space
• is almost always multivariate
• requires telling the truth about the data
These guidelines discourage fancified graphs with multiple fonts, cross hatching, and 3-D effects They do not say that color is necessary or helpful A poor graph does not become better because color
is added
Style is to a large extent personal Let us look at five graphical versions of the same data in Figure 3.14 The graphs show how the downward trend in the average number of bald eagle hatchlings in northwestern Ontario reversed after DDT was banned in 1973 The top graphic (so easily produced by computer graphics) does not facilitate understanding the data It is loaded with what Tufte (1983) calls chartjunk— three-dimensional boxes and shading “Every bit of ink on a graphic requires a reason And nearly always that reason should be that the ink presents new information (Tufte, 1983).” The two bar charts in the
FIGURE 3.13 Residuals of the chloride data with respect to the average peak value at each concentration level.
120 100 80 60 40 20 0 -4000 -2000 0 2000 4000
Standard concentration (mg/L)
L1592_frame_C03 Page 33 Tuesday, December 18, 2001 1:41 PM
Trang 3© 2002 By CRC Press LLC
middle row are clear The version on the right is cleaner and clearer (the box frame is not needed) The white lines through the bars serve as the vertical scale The two graphs in the bottom row are better yet The bars become dots with a line added to emphasize the trend The version on the right smoothes the trend with a curve and adds a note to show when DDT was banned
Most data sets, like this simple one, can be plotted in a variety of ways The viewer will appreciate the effort required to explore variations and present one that is clear, precise, and efficient in the presentation
of the essential information
Should We Always Plot the Data?
According to Farquhar and Farquhar (1891), two 19th century economists, “Getting information from
a table is like extracting sunlight from a cucumber.” A virtually perfect rule of statistics is “Plot the data.” There are times, however, when a plot is unnecessary Figure 3.15 is an example This is a simplified reproduction (shading removed) of a published graph that showed five values
pH = 5 COD = 2300 mg/L BOD = 1500 mg/L TSS = 875 mg/L TDS = 5700 mg/L
FIGURE 3.14 Several versions of plots that show how banning DDT helped the recovery of the bald eagle population in northwestern Ontario.
1980 1975 1970 1965 0.4 0.6 0.8 1.0 1.2 1.4
68
66 70 72 74 76 78 80 0
0.2 0.4 0.6 0.8 1.0 1.2
1966 1970 1972 1974 1976 1978 1980
68
0.
0.
0.
0.
1.
1.
1.
1980 1975 1970 1965 0.4 0.6 0.8 1.0 1.2 1.4
Year
Year
Year
0 2 4 6 8 0 2 4
Average Number of Bald Eagle Hatchlings per Area
Average Number of Bald Eagle Hatchlings per Area 0 Average Number of Bald Eagle Hatchlings per Area 5
0
5 1.
0.
0.
1.
DDT banned in Ontario in 1973
L1592_frame_C03 Page 34 Tuesday, December 18, 2001 1:41 PM
Trang 4© 2002 By CRC Press LLC
These five values say it all, and better than the graph Do not use an axe to hack your way through an open door Aside from being unnecessary, this chart has three major faults It confuses units — pH is not measured in mg/L Three-dimensional effects make it more difficult to read the numerical values Using
a log scale makes the values seem nearly the same when they are much different The 875 mg/L TSS and the 1500 mg/L COD have bars that are nearly the same height
Summary
Graphical methods are obviously useful for both initial and exploratory data analyses, but they also serve
us well in the final analysis “A picture is worth a thousand words” is a cliché, but still powerfully true The right graph may reveal all that is important If it only tells part of the story, that is the part that is most likely to be remembered
Tables of numbers camouflage the interesting features of data The human mind, which is remarkably well adapted to so many and varied tasks, is simply not capable of extracting useful information from tabulated figures Putting these same numbers in appropriate graphical form completely changes the situation The informed human mind can then operate efficiently with these graphs as inputs In short, suitable graphs of data and the human mind are an effective combination; endless tables of data and the mind are not
It is extremely important that plots be kept current because the first purpose of keeping these plots
is to help monitor and, if necessary, to troubleshoot difficulties as they arise The plots do not have to
be beautiful, or computer drafted, to be useful Make simple plots by hand as the data become available If the plots are made at some future date to provide a record of what happened in the distant past, it will
be too late to take appropriate action to improve performance The second purpose is to have an accurate record of what has happened in the past, especially if the salient information is in such a form that it is easily communicated and readily understood If they are kept up-to-date and used for the first purpose, they can also be used for the second On the other hand, if they are not kept up-to-date, they may be useful for the second purpose only In the interest of efficiency, they ought to serve double duty
Intelligent data analysis begins with plotting the data Be imaginative Use a collection of different graphs to see different aspects of the data Plotting graphs in a notebook is not as useful as making plots large and visible Plots should be displayed in a prominent place so that those concerned with the environ-mental system can review them readily
We close with Tukey’s (1977) declaration: “The greatest value of a picture is when it forces us to
notice what we never expected to see.” (Emphasis and italics in the original.)
References
Anscombe, F J (1973) “Graphs in Statistical Analysis,” American Statistician, 27, 17–21
Chatfield, C (1988) Problem Solving: A Statistician’s Guide, London, Chapman & Hall
Chatfield, C (1991) “Avoiding Statistical Pitfalls,” Stat Sci., 6(3), 240–268
Cleveland, W S (1990) The Elements of Graphing Data, 2nd ed., Summit, NJ, Hobart Press
Cleveland, W S (1994), Visualizing Data, Summit, NJ, Hobart Press
FIGURE 3.15 This unnecessary graph, which shows just five values, should be replaced by a table.
10000 1000
100 10 1
pH COD
TDS TSS BOD L1592_frame_C03 Page 35 Tuesday, December 18, 2001 1:41 PM
Trang 5© 2002 By CRC Press LLC
Farquhar, A B and H Farquhar (1891) “Economic and Industrial Delusions: A Discourse of the Case for
Protection,” New York, Putnam
Gameson, A L H., G A Truesdale, and M J Van Overdijk (1961) “Variation in Performance of Twelve
Replicate Small-Scale Percolating Filters,” Water and Waste Treatment J., 9, 342–350
Hunter, J S (1988) “The Digidot Plot,” Am Statistician, 42, 54
Tufte, E R (1983) The Visual Display of Quantitative Information, Cheshire, CN, Graphics Press
Tufte, E R (1990) Envisioning Information, Cheshire, CN, Graphics Press
Tufte, E R (1997) Visual Explanations, Cheshire, CN, Graphics Press
Tukey, J W (1977) Exploratory Data Analysis, Reading, MA, Addison-Wesley
Tukey, J W (1990) “Data Based Graphics: Visual Display in the Decades to Come,” Stat Sci., 5, 327–329
Wainer, H (1997) Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Boneparte to
Ross Perot, New York, Copernicus, Springer-Verlag
Exercises
3.1 Box-Whisker Plot For the 11 ordered observations below, make the box-whisker plot to show
the median, the upper and lower quartiles, and the upper and lower cut-off
3.2 Phosphorus in Sludge The values below are annual average concentrations of total phosphorus
in municipal sewage sludge, measured as percent of dry weight solids Time runs from right
to left The first value is for 1979 Make several plots of the data to discover any trends or
patterns Try to explain any patterns you discover
2.7 2.5 2.3 2.4 2.6 2.7 2.6 2.7 2.3 2.9 2.8 2.5 2.6 2.7 2.8 2.6 2.4 2.7 3.0 4.5 4.5 4.3
3.3 Waste Load Survey Data Analysis The table gives 52 weekly average flow and BOD5 data
for wastewater Plot the data in variety of ways that might interest an engineer who needs
to base a treatment plant design on these data As a minimum, (a) make the time series plots
for BOD concentration, flow, and BOD mass load (lb/day); and (b) determine whether flow
and BOD are correlated
Week
Flow (MGD)
BOD
Flow (MGD)
BOD
Flow (MGD)
BOD (mg/L)
L1592_frame_C03 Page 36 Tuesday, December 18, 2001 1:41 PM
Trang 6© 2002 By CRC Press LLC
3.4 Effluent Suspended Solids The data below are effluent suspended solids data for one year of
a wastewater treatment plant operation Plot the data and discuss any patterns or
character-istics of the data that might interest plant management or operators
3.5 Solid Waste Fuel Value The table gives fuel values (Btu/lb) for typical solid waste from 35
countries (The United States is number 35) Make a matrix scatterplot of the five waste
characteristics and any other plots that might help to identify a plausible model to relate fuel
value to composition
L1592_frame_C03 Page 37 Tuesday, December 18, 2001 1:41 PM
Trang 7© 2002 By CRC Press LLC
3.6 Highway TPH Contamination Total petroleum hydrocarbons (TPH) in soil specimens
col-lected at 30 locations alongside a 44.8-mile stretch of major highway are given in the table below The length was divided into 29 segments of 1.5 miles and one segment of 1.3 miles The sampling order for these segments was randomized, as was the location within each segment Also, the sample collection was randomized with respect to the eastbound or westbound lane of the roadway There are duplicate measurements on three specimens Plot the data in a variety of ways to check for randomness, independence, trend, and other inter-esting patterns
Source: Khan, et al., J Envir Eng., ASCE, 117, 376, 1991.
Distance (mile) Location
Sample Order
TPH (mg/kg)
Distance (mile) Location
Sample Order
TPH (mg/kg)
Source: Phillips, I (1999) Unpublished paper, Tufts University.
Trang 8© 2002 By CRC Press LLC
3.7 Heavy Metals Below are 100 daily observations of wastewater influent and effluent lead (Pb)
concentration, measured as µg/L, in wastewater State your expectation for the relation between influent and effluent and then plot the data to see whether your ideas need modifi-cation
Trang 9© 2002 By CRC Press LLC
4
Smoothing Data
KEY WORDS moving average, exponentially weighted moving average, weighting factors, smooth-ing, and median smoothing.
Smoothing is drawing a smooth curve through data in order to eliminate the roughness (scatter) that blurs the fundamental underlying pattern It sharpens our focus by unhooking our eye from the irregularities Smoothing can be thought of as a decomposition of the data In curve fitting, this decomposition has the general relation: data=fit+residuals In smoothing, the analogous expression is: data=smooth +
rough Because the smooth is intended to be smooth (as the “fit” is smooth in curve fitting), we usually show its points connected Similarly, we show the rough (or residuals) as separated points, if we show them at all We may choose to show only those rough (residual) points that stand out markedly from the smooth (Tukey, 1977)
We will discuss several methods of smoothing to produce graphs that are especially useful with time series data from treatment plants and complicated environmental systems The methods are well estab-lished and have a long history of successful use in industry and econometrics The methods are effective and economical in terms of time and money They are simple; they are useful to everyone, regardless
of statistical expertise Only elementary arithmetic is needed A computer may be helpful, but is not needed, especially if one keeps the plot up-to-date by adding points daily or weekly as they become available
In statistics and quality control literature, one finds mathematics and theory that can embellish these graphs A formal statistical analysis, such as adding control limits, can become quite complex because often the assumptions on which such tests are usually based are violated rather badly by environmental data These embellishments are discussed in another chapter
Smoothing Methods
One method of smoothing would be to fit a straight line or polynomial curve to the data Aside from the computational bother, this is not a useful general procedure because the very fact that smoothing is needed means that we cannot see the underlying pattern clearly enough to know what particular polynomial would be useful
The simplest smoothing method is to plot the data on a logarithmic scale (or plot the logarithm of y
instead of y itself) Smoothing by plotting the moving averages (MA) or exponentially weighted moving averages (EWMA) requires only arithmetic
A moving average (MA) gives equal weight to a sequence of past values; the weight depends on how many past values are to be remembered The EWMA gives more weight to recent events and progressively forgets the past How quickly the past is forgotten is determined by one parameter The EWMA will follow the current observations more closely than the MA Often this is desirable but this responsiveness
is purchased by a loss in smoothing
The choice of a smoothing method might be influenced by the application Because the EWMA forgets the past, it may give a more realistic representation of the actual threat of the pollutant to the environment L1592_Frame_C04 Page 41 Tuesday, December 18, 2001 1:41 PM
Trang 10© 2002 By CRC Press LLC
For example, the BOD discharged into a freely flowing stream is important the day it is discharged A 2- or 3-day average might also be important because a few days of dissolved oxygen depression could
be disastrous while one day might be tolerable to aquatic organisms A 30-day average of BOD could
be a less informative statistic about the threat to fish than a short-term average, but it may be needed to assess the long-term trend in treatment plant performance
For suspended solids that settle on a stream bed and form sludge banks, a long-term average might
be related to depth of the sludge bed and therefore be an informative statistic If the solids do not settle, the daily values may be more descriptive of potential damage For a pollutant that could be ingested by
an organism and later excreted or metabolized, the exponentially weighted moving average might be a good statistic
Conversely, some pollutants may not exhibit their effect for years Carcinogens are an example where the long-term average could be important Long-term in this context is years, so the 30-day average would not be a particularly useful statistic The first ingested (or inhaled) irritants may have more importance than recently ingested material If so, perhaps past events should be weighted more heavily than recent events if a statistic is to relate source of pollution to present effect Choosing a statistic with the appropriate weighting could increase the value of the data to biologists, epidemiologists, and others who seek to relate pollutant discharges to effects on organisms
Plotting on a Logarithmic Scale
The top panel of Figure 4.1 is a plot of influent copper concentration at a wastewater treatment plant This plot emphasizes the few high values, expecially those at days 225, 250, and 340 The bottom panel shows the same data on a logarithmic scale Now the process behavior appears more consistent The low values are more evident, and the high values do not seem so extreme The episode around day 250 still looks unusual, but the day 225 and 340 values are above the average (on the log scale) by about the same amount that the lowest values are below average
Are the high values so extraordinary as to deserve special attention? Or are they rogue values (outliers) that can be disregarded? This question cannot be answered without knowing the underlying distribution
of the data If the underlying process naturally generates data with a lognormal distribution, the high values fit the general pattern of the data record
FIGURE 4.1 Copper data plotted on arithmetic and logarithmic scales give a different impression about the high values.
350 300 250 200 150 100 50
0
Days
0 500 1000
10 100 1000 10000
L1592_Frame_C04 Page 42 Tuesday, December 18, 2001 1:41 PM