1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Tài liệu Statistics for Environmental Engineers P2 doc

10 403 1
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Statistics for environmental engineers P2 doc
Chuyên ngành Statistics
Thể loại Chapter
Năm xuất bản 2002
Định dạng
Số trang 10
Dung lượng 452,11 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

The visual impression from the top plot is that the vertical deviations are greater for large values of time, but the residual plot bottom shows that the curve does not fit the points at

Trang 1

© 2002 By CRC Press LLC

at the shorter times and in this region the residuals are large and predominantly positive Tukey (1977) calls this process of plotting residuals flattening the data He emphasizes its power to shift our attention from the fitted line to the discrepancies between prediction and observation It is these discrepancies that contain the information needed to improve the model

Make it a habit to examine the residuals of a fitted model, including deviations from a simple mean Check for normality by making a dot diagram or histogram Plot the residuals against the predicted values, against the predictor variables, and as a function of the time order in which the measurements were made Residuals that appear to be random and to have uniform variance are persuasive evidence that the model has no serious deficiencies If the residuals show a trend, it is evidence that the model is inadequate If the residuals spread out, it suggests that a data transformation is probably needed Figure 3.12 is a calibration curve for measuring chloride using an ion chromatograph There are three repli-cate measures at each concentration level The hidden variation of the replirepli-cates is revealed in Figure 3.13,

FIGURE 3.11 Graphing residuals The visual impression from the top plot is that the vertical deviations are greater for large values of time, but the residual plot (bottom) shows that the curve does not fit the points at low times.

FIGURE 3.12 Calibration curve for measuring chloride with an ion chromatograph There are three replicate measure-ments at each of the 13 levels of chloride.

0 10 20 30

30

20

10

0 6 4 2 0 -2 -4 -6

Time (hours)

0 200 400 600 800

120 100 80 60 40 20 0 Standard conc (mg/L)

L1592_frame_C03 Page 32 Tuesday, December 18, 2001 1:41 PM

Trang 2

© 2002 By CRC Press LLC

which has flattened the data by looking at deviations from the average of the three values at each level

An important fact is revealed: the measurement error (variation) tends to increase as the concentration increases This must be taken into account when fitting the calibration curve to the data

A Note on Clarity and Style

Here are the words of some people who have devoted their talent and energy to improving the quality

of graphical presentations of statistical data

Excellence in statistical graphics consists of complex ideas communicated with clarity, precision, and efficiency.” Edward Tufte (1983)

The greatest possibilities of visual display lie in vividness and inescapability of the intended message.” John Tukey (1990)

Graphing data should be an iterative experiment process.” Cleveland (1994)

Tufte (1983) emphasizes clarity and simplicity in graphics Wainer (1997) uses elegance, grace, and impact to describe good graphics Cleveland (1994) emphasizes clarity, precision, and efficiency William Playfair (1786), a pioneer and innovator in the use of statistical graphics, desires to tell a story graphically

as well as dramatically

Vividness, drama, elegance, grace, clarity, and impact are not technical terms and the ideas they convey are not easy to capture in technical rules, but Cleveland (1994) and Tufte (1983) have suggested basic principles that will produce better graphics Tufte (1983) says that graphical excellence:

• is the well-designed presentation of interesting data: a matter of substance, of statistics, and

of design

• consists of complex ideas communicated with clarity, precision, and efficiency

• is that which gives the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space

• is almost always multivariate

• requires telling the truth about the data

These guidelines discourage fancified graphs with multiple fonts, cross hatching, and 3-D effects They do not say that color is necessary or helpful A poor graph does not become better because color

is added

Style is to a large extent personal Let us look at five graphical versions of the same data in Figure 3.14 The graphs show how the downward trend in the average number of bald eagle hatchlings in northwestern Ontario reversed after DDT was banned in 1973 The top graphic (so easily produced by computer graphics) does not facilitate understanding the data It is loaded with what Tufte (1983) calls chartjunk— three-dimensional boxes and shading “Every bit of ink on a graphic requires a reason And nearly always that reason should be that the ink presents new information (Tufte, 1983).” The two bar charts in the

FIGURE 3.13 Residuals of the chloride data with respect to the average peak value at each concentration level.

120 100 80 60 40 20 0 -4000 -2000 0 2000 4000

Standard concentration (mg/L)

L1592_frame_C03 Page 33 Tuesday, December 18, 2001 1:41 PM

Trang 3

© 2002 By CRC Press LLC

middle row are clear The version on the right is cleaner and clearer (the box frame is not needed) The white lines through the bars serve as the vertical scale The two graphs in the bottom row are better yet The bars become dots with a line added to emphasize the trend The version on the right smoothes the trend with a curve and adds a note to show when DDT was banned

Most data sets, like this simple one, can be plotted in a variety of ways The viewer will appreciate the effort required to explore variations and present one that is clear, precise, and efficient in the presentation

of the essential information

Should We Always Plot the Data?

According to Farquhar and Farquhar (1891), two 19th century economists, “Getting information from

a table is like extracting sunlight from a cucumber.” A virtually perfect rule of statistics is “Plot the data.” There are times, however, when a plot is unnecessary Figure 3.15 is an example This is a simplified reproduction (shading removed) of a published graph that showed five values

pH = 5 COD = 2300 mg/L BOD = 1500 mg/L TSS = 875 mg/L TDS = 5700 mg/L

FIGURE 3.14 Several versions of plots that show how banning DDT helped the recovery of the bald eagle population in northwestern Ontario.

1980 1975 1970 1965 0.4 0.6 0.8 1.0 1.2 1.4

68

66 70 72 74 76 78 80 0

0.2 0.4 0.6 0.8 1.0 1.2

1966 1970 1972 1974 1976 1978 1980

68

0.

0.

0.

0.

1.

1.

1.

1980 1975 1970 1965 0.4 0.6 0.8 1.0 1.2 1.4

Year

Year

Year

0 2 4 6 8 0 2 4

Average Number of Bald Eagle Hatchlings per Area

Average Number of Bald Eagle Hatchlings per Area 0 Average Number of Bald Eagle Hatchlings per Area 5

0

5 1.

0.

0.

1.

DDT banned in Ontario in 1973

L1592_frame_C03 Page 34 Tuesday, December 18, 2001 1:41 PM

Trang 4

© 2002 By CRC Press LLC

These five values say it all, and better than the graph Do not use an axe to hack your way through an open door Aside from being unnecessary, this chart has three major faults It confuses units — pH is not measured in mg/L Three-dimensional effects make it more difficult to read the numerical values Using

a log scale makes the values seem nearly the same when they are much different The 875 mg/L TSS and the 1500 mg/L COD have bars that are nearly the same height

Summary

Graphical methods are obviously useful for both initial and exploratory data analyses, but they also serve

us well in the final analysis “A picture is worth a thousand words” is a cliché, but still powerfully true The right graph may reveal all that is important If it only tells part of the story, that is the part that is most likely to be remembered

Tables of numbers camouflage the interesting features of data The human mind, which is remarkably well adapted to so many and varied tasks, is simply not capable of extracting useful information from tabulated figures Putting these same numbers in appropriate graphical form completely changes the situation The informed human mind can then operate efficiently with these graphs as inputs In short, suitable graphs of data and the human mind are an effective combination; endless tables of data and the mind are not

It is extremely important that plots be kept current because the first purpose of keeping these plots

is to help monitor and, if necessary, to troubleshoot difficulties as they arise The plots do not have to

be beautiful, or computer drafted, to be useful Make simple plots by hand as the data become available If the plots are made at some future date to provide a record of what happened in the distant past, it will

be too late to take appropriate action to improve performance The second purpose is to have an accurate record of what has happened in the past, especially if the salient information is in such a form that it is easily communicated and readily understood If they are kept up-to-date and used for the first purpose, they can also be used for the second On the other hand, if they are not kept up-to-date, they may be useful for the second purpose only In the interest of efficiency, they ought to serve double duty

Intelligent data analysis begins with plotting the data Be imaginative Use a collection of different graphs to see different aspects of the data Plotting graphs in a notebook is not as useful as making plots large and visible Plots should be displayed in a prominent place so that those concerned with the environ-mental system can review them readily

We close with Tukey’s (1977) declaration: “The greatest value of a picture is when it forces us to

notice what we never expected to see.” (Emphasis and italics in the original.)

References

Anscombe, F J (1973) “Graphs in Statistical Analysis,” American Statistician, 27, 17–21

Chatfield, C (1988) Problem Solving: A Statistician’s Guide, London, Chapman & Hall

Chatfield, C (1991) “Avoiding Statistical Pitfalls,” Stat Sci., 6(3), 240–268

Cleveland, W S (1990) The Elements of Graphing Data, 2nd ed., Summit, NJ, Hobart Press

Cleveland, W S (1994), Visualizing Data, Summit, NJ, Hobart Press

FIGURE 3.15 This unnecessary graph, which shows just five values, should be replaced by a table.

10000 1000

100 10 1

pH COD

TDS TSS BOD L1592_frame_C03 Page 35 Tuesday, December 18, 2001 1:41 PM

Trang 5

© 2002 By CRC Press LLC

Farquhar, A B and H Farquhar (1891) “Economic and Industrial Delusions: A Discourse of the Case for

Protection,” New York, Putnam

Gameson, A L H., G A Truesdale, and M J Van Overdijk (1961) “Variation in Performance of Twelve

Replicate Small-Scale Percolating Filters,” Water and Waste Treatment J., 9, 342–350

Hunter, J S (1988) “The Digidot Plot,” Am Statistician, 42, 54

Tufte, E R (1983) The Visual Display of Quantitative Information, Cheshire, CN, Graphics Press

Tufte, E R (1990) Envisioning Information, Cheshire, CN, Graphics Press

Tufte, E R (1997) Visual Explanations, Cheshire, CN, Graphics Press

Tukey, J W (1977) Exploratory Data Analysis, Reading, MA, Addison-Wesley

Tukey, J W (1990) “Data Based Graphics: Visual Display in the Decades to Come,” Stat Sci., 5, 327–329

Wainer, H (1997) Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Boneparte to

Ross Perot, New York, Copernicus, Springer-Verlag

Exercises

3.1 Box-Whisker Plot For the 11 ordered observations below, make the box-whisker plot to show

the median, the upper and lower quartiles, and the upper and lower cut-off

3.2 Phosphorus in Sludge The values below are annual average concentrations of total phosphorus

in municipal sewage sludge, measured as percent of dry weight solids Time runs from right

to left The first value is for 1979 Make several plots of the data to discover any trends or

patterns Try to explain any patterns you discover

2.7 2.5 2.3 2.4 2.6 2.7 2.6 2.7 2.3 2.9 2.8 2.5 2.6 2.7 2.8 2.6 2.4 2.7 3.0 4.5 4.5 4.3

3.3 Waste Load Survey Data Analysis The table gives 52 weekly average flow and BOD5 data

for wastewater Plot the data in variety of ways that might interest an engineer who needs

to base a treatment plant design on these data As a minimum, (a) make the time series plots

for BOD concentration, flow, and BOD mass load (lb/day); and (b) determine whether flow

and BOD are correlated

Week

Flow (MGD)

BOD

Flow (MGD)

BOD

Flow (MGD)

BOD (mg/L)

L1592_frame_C03 Page 36 Tuesday, December 18, 2001 1:41 PM

Trang 6

© 2002 By CRC Press LLC

3.4 Effluent Suspended Solids The data below are effluent suspended solids data for one year of

a wastewater treatment plant operation Plot the data and discuss any patterns or

character-istics of the data that might interest plant management or operators

3.5 Solid Waste Fuel Value The table gives fuel values (Btu/lb) for typical solid waste from 35

countries (The United States is number 35) Make a matrix scatterplot of the five waste

characteristics and any other plots that might help to identify a plausible model to relate fuel

value to composition

L1592_frame_C03 Page 37 Tuesday, December 18, 2001 1:41 PM

Trang 7

© 2002 By CRC Press LLC

3.6 Highway TPH Contamination Total petroleum hydrocarbons (TPH) in soil specimens

col-lected at 30 locations alongside a 44.8-mile stretch of major highway are given in the table below The length was divided into 29 segments of 1.5 miles and one segment of 1.3 miles The sampling order for these segments was randomized, as was the location within each segment Also, the sample collection was randomized with respect to the eastbound or westbound lane of the roadway There are duplicate measurements on three specimens Plot the data in a variety of ways to check for randomness, independence, trend, and other inter-esting patterns

Source: Khan, et al., J Envir Eng., ASCE, 117, 376, 1991.

Distance (mile) Location

Sample Order

TPH (mg/kg)

Distance (mile) Location

Sample Order

TPH (mg/kg)

Source: Phillips, I (1999) Unpublished paper, Tufts University.

Trang 8

© 2002 By CRC Press LLC

3.7 Heavy Metals Below are 100 daily observations of wastewater influent and effluent lead (Pb)

concentration, measured as µg/L, in wastewater State your expectation for the relation between influent and effluent and then plot the data to see whether your ideas need modifi-cation

Trang 9

© 2002 By CRC Press LLC

4

Smoothing Data

KEY WORDS moving average, exponentially weighted moving average, weighting factors, smooth-ing, and median smoothing.

Smoothing is drawing a smooth curve through data in order to eliminate the roughness (scatter) that blurs the fundamental underlying pattern It sharpens our focus by unhooking our eye from the irregularities Smoothing can be thought of as a decomposition of the data In curve fitting, this decomposition has the general relation: data=fit+residuals In smoothing, the analogous expression is: data=smooth +

rough Because the smooth is intended to be smooth (as the “fit” is smooth in curve fitting), we usually show its points connected Similarly, we show the rough (or residuals) as separated points, if we show them at all We may choose to show only those rough (residual) points that stand out markedly from the smooth (Tukey, 1977)

We will discuss several methods of smoothing to produce graphs that are especially useful with time series data from treatment plants and complicated environmental systems The methods are well estab-lished and have a long history of successful use in industry and econometrics The methods are effective and economical in terms of time and money They are simple; they are useful to everyone, regardless

of statistical expertise Only elementary arithmetic is needed A computer may be helpful, but is not needed, especially if one keeps the plot up-to-date by adding points daily or weekly as they become available

In statistics and quality control literature, one finds mathematics and theory that can embellish these graphs A formal statistical analysis, such as adding control limits, can become quite complex because often the assumptions on which such tests are usually based are violated rather badly by environmental data These embellishments are discussed in another chapter

Smoothing Methods

One method of smoothing would be to fit a straight line or polynomial curve to the data Aside from the computational bother, this is not a useful general procedure because the very fact that smoothing is needed means that we cannot see the underlying pattern clearly enough to know what particular polynomial would be useful

The simplest smoothing method is to plot the data on a logarithmic scale (or plot the logarithm of y

instead of y itself) Smoothing by plotting the moving averages (MA) or exponentially weighted moving averages (EWMA) requires only arithmetic

A moving average (MA) gives equal weight to a sequence of past values; the weight depends on how many past values are to be remembered The EWMA gives more weight to recent events and progressively forgets the past How quickly the past is forgotten is determined by one parameter The EWMA will follow the current observations more closely than the MA Often this is desirable but this responsiveness

is purchased by a loss in smoothing

The choice of a smoothing method might be influenced by the application Because the EWMA forgets the past, it may give a more realistic representation of the actual threat of the pollutant to the environment L1592_Frame_C04 Page 41 Tuesday, December 18, 2001 1:41 PM

Trang 10

© 2002 By CRC Press LLC

For example, the BOD discharged into a freely flowing stream is important the day it is discharged A 2- or 3-day average might also be important because a few days of dissolved oxygen depression could

be disastrous while one day might be tolerable to aquatic organisms A 30-day average of BOD could

be a less informative statistic about the threat to fish than a short-term average, but it may be needed to assess the long-term trend in treatment plant performance

For suspended solids that settle on a stream bed and form sludge banks, a long-term average might

be related to depth of the sludge bed and therefore be an informative statistic If the solids do not settle, the daily values may be more descriptive of potential damage For a pollutant that could be ingested by

an organism and later excreted or metabolized, the exponentially weighted moving average might be a good statistic

Conversely, some pollutants may not exhibit their effect for years Carcinogens are an example where the long-term average could be important Long-term in this context is years, so the 30-day average would not be a particularly useful statistic The first ingested (or inhaled) irritants may have more importance than recently ingested material If so, perhaps past events should be weighted more heavily than recent events if a statistic is to relate source of pollution to present effect Choosing a statistic with the appropriate weighting could increase the value of the data to biologists, epidemiologists, and others who seek to relate pollutant discharges to effects on organisms

Plotting on a Logarithmic Scale

The top panel of Figure 4.1 is a plot of influent copper concentration at a wastewater treatment plant This plot emphasizes the few high values, expecially those at days 225, 250, and 340 The bottom panel shows the same data on a logarithmic scale Now the process behavior appears more consistent The low values are more evident, and the high values do not seem so extreme The episode around day 250 still looks unusual, but the day 225 and 340 values are above the average (on the log scale) by about the same amount that the lowest values are below average

Are the high values so extraordinary as to deserve special attention? Or are they rogue values (outliers) that can be disregarded? This question cannot be answered without knowing the underlying distribution

of the data If the underlying process naturally generates data with a lognormal distribution, the high values fit the general pattern of the data record

FIGURE 4.1 Copper data plotted on arithmetic and logarithmic scales give a different impression about the high values.

350 300 250 200 150 100 50

0

Days

0 500 1000

10 100 1000 10000

L1592_Frame_C04 Page 42 Tuesday, December 18, 2001 1:41 PM

Ngày đăng: 20/01/2014, 01:20

TỪ KHÓA LIÊN QUAN

w