Statistics for Environmental Engineers - Part 6 (end) pps

Trang 1

50

Using Simulation to Study Statistical Problems

KEY WORDS bootstrap, lognormal distribution, Monte Carlo simulation, percentile estimation, dom normal variate, random uniform variate, resampling, simulation, synthetic sampling, t-test.

ran-Sometimes it is difficult to analytically determine the properties of a statistic This might happen because

an unfamiliar statistic has been created by a regulatory agency One might demonstrate the properties

or sensitivity of a statistical procedure by carrying through the proposed procedure on a large number

simulation

use the computer to simulate repeated sampling The statistic of interest is calculated for each resample

to construct a simulated distribution that approximates the true sampling distribution of the statistic Theapproximation improves as the number of simulated estimates increases

Monte Carlo Simulation

Monte Carlo simulation is a way of experimenting with a computer to study complex situations Themethod consists of sampling to create many data sets that are analyzed to learn how a statistical methodperforms

It is easy to compute uniform and normal random variates directly The values generated from good

but they have statistical properties that cannot be distinguished from those of true random numbers Wewill assume such a random number generating program is available

Lognormally distributed random variates can be simulated from random normal variates using:

Trang 2

and

You may not need to make the manipulations described above Most statistics software programs (e.g.,

Bernoulli, binomial, Poisson, logistic, Weibull, and other distributions Microsoft EXCEL will generaterandom numbers from uniform, normal, Bernoulli, binomial, and Poisson distributions Equations forgenerating random values for the exponential, Gamma, Chi-square, lognormal, Beta, Weibull, Poisson,and binomial distributions from the standard uniform and normal variates are given in Hahn and Shapiro(1967) Another useful source is Press et al (1992)

Case Study: Properties of a Computed Statistic

A new regulation on chronic toxicity requires enforcement decisions to be made on the basis of 4-day

this system should be more nearly normal than the parent lognormal population, but we want to check

on how closely normality is approached We do this empirically by constructing a distribution of simulatedaverages The steps are:

1 Generate four random, independent, normally distributed numbers having η= 2 and σ= 1

3 Average the four values to estimate the 4-day average ( )

compute the 1000 simulated 4-day averages represented by the frequency distribution of Figure 50.1(b).Although 1000 observations sounds likes a large number, the frequency distributions are still not smooth,but the essential information has emerged from the simulation The distribution of 4-day averages isskewed, although not as strongly as the parent lognormal distribution The median, average, and standarddeviation of the 4000 lognormal values are 7.5, 12.3, and 16.1 The average of the 1000 4-day averages

is 12.3; the standard deviation of the 4-day averages is 11.0; 90% of the 4-day averages are in the range

of 5.0 to 26.5; and 50% are in the range of 7.2 to 15.4

Case Study: Percentile Estimation

A state regulation requires the 99th percentile of measurements on a particular chemical to be less than

value of 18.0 If we make 100 random observations of the concentration, how often will the 99th percentile

Trang 3

esti-mated parametrically or nonparametrically? (These two estimation methods are explained in Chapter 8.)

These questions can be answered by simulation, as follows

nonparametric method, each estimate being obtained from 100 values drawn at random from the

FIGURE 50.1 Left-hand panel: frequency distribution of 4000 daily observations that are random, independent, and

have a lognormal distribution x = exp(y), where y is normally distributed with η = 2 and σ = 1 Right-hand panel:

frequency distribution of 1000 4-day averages, each computed from four random values sampled from the lognormal

distribution.

FIGURE 50.2 Distribution of 100 nonparametric estimates and 100 parametric estimates of the 99th percentile, each

computed using a sample of n= 100 from the lognormal distribution shown in the top panel.

x

Histogram of 1000 4-day averages of lognormally distributed values

Histogram of 4000 lognormally distributed values

0 10 20 30 40

0 2

0 1 0

0 2

0 1 0

Trang 4

with the parametric method

One hundred estimates gives a rough, but informative, empirical distribution Simulating one thousand

estimates would give a smoother distribution, but it would still show that the parametric estimates are

less variable than the nonparametric estimates and they are distributed more symmetrically about the

that the data are from a lognormal distribution, whereas the nonparametric method assumes no prior

knowledge of the distribution (Berthouex and Hau, 1991)

show at least 5% violations due merely to random errors in sampling the distribution, and this is with

increase The nonparametric estimation gives more and larger violations

Bootstrap Sampling

The bootstrap method is random resampling, with replacement, to create new sets of data (Metcalf,

1997; Draper and Smith, 1998) Suppose that we wish to determine confidence intervals for the

n residuals Assuming the model is an adequate description of the data, the residuals are random errors.

We can imagine that in a repeat experiment the residual of the original eighth observation might happen

to become the residual for the third new observation, the original third residual might become the new

sixth residual, etc This suggests how n residuals drawn at random from the original set can be assigned

to the original observations to create a set of new data Obviously this requires that the original data be

a random sample so that residuals are independent of each other

The resampling is done with replacement, which means that the original eighth residual can be used

more than once in the bootstrap sample of new data

The boostrap resampling is done many times, the statistics of interest are estimated from the set of

new data, and the empirical reference distributions of the statistics are compiled The number of resamples

might depend on the number of observations in the pool that will be sampled One recommendation is

times, but it is common to round this up to 100, 500, or 1000 (Peigorsch andBailer, 1997)

The resampling is accomplished by randomly selecting the mth observation using a uniformly

dis-tributed random number between 1 and n:

The bootstrap method will be applied to estimating confidence intervals for the parameters of the

bootstrap this problem because the confidence intervals are known exactly, but using a familiar example

makes it easy to follow and check the calculations

from the original fitting If this process were repeated a large number of times (i.e., 100 or more), the

distribution of the intercept and slope would become apparent and the confidence intervals could be

joint confidence region is starting to emerge

yˆ

Trang 5

Comments

Another use of simulation is to test the consequences of violation the assumptions on which a statisticalprocedure rests A good example is provided by Box et al (1978) who used simulation to study how

nonnormality and serial correlation affect the performance of the t-test The effect of nonnormality was

not very serious In a case where 5% of tests should have been significant, 4.3% were significant for

New Residuals and Data Generated by Resampling, with Replacement, Using the Random Numbers

in Table 50.2 and the Residuals in Table 50.1

Trang 6

normally distributed data, 6.0% for a rectangular parent distribution, and 5.9% for a skewed parentdistribution The effect of modest serial correlation in the data was much greater than these differences

from the correct level of 5% to 10.5% for the normal distribution, 12.5% for a rectangular distribution,and 11.4% for a skewed distribution They also showed that randomization would negate the autocor-relation and give percentages of significant results at the expected level of about 5% Normality, whichoften causes concern, turns out to be relatively unimportant while serial correlation, which is too seldomconsidered, can be ruinous

The bootstrap method is a special form of simulation that is based on resampling with replacement

It can be used to investigate the properties of any statistic that may have unusual properties or one forwhich a convenient analytical solution does not exist

Simulation is familiar to most engineers as a design tool Use it to explore and discover unknownproperties of unfamiliar statistics and to check the performance of statistical methods that might beapplied to data with nonideal properties Sometimes we find that our worries are misplaced or unfounded

References

Berthouex, P M and I Hau (1991) “Difficulties in Using Water Quality Standards Based on Extreme

Percentiles,” Res J Water Pollution Control Fed., 63(5), 873–879.

Box, G E P., W G Hunter, and J S Hunter (1978) Statistics for Experimenters: An Introduction to Design,

Data Analysis, and Model Building, New York, Wiley Interscience.

Draper, N R and H Smith, (1998) Applied Regression Analysis, 3rd ed., New York, John Wiley.

Hahn, G J and S S Shapiro (1967) Statistical Methods for Engineers, New York, John Wiley.

Metcalf, A V (1997) Statistics in Civil Engineering, London, Arnold.

TABLE 50.4

Parameter Estimates for the Original Data and for Five Sets of New Data Generated by Resamplingthe Residuals in Table 50.1

0

Resample data

Original data 0.

Trang 7

Peigorsch, W W and A J Bailer (1997) Statistics for Environmental Biology and Toxicology, New York,

Chapman & Hall

Press, W H., B P Flannery, S A Tenkolsky, and W T Vetterling (1992) Numerical Recipes in FORTRAN:

The Art of Scientific Computing, 2nd ed., Cambridge, England, Cambridge University Press.

Exercises

50.1 Limit of Detection The Method Limit of Detection is calculated using MDL = 3.143s, where

s is the standard deviation of measurements on seven identical aliquots Use simulation to

study how much the MDL can vary due to random variation in the replicate measurements

50.2 Nonconstant Variance Chapter 37 on weighted least squares discussed a calibration problem

where there were three replicate observations at several concentration levels By how muchcan the variance of triplicate observations vary before one would decide that there is noncon-stant variance? Answer this by simulating 500 sets of random triplicate observations, calculatingthe variance of each set, and plotting the histogram of estimated variances

50.3 Uniform Distribution Data from a process is discovered to have a uniform distribution with

determine the reference distribution for the standard deviation, the standard error of the mean,

50.4 Regression Extend the example in Table 50.3 and add five to ten more points to Figure 50.3

50.5 Bootstrap Confidence Intervals Fit the exponential model y = θ1 exp(−θ2 x) to the data below

and use the bootstrap method to determine the approximate joint confidence region of theparameter estimates

bootstrap experiment to see how the shape of the confidence region is changed by having

data at larger values of x.

50.6 Legal Statistics Find an unfamiliar or unusual statistic in a state or U.S environmental

regulation and discover its properties by simulation

50.7 99th Percentile Distribution A quality measure for an industrial discharge (kg/day of TSS)

has a lognormal distribution with mean 3000 and standard deviation 2000 Use simulation toconstruct a reference distribution of the 99th percentile value of the TSS load From thisdistribution, estimate an upper 90% confidence limit for the 99th percentile

Trang 8

51

Introduction to Time Series Modeling

KEY WORDS ARIMA model, ARMA model, AR model, autocorrelation, autocorrelation function, autoregressive model, cross-correlation, integrated model, IMA model, intervention analysis, lag, linear trend, MA model, moving average model, nonstationary, parsimony, seasonality, stationary, time series, transfer function.

Our discussion will be limited to discrete (sampled-data) systems where observations occur at equally

stock market, or they may be measured imperfectly, as the biochemical oxygen demand (BOD) of atreatment plant effluent The BOD data will contain a component of measurement error; the IBM stockprices will not In both cases there are forces, some unknown, that nudge the series this way and that.The effect of these forces on the system can be “remembered” to some extent by the process Thismemory makes adjacent observations dependent on the recent past Time series analysis provides toolsfor analyzing and describing this dependence The goal usually is to obtain a model that can be used toforecast future values of the same series, or to obtain a transfer function to predict the value of an outputfrom knowledge of a related input

Time series data are common in environmental work Data may be monitored frequently (pH everysecond) or at long intervals and at regular or irregular intervals The records may be complete, or havemissing data They may be homogeneous over time, or measurement methods may have changed, or someintervention has shifted the system to a new level (new treatment plant or new flood control dam) The datamay show a trend or cycle, or they may vary about a fixed mean value All of these are possible complications

in times series data The common features are that time runs with the data and we do not expect neighboringobservations to be independent Otherwise, each time series is unique and its interpretation and modelingare not straightforward It is a specialty Most of us will want a specialist’s help even for simple time seriesanalysis The authors have always teamed with an experienced statistician on these jobs

Some Examples of Time Series Analysis

We have urged plotting the data before doing analysis because this usually provides some hints aboutthe model that should be fitted to the data It is a good idea to plot time series data as well, but the plotsusually do not reveal the form of the model, except perhaps that there is a trend or seasonality Thedetails need to be dug out using the tools of time series analysis

Wisconsin Daytime to nighttime variation is clear, but this cycle is not a smooth harmonic (sine orcosine) One day looks pretty much like another although a long record of daily average data shows thatSundays are different than the other days of the week

annual cycle The positive correlation between successive observations is very strong, high temperatures

Trang 9

Deer Island effluent suspended solids The long-term drift over the year is roughly the inverse of thetemperature cycle There is also “spiky” variation The spikes occurred because of an intermittent physicalcondition in the final clarifiers (the problem has been corrected) An ordinary time series model wouldnot be able to capture the spikes because they erupt at random

are monthly averages that run from January 1972 to December 1977 In February 1974, an important

FIGURE 51.1 Influent BOD at the Nine Springs Wastewater Treatment Plant, Madison, WI.

FIGURE 51.2 Influent temperature at Deer Island Wastewater Treatment Plant, Boston, for the year 2000.

FIGURE 51.3 Effluent pH at the Deer Island Wastewater Treatment Plant for the year 2000.

FIGURE 51.4 Effluent suspended solids at the Deer Island Wastewater Treatment Plant for the year 2000.

120 108 96 84 72 60 48 36 24 12 0 0 100 200 300

Time (2-hour intervals)

50 60 70 80

350 300 250 200 150 100 50 00

Days

6.0 6.5 7.0

350 300 250 200 150 100 50 0

Days

350 300 250 200 150 100 50 0 0 20 40 60

Days

L1592_Frame_C51 Page 442 Tuesday, December 18, 2001 3:38 PM

Trang 10

wastewater treatment plant initiated phosphorus removal and the nature of the time series changedabruptly A time series analysis of this data needs to account for this intervention Chapter 54 discusses

intervention analysis

Each of these time series has correlation of adjacent or nearby values within the time series This is

These few graphs show that a time series may have a trend, a cycle, an intervention shift, and a strongrandom component Our eye can see the difference but not quantify it We need some special tools to

function (ACF) Another is the ARIMA class of time series models

The Autocorrelation Function

The autocorrelation function is the fundamental tool for diagnosing the structure of a time series The

FIGURE 51.5 Phosphorus data for a Canadian river showing an intervention that reduced the P concentration after February 1974.

0.01 0.1 1

10

Phos removal started Feb '74

Trang 11

positive-negative-positive pattern on the right indicates a negative correlation and this is reflected in the

information about the structure of the time series process; the small ones will be ignored We expect

that beyond some value of k the correlation will “die out” and can be ignored.

where n is the length of the series The corresponding standard error is and the approximate 95%

The autocorrelation function produced by commercial software for time series analysis (e.g., Minitab)

will give confidence intervals that are better than this approximation, but the equation above provides

sufficient information for our introduction

any fitted model, and one check on the adequacy of the model is independent residuals (i.e., no correlation)

positive correlations and they do not die out within a reasonable number of lags This indicates that the

time series has a linear trend (upward because the correlations are positive) A later section (nonstationary

processes) explains how this trend is handled

FIGURE 51.6 Two first-order autoregressive time series (a and b) with their theoretical (c) and sample (d) autocorrelation

functions.

12 10 8 6 4 2

First-Order Autotregressive Time Series

(c) Theoretical Autocorrelation Functions

(d) Sample Autocorrelation Functions

(a) zt = 0.7zt-1 +a1 (b) zt = Ð0.7zt-1 +a1

-1 0

+1

rk-1

0

+1

rk

-1 0

+1

rk-1

Trang 12

The ARIMA Family of Time Series Models

We will examine the family of autoregressive integrated moving average (ARIMA) models ARIMAdescribes a collection of models that can be very simple or complicated The simple models only usethe autoregressive (AR) part of the ARIMA structure, and some only use the moving average (MA) part

AR models are used to describe stationary time series — those that fluctuate about a fixed level MAmodels are used to describe nonstationary processes — those that drift and do not have a fixed meanlevel, except as an approximation over a short time period More complicated models integrate the ARand MA features They also include features that deal with drift, trends, and seasonality Thus, theARIMA structure provides a powerful and flexible collection of models that can be adapted to all kinds

of time series

The ARIMA models are parsimonious; they can be written with a small number of parameters Themodels are useful for forecasting as well as interpreting time series Parameter estimation is done by aniterative minimization of the sum of squares, as in nonlinear regression

Time Series Models for a Stationary Process

A time series that fluctuates about a fixed mean level is said to be stationary The data describe, or

comes from, a stationary process The data properties are unaffected by a change of the time origin Thesample mean and sample variance of a stationary process are:

The correlation may be positive or negative, and it may change from positive to negative as the lagdistance increases

That is, the nature of the relationship between observations separated by a constant time can be inferred

Direction in time is unimportant in determining autocorrelation in a stationary process, so we could just

as well plot (z t , z t− k)

Stationary processes are described by autoregressive models For the case where p recent observations

are relevant, the model is:

FIGURE 51.7 Autocorrelation function for the Deer Island pH series of Figure 53.3 The dotted lines are the approximate

95% confidence intervals.

r

-0.2 -0.1 0.0 0.1 0.2

Trang 13

A first-order autoregressive process, abbreviated AR(1), is:

A second-order autoregressive process, AR(2), has the form:

= 0.0256 and the correlation has become

negative with their magnitudes decreasing exponentially

Time Series Models for a Nonstationary Process

Nonstationary processes have no fixed mean They “drift” and this drift is described by the nonstationary moving average models A moving average model expresses the current value of the time series as a

weighted sum of a finite number of previous random inputs

This is abbreviated as MA(q) Moving average models have been used for more than 60 years (e.g.,

Wold, 1938)

A first-order moving average model MA(1) is:

The autocorrelation function for an MA(1) model is a single spike at lag 1, and zero otherwise That is:

Trang 14

The second-order model MA(2) is:

We have been explaining these models with a minimal amount of notation, but the “backshift operator”

The Principle of Parsimony

If there are two equivalent ways to express a model, we should choose the most parsimonious expression,that is, the form that uses the fewest parameters

of present and past terms of a decaying white noise series:

This is an infinite moving average process This important result allows us to express an infinite movingaverage process in the parsimonious form of an AR(1) model that has only one parameter

In a similar manner, the finite moving average process described by an MA(1) model can be written

as a infinite autoregressive series:

Mixed Autoregressive–Moving Average Processes

We saw that the finite first-order autoregressive model could be expressed as an infinite series moving

As the model becomes more complicated, the parsimonious parameterization may call for a combination

of moving average and autoregressive terms This is the so-called mixed autoregressive–moving average process The abbreviated name for this process is ARMA(p,q), where p and q refer to the number of

terms in each part of the model:

The ARMA(1,1) process is:

Trang 15

Integrated Autoregressive–Moving Average Processes

A nonstationary process has no fixed central level, except perhaps as an approximation over a short

period of time This drifting or trending behavior is common in business, economic, and environmental

data, and also in controlled manufacturing systems If one had to make an a priori prediction of the

character of a times series, the best bet would be nonstationary There is seldom a hard and fast reason

to declare that a time series will continue a deterministic trend forever

A nonstationary series can be converted into a stationary series by differencing An easily understoodcase of nonstationarity is an upward trend We can flatten the nonstationary series by taking the difference

of successive values This removes the trend and leaves a series of stationary, but still autocorrelated,values Occasionally, a series has to be differenced twice (take the difference of the differences) to produce

a stationary series The differenced stationary series is then analyzed and modeled as a stationary series

A time series is said to follow an integrated autoregressive–moving average process, abbreviated as ARIMA( p,d,q), if the dth difference is a stationary ARMA( p,q) process In practice, d is usually 1 or 2

We may also have an ARIMA(0,d,q), or IMA(d,q), integrated moving average model The IMA(1,1)

model:

satisfactorily represents many times series, especially in business and economics The difference is

Another note on notation that is used in the standard texts on times series: a first difference is denoted

Seasonality

Seasonal models are important in environmental data analysis, especially when dealing with naturalsystems They are less important when dealing with treatment process and manufacturing system data.What appears to be seasonality in these data can often be modeled as drift using one of the simpleARIMA models Therefore, our discussion of seasonal models will be suggestive and the interestedreader is directed to the literature for details

Some seasonal patterns can be modeled by introducing a few lag terms If there is a 7-day cycle

In some cases, seasonal patterns can be modeled with cosine curves that incorporate the smooth changeexpected from one time period to the next Consider the model:

not enter the expression linearly A trigonometric identity that is more convenient is:

Trang 16

model becomes:

Cryer (1986) shows how this model is used

To go beyond this simple description requires more than the available space, so we leave the interestedreader to consult additional references The classic text is Box, Jenkins and Reinsel (1994) Esterby(1993) discusses several ways of dealing with seasonal environmental data Pandit and Wu (1983) discussapplications in engineering

Fitting Time Series Models

Fitting time series models is much like fitting nonlinear mechanistic models by nonlinear least squares.The form of the model is specified, an initial estimate is made for each parameter value, a series ofresiduals is calculated, and the minimum sum of squares is located by an iterative search procedure.When convergence is obtained, the residuals are examined for normality, independence, and constantvariance If these conditions are satisfied, the final parameter values are unbiased least squares estimates

Deer Island Effluent pH Series

The autocorrelation function (ACF) of the pH series does not die out and it has mostly positive

differenced series essentially dies out after one lag; this indicates that the model should include a firstdifference and a first-order dependence The IMA(1,1) model is suggested

The fitted IMA(1,1) model:

The residuals (not shown) are random and independent By comparison, an ARIMA(1,1,1) model has

We can use this example to see why we should often prefer time series models to deterministic trend

statistics you might wrongly conclude that the linear model is adequate

The flaw is that there is no reason to believe that a linear upward trend will continue indefinitely Theextended record (Figure 51.9) shows that pH drifts between pH 6 and 7, depending on rainfall, snow

FIGURE 51.8 The autocorrelation function of the differenced pH series decays to negligible values after a few lags.

y t = y t−1+a t–0.9a t−1

- 0.4 0.0

0.4 rk

1 5 10 15 20 25 30 35 40 45 50

ACF of Differenced pH

Lag (k)

Trang 17

melt, and other factors and that around day 200 it starts to drift downward A linear model fitted to the

points But the linear model is not sufficient It is not adaptable, and the pH is not going to keep going

up or down It is going to drift

Madison Flow Data

Treatment Plant and the fitted time series model:

Because April in one year tends to be like April in another year, and one September tends to be likeanother, there is a seasonality in the data There is also an upward trend due to growth in the servicearea The series is nonstationary and seasonal Both characteristics can be handled by differencing, aone-lag difference to remove the trend and a 12-month lag difference for the seasonal pattern The residuals shown in Figure 51.11 are random Also, they have constant variance except for a fewlarge positive values near the end of the series These correspond to unusual events that the time seriesmodel cannot capture The sewer system is not intended to collect stormwater and under normalconditions it does not, except for small quantities of infiltration However, stormwater inflow andinfiltration increase noticeably if high groundwater levels are combined with a heavy rainfall This iswhat caused the exceptionally high flows that produced the large residuals If one wanted to try andmodel these extremes, a dummy variable could be used to indicate special conditions It is true of mostmodels, but especially of regression-type models that they tend to overestimate extreme low values and

FIGURE 51.9 IMA(1,1) model y t = y t−1 + a t − 0.9a t−1 fitted to the Deer Island pH series Notice that the data recording format changed from two decimal places prior to day 365 and to one place thereafter.

FIGURE 51.10 Comparison of deterministic and IMA(1,1) models fitted to the first 200 pH observations The

determin-istic model appears to fit the data, but it in inadequate in several ways, mainly by not being able to follow the downward drift that occurs within a short time.

Trang 18

This chapter is an introduction to a difficult subject Only the simplest models and concepts have beenmentioned Many practical problems contain at least one feature that is “nonstandard.” The spikes inthe effluent suspended solids and Madison flow data are examples Nonconstant variance or missingdata are two others Use common sense to recognize the special features of a problem Then find experthelp when you need it

Some statistical methods can be learned quickly by practicing a few example calculations Regression

and t-tests are like this Time series analysis is not It takes a good deal of practice; experience is priceless.

If you want to learn, start by generating time series and then progress to fitting time series that havebeen previously studied by specialists Minitab is a good package to use

When you need to do time series analysis, do not be shy about consulting a professional statisticianwho is experienced in doing time series modeling You will save time and learn at the side of a master.That is a bargain not to be missed

References

Box, G E P., G M Jenkins, and G C Reinsel (1994) Time Series Analysis, Forecasting and Control, 3rd

ed., Englewood Cliffs, NJ, Prentice-Hall

Cryer, J D (1986) Time Series Analysis, Boston, Duxbury Press.

Esterby, S R (1993) “Trend Analysis Methods for Environmental Data,” Environmetrics, 4(4) 459–481.

Hipel, K W., A I McLeod, and P K Fosu (1986) “Empirical Power Comparison of Some Tests for Trend,”

pp 347–362, in Statistical Aspects of Water Quality Monitoring, A H El-Shaarawi and R E

Kwia-towski Eds., Amsterdam, Elsevier

Pandit, S M and S M Wu (1983) Time Series and Systems Analysis with Applications, New York, John Wiley.

FIGURE 51.11 A seasonal nonstationary model fitted to a 20-yr record of average monthly wastewater flows for Madison,

WI The residual plot highlights a few high-flow events that the model cannot capture

300 200

100 0

300 200

100 0

Observation

-10 0 10

30 40 50 60

Trang 19

Wold, H O A (1938) A Study in the Analysis of Stationary Time Series, 2nd ed (1954), Uppsala, Sweden,

Almquist and Wiskell

Exercises

51.1 BOD Trend Analysis The table below gives 16 years of BOD loading (lb/day) data for a municipal

wastewater treatment plant Plot the data Difference the data to remove the upward trend Examinethe differenced data for a seasonal cycle Calculate the autocorrelation function of the differenceddata Fit some simple times series models to the data

51.2 Phosphorus Loading The table below gives 16 years of phosphorus loading (lb/day) data for

a municipal wastewater treatment plant Interpret the data using time series analysis

Trang 20

52

Transfer Function Models

KEY WORDS chemical reaction, CSTR, difference equations, discrete model, dynamic system, empirical model, impulse, material balance, mechanistic model, lag, step change, stochastic model, time constant, transfer function, time series.

The transfer function models in this chapter are simple dynamic models that relate time series output

of a process to the time series input Our examples have only one input series and one output series,but the basic concept can be extended to deal with multiple inputs

another level we might try to classify models as mechanistic vs empirical, or as deterministic vs.stochastic The categories are not entirely exclusive A simple power function looks like anempirical model (a mathematical French curve) until we see it in the form We will show nexthow a mechanistic model could masquerade as an empirical stochastic model and explain why empiricalstochastic models can be a good way to describe real processes

Case Study: A Continuous Stirred Tank Reactor Model

so the average input concentration and the average effluent concentration are equal over a long period

of time Defining a time constant , this becomes:

We know that the response of a mixed reactor to an impulse (spike) in the input is exponential decayfrom a peak that occurs at the time of the impulse Also, the response to a step change in the input is

an exponential approach to an asymptotic value

function modeling deals with the kind of data shown in the right-hand panel

y = θ1xθ2

e = mc2

V dy dt

Trang 21

A Discrete Time Series Approximation

Suppose now that this process has been observed over time at a number of equally spaced intervals to

and

The effluent concentration jumps to a peak and then decreases in an exponential decay This is exactlythe expected response of a CSTR process to a pulse input

The table below shows how this model responds to an step change of one unit in the input

FIGURE 52.1 The left-hand panels show the ideal input and output of a CSTR The right-hand panels show “data.” The input is the ideal input plus a random variable with mean zero and standard deviation 2.0 The output was calculated using

0 10 20 30 40 50 0

120 100 80 60 40 20

Ideal Input

Ideal Output

Input with Noise

Output with Noise

dy dt

Trang 22

The step change in the input forces the effluent to rise exponentially and approach the new steady-state

level of 1.0 Once again, this meets our expectation for the process

Fitting the CSTR Model

This example has been constructed so the discrete form of the transfer function for the CSTR

can be fitted by linear regression A new predictor variable is created

the parameter estimation would be done by nonlinear regression The linear regression approach can be

used for autoregressive models and for transfer function models that do not need a moving average

Note: y(t) is the dependent variable and x(t) and y(t – 1)

are the independent (predictor) variables.

FIGURE 52.2 The fitted transfer function model y t = 0.19x t + 0.81y t−1 and the effluent values, which were simulated using

y t = 0.2x t + 0.8y t−1+ a t The input x t was the idealized pattern shown in Figure 53.6 plus random noise from a normal distribution

with mean zero and s = 2 (i.e., N(0,2)) The output was calculated from y t = 0.2x t + 0.8y t−1+ a t , where a t was N(0,4).

y t =θ1x t+θ2y t1 +a t

100 80

60 40

20 0

y

0 10 20 30 40

Tiêu đề	Using Simulation to Study Statistical Problems
Chuyên ngành	Statistics for Environmental Engineers
Thể loại	Lecture notes
Năm xuất bản	2001

Định dạng
Số trang	45
Dung lượng	0,91 MB