Identifying Distribution on the Basis of Historica- 123docz.net

In practice, probability distributions used in simulation models are usually deter- mined from historical data. Th e procedure involves two steps:

1. Summarizing historic or observed data in the form of an appropriate histogram.

2. Hypothesizing a theoretical distribution on the basis of the shape of the histogram and testing its goodness-of-ﬁ t using an appropriate statistical model.

2.2.2.1 Building Histograms

Histograms are a pictorial reference for “guessing” the shape of the population from which observed data are drawn. Histogram data may be from historical records or collected observations. In either case, a histogram is constructed by dividing the range of data into nonoverlapping intervals or cells. Each data point is then assigned to one of the deﬁ ned cells. Th e result is a tally of the number of points within each cell.

Th e choice of cell width is crucial to producing a histogram that is descriptive of the shape of the population from which the data are obtained. To appreciate this point, imagine the extreme case in which the cell width is deﬁ ned such that at most one data point or none would fall within the cell. Th e other extreme would be to represent the entire range of the data by one cell. In both cases, the result would not be indicative of the shape of the population. Because there are no ﬁ xed rules governing the selection of the cell width, the user must exercise judgment in mak- ing a reasonable choice.

We could use Excel’s Histogram program to tally frequencies, calculate per- centages, and graph the relative frequencies and cumulative relative frequency.

Examples of constructing a histogram using Excel are shown in Appendix A.

Example 2.1

Figure 2.10 provides two examples of histograms. In part (a) the data represent the service time (in minutes) in a facility, and in part (b) the data represent arrivals per hour at a service facility.

Histogram (a), which represents a continuous random variable, has cell width of three minutes (cells need not be of equal widths). Notice that cells in histogram (a) do not overlap. Speciﬁ cally, cell intervals are deﬁ ned as [0, 3), [3, 6), [6, 9), [9, 12), and [12, ∞]. Th e tally for each cell is the number of data points in the cell. For example, in histogram (a) there are eight service times (out of the total of 100) that are greater than or equal to six minutes and less than nine minutes. Clearly, the best histogram is when cells are as small as possible. In histogram

(b) there is no cell width because the random variable is discrete.

Instead, each cell is taken to represent a value of the discrete variable.

Histograms in Figure 2.10 are converted to probability distributions by computing the relative frequency for each cell. Figure 2.11 shows the resulting probability functions for histograms in Figure 2.10. Th e cell boundaries from histogram (a) lose their identity, and instead, each cell is represented by its midpoint. Probability function (a) represents a continuous random variable. Th e resulting function is used as an approximation of the continuous distribution. Clearly, the best histogram for continuous functions is when cells are as small as possible.

Although the piecewise linear approximation and the discrete density functions of Figure 2.11 can be used as input data to the simulation model, we may prefer (perhaps it is less cumbersome) to represent these distributions by closed forms by assuming the functions in Figure 2.11 as samples from unknown populations.

8 4 2

0 10 20 30 40 50 60 70

Service time [minutes]

Frequency

20%

40%

60%

80%

100%

25 25 26

0 0

5 10 15 20 25 30

Arrivals per hour

Frequency

20%

40%

60%

80%

100%

(a) (b)

3 6 9 12 More 0 1 2 3 4 More

Figure 2.10 Frequency histograms for service time and arrival rates.

Figure 2.11 Frequency histograms for service time and for arrivals per hour.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Relative frequency

0.1

0.25 0.25 0.26

0.14

0.00 0.10 0.20 0.30

0 1 2 3 4

Arrivals per hour

Relative frequency

0.65

0.21

0.08 0.06

3.00 6.00 9.00 12.00 Service time

(a) (b)

We hypothesize theoretical density functions (e.g., exponential and Poisson) with shapes that “resemble” those of the samples. Th e next section details statistical procedures used to test the “goodness-of-ﬁ t”

of proposed theoretical density functions.

2.2.2.2 Goodness-of-Fit Tests

In Section 2.2.2.1, we stated that histogram-based empirical density functions may be used in simulation models. Although the use of empirical distributions is straightforward, they may be cumbersome when performing sensitivity analysis for the model. Imagine, for example, an empirical distribution represents service time in a facility. If we want to determine the impact of increasing or decreasing the service rate, it would be necessary to change the speciﬁ cations of the empirical distribution for each sensitivity analysis study. Alternatively, if the empirical distribution is replaced by a closed-form function, the task of carrying out sensitivity analysis should become much simpler. Service rate changes are achieved by simply changing the distribution parameter(s).

In this section, we present two statistical models for testing the goodness-of-ﬁ t using a theoretical distribution to represent empirical data: the chi-square ( χ2) and Kolmogrov–Smirnov (K-S) tests. Th e χ2 test has more general applicability than the K-S test. Speciﬁ cally, the χ2 test applies to both discrete and continuous random variables, whereas the K-S test is applicable to continuous random variables only.

Additionally, the K-S test, in its original form, assumes that the parameters of the theoretical distribution are known. Th e χ2 test, on the contrary, is based on estimating distribution parameters from raw data. More recently, however, the K-S test has been modiﬁ ed to account for estimating the parameters of the exponential, normal, and Weibull distributions from raw data.

2.2.2.2.1 Chi-Square Test

Th e χ2 test starts by representing the raw data in a histogram and probability function as demonstrated in Section 2.2.2.1. Th e resulting histogram provides a visual clue to possible shape of the theoretical distribution. After selecting a theoretical distribution to be considered, we estimate the distribution parameters from the sample data. Th e theoretical frequency associated with each cell of the empirical histogram is estimated in the following manner: Let [ai–1, ai] represent the boundaries of cell i and assume that f (x) represents the hypothesized theoretical pdf. If the available sample includes n data points, the theoretical frequency in cell i on the basis of f (x) is given as

ni = npi = n ∫

ai−1

f (x)dx, i = 1, 2, …, N

where pi is the probability that x falls within cell i. (Th e value of pi for a discrete random variable equals the probability that the ith value of the variable is realized.)

Given Oi is the observed frequency in cell i, the test statistic χ2 = ∑

i=1

(Oi − ni) 2 ________ ni

becomes asymptotically χ2 with N – k – 1 degrees of freedom as the number of cells N → ∞. Th e scalar k represents the number (Oi – ni)2/ni of distribution parameters estimated from the observed data. Given this information, the null hypothesis stat- ing that the observed data are drawn from the pdf f (x) is accepted at a level of signiﬁ cance α if the test statistic χ2 does not exceed χ2n–k–1,1–α. Otherwise the null hypothesis is rejected.

Example 2.2

To illustrate the applications of the χ2 test, consider histogram (a) in Figure 2.10 (Example 2.1). Th e histogram appears to have a shape simi- lar to an exponential distribution. Before computing the theoretical frequency ni(=1, 2, …, 5), we must estimate the mean service time.

In general, the desired mean value can be estimated directly from raw data as

_x = ∑

i=1

xi/n

where n = 110, the total number of data points. However, because we have not tabulated the original raw data, we can estimate _x from the histogram as

x = _ ∑

i=1

Oi x _i/n

where N is the number of histogram cells, Oi the observed frequency in cell i, and xi the midpoint of cell i.

Th is formula yields the average service time for histogram (a) in Figure 2.10 as

x = _ __________________________________ 65 * 1.5 + 21 * 4.5 + 8 * 7.5 +100 6 * 10.5 = 3.15 minutes Using the estimate x = 3.15 minutes, the hypothesized exponential density function is written as

f (x) = (1/3.15)e−x/3.15, x > 0

Th e associated CDF is

Fx(X ) = 1 – e−x/3.15, X > 0

We then compute the theoretical frequency in cell [ai–1, ai] as ni = npi = n[Fx(ai) − Fx(ai−1)]

Table 2.1 summarizes the computations of the χ2-test given the hypothesized exponential density function with mean 3.15 minutes.

Notice that cells 4 and 5 are combined because, as a general rule, it is recommended that the expected frequency ni in any cell be no less than ﬁ ve points.

Also, because of the tail of the exponential density function, 2.22 percent of the cumulative area is above the upper interval limit of 12.0. Th e 2.22 percent is included in cell 4. Because we estimated the mean of the exponential from the observations, the χ2 will have 4 – 1 – 1 = 2 degrees of freedom. Assuming a signiﬁ cance level α = .05, we obtain the critical value from the chi-square tables as χ22,.95 = 7.81.

Because the χ2 value of 2.39 in Table 2.1 is less than the critical value of 7.81, we accept the hypothesis that the observed data is from an exponential distribution with mean 3.15 minutes.

A problem with the χ2 test for continuous random variables is the lack of defi - nite rules to select cell width. In particular, the “grouping” of data and their repre- sentation by cell midpoints can lead to loss of information. In general, the objective is to select the cell width “suffi ciently” small to minimize the adverse eff ect of midpoint approximation, but large enough to capture information regarding the shape of the distribution.

In Table 2.1, the user must calculate theoretical frequency ni and the χ2 statistic in the last column and then decide if the hypothesis about the distribution should be accepted. As shown in Appendix A examples, these calculations can also be facilitated using the Excel’s CHITEST function.

Table 2.1 Chi-Square Goodness-of-Fit Test

Cell Number

Cell Boundary

Cell Midpoint

Observed Frequency

Oi Oi xi /ni nFx(ai) ni

(Oi – ni)2/ ni

1 [0, 3) 1.5 65 65 0.975 61.42 61.42 0.21

2 [3, 6) 4.5 21 21 0.945 85.11 23.70 0.31

3 [6, 9) 7.5 8 8 0.6 94.26 9.14 0.14

4 [9, 12) 10.5 4 6 0.63 97.78 3.53 1.73

5 [12, ∞) 2

100 3.15 100.00 2.39

2.2.2.2.2 Kolmogrov–Smirnov Test

Th e K-S model is another goodness-of-fi t test frequently referenced in the litera- ture. Th e test is applicable to continuous distributions only. It diff ers from the χ2 test in that it does not require a histogram. Instead, the model utilizes the raw data xl, x2, …, xn to defi ne an empirical CDF of the form:

Fn(x) = 1/n (number of xi’s < x)

Th e idea is to determine a corresponding CDF, G(x), from an assumed ﬁ tted distribution and then deﬁ ne the maximum deviation Dn between Fn(x) and G(x) as

Dn = max {|Fn(x) − G(x)|}

Intuitively, large values of Dn adversely aff ect the fi t. Statistical tables that provide critical values for diff erent signifi cance levels have been compiled for testing the null hypothesis that G(x) is the CDF of the population from which the data are drawn.

Th e K-S test has limitations. In addition to its applicability to continuous distribution only, the test assumes that none of the ﬁ tted distribution parameters can be estimated from raw data. Exceptions to this case include the normal, exponential, and Weibull random variables. In these situations, it may be advantageous to use the K-S test because its implementation does not require a histogram that causes a possible loss of information (see Ref. [3], pp. 201–203).

2.2.2.3 Maximum Likelihood Estimates of Distribution Parameters

To apply the χ2 test of goodness-of-ﬁ t, it is necessary to estimate the parameters of the theoretical distribution from sample data. For example, in the uniform distribution we must estimate the limits of the interval (a, b) covered by the distribution.

For the normal distribution, estimates of the mean μ and standard deviation σ are needed. Once these parameters are estimated, we can then propose a theoretical distribution from which the expected frequencies are estimated.

Parameter estimation from sample data is based on important statistical theory that requires these parameters to satisfy certain statistical properties (i.e., unique, unbiased, invariant, and consistent). Th ese properties are usually satisﬁ ed when the parameters are based on the maximum likelihood estimates. For example, in the case of the normal distribution, this procedure yields the sample mean and variance as the maximum likelihood estimates of the population mean and variance. Also, in the uniform distribution, the sample minimum and maximum values estimate the limits of the interval on which the distribution is deﬁ ned. Similarly, in the exponential distribution, the sample mean provides the maximum likelihood estimate of the single parameter distribution.

Unfortunately, determining maximum likelihood estimates is not always con- venient. For example, the gamma, beta, and Weibull distributions require solutions of two nonlinear equations.

A full presentation of the maximum likelihood procedure is beyond the scope of this chapter. Most specialized books on statistics treat this topic (see, for example, Refs. [1] and [2]). A concise summary of the results of applying the maximum likelihood procedure to important distribution can be found in Ref. [3] (pp. 157–175).

Identifying Distribution on the Basis of Historical Data

Queue and Facility Statistics in Simulation

Peculiarities of the Simulation Experiment