In practice, probability distributions used in simulation models are usually deter- mined from historical data. Th e procedure involves two steps:
1. Summarizing historic or observed data in the form of an appropriate histogram.
2. Hypothesizing a theoretical distribution on the basis of the shape of the his- togram and testing its goodness-of-fi t using an appropriate statistical model.
2.2.2.1 Building Histograms
Histograms are a pictorial reference for “guessing” the shape of the population from which observed data are drawn. Histogram data may be from historical records or collected observations. In either case, a histogram is constructed by dividing the range of data into nonoverlapping intervals or cells. Each data point is then assigned to one of the defi ned cells. Th e result is a tally of the number of points within each cell.
Th e choice of cell width is crucial to producing a histogram that is descriptive of the shape of the population from which the data are obtained. To appreciate this point, imagine the extreme case in which the cell width is defi ned such that at most one data point or none would fall within the cell. Th e other extreme would be to represent the entire range of the data by one cell. In both cases, the result would not be indicative of the shape of the population. Because there are no fi xed rules governing the selection of the cell width, the user must exercise judgment in mak- ing a reasonable choice.
We could use Excel’s Histogram program to tally frequencies, calculate per- centages, and graph the relative frequencies and cumulative relative frequency.
Examples of constructing a histogram using Excel are shown in Appendix A.
Example 2.1
Figure 2.10 provides two examples of histograms. In part (a) the data represent the service time (in minutes) in a facility, and in part (b) the data represent arrivals per hour at a service facility.
Histogram (a), which represents a continuous random variable, has cell width of three minutes (cells need not be of equal widths). Notice that cells in histogram (a) do not overlap. Specifi cally, cell intervals are defi ned as [0, 3), [3, 6), [6, 9), [9, 12), and [12, ∞]. Th e tally for each cell is the number of data points in the cell. For example, in histogram (a) there are eight service times (out of the total of 100) that are greater than or equal to six minutes and less than nine minutes. Clearly, the best histogram is when cells are as small as possible. In histogram
(b) there is no cell width because the random variable is discrete.
Instead, each cell is taken to represent a value of the discrete variable.
Histograms in Figure 2.10 are converted to probability distributions by computing the relative frequency for each cell. Figure 2.11 shows the resulting probability functions for histograms in Figure 2.10. Th e cell boundaries from histogram (a) lose their identity, and instead, each cell is represented by its midpoint. Probability function (a) represents a con- tinuous random variable. Th e resulting function is used as an approxi- mation of the continuous distribution. Clearly, the best histogram for continuous functions is when cells are as small as possible.
Although the piecewise linear approximation and the discrete den- sity functions of Figure 2.11 can be used as input data to the simulation model, we may prefer (perhaps it is less cumbersome) to represent these distributions by closed forms by assuming the functions in Figure 2.11 as samples from unknown populations.
65
21
8 4 2
0 10 20 30 40 50 60 70
Service time [minutes]
Frequency
0%
20%
40%
60%
80%
100%
10
25 25 26
14
0 0
5 10 15 20 25 30
Arrivals per hour
Frequency
0%
20%
40%
60%
80%
100%
(a) (b)
3 6 9 12 More 0 1 2 3 4 More
Figure 2.10 Frequency histograms for service time and arrival rates.
Figure 2.11 Frequency histograms for service time and for arrivals per hour.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Relative frequency
0.1
0.25 0.25 0.26
0.14
0.00 0.10 0.20 0.30
0 1 2 3 4
Arrivals per hour
Relative frequency
0.65
0.21
0.08 0.06
3.00 6.00 9.00 12.00 Service time
(a) (b)
We hypothesize theoretical density functions (e.g., exponential and Poisson) with shapes that “resemble” those of the samples. Th e next section details statistical procedures used to test the “goodness-of-fi t”
of proposed theoretical density functions.
2.2.2.2 Goodness-of-Fit Tests
In Section 2.2.2.1, we stated that histogram-based empirical density functions may be used in simulation models. Although the use of empirical distributions is straightforward, they may be cumbersome when performing sensitivity analysis for the model. Imagine, for example, an empirical distribution represents service time in a facility. If we want to determine the impact of increasing or decreasing the service rate, it would be necessary to change the specifi cations of the empirical distribution for each sensitivity analysis study. Alternatively, if the empirical dis- tribution is replaced by a closed-form function, the task of carrying out sensitivity analysis should become much simpler. Service rate changes are achieved by simply changing the distribution parameter(s).
In this section, we present two statistical models for testing the goodness-of-fi t using a theoretical distribution to represent empirical data: the chi-square ( χ2) and Kolmogrov–Smirnov (K-S) tests. Th e χ2 test has more general applicability than the K-S test. Specifi cally, the χ2 test applies to both discrete and continuous random variables, whereas the K-S test is applicable to continuous random variables only.
Additionally, the K-S test, in its original form, assumes that the parameters of the theoretical distribution are known. Th e χ2 test, on the contrary, is based on estimat- ing distribution parameters from raw data. More recently, however, the K-S test has been modifi ed to account for estimating the parameters of the exponential, normal, and Weibull distributions from raw data.
2.2.2.2.1 Chi-Square Test
Th e χ2 test starts by representing the raw data in a histogram and probability func- tion as demonstrated in Section 2.2.2.1. Th e resulting histogram provides a visual clue to possible shape of the theoretical distribution. After selecting a theoretical distribution to be considered, we estimate the distribution parameters from the sample data. Th e theoretical frequency associated with each cell of the empirical histogram is estimated in the following manner: Let [ai–1, ai] represent the bound- aries of cell i and assume that f (x) represents the hypothesized theoretical pdf. If the available sample includes n data points, the theoretical frequency in cell i on the basis of f (x) is given as
ni = npi = n ∫
ai−1
ai
f (x)dx, i = 1, 2, …, N
where pi is the probability that x falls within cell i. (Th e value of pi for a discrete ran- dom variable equals the probability that the ith value of the variable is realized.)
Given Oi is the observed frequency in cell i, the test statistic χ2 = ∑
i=1
N
(Oi − ni) 2 ________ ni
becomes asymptotically χ2 with N – k – 1 degrees of freedom as the number of cells N → ∞. Th e scalar k represents the number (Oi – ni)2/ni of distribution parameters estimated from the observed data. Given this information, the null hypothesis stat- ing that the observed data are drawn from the pdf f (x) is accepted at a level of signifi cance α if the test statistic χ2 does not exceed χ2n–k–1,1–α. Otherwise the null hypothesis is rejected.
Example 2.2
To illustrate the applications of the χ2 test, consider histogram (a) in Figure 2.10 (Example 2.1). Th e histogram appears to have a shape simi- lar to an exponential distribution. Before computing the theoretical frequency ni(=1, 2, …, 5), we must estimate the mean service time.
In general, the desired mean value can be estimated directly from raw data as
_x = ∑
i=1
n
xi/n
where n = 110, the total number of data points. However, because we have not tabulated the original raw data, we can estimate _x from the histogram as
x = _ ∑
i=1
N
Oi x _i/n
where N is the number of histogram cells, Oi the observed frequency in cell i, and xi the midpoint of cell i.
Th is formula yields the average service time for histogram (a) in Figure 2.10 as
x = _ __________________________________ 65 * 1.5 + 21 * 4.5 + 8 * 7.5 +100 6 * 10.5 = 3.15 minutes Using the estimate x = 3.15 minutes, the hypothesized exponential density function is written as
f (x) = (1/3.15)e−x/3.15, x > 0
Th e associated CDF is
Fx(X ) = 1 – e−x/3.15, X > 0
We then compute the theoretical frequency in cell [ai–1, ai] as ni = npi = n[Fx(ai) − Fx(ai−1)]
Table 2.1 summarizes the computations of the χ2-test given the hypothesized expo- nential density function with mean 3.15 minutes.
Notice that cells 4 and 5 are combined because, as a general rule, it is recommended that the expected frequency ni in any cell be no less than fi ve points.
Also, because of the tail of the exponential density function, 2.22 percent of the cumulative area is above the upper interval limit of 12.0. Th e 2.22 percent is included in cell 4. Because we estimated the mean of the exponential from the observations, the χ2 will have 4 – 1 – 1 = 2 degrees of freedom. Assuming a signifi cance level α = .05, we obtain the critical value from the chi-square tables as χ22,.95 = 7.81.
Because the χ2 value of 2.39 in Table 2.1 is less than the critical value of 7.81, we accept the hypothesis that the observed data is from an exponential distribution with mean 3.15 minutes.
A problem with the χ2 test for continuous random variables is the lack of defi - nite rules to select cell width. In particular, the “grouping” of data and their repre- sentation by cell midpoints can lead to loss of information. In general, the objective is to select the cell width “suffi ciently” small to minimize the adverse eff ect of midpoint approximation, but large enough to capture information regarding the shape of the distribution.
In Table 2.1, the user must calculate theoretical frequency ni and the χ2 statistic in the last column and then decide if the hypothesis about the distribution should be accepted. As shown in Appendix A examples, these calculations can also be facilitated using the Excel’s CHITEST function.
Table 2.1 Chi-Square Goodness-of-Fit Test
Cell Number
Cell Boundary
Cell Midpoint
Observed Frequency
Oi
Observed Frequency
Oi Oi xi /ni nFx(ai) ni
(Oi – ni)2/ ni
1 [0, 3) 1.5 65 65 0.975 61.42 61.42 0.21
2 [3, 6) 4.5 21 21 0.945 85.11 23.70 0.31
3 [6, 9) 7.5 8 8 0.6 94.26 9.14 0.14
4 [9, 12) 10.5 4 6 0.63 97.78 3.53 1.73
5 [12, ∞) 2
100 3.15 100.00 2.39
2.2.2.2.2 Kolmogrov–Smirnov Test
Th e K-S model is another goodness-of-fi t test frequently referenced in the litera- ture. Th e test is applicable to continuous distributions only. It diff ers from the χ2 test in that it does not require a histogram. Instead, the model utilizes the raw data xl, x2, …, xn to defi ne an empirical CDF of the form:
Fn(x) = 1/n (number of xi’s < x)
Th e idea is to determine a corresponding CDF, G(x), from an assumed fi tted distri- bution and then defi ne the maximum deviation Dn between Fn(x) and G(x) as
Dn = max {|Fn(x) − G(x)|}
Intuitively, large values of Dn adversely aff ect the fi t. Statistical tables that provide critical values for diff erent signifi cance levels have been compiled for testing the null hypothesis that G(x) is the CDF of the population from which the data are drawn.
Th e K-S test has limitations. In addition to its applicability to continuous distri- bution only, the test assumes that none of the fi tted distribution parameters can be estimated from raw data. Exceptions to this case include the normal, exponential, and Weibull random variables. In these situations, it may be advantageous to use the K-S test because its implementation does not require a histogram that causes a possible loss of information (see Ref. [3], pp. 201–203).
2.2.2.3 Maximum Likelihood Estimates of Distribution Parameters
To apply the χ2 test of goodness-of-fi t, it is necessary to estimate the parameters of the theoretical distribution from sample data. For example, in the uniform distri- bution we must estimate the limits of the interval (a, b) covered by the distribution.
For the normal distribution, estimates of the mean μ and standard deviation σ are needed. Once these parameters are estimated, we can then propose a theoretical distribution from which the expected frequencies are estimated.
Parameter estimation from sample data is based on important statistical theory that requires these parameters to satisfy certain statistical properties (i.e., unique, unbiased, invariant, and consistent). Th ese properties are usually satisfi ed when the parameters are based on the maximum likelihood estimates. For example, in the case of the normal distribution, this procedure yields the sample mean and variance as the maximum likelihood estimates of the population mean and variance. Also, in the uniform distribution, the sample minimum and maximum values estimate the lim- its of the interval on which the distribution is defi ned. Similarly, in the exponential distribution, the sample mean provides the maximum likelihood estimate of the single parameter distribution.
Unfortunately, determining maximum likelihood estimates is not always con- venient. For example, the gamma, beta, and Weibull distributions require solutions of two nonlinear equations.
A full presentation of the maximum likelihood procedure is beyond the scope of this chapter. Most specialized books on statistics treat this topic (see, for example, Refs. [1] and [2]). A concise summary of the results of applying the maximum likeli- hood procedure to important distribution can be found in Ref. [3] (pp. 157–175).