Input Modeling for Simulation
IEEM 313, Spring 2005 L. Jeff Hong Dept of IEEM, HKUST
Input Models
Input models represent the uncertainty in a stochastic simulation.
Random variates are generated based on the input models Simulation outputs are determined by the input models
The quality of the output is no better than the quality of inputs.
Garbage In, Garbage Out! The quality of input models depends on the raw data, which is often collected by people who dont know simulation
2
Input Models
The fundamental requirements for an input model are:
It must be capable of representing the physical realities of the process. It must be easily tuned to the situation at hand. It must be amenable to random-variate generation.
There is no true model for any stochastic input. The best that we can hope is to obtain an approximation that yields useful results.
3
Input Models
A key distinction in input modeling problems is the presence or absence of data:
When we have data, then we fit a model to the data. Good software is available for this. When no data are available then we have to creatively use what we can get to construct an input model.
Outline
Input modeling with data
Data collection Select candidate distributions Fitting and checking Arena Input Analyzer
Input modeling without data
Sources of information Incorporating expert opinion
Input Modeling with Data
1. Collect data. 2. Select one or more candidate distributions, based on physical characteristics of the process and graphical examination of the data. 3. Fit the distribution to the data (determine values for its unknown parameters). 4. Check the fit to the data via tests and graphical analysis. 5. If the distribution does not fit, select another candidate and go to 3, or use an empirical distribution.
Suggestions on Data Collection
Plan ahead: begin by a practice or pre-observing session, watch for unusual circumstances Analyze the data as it is being collected: check adequacy Combine homogeneous data sets, e.g. successive time periods, during the same time period on successive days Be aware of data censoring: the quantity is not observed in its entirety, danger of leaving out extreme values Collect input data, not performance data
7
Selecting Distributions: Histogram
A frequency distribution or histogram is useful in determining the shape of a distribution The number of class intervals depends on:
The number of observations The dispersion of the data Suggested: the square root of the sample size
Same data with different interval sizes
If few data points are available: combine adjacent cells to eliminate the ragged appearance of the histogram
8
Selecting Distributions: Histogram
Vehicle Arrival Example: # of vehicles arriving at an intersection between 7-7:05 am was monitored for 100 random workdays. There are ample data, so the histogram may have a cell for each possible value in the data range
Arrivals per Period 0 1 2 3 4 5 6 7 8 9 10 11 Frequency 12 10 19 17 10 8 7 5 5 3 3 1
Histogram of # of Arrivals during 7-7:05am
20 Frequency 15 10 5 0 0 1 2 3 4 5 6 7 8 9 10 11 # of Arrivals
Selecting Distributions: Physical Basis
Most probability distributions were invented to represent a particular physical situation. If we know the physical basis for a distribution, then we can match it to the situation we have to model A number of examples follow
10
binomial: Models the number of successes in n trials, when the trials are independent with common success probability, p. Example: the number of defective components found in a lot of n components. negative binomial: Models the number of trials required to achieve k successes. Example: the number of components that we must inspect to find 4 defective components. Poisson: Models the number of independent events that occur in a fixed amount of time or space. Ex: number of customers that arrive to a store during 1 hour, or number of defects found in 30 square meters of sheet metal. normal: Models the distribution of a process that can be thought of as the sum of a number of component processes. Ex: the time to assemble a product which is the sum of the times required for each assembly operation.
11
lognormal: Models the distribution of a process that can be thought of as the product of a number of component processes. Example: the rate of return on an investment, when interest is compounded, is the product of the returns for a number of periods. Also widely used to model stock prices. exponential: Models the time between independent events, or a process time which is memoryless. Example: the time to failure for a system that has constant failure rate over time. Note: if the time between events is exponential, then the number of events is Poisson. Erlang: The sum of k identical exponential random variables. A special case of the gamma... gamma: An extremely flexible distribution used to model nonnegative random variables.
12
beta: An extremely flexible distribution used to model bounded (fixed upper and lower limits) random variables. Weibull: Models the time to failure (minimum of a number of possible causes); can model increasing or decreasing failure rate hazard. Ex: the time to failure for a disk drive. discrete or continuous uniform: Models complete uncertainty, since all outcomes are equally likely. triangular: Models a process when only the minimum, most likely and maximum values of the distribution are known. Ex: the minimum, most likely and maximum inflation rate we will have this year. empirical: Reuses the data themselves by making each observed value equally likely. Can be interpolated to obtain a continuous distribution.
13
Fitting
Determine the unknown parameters for the distribution. For example, if you believe the data is from normal distribution, then you need to determine the mean and variance of the distribution. Common methods for fitting distributions are maximum likelihood, method of moments, and least squares.
While the method matters, the variability in the data often overwhelms the differences in the estimators (see Section 9.3). Remember: There is no true distribution just waiting to be found!
14
Check goodness of fit
Goodness of fit can be checked by two approaches: goodness-of-fit test and Graphic analysis. Goodness-of-fit test (statistical hypothesis test)
Chi-squared test Kolmogorov-Smirnov test
Graphic analysis
Histogram with the fitted line q-q plot
15
Goodness-of-fit Tests
In the test... H0: the chosen distribution fits the data H1: the chosen distribution does not fit the data The p-value of a test is the Type I error level (significance) at which we would just reject H0 for the given data. If the level is less than p-value, we do not reject H0; otherwise, we reject H0. Thus, a large (> 0.10) p-value supports H0 that the distribution fits.
16
Goodness-of-fit Tests
If there are little data, the goodness-of-fit test is likely to accept any distributions. Why?
If there are lots of data, the goodness-of-fit test is likely to reject any distributions. Why?
17
Chi-squared Test
A histogram-based of test
Observed Frequency
X <= 0.0000 0.0% 9 8 7 X <= 40.000 99.9%
Values x 10^-2
6 5 4 3 2 1 0 0 5 10 15 20
@RISK Student Version
For Academic Use Only
Sums the squared difference
2 ( ) O E i 02 = i Ei i =1 k
25
30
35
40
Expected Frequency Ei = n*pi where pi is the theoretical prob. of the ith interval. 18
Chi-square Test
Arrange the n observations into a k cells, the test statistics is:
2 0
i =1
(Oi Ei ) 2 Ei
which approximately follows the chi-square distribution with k-s-1 degrees of freedom, where s = # of parameters of the hypothesized distribution estimated by the sample statistics.
Valid only for large sample size Each cell has at least 5 observations for both Oi and Ei Result of the test depends on grouping of the data
19
Chi-squared Test
Vehicle arrival example (page 9), sample mean 3.64 H0: Data are Poisson distributed with mean 3.64 H1: Data are not Poisson distributed with mean 3.64
xi 0 1 2 3 4 5 6 7 8 9 10 > 11 Observed Frequency, Oi 12 10 19 17 19 6 7 5 5 3 3 1 100 Expected Frequency, Ei 2.6 9.6 17.4 21.1 19.2 14.0 8.5 4.4 2.0 0.8 0.3 0.1 100.0 (Oi - Ei)2/Ei 7.87 0.15 0.8 4.41 2.57 0.26 11.62 27.68
Ei = np ( x) e x =n x!
Combined because of min Ei
Degree of freedom is k-s-1 = 7-1-1 = 5 and so the p-value is 0.00004. What is your conclusion?
20
Kolmogorov-Smirnov Test
X <= 0.0000 0.0% 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 35 40 X <= 40.000 98.5%
K-S test is useful when sample size is small Test statistic
D = max| F(x) - Sn(x)|
@RISK Student Version
For Academic Use Only
K-S test looks at maximum difference
CDF of the hypothesized distribution
CDF of the empirical distribution constructed from the data
21
Kolmogorov-Smirnov Test
Empirical Distribution
If we have n observations x1,x2,,xn, then Sn(x) = (number of x1,x2,,xn that are x) / n
1
D = max| F(x) - Sn(x)|
x1
x2 x3
x4
22
Graphic Analysis
Graphic analysis includes: histogram with fitted distribution and q-q plot. Goodness-of-fit tests represent lack of fit by a summary statistic, while plots show where the lack of fit occurs and whether it is important. Goodness-of-fit tests may accept the fit, but the plots may suggest the opposite, especially when the number of observations is small.
23
Graphic Analysis
A data set of 30 observations is believed to be from a normal distribution. The following are the p-values from chi-square test and K-S test: Chi-square test: 0.166 K-S test: >0.15 What is your conclusion?
24
Arena Input Analyzer
A tool in Arena for fitting distributions to data. Fits beta, empirical, Erlang, exponential, gamma, Johnson, lognormal, normal, Poisson, triangular, uniform and Weibull. Reports p-values for 2 and K-S tests and provides a histogram plot.
25
Histogram Plot
26
Text Report
Distribution Summary Distribution: Lognormal Expression: 3 + LOGN(9.22, 7.13) Square Error: 0.004997 Chi Square Test Number of intervals Degrees of freedom Test Statistic Corresponding p-value Data Summary Number of Data Points Min Data Value Max Data Value Sample Mean Sample Std Dev Histogram Summary Histogram Range Number of Intervals = 3 to 37 = 14 = = = = = 200 3.38 36.1 12 5.91
= = = =
8 5 9.13 0.106
Kolmogorov-Smirnov Test Test Statistic = 0.0551 Corresponding p-value > 0.15
27
Usage Notes
The Fit All option tries all relevant distributions and picks the one with the smallest squared error. It can easily be fooled! Be sure to try different numbers of histogram cells; it affects the p-value of the 2 test, and your perception of the fit. Since EXPO is a special case of ERLA which is a special case of GAMMA, Fit All rarely selects EXPO or ERLA. Similarly, EXPO is a special case of WEIB. Raw data can be read in from text files (looks for .dst), one value per line.
28
Vehicle Arrival Data (Fit All)
Distribution Summary Distribution: Expression: Square Error: Weibull -0.5 + WEIB(4.59, 1.51) 0.006067 =6 =3 = 2.97 = 0.414
Data Summary Number of Data Points Min Data Value Max Data Value Sample Mean Sample Std Dev = 100 =0 = 11 = 3.64 = 2.76
Chi Square Test Number of intervals Degrees of freedom Test Statistic Corresponding p-value
Histogram Summary Histogram Range = -0.5 to 11.5 Number of Intervals = 12
29
Vehicle Arrival Data (Fit Poisson)
Distribution Summary Distribution: Expression: Square Error: Poisson POIS(3.64) 0.025236 =6 =4 = 19.8 < 0.005
Data Summary Number of Data Points Min Data Value Max Data Value Sample Mean Sample Std Dev = 100 =0 =11 = 3.64 = 2.76
Chi Square Test Number of intervals Degrees of freedom Test Statistic Corresponding p-value
Histogram Summary Histogram Range = -0.001 to 11 Number of Intervals = 12
30
q-q Plot
Recall that one way to generate data from cdf F is via
Y = F (R)
The q-q plot displays the sorted data
Y1 Y2 L Yn
vs.
1 / 2 1 3 / 2 1 n 1 / 2 F , K , F , F n n n
1
31
q-q Plot Intuition
We have a sample Y1, Y2,,Yn and we fit a distribution F that we hope is good. If we now generate a random sample of size n from F it should look about like Y1, Y2,,Yn. The q-q plot generates a perfect random sample for comparison.
32
Features of the q-q Plot
It does not depend on how the data are grouped. It is much better than a histogram when the number of data points is small. Deviations from a straight line show where the distribution does not match. A straight line implies the family of distributions is correct; a 45o line implies correct parameters.
33
Examples of q-q plot
30
50 45
25
40 35 30
20
25 20
15
The fitting is pretty good!
10 15 20 25 30
15 10 5 5 10 15
The distribution family seems OK but the parameters are not
10
20
25
30
35
40
45
50
34
Example of q-q plot
35 30
25
20
A data set of 30 observations is believed to be from a normal distribution. The following are the p-values from chi-square test and K-S test:
Poor fit! Miss badly in the left tail
15
Chi-square test: 0.166 K-S test: >0.15
35
10 10 15 20 25 30
35
Using the Data Itself
When might we want to use the data itself?
When no standard distribution fits well When we have no justification for a standard distribution When there is too little data to distinguish between standard distributions
Empirical distribution
Each data point is equally likely to be resampled. In Arena: DISCRETE(1/n, X1, 2/n, X2,, 1, Xn)
Example: Fit data 2.1, 5.7, 3.4, 8.1 (min)
DISCRETE(.25, 2.1, .5, 3.4, .75, 5.7, 1, 8.1)
36
Pros and Cons
As the sample size n goes to infinity, the empirical distribution converges to the truth. No assumed distribution need be selected. Only the values we saw can appear again and nothing in the gaps. Empirical distribution has no tail. It is easy to miss extreme values
37
Interpolated Empirical
To fill in gaps, we interpolate between the sorted data points. In Arena:
CONT(0, 2.1, .33, 3.4, .67, 5.7, 1, 8.1) CONT(0, X1, 1/(n-1), X2, 2/(n-1), X3,, 1, Xn)
Interpolated Empirical cdf
cumulative probability 1 0.8 0.6 0.4 0.2 0 0 2 0 4 X 6 8 10 0.33 0.67 1
38
Why not always use empirical?
Tail (extreme) behavior is often not well represented (especially when we have a small sample). It is difficult to change the data to represent a shifted mean or reduced variance. A fitted distribution smoothes out random sample oddities.
39
Input Modeling without Data
We have to use anything we can find...
Engineering data, standards and ratings can provide central values Expert opinion Physical or conventional limitations can provide bounds Physical basis of the process can suggest appropriate distribution families
40
Breakpoints Method
Useful for modeling quantities with a large number of possible outcomes, like quarterly sales volume or aggregate number of overtime hours. Minimum data needed: smallest and largest possible values.
Ex: sales of XYZ-123 will be no less than 1000 units, but no more than 5000 units. UNIF(1000,5000) Comments: This is typically a poor model since the probability is spread out evenly from low to high. Thus, the extremes are just as likely as the middle values. However, if you want to be conservative (maximum uncertainty) or you cannot justify any additional information, then this model may be reasonable.
41
Breakpoints Method
Better data: smallest and largest possible values, and most likely value.
Ex: sales of XYZ-123 will be no less than 1000 units, no more than 5000 units, and is most likely to be 3500 units. Triangular distribution, TRIA(1000,3500,5000)
Uniform(1000, 5000)
3.0
6
Triang(1000, 3500, 5000)
2.5
Values x 10^-4
2.0
Values x 10^-4
1.5
@RISK Student Version
For Academic Use Only
@RISK Student Version
For Academic Use Only
1.0
0.5
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Values in Thousands
100.0% 0.5000 5.5000
0.5000
Values in Thousands
100.0% 5.5000
5.5
42
Breakpoints Method
Best data: smallest and largest possible values plus 1-3 breakpoints (values and a percentage chance of being less than that value).
Ex: sales of XYZ-123 will be between 1000 and 5000, and sales chance of being sales 2000 25% 3500 75% 4500 99%
In Arena: CONT(0, 1000, .25, 2000, .75, 3500, .99, 4500, 1, 5000) Comments:
Use only as many breakpoints as you can confidently get. Three is usually the maximum if no data are available. Try to get breakpoints near the extremes if possible, since the extremes are often not realistic. Sometimes it is easier to get people to give the chance of exceeding a value, rather than being less than a value.
43
Mean & Variability Method
Useful for modeling quantities with a large number of possible outcomes, like quarterly sales volume or aggregate number of overtime hours. Also useful for modeling the variability in percentage changes.
44
Mean & Variability Method
Minimum data: mean value and an average percentage variation around that mean.
Example: Last year we sold 10,000 units of ABC-000. This year we expect a 15% increase, with a typical swing of 5% above or below that value. NORM(11500, 0.05*11500) Comments:
May need to round this value. If the % swing > 33%, then negative values are possible.
45
Mean & Variability Method
Better data: mean value, an average percentage variation around that mean and upper and lower limits.
Ex: Last year we sold 10,000 units of ABC-000. This year we expect a 15% increase, with a typical swing of 5% above or below that value. But we wont sell less than 8000 units, or more than 15,000 under any conditions MN(15000, MX(NORM(11500, 0.05*11500),8000))
46
Discrete Outcomes
Used to model discrete events, like making or not making a sale, or whether a new product is ready to ship in the first, second or third quarter. Data: A percentage chance of each possible outcome so that the total is 100%.
47
Discrete Outcomes
Example: We have an 80% chance of landing the contract with Big Corp. If we land the contract, then there is only a 10% chance that the initial orders will arrive first quarter; a 50% chance they will start the second quarter; and a 40% chance they will start in quarter 3. Sales will be $250K in each quarter we produce. How to model the sales for the year related to Big Corp?
48