Given the definition of the lag plot, Yiversus Yi-1, a good candidate model is a model of the form Fit Output A linear fit of this model generated the following output.. 4-Plot ofResidua
Trang 1I STAT EXP(STAT) SD(STAT) Z
STATISTIC = NUMBER OF RUNS TOTAL
OF LENGTH I OR MORE
I STAT EXP(STAT) SD(STAT) Z
LENGTH OF THE LONGEST RUN UP = 10 LENGTH OF THE LONGEST RUN DOWN = 7 LENGTH OF THE LONGEST RUN UP OR DOWN = 10
NUMBER OF POSITIVE DIFFERENCES = 258 NUMBER OF NEGATIVE DIFFERENCES = 241 NUMBER OF ZERO DIFFERENCES = 0
Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically significant at the 5% level Numerous values in this column are much larger than +/-1.96,
so we conclude that the data are not random.
Distributional
Assumptions
Since the quantitative tests show that the assumptions of randomness and constant location and scale are not met, the distributional measures will not be meaningful.
Therefore these quantitative tests are omitted.
1.4.2.3.2 Test Underlying Assumptions
Trang 21 Exploratory Data Analysis
1.4 EDA Case Studies
Since the underlying assumptions did not hold, we need to develop a better model.
The lag plot showed a distinct linear pattern Given the definition of the lag plot, Yiversus Yi-1, a good candidate model is a model of the form
Fit
Output
A linear fit of this model generated the following output.
LEAST SQUARES MULTILINEAR FIT SAMPLE SIZE N = 499 NUMBER OF VARIABLES = 1
NO REPLICATION CASE
PARAMETER ESTIMATES (APPROX ST DEV.) TVALUE
1 A0 0.501650E-01 (0.2417E-01) 2.075
2 A1 YIM1 0.987087 (0.6313E-02) 156.4
RESIDUAL STANDARD DEVIATION = 0.2931194 RESIDUAL DEGREES OF FREEDOM = 497
The slope parameter, A1, has a t value of 156.4 which is statistically significant Also, the residual standard deviation is 0.29 This can be compared to the standard deviation
shown in the summary table , which is 2.08 That is, the fit to the autoregressive model has reduced the variability by a factor of 7.
Trang 31.4.2.3.3 Develop A Better Model
Trang 41 Exploratory Data Analysis
1.4 EDA Case Studies
1.4.2.3.4 Validate New Model
http://www.itl.nist.gov/div898/handbook/eda/section4/eda4234.htm (1 of 4) [5/1/2006 9:58:40 AM]
Trang 54-Plot of
Residuals
Interpretation The assumptions are addressed by the graphics shown above:
The run sequence plot (upper left) indicates no significant shifts
in location or scale over time.
1.4.2.3.4 Validate New Model
Trang 6in that we knew how the data were constructed, it is common and desirable to use scientific and engineering knowledge of the process that generated the data in formulating and testing models for the data Quite often, several competing models will produce nearly equivalent mathematical results In this case, selecting the model that best
approximates the scientific understanding of the process is a reasonable choice.
1.4.2.3.4 Validate New Model
http://www.itl.nist.gov/div898/handbook/eda/section4/eda4234.htm (3 of 4) [5/1/2006 9:58:40 AM]
Trang 7Time Series
Model
This model is an example of a time series model More extensive discussion of time series is given in the Process Monitoring chapter.1.4.2.3.4 Validate New Model
Trang 81 Exploratory Data Analysis
1.4 EDA Case Studies
Click on the links below to start Dataplot and run this case
study yourself Each step may use results from previous steps,
so please be patient Wait until the software verifies that the
current step is complete before clicking on the next step.
The links in this column will connect you with more detailed information about each analysis step from the case study description.
1 Invoke Dataplot and read data
1 Read in the data
4 Detect drift in variation by
dividing the data into quarters and
computing Levene's test for equal
1 Based on the 4-plot, there are shifts
in location and scale and the data are not random
2 The summary statistics table displays 25+ statistics
3 The linear fit indicates drift in location since the slope parameter
Trang 93 Generate the randomness plots.
1 Generate an autocorrelation plot
2 Generate a spectral plot
1 The autocorrelation plot shows significant autocorrelation at lag 1
2 The spectral plot shows a single dominant low frequency peak
4 Fit Yi = A0 + A1*Yi-1 + Ei
and validate
1 Generate the fit
2 Plot fitted line with original data
3 Generate a 4-plot of the residuals
from the fit
4 Generate a uniform probability plot
of the residuals
1 The residual standard deviation from the fit is 0.29 (compared to the standard deviation of 2.08 from the original data)
2 The plot of the predicted values with the original data indicates a good fit
3 The 4-plot indicates that the assumptions
of constant location and scale are valid The lag plot indicates that the data are random However, the histogram and normal probability plot indicate that the uniform disribution might be a better model for the residuals than the normal
distribution
4 The uniform probability plot verifies that the residuals can be fit by a uniform distribution
1.4.2.3.5 Work This Example Yourself
Trang 101 Exploratory Data Analysis
1.4 EDA Case Studies
Trang 111 Exploratory Data Analysis
1.4 EDA Case Studies
1.4.2 Case Studies
1.4.2.4 Josephson Junction Cryothermometry
1.4.2.4.1 Background and Data
Generation This data set was collected by Bob Soulen of NIST in October, 1971 as
a sequence of observations collected equi-spaced in time from a volt meter to ascertain the process temperature in a Josephson junction cryothermometry (low temperature) experiment The response variable
is voltage counts.
Motivation The motivation for studying this data set is to illustrate the case where
there is discreteness in the measurements, but the underlying assumptions hold In this case, the discreteness is due to the data being integers.
This file can be read by Dataplot with the following commands:
SKIP 25 SET READ FORMAT 5F5.0 SERIAL READ SOULEN.DAT Y SET READ FORMAT
Trang 151 Exploratory Data Analysis
1.4 EDA Case Studies
1.4.2 Case Studies
1.4.2.4 Josephson Junction Cryothermometry
1.4.2.4.2 Graphical Output and
Interpretation
Goal The goal of this analysis is threefold:
Determine if the univariate model:
is appropriate and valid.
Determine if the confidence interval
is appropriate and valid where s is the standard deviation of the
original data.
3
1.4.2.4.2 Graphical Output and Interpretation
Trang 164-Plot of
Data
Interpretation The assumptions are addressed by the graphics shown above:
The run sequence plot (upper left) indicates that the data do not have any significant shifts in location or scale over time.
3
The normal probability plot (lower right) is difficult to interpret due to the fact that there are only a few distinct values with many repeats.
4
The integer data with only a few distinct values and many repeats accounts for the discrete appearance of several of the plots (e.g., the lag plot and the normal probability plot) In this case, the nature of the data makes the normal probability plot difficult to interpret, especially since each number is repeated many times However, the histogram indicates that a normal distribution should provide an adequate model for the data.
From the above plots, we conclude that the underlying assumptions are valid and the data can be reasonably approximated with a normal
distribution Therefore, the commonly used uncertainty standard is valid and appropriate The numerical values for this model are given in1.4.2.4.2 Graphical Output and Interpretation
http://www.itl.nist.gov/div898/handbook/eda/section4/eda4242.htm (2 of 4) [5/1/2006 9:58:49 AM]
Trang 17the Quantitative Output and Interpretation section.
Trang 191 Exploratory Data Analysis
1.4 EDA Case Studies
1.4.2 Case Studies
1.4.2.4 Josephson Junction Cryothermometry
1.4.2.4.3 Quantitative Output and Interpretation
Summary
Statistics As a first step in the analysis, a table of summary statistics is computed from the data. The following table, generated by Dataplot , shows a typical set of statistics.
SUMMARY
NUMBER OF OBSERVATIONS = 700
***********************************************************************
* LOCATION MEASURES * DISPERSION MEASURES
* ***********************************************************************
* MIDRANGE = 0.2898500E+04 * RANGE = 0.7000000E+01
* RANDOMNESS MEASURES * DISTRIBUTIONAL MEASURES
* ***********************************************************************
* AUTOCO COEF = 0.3148023E+00 * ST 3RD MOM = 0.6580265E-02
Trang 20* = * TUK -.5 PPCC = 0.7935873E+00
*
* = * CAUCHY PPCC = 0.4231319E+00
* ***********************************************************************
Location One way to quantify a change in location over time is to fit a straight line to the data set
using the index variable X = 1, 2, , N, with N denoting the number of observations If there is no significant drift in the location, the slope parameter should be zero For this data set, Dataplot generates the following output:
LEAST SQUARES MULTILINEAR FIT SAMPLE SIZE N = 700 NUMBER OF VARIABLES = 1
NO REPLICATION CASE
PARAMETER ESTIMATES (APPROX ST DEV.) TVALUE
1 A0 2898.19 (0.9745E-01) 0.2974E+05
2 A1 X 0.107075E-02 (0.2409E-03) 4.445
RESIDUAL STANDARD DEVIATION = 1.287802 RESIDUAL DEGREES OF FREEDOM = 698
The slope parameter, A1, has a t value of 2.1 which is statistically significant (the critical value is 1.98) However, the value of the slope is 0.0011 Given that the slope is nearly zero, the assumption of constant location is not seriously violated even though it is (just barely) statistically significant.
Variation One simple way to detect a change in variation is with a Bartlett test after dividing the
data set into several equal-sized intervals However, the Bartlett test is not robust for non-normality Since the nature of the data (a few distinct points repeated many times) makes the normality assumption questionable, we use the alternative Levene test In partiuclar, we use the Levene test based on the median rather the mean The choice of the number of intervals is somewhat arbitrary, although values of 4 or 8 are reasonable.
Dataplot generated the following output for the Levene test.
LEVENE F-TEST FOR SHIFT IN VARIATION (ASSUMPTION: NORMALITY)
1 STATISTICS NUMBER OF OBSERVATIONS = 700 NUMBER OF GROUPS = 4 LEVENE F TEST STATISTIC = 1.432365
FOR LEVENE TEST STATISTIC
Trang 2199.9 % POINT = 5.482234
76.79006 % Point: 1.432365
3 CONCLUSION (AT THE 5% LEVEL):
THERE IS NO SHIFT IN VARIATION
THUS THE GROUPS ARE HOMOGENEOUS WITH RESPECT TO VARIATION
Since the Levene test statistic value of 1.43 is less than the 95% critical value of 2.67, we conclude that the standard deviations are not significantly different in the 4 intervals.
Randomness
There are many ways in which data can be non-random However, most common forms
of non-randomness can be detected with a few simple tests The lag plot in the previous section is a simple graphical technique.
Another check is an autocorrelation plot that shows the autocorrelations for various lags Confidence bands can be plotted at the 95% and 99% confidence levels Points outside this band indicate statistically significant values (lag 0 is always 1) Dataplot generated the following autocorrelation plot.
The lag 1 autocorrelation, which is generally the one of most interest, is 0.31 The critical values at the 5% level of significance are -0.087 and 0.087 This indicates that the lag 1 autocorrelation is statistically significant, so there is some evidence for non-randomness.
A common test for randomness is the runs test
RUNS UP
STATISTIC = NUMBER OF RUNS UP
OF LENGTH EXACTLY I
I STAT EXP(STAT) SD(STAT) Z
1 102.0 145.8750 12.1665 -3.61
1.4.2.4.3 Quantitative Output and Interpretation
Trang 22STATISTIC = NUMBER OF RUNS UP
OF LENGTH I OR MORE
I STAT EXP(STAT) SD(STAT) Z
RUNS DOWN
STATISTIC = NUMBER OF RUNS DOWN
OF LENGTH EXACTLY I
I STAT EXP(STAT) SD(STAT) Z
STATISTIC = NUMBER OF RUNS DOWN
OF LENGTH I OR MORE
Trang 239 0.0 0.0002 0.0132 -0.01
10 0.0 0.0000 0.0040 0.00
RUNS TOTAL = RUNS UP + RUNS DOWN
STATISTIC = NUMBER OF RUNS TOTAL
OF LENGTH EXACTLY I
I STAT EXP(STAT) SD(STAT) Z
STATISTIC = NUMBER OF RUNS TOTAL
OF LENGTH I OR MORE
I STAT EXP(STAT) SD(STAT) Z
LENGTH OF THE LONGEST RUN UP = 7 LENGTH OF THE LONGEST RUN DOWN = 6 LENGTH OF THE LONGEST RUN UP OR DOWN = 7
NUMBER OF POSITIVE DIFFERENCES = 262 NUMBER OF NEGATIVE DIFFERENCES = 258 NUMBER OF ZERO DIFFERENCES = 179
Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically significant at the 5% level The runs test indicates some mild non-randomness.
Although the runs test and lag 1 autocorrelation indicate some mild non-randomness, it is
not sufficient to reject the Yi = C + Ei model At least part of the non-randomness can be explained by the discrete nature of the data.
1.4.2.4.3 Quantitative Output and Interpretation
Trang 24Chi-square and Kolmogorov-Smirnov goodness-of-fit tests are alternative methods for assessing distributional adequacy The Wilk-Shapiro and Anderson-Darling tests can be used to test for normality Dataplot generates the following output for the
Anderson-Darling normality test.
ANDERSON-DARLING 1-SAMPLE TEST THAT THE DATA CAME FROM A NORMAL DISTRIBUTION
1 STATISTICS:
NUMBER OF OBSERVATIONS = 700 MEAN = 2898.562 STANDARD DEVIATION = 1.304969
ANDERSON-DARLING TEST STATISTIC VALUE = 16.76349 ADJUSTED TEST STATISTIC VALUE = 16.85843
2 CRITICAL VALUES:
90 % POINT = 0.6560000
95 % POINT = 0.7870000 97.5 % POINT = 0.9180000
99 % POINT = 1.092000
3 CONCLUSION (AT THE 5% LEVEL):
THE DATA DO NOT COME FROM A NORMAL DISTRIBUTION
The Anderson-Darling test rejects the normality assumption because the test statistic, 16.76, is greater than the 99% critical value 1.092.
Although the data are not strictly normal, the violation of the normality assumption is not
severe enough to conclude that the Yi = C + Ei model is unreasonable At least part of the non-normality can be explained by the discrete nature of the data.
1 STATISTICS:
NUMBER OF OBSERVATIONS = 700 MINIMUM = 2895.000 MEAN = 2898.562 MAXIMUM = 2902.000 STANDARD DEVIATION = 1.304969
GRUBBS TEST STATISTIC = 2.729201
1.4.2.4.3 Quantitative Output and Interpretation
http://www.itl.nist.gov/div898/handbook/eda/section4/eda4243.htm (6 of 8) [5/1/2006 9:58:49 AM]