If the autocorrelation plot is being used to test forrandomness i.e., there is no time dependence in thedata, the following formula is recommended: where N is the sample size, z is the p
Trang 11 Exploratory Data Analysis
promotes insight into important aspects of the process that may nothave surfaced otherwise
no more powerful catalyst for discovery than the bringing together of
an experienced/expert scientist/engineer and a data set ripe withintriguing "anomalies" and characteristics
Trang 21 Exploratory Data Analysis
If the randomness assumption does not hold, then
All of the usual statistical tests are invalid
One specific and common type of non-randomness is
autocorrelation Autocorrelation is the correlation between Y t and
Y t-k , where k is an integer that defines the lag for the
autocorrelation That is, autocorrelation is a time dependentnon-randomness This means that the value of the current point is
highly dependent on the previous point if k = 1 (or k points ago if k
is not 1) Autocorrelation is typically detected via an
autocorrelation plot or a lag plot
If the data are not random due to autocorrelation, then
Adjacent data values may be related
1
There may not be n independent snapshots of the
phenomenon under study
2
1.2.5.1 Consequences of Non-Randomness
Trang 3There may be undetected "junk"-outliers.
Trang 41 Exploratory Data Analysis
The usual formula for the uncertainty of the mean:
may be invalid and the numerical value optimistically small
Trang 51 Exploratory Data Analysis
Trang 61 Exploratory Data Analysis
Case Studies The airplane glass failure case study gives an example of determining
an appropriate distribution and estimating the parameters of thatdistribution The uniform random numbers case study gives anexample of determining a more appropriate centrality parameter for anon-normal distribution
Other consequences that flow from problems with distributionalassumptions are:
Distribution 1 The distribution may be changing
The single distribution estimate may be meaningless (if theprocess distribution is changing)
Trang 7Model 1 The model may be changing.
The single model estimate may be meaningless
Process 1 The process may be out-of-control
The process may be unpredictable
Trang 81 Exploratory Data Analysis
1.3 EDA Techniques
Summary After you have collected a set of data, how do you do an exploratory
data analysis? What techniques do you employ? What do the varioustechniques focus on? What conclusions can you expect to reach?This section provides answers to these kinds of questions via a gallery
of EDA techniques and a detailed description of each technique Thetechniques are divided into graphical and quantitative techniques Forexploratory data analysis, the emphasis is primarily on the graphicaltechniques
Trang 91 Exploratory Data Analysis
we have divided the descriptions into graphical and quantitativetechniques This is for organizational clarity and is not meant todiscourage the use of both graphical and quantitiative techniques whenanalyzing data
techniques
Availability
in Software
The sample plots and output in this section were generated with the
Dataplot software program Other general purpose statistical dataanalysis programs can generate most of the plots, intervals, and testsdiscussed here, or macros can be written to acheive the same result.1.3.1 Introduction
http://www.itl.nist.gov/div898/handbook/eda/section3/eda31.htm [5/1/2006 9:56:27 AM]
Trang 101 Exploratory Data Analysis
question-specific EDA technqiues
1.3.2 Analysis Questions
Trang 11to address these problems.
1.3.2 Analysis Questions
http://www.itl.nist.gov/div898/handbook/eda/section3/eda32.htm (2 of 2) [5/1/2006 9:56:27 AM]
Trang 121 Exploratory Data Analysis
1.3 EDA Techniques
1.3.3 Graphical Techniques: Alphabetic
This section provides a gallery of some useful graphical techniques Thetechniques are ordered alphabetically, so this section is not intended to
be read in a sequential fashion The use of most of these graphicaltechniques is demonstrated in the case studies in this chapter A few ofthese graphical techniques are demonstrated in later chapters
1.3.3.6
Box Plot: 1.3.3.7 Complex
DemodulationAmplitude Plot:1.3.3.8
Trang 13Quantile-QuantilePlot: 1.3.3.24
Trang 14Star Plot: 1.3.3.29 Weibull Plot:
Trang 151 Exploratory Data Analysis
be near zero for any and all time-lag separations If non-random,then one or more of the autocorrelations will be significantlynon-zero
In addition, autocorrelation plots are used in the model identificationstage for Box-Jenkins autoregressive, moving average time seriesmodels
1.3.3.1 Autocorrelation Plot
http://www.itl.nist.gov/div898/handbook/eda/section3/eda331.htm (1 of 5) [5/1/2006 9:56:30 AM]
Trang 16r(h) versus h
Autocorrelation plots are formed by
Vertical axis: Autocorrelation coefficient
where C h is the autocovariance function
and C 0 is the variance function
Note R h is between -1 and +1
Note Some sources may use the following formula for theautocovariance function
Although this definition has less bias, the (1/N) formulation
has some desirable statistical properties and is the form mostcommonly used in the statistics literature See pages 20 and49-50 in Chatfield for details
If the autocorrelation plot is being used to test forrandomness (i.e., there is no time dependence in thedata), the following formula is recommended:
where N is the sample size, z is the percent point
function of the standard normal distribution and isthe significance level In this case, the confidencebands have fixed width that depends on the samplesize This is the formula that was used to generate theconfidence bands in the above plot
1
●
1.3.3.1 Autocorrelation Plot
Trang 17Autocorrelation plots are also used in the modelidentification stage for fitting ARIMA models In thiscase, a moving average model is assumed for the dataand the following confidence bands should be
generated:
where k is the lag, N is the sample size, z is the percent
point function of the standard normal distribution and
is the significance level In this case, the confidencebands increase as the lag increases
Trang 18Most standard statistical tests depend on randomness Thevalidity of the test conclusions is directly linked to thevalidity of the randomness assumption.
1
Many commonly-used statistical formulae depend on therandomness assumption, the most common formula being theformula for determining the standard deviation of the samplemean:
where is the standard deviation of the data Althoughheavily used, the results from using this formula are of novalue unless the randomness assumption holds
3
In short, if the analyst does not check for randomness, then thevalidity of many of the statistical conclusions becomes suspect Theautocorrelation plot is an excellent way of checking for such
randomness
Examples Examples of the autocorrelation plot for several common situations
are given in the following pages
Random (= White Noise)
Case Study The autocorrelation plot is demonstrated in the beam deflection data
case study
1.3.3.1 Autocorrelation Plot
Trang 19Software Autocorrelation plots are available in most general purpose
statistical software programs including Dataplot.1.3.3.1 Autocorrelation Plot
http://www.itl.nist.gov/div898/handbook/eda/section3/eda331.htm (5 of 5) [5/1/2006 9:56:30 AM]
Trang 201 Exploratory Data Analysis
The following is a sample autocorrelation plot
Conclusions We can make the following conclusions from this plot
There are no significant autocorrelations
Trang 21Discussion Note that with the exception of lag 0, which is always 1 by
definition, almost all of the autocorrelations fall within the 95%confidence limits In addition, there is no apparent pattern (such asthe first twenty-five being positive and the second twenty-five beingnegative) This is the abscence of a pattern we expect to see if thedata are in fact random
A few lags slightly outside the 95% and 99% confidence limits donot neccessarily indicate non-randomness For a 95% confidenceinterval, we might expect about one out of twenty lags to bestatistically significant due to random fluctuations
There is no associative ability to infer from a current value Yi as to
what the next value Yi+1 will be Such non-association is the essense
of randomness In short, adjacent observations do not "co-relate", so
we call this the "no autocorrelation" case
1.3.3.1.1 Autocorrelation Plot: Random Data
http://www.itl.nist.gov/div898/handbook/eda/section3/eda3311.htm (2 of 2) [5/1/2006 9:56:30 AM]
Trang 221 Exploratory Data Analysis
The following is a sample autocorrelation plot
Conclusions We can make the following conclusions from this plot
The data come from an underlying autoregressive model withmoderate positive autocorrelation
1
Discussion The plot starts with a moderately high autocorrelation at lag 1
(approximately 0.75) that gradually decreases The decreasingautocorrelation is generally linear, but with significant noise Such apattern is the autocorrelation plot signature of "moderate
autocorrelation", which in turn provides moderate predictability ifmodeled properly
1.3.3.1.2 Autocorrelation Plot: Moderate Autocorrelation
Trang 23randomness, the residuals after fitting Yi against Yi-1 should result inrandom residuals Assessing whether or not the proposed model infact sufficiently removed the randomness is discussed in detail in the
Process Modeling chapter
The residual standard deviation for this autoregressive model will bemuch smaller than the residual standard deviation for the defaultmodel
1.3.3.1.2 Autocorrelation Plot: Moderate Autocorrelation
http://www.itl.nist.gov/div898/handbook/eda/section3/eda3312.htm (2 of 2) [5/1/2006 9:56:30 AM]
Trang 241 Exploratory Data Analysis
Autocorrelation
Plot for Strong
Autocorrelation
The following is a sample autocorrelation plot
Conclusions We can make the following conclusions from the above plot
The data come from an underlying autoregressive model withstrong positive autocorrelation
1
1.3.3.1.3 Autocorrelation Plot: Strong Autocorrelation and Autoregressive Model
Trang 25Discussion The plot starts with a high autocorrelation at lag 1 (only slightly less
than 1) that slowly declines It continues decreasing until it becomesnegative and starts showing an incresing negative autocorrelation.The decreasing autocorrelation is generally linear with little noise.Such a pattern is the autocorrelation plot signature of "strongautocorrelation", which in turn provides high predictability ifmodeled properly
randomness, the residuals after fitting Yi against Yi-1 should result inrandom residuals Assessing whether or not the proposed model infact sufficiently removed the randomness is discussed in detail in the
Process Modeling chapter
The residual standard deviation for this autoregressive model will bemuch smaller than the residual standard deviation for the defaultmodel
1.3.3.1.3 Autocorrelation Plot: Strong Autocorrelation and Autoregressive Model
http://www.itl.nist.gov/div898/handbook/eda/section3/eda3313.htm (2 of 2) [5/1/2006 9:56:31 AM]
Trang 261 Exploratory Data Analysis
The following is a sample autocorrelation plot
Conclusions We can make the following conclusions from the above plot
The data come from an underlying sinusoidal model
1
Discussion The plot exhibits an alternating sequence of positive and negative
spikes These spikes are not decaying to zero Such a pattern is theautocorrelation plot signature of a sinusoidal model
Trang 271.3.3.1.4 Autocorrelation Plot: Sinusoidal Model
http://www.itl.nist.gov/div898/handbook/eda/section3/eda3314.htm (2 of 2) [5/1/2006 9:56:31 AM]
Trang 281 Exploratory Data Analysis
It is also based on the common and well-understood histogram
Trang 29factor has a significant effect on the location (typical value) for strengthand hence batch is said to be "significant" or to "have an effect" Wethus see graphically and convincingly what a t-test or analysis ofvariance would indicate quantitatively.
With respect to variation, note that the spread (variation) of theabove-axis batch 1 histogram does not appear to be that much differentfrom the below-axis batch 2 histogram With respect to distributionalshape, note that the batch 1 histogram is skewed left while the batch 2histogram is more symmetric with even a hint of a slight skewness tothe right
Thus the bihistogram reveals that there is a clear difference between thebatches with respect to location and distribution, but not in regard tovariation Comparing batch 1 and batch 2, we also note that batch 1 isthe "better batch" due to its 100-unit higher average strength (around725)
Definition:
Two
adjoined
histograms
Bihistograms are formed by vertically juxtaposing two histograms:
Above the axis: Histogram of the response variable for condition1
●
Below the axis: Histogram of the response variable for condition2
●
Questions The bihistogram can provide answers to the following questions:
Is a (2-level) factor significant?
The bihistogram is an important EDA tool for determining if a factor
"has an effect" Since the bihistogram provides insight into the validity
of three (location, variation, and distribution) out of the four (missingonly randomness) underlying assumptions in a measurement process, it
is an especially valuable tool Because of the dual (above/below) nature
of the plot, the bihistogram is restricted to assessing factors that haveonly two levels However, this is very common in the
before-versus-after character of many scientific and engineeringexperiments
1.3.3.2 Bihistogram
http://www.itl.nist.gov/div898/handbook/eda/section3/eda332.htm (2 of 3) [5/1/2006 9:56:31 AM]
Trang 30Techniques
t test (for shift in location)
F test (for shift in variation)
Kolmogorov-Smirnov test (for shift in distribution)
Quantile-quantile plot (for shift in location and distribution)
Case Study The bihistogram is demonstrated in the ceramic strength data case
study
Software The bihistogram is not widely available in general purpose statistical
software programs Bihistograms can be generated using Dataplot
1.3.3.2 Bihistogram
Trang 311 Exploratory Data Analysis
It replaces the analysis of variance test with a lessassumption-dependent binomial test and should be routinely usedwhenever we are trying to robustly decide whether a primary factor has
This block plot reveals that in 10 of the 12 cases (bars), weld method 2
is lower (better) than weld method 1 From a binomial point of view,weld method is statistically significant
1.3.3.3 Block Plot
http://www.itl.nist.gov/div898/handbook/eda/section3/eda333.htm (1 of 4) [5/1/2006 9:56:32 AM]
Trang 32Definition Block Plots are formed as follows:
Vertical axis: Response variable Y
called "general conclusions" If we find that one weld method settingdoes better (smaller average defects per hour) than the other weldmethod setting for all or most of these 12 nuisance factor combinations,then the conclusion is in fact general and robust
Ordering
along the
horizontal
axis
In the above chart, the ordering along the horizontal axis is as follows:
The left 6 bars are from plant 1 and the right 6 bars are from plant2
●
The first 3 bars are from speed 1, the next 3 bars are from speed
2, the next 3 bars are from speed 1, and the last 3 bars are fromspeed 2
●
Bars 1, 4, 7, and 10 are from the first shift, bars 2, 5, 8, and 11 arefrom the second shift, and bars 3, 6, 9, and 12 are from the thirdshift
●
1.3.3.3 Block Plot