Exploratory Data Analysis - Detailed Table of Contents [1.]This chapter presents the assumptions, principles, and techniques necessary to gain insight intodata via EDA--exploratory data
Trang 11 Exploratory Data Analysis
This chapter presents the assumptions, principles, and techniques necessary to gaininsight into data via EDA exploratory data analysis
Dataplot Commands for EDA Techniques
1 Exploratory Data Analysis
http://www.itl.nist.gov/div898/handbook/eda/eda.htm [5/1/2006 9:56:13 AM]
Trang 21 Exploratory Data Analysis - Detailed Table of Contents [1.]
This chapter presents the assumptions, principles, and techniques necessary to gain insight intodata via EDA exploratory data analysis
Trang 3Consequences of Non-Fixed Variation Parameter [1.2.5.3.]
Trang 4Histogram Interpretation: Skewed (Non-Normal) Right [1.3.3.14.6.]
Normal Probability Plot [1.3.3.21.]
Normal Probability Plot: Normally Distributed Data [1.3.3.21.1.]
Trang 5Scatter Plot: Variation of Y Does Not Depend on X(homoscedastic) [1.3.3.26.8.]
Two-Sample t-Test for Equal Means [1.3.5.3.]
Data Used for Two-Sample t-Test [1.3.5.3.1.]
Chi-Square Test for the Standard Deviation [1.3.5.8.]
Data Used for Chi-Square Test for the Standard Deviation [1.3.5.8.1.]
Trang 7Power Lognormal Distribution [1.3.6.6.14.]
Tables for Probability Distributions [1.3.6.7.]
Cumulative Distribution Function of the Standard NormalDistribution [1.3.6.7.1.]
EDA Case Studies [1.4.]
Case Studies Introduction [1.4.1.]
1
Case Studies [1.4.2.]
Normal Random Numbers [1.4.2.1.]
Background and Data [1.4.2.1.1.]
Uniform Random Numbers [1.4.2.2.]
Background and Data [1.4.2.2.1.]
Trang 8Josephson Junction Cryothermometry [1.4.2.4.]
Background and Data [1.4.2.4.1.]
Heat Flow Meter 1 [1.4.2.8.]
Background and Data [1.4.2.8.1.]
Airplane Glass Failure Time [1.4.2.9.]
Background and Data [1.4.2.9.1.]
Trang 9Power Lognormal Analysis [1.4.2.9.7.]
Trang 101 Exploratory Data Analysis
1.1 EDA Introduction
Summary What is exploratory data analysis? How did it begin? How and where
did it originate? How is it differentiated from other data analysisapproaches, such as classical and Bayesian? Is EDA the same asstatistical graphics? What role does statistical graphics play in EDA? Isstatistical graphics identical to EDA?
These questions and related questions are dealt with in this section Thissection answers these questions and provides the necessary frame ofreference for EDA assumptions, principles, and techniques
Trang 111 Exploratory Data Analysis
1.1 EDA Introduction
1.1.1 What is EDA?
Approach Exploratory Data Analysis (EDA) is an approach/philosophy for data
analysis that employs a variety of techniques (mostly graphical) to
maximize insight into a data set;
Focus The EDA approach is precisely that an approach not a set of
techniques, but an attitude/philosophy about how a data analysis should
be carried out
Philosophy EDA is not identical to statistical graphics although the two terms are
used almost interchangeably Statistical graphics is a collection oftechniques all graphically based and all focusing on one datacharacterization aspect EDA encompasses a larger venue; EDA is anapproach to data analysis that postpones the usual assumptions aboutwhat kind of model the data follow with the more direct approach ofallowing the data itself to reveal its underlying structure and model.EDA is not a mere collection of techniques; EDA is a philosophy as tohow we dissect a data set; what we look for; how we look; and how weinterpret It is true that EDA heavily uses the collection of techniquesthat we call "statistical graphics", but it is not identical to statisticalgraphics per se
1.1.1 What is EDA?
http://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm (1 of 2) [5/1/2006 9:56:13 AM]
Trang 12History The seminal work in EDA is Exploratory Data Analysis, Tukey, (1977).
Over the years it has benefitted from other noteworthy publications such
as Data Analysis and Regression, Mosteller and Tukey (1977),
Interactive Data Analysis, Hoaglin (1977), The ABC's of EDA,Velleman and Hoaglin (1981) and has gained a large following as "the"way to analyze a data set
Techniques Most EDA techniques are graphical in nature with a few quantitative
techniques The reason for the heavy reliance on graphics is that by itsvery nature the main role of EDA is to open-mindedly explore, andgraphics gives the analysts unparalleled power to do so, enticing thedata to reveal its structural secrets, and being always ready to gain somenew, often unsuspected, insight into the data In combination with thenatural pattern-recognition capabilities that we all possess, graphicsprovides, of course, unparalleled power to carry this out
The particular graphical techniques employed in EDA are often quitesimple, consisting of various techniques of:
Plotting the raw data (such as data traces, histograms,
bihistograms, probability plots, lag plots, block plots, and Youdenplots
3
1.1.1 What is EDA?
http://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm (2 of 2) [5/1/2006 9:56:13 AM]
Trang 131 Exploratory Data Analysis
1.1 EDA Introduction
1.1.2 How Does Exploratory Data Analysis
differ from Classical Data Analysis?
For classical analysis, the sequence is
Problem => Data => Model => Analysis => ConclusionsFor EDA, the sequence is
Problem => Data => Analysis => Model => ConclusionsFor Bayesian, the sequence is
Problem => Data => Model => Prior Distribution => Analysis =>Conclusions
1.1.2 How Does Exploratory Data Analysis differ from Classical Data Analysis?
http://www.itl.nist.gov/div898/handbook/eda/section1/eda12.htm (1 of 2) [5/1/2006 9:56:13 AM]
Trang 14In the real world, data analysts freely mix elements of all of the abovethree approaches (and other approaches) The above distinctions weremade to emphasize the major differences among the three approaches.
Trang 151 Exploratory Data Analysis
1.1 EDA Introduction
1.1.2 How Does Exploratory Data Analysis differ from Classical Data Analysis?
1.1.2.1 Model
Classical The classical approach imposes models (both deterministic and
probabilistic) on the data Deterministic models include, for example,
regression models and analysis of variance (ANOVA) models The mostcommon probabilistic model assumes that the errors about the
deterministic model are normally distributed this assumption affects thevalidity of the ANOVA F tests
Exploratory The Exploratory Data Analysis approach does not impose deterministic
or probabilistic models on the data On the contrary, the EDA approachallows the data to suggest admissible models that best fit the data
1.1.2.1 Model
http://www.itl.nist.gov/div898/handbook/eda/section1/eda121.htm [5/1/2006 9:56:13 AM]
Trang 161 Exploratory Data Analysis
1.1 EDA Introduction
1.1.2 How Does Exploratory Data Analysis differ from Classical Data Analysis?
1.1.2.2 Focus
Classical The two approaches differ substantially in focus For classical analysis,
the focus is on the model estimating parameters of the model andgenerating predicted values from the model
Exploratory For exploratory data analysis, the focus is on the data its structure,
outliers, and models suggested by the data
1.1.2.2 Focus
http://www.itl.nist.gov/div898/handbook/eda/section1/eda122.htm [5/1/2006 9:56:13 AM]
Trang 171 Exploratory Data Analysis
1.1 EDA Introduction
1.1.2 How Does Exploratory Data Analysis differ from Classical Data Analysis?
1.1.2.3 Techniques
Classical Classical techniques are generally quantitative in nature They include
ANOVA, t tests, chi-squared tests, and F tests
Exploratory EDA techniques are generally graphical They include scatter plots,
character plots, box plots, histograms, bihistograms, probability plots,
residual plots, and mean plots.1.1.2.3 Techniques
http://www.itl.nist.gov/div898/handbook/eda/section1/eda123.htm [5/1/2006 9:56:14 AM]
Trang 181 Exploratory Data Analysis
1.1 EDA Introduction
1.1.2 How Does Exploratory Data Analysis differ from Classical Data Analysis?
1.1.2.4 Rigor
Classical Classical techniques serve as the probabilistic foundation of science and
engineering; the most important characteristic of classical techniques isthat they are rigorous, formal, and "objective"
Exploratory EDA techniques do not share in that rigor or formality EDA techniques
make up for that lack of rigor by being very suggestive, indicative, andinsightful about what the appropriate model should be
EDA techniques are subjective and depend on interpretation which maydiffer from analyst to analyst, although experienced analysts commonlyarrive at identical conclusions
1.1.2.4 Rigor
http://www.itl.nist.gov/div898/handbook/eda/section1/eda124.htm [5/1/2006 9:56:14 AM]
Trang 191 Exploratory Data Analysis
1.1 EDA Introduction
1.1.2 How Does Exploratory Data Analysis differ from Classical Data Analysis?
1.1.2.5 Data Treatment
Classical Classical estimation techniques have the characteristic of taking all of
the data and mapping the data into a few numbers ("estimates") This isboth a virtue and a vice The virtue is that these few numbers focus onimportant characteristics (location, variation, etc.) of the population Thevice is that concentrating on these few characteristics can filter out othercharacteristics (skewness, tail length, autocorrelation, etc.) of the samepopulation In this sense there is a loss of information due to this
"filtering" process
Exploratory The EDA approach, on the other hand, often makes use of (and shows)
all of the available data In this sense there is no corresponding loss ofinformation
1.1.2.5 Data Treatment
http://www.itl.nist.gov/div898/handbook/eda/section1/eda125.htm [5/1/2006 9:56:14 AM]
Trang 201 Exploratory Data Analysis
1.1 EDA Introduction
1.1.2 How Does Exploratory Data Analysis differ from Classical Data Analysis?
1.1.2.6 Assumptions
Classical The "good news" of the classical approach is that tests based on
classical techniques are usually very sensitive that is, if a true shift inlocation, say, has occurred, such tests frequently have the power todetect such a shift and to conclude that such a shift is "statisticallysignificant" The "bad news" is that classical tests depend on underlyingassumptions (e.g., normality), and hence the validity of the test
conclusions becomes dependent on the validity of the underlyingassumptions Worse yet, the exact underlying assumptions may beunknown to the analyst, or if known, untested Thus the validity of thescientific conclusions becomes intrinsically linked to the validity of theunderlying assumptions In practice, if such assumptions are unknown
or untested, the validity of the scientific conclusions becomes suspect
Exploratory Many EDA techniques make little or no assumptions they present and
show the data all of the data as is, with fewer encumberingassumptions
1.1.2.6 Assumptions
http://www.itl.nist.gov/div898/handbook/eda/section1/eda126.htm [5/1/2006 9:56:14 AM]
Trang 211 Exploratory Data Analysis
1.1 EDA Introduction
1.1.3 How Does Exploratory Data Analysis
Differ from Summary Analysis?
Summary A summary analysis is simply a numeric reduction of a historical data
set It is quite passive Its focus is in the past Quite commonly, itspurpose is to simply arrive at a few key statistics (for example, meanand standard deviation) which may then either replace the data set or beadded to the data set in the form of a summary table
Exploratory In contrast, EDA has as its broadest goal the desire to gain insight into
the engineering/scientific process behind the data Whereas summarystatistics are passive and historical, EDA is active and futuristic In anattempt to "understand" the process and improve it in the future, EDAuses the data as a "window" to peer into the heart of the process thatgenerated the data There is an archival role in the research andmanufacturing world for summary statistics, but there is an enormouslylarger role for the EDA approach
1.1.3 How Does Exploratory Data Analysis Differ from Summary Analysis?
http://www.itl.nist.gov/div898/handbook/eda/section1/eda13.htm [5/1/2006 9:56:14 AM]
Trang 221 Exploratory Data Analysis
a good-fitting, parsimonious model
essence of the data Graphics are irreplaceable there are no quantitativeanalogues that will give the same insight as well-chosen graphics
To get a "feel" for the data, it is not enough for the analyst to know what
is in the data; the analyst also must know what is not in the data, and theonly way to do that is to draw on our own human pattern-recognitionand comparative abilities in the context of a series of judicious graphicaltechniques applied to the data
1.1.4 What are the EDA Goals?
http://www.itl.nist.gov/div898/handbook/eda/section1/eda14.htm [5/1/2006 9:56:15 AM]
Trang 231 Exploratory Data Analysis
●
Quantitative Quantitative techniques are the set of statistical procedures that yield
numeric or tabular output Examples of quantitative techniques include:
hypothesis testing
● analysis of variance
● point estimates and confidence intervals
● least squares regression
●
These and similar techniques are all valuable and are mainstream interms of classical analysis
Graphical On the other hand, there is a large collection of statistical tools that we
generally refer to as graphical techniques These include:
scatter plots
● histograms
● probability plots
● residual plots
● box plots
● block plots
●
1.1.5 The Role of Graphics
http://www.itl.nist.gov/div898/handbook/eda/section1/eda15.htm (1 of 2) [5/1/2006 9:56:15 AM]
Trang 251 Exploratory Data Analysis
Summary
Statistics
If the goal of the analysis is to compute summary statistics plus
determine the best linear fit for Y as a function of X, the results might
be given as:
N = 11 Mean of X = 9.0 Mean of Y = 7.5
Intercept = 3Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816
The above quantitative analysis, although valuable, gives us onlylimited insight into the data
1.1.6 An EDA/Graphics Example
http://www.itl.nist.gov/div898/handbook/eda/section1/eda16.htm (1 of 5) [5/1/2006 9:56:15 AM]
Trang 26Scatter Plot In contrast, the following simple scatter plot of the data
suggests the following:
The data set "behaves like" a linear curve with some scatter;
the vertical spread of the data appears to be of equal height
irrespective of the X-value; this indicates that the data are
equally-precise throughout and so a "regular" (that is,equi-weighted) fit is appropriate
[Anscombe data sets 2, 3, and 4]:
X2 Y2 X3 Y3 X4 Y410.00 9.14 10.00 7.46 8.00 6.58 8.00 8.14 8.00 6.77 8.00 5.7613.00 8.74 13.00 12.74 8.00 7.71
1.1.6 An EDA/Graphics Example
http://www.itl.nist.gov/div898/handbook/eda/section1/eda16.htm (2 of 5) [5/1/2006 9:56:15 AM]
Trang 279.00 8.77 9.00 7.11 8.00 8.8411.00 9.26 11.00 7.81 8.00 8.4714.00 8.10 14.00 8.84 8.00 7.04 6.00 6.13 6.00 6.08 8.00 5.25 4.00 3.10 4.00 5.39 19.00 12.5012.00 9.13 12.00 8.15 8.00 5.56 7.00 7.26 7.00 6.42 8.00 7.91 5.00 4.74 5.00 5.73 8.00 6.89
Intercept = 3Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816
which is identical to the analysis for data set 1 One might naivelyassume that the two data sets are "equivalent" since that is what thestatistics tell us; but what do the statistics not tell us?
Intercept = 3Slope = 0.5Residual standard deviation = 1.236Correlation = 0.816 (0.817 for data set 4)which implies that in some quantitative sense, all four of the data setsare "equivalent" In fact, the four data sets are far from "equivalent"and a scatter plot of each data set, which would be step 1 of any EDAapproach, would tell us that immediately
1.1.6 An EDA/Graphics Example
http://www.itl.nist.gov/div898/handbook/eda/section1/eda16.htm (3 of 5) [5/1/2006 9:56:15 AM]