1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Exploratory Data Analysis_1 pot

42 182 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Exploratory Data Analysis
Trường học National Institute of Standards and Technology
Chuyên ngành Data Analysis
Thể loại Thesis
Năm xuất bản 2006
Định dạng
Số trang 42
Dung lượng 2,96 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Exploratory Data Analysis - Detailed Table of Contents [1.]This chapter presents the assumptions, principles, and techniques necessary to gain insight intodata via EDA--exploratory data

Trang 1

1 Exploratory Data Analysis

This chapter presents the assumptions, principles, and techniques necessary to gaininsight into data via EDA exploratory data analysis

Dataplot Commands for EDA Techniques

1 Exploratory Data Analysis

http://www.itl.nist.gov/div898/handbook/eda/eda.htm [5/1/2006 9:56:13 AM]

Trang 2

1 Exploratory Data Analysis - Detailed Table of Contents [1.]

This chapter presents the assumptions, principles, and techniques necessary to gain insight intodata via EDA exploratory data analysis

Trang 3

Consequences of Non-Fixed Variation Parameter [1.2.5.3.]

Trang 4

Histogram Interpretation: Skewed (Non-Normal) Right [1.3.3.14.6.]

Normal Probability Plot [1.3.3.21.]

Normal Probability Plot: Normally Distributed Data [1.3.3.21.1.]

Trang 5

Scatter Plot: Variation of Y Does Not Depend on X(homoscedastic) [1.3.3.26.8.]

Two-Sample t-Test for Equal Means [1.3.5.3.]

Data Used for Two-Sample t-Test [1.3.5.3.1.]

Chi-Square Test for the Standard Deviation [1.3.5.8.]

Data Used for Chi-Square Test for the Standard Deviation [1.3.5.8.1.]

Trang 7

Power Lognormal Distribution [1.3.6.6.14.]

Tables for Probability Distributions [1.3.6.7.]

Cumulative Distribution Function of the Standard NormalDistribution [1.3.6.7.1.]

EDA Case Studies [1.4.]

Case Studies Introduction [1.4.1.]

1

Case Studies [1.4.2.]

Normal Random Numbers [1.4.2.1.]

Background and Data [1.4.2.1.1.]

Uniform Random Numbers [1.4.2.2.]

Background and Data [1.4.2.2.1.]

Trang 8

Josephson Junction Cryothermometry [1.4.2.4.]

Background and Data [1.4.2.4.1.]

Heat Flow Meter 1 [1.4.2.8.]

Background and Data [1.4.2.8.1.]

Airplane Glass Failure Time [1.4.2.9.]

Background and Data [1.4.2.9.1.]

Trang 9

Power Lognormal Analysis [1.4.2.9.7.]

Trang 10

1 Exploratory Data Analysis

1.1 EDA Introduction

Summary What is exploratory data analysis? How did it begin? How and where

did it originate? How is it differentiated from other data analysisapproaches, such as classical and Bayesian? Is EDA the same asstatistical graphics? What role does statistical graphics play in EDA? Isstatistical graphics identical to EDA?

These questions and related questions are dealt with in this section Thissection answers these questions and provides the necessary frame ofreference for EDA assumptions, principles, and techniques

Trang 11

1 Exploratory Data Analysis

1.1 EDA Introduction

1.1.1 What is EDA?

Approach Exploratory Data Analysis (EDA) is an approach/philosophy for data

analysis that employs a variety of techniques (mostly graphical) to

maximize insight into a data set;

Focus The EDA approach is precisely that an approach not a set of

techniques, but an attitude/philosophy about how a data analysis should

be carried out

Philosophy EDA is not identical to statistical graphics although the two terms are

used almost interchangeably Statistical graphics is a collection oftechniques all graphically based and all focusing on one datacharacterization aspect EDA encompasses a larger venue; EDA is anapproach to data analysis that postpones the usual assumptions aboutwhat kind of model the data follow with the more direct approach ofallowing the data itself to reveal its underlying structure and model.EDA is not a mere collection of techniques; EDA is a philosophy as tohow we dissect a data set; what we look for; how we look; and how weinterpret It is true that EDA heavily uses the collection of techniquesthat we call "statistical graphics", but it is not identical to statisticalgraphics per se

1.1.1 What is EDA?

http://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm (1 of 2) [5/1/2006 9:56:13 AM]

Trang 12

History The seminal work in EDA is Exploratory Data Analysis, Tukey, (1977).

Over the years it has benefitted from other noteworthy publications such

as Data Analysis and Regression, Mosteller and Tukey (1977),

Interactive Data Analysis, Hoaglin (1977), The ABC's of EDA,Velleman and Hoaglin (1981) and has gained a large following as "the"way to analyze a data set

Techniques Most EDA techniques are graphical in nature with a few quantitative

techniques The reason for the heavy reliance on graphics is that by itsvery nature the main role of EDA is to open-mindedly explore, andgraphics gives the analysts unparalleled power to do so, enticing thedata to reveal its structural secrets, and being always ready to gain somenew, often unsuspected, insight into the data In combination with thenatural pattern-recognition capabilities that we all possess, graphicsprovides, of course, unparalleled power to carry this out

The particular graphical techniques employed in EDA are often quitesimple, consisting of various techniques of:

Plotting the raw data (such as data traces, histograms,

bihistograms, probability plots, lag plots, block plots, and Youdenplots

3

1.1.1 What is EDA?

http://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm (2 of 2) [5/1/2006 9:56:13 AM]

Trang 13

1 Exploratory Data Analysis

1.1 EDA Introduction

1.1.2 How Does Exploratory Data Analysis

differ from Classical Data Analysis?

For classical analysis, the sequence is

Problem => Data => Model => Analysis => ConclusionsFor EDA, the sequence is

Problem => Data => Analysis => Model => ConclusionsFor Bayesian, the sequence is

Problem => Data => Model => Prior Distribution => Analysis =>Conclusions

1.1.2 How Does Exploratory Data Analysis differ from Classical Data Analysis?

http://www.itl.nist.gov/div898/handbook/eda/section1/eda12.htm (1 of 2) [5/1/2006 9:56:13 AM]

Trang 14

In the real world, data analysts freely mix elements of all of the abovethree approaches (and other approaches) The above distinctions weremade to emphasize the major differences among the three approaches.

Trang 15

1 Exploratory Data Analysis

1.1 EDA Introduction

1.1.2 How Does Exploratory Data Analysis differ from Classical Data Analysis?

1.1.2.1 Model

Classical The classical approach imposes models (both deterministic and

probabilistic) on the data Deterministic models include, for example,

regression models and analysis of variance (ANOVA) models The mostcommon probabilistic model assumes that the errors about the

deterministic model are normally distributed this assumption affects thevalidity of the ANOVA F tests

Exploratory The Exploratory Data Analysis approach does not impose deterministic

or probabilistic models on the data On the contrary, the EDA approachallows the data to suggest admissible models that best fit the data

1.1.2.1 Model

http://www.itl.nist.gov/div898/handbook/eda/section1/eda121.htm [5/1/2006 9:56:13 AM]

Trang 16

1 Exploratory Data Analysis

1.1 EDA Introduction

1.1.2 How Does Exploratory Data Analysis differ from Classical Data Analysis?

1.1.2.2 Focus

Classical The two approaches differ substantially in focus For classical analysis,

the focus is on the model estimating parameters of the model andgenerating predicted values from the model

Exploratory For exploratory data analysis, the focus is on the data its structure,

outliers, and models suggested by the data

1.1.2.2 Focus

http://www.itl.nist.gov/div898/handbook/eda/section1/eda122.htm [5/1/2006 9:56:13 AM]

Trang 17

1 Exploratory Data Analysis

1.1 EDA Introduction

1.1.2 How Does Exploratory Data Analysis differ from Classical Data Analysis?

1.1.2.3 Techniques

Classical Classical techniques are generally quantitative in nature They include

ANOVA, t tests, chi-squared tests, and F tests

Exploratory EDA techniques are generally graphical They include scatter plots,

character plots, box plots, histograms, bihistograms, probability plots,

residual plots, and mean plots.1.1.2.3 Techniques

http://www.itl.nist.gov/div898/handbook/eda/section1/eda123.htm [5/1/2006 9:56:14 AM]

Trang 18

1 Exploratory Data Analysis

1.1 EDA Introduction

1.1.2 How Does Exploratory Data Analysis differ from Classical Data Analysis?

1.1.2.4 Rigor

Classical Classical techniques serve as the probabilistic foundation of science and

engineering; the most important characteristic of classical techniques isthat they are rigorous, formal, and "objective"

Exploratory EDA techniques do not share in that rigor or formality EDA techniques

make up for that lack of rigor by being very suggestive, indicative, andinsightful about what the appropriate model should be

EDA techniques are subjective and depend on interpretation which maydiffer from analyst to analyst, although experienced analysts commonlyarrive at identical conclusions

1.1.2.4 Rigor

http://www.itl.nist.gov/div898/handbook/eda/section1/eda124.htm [5/1/2006 9:56:14 AM]

Trang 19

1 Exploratory Data Analysis

1.1 EDA Introduction

1.1.2 How Does Exploratory Data Analysis differ from Classical Data Analysis?

1.1.2.5 Data Treatment

Classical Classical estimation techniques have the characteristic of taking all of

the data and mapping the data into a few numbers ("estimates") This isboth a virtue and a vice The virtue is that these few numbers focus onimportant characteristics (location, variation, etc.) of the population Thevice is that concentrating on these few characteristics can filter out othercharacteristics (skewness, tail length, autocorrelation, etc.) of the samepopulation In this sense there is a loss of information due to this

"filtering" process

Exploratory The EDA approach, on the other hand, often makes use of (and shows)

all of the available data In this sense there is no corresponding loss ofinformation

1.1.2.5 Data Treatment

http://www.itl.nist.gov/div898/handbook/eda/section1/eda125.htm [5/1/2006 9:56:14 AM]

Trang 20

1 Exploratory Data Analysis

1.1 EDA Introduction

1.1.2 How Does Exploratory Data Analysis differ from Classical Data Analysis?

1.1.2.6 Assumptions

Classical The "good news" of the classical approach is that tests based on

classical techniques are usually very sensitive that is, if a true shift inlocation, say, has occurred, such tests frequently have the power todetect such a shift and to conclude that such a shift is "statisticallysignificant" The "bad news" is that classical tests depend on underlyingassumptions (e.g., normality), and hence the validity of the test

conclusions becomes dependent on the validity of the underlyingassumptions Worse yet, the exact underlying assumptions may beunknown to the analyst, or if known, untested Thus the validity of thescientific conclusions becomes intrinsically linked to the validity of theunderlying assumptions In practice, if such assumptions are unknown

or untested, the validity of the scientific conclusions becomes suspect

Exploratory Many EDA techniques make little or no assumptions they present and

show the data all of the data as is, with fewer encumberingassumptions

1.1.2.6 Assumptions

http://www.itl.nist.gov/div898/handbook/eda/section1/eda126.htm [5/1/2006 9:56:14 AM]

Trang 21

1 Exploratory Data Analysis

1.1 EDA Introduction

1.1.3 How Does Exploratory Data Analysis

Differ from Summary Analysis?

Summary A summary analysis is simply a numeric reduction of a historical data

set It is quite passive Its focus is in the past Quite commonly, itspurpose is to simply arrive at a few key statistics (for example, meanand standard deviation) which may then either replace the data set or beadded to the data set in the form of a summary table

Exploratory In contrast, EDA has as its broadest goal the desire to gain insight into

the engineering/scientific process behind the data Whereas summarystatistics are passive and historical, EDA is active and futuristic In anattempt to "understand" the process and improve it in the future, EDAuses the data as a "window" to peer into the heart of the process thatgenerated the data There is an archival role in the research andmanufacturing world for summary statistics, but there is an enormouslylarger role for the EDA approach

1.1.3 How Does Exploratory Data Analysis Differ from Summary Analysis?

http://www.itl.nist.gov/div898/handbook/eda/section1/eda13.htm [5/1/2006 9:56:14 AM]

Trang 22

1 Exploratory Data Analysis

a good-fitting, parsimonious model

essence of the data Graphics are irreplaceable there are no quantitativeanalogues that will give the same insight as well-chosen graphics

To get a "feel" for the data, it is not enough for the analyst to know what

is in the data; the analyst also must know what is not in the data, and theonly way to do that is to draw on our own human pattern-recognitionand comparative abilities in the context of a series of judicious graphicaltechniques applied to the data

1.1.4 What are the EDA Goals?

http://www.itl.nist.gov/div898/handbook/eda/section1/eda14.htm [5/1/2006 9:56:15 AM]

Trang 23

1 Exploratory Data Analysis

Quantitative Quantitative techniques are the set of statistical procedures that yield

numeric or tabular output Examples of quantitative techniques include:

hypothesis testing

● analysis of variance

● point estimates and confidence intervals

● least squares regression

These and similar techniques are all valuable and are mainstream interms of classical analysis

Graphical On the other hand, there is a large collection of statistical tools that we

generally refer to as graphical techniques These include:

scatter plots

● histograms

● probability plots

● residual plots

● box plots

● block plots

1.1.5 The Role of Graphics

http://www.itl.nist.gov/div898/handbook/eda/section1/eda15.htm (1 of 2) [5/1/2006 9:56:15 AM]

Trang 25

1 Exploratory Data Analysis

Summary

Statistics

If the goal of the analysis is to compute summary statistics plus

determine the best linear fit for Y as a function of X, the results might

be given as:

N = 11 Mean of X = 9.0 Mean of Y = 7.5

Intercept = 3Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816

The above quantitative analysis, although valuable, gives us onlylimited insight into the data

1.1.6 An EDA/Graphics Example

http://www.itl.nist.gov/div898/handbook/eda/section1/eda16.htm (1 of 5) [5/1/2006 9:56:15 AM]

Trang 26

Scatter Plot In contrast, the following simple scatter plot of the data

suggests the following:

The data set "behaves like" a linear curve with some scatter;

the vertical spread of the data appears to be of equal height

irrespective of the X-value; this indicates that the data are

equally-precise throughout and so a "regular" (that is,equi-weighted) fit is appropriate

[Anscombe data sets 2, 3, and 4]:

X2 Y2 X3 Y3 X4 Y410.00 9.14 10.00 7.46 8.00 6.58 8.00 8.14 8.00 6.77 8.00 5.7613.00 8.74 13.00 12.74 8.00 7.71

1.1.6 An EDA/Graphics Example

http://www.itl.nist.gov/div898/handbook/eda/section1/eda16.htm (2 of 5) [5/1/2006 9:56:15 AM]

Trang 27

9.00 8.77 9.00 7.11 8.00 8.8411.00 9.26 11.00 7.81 8.00 8.4714.00 8.10 14.00 8.84 8.00 7.04 6.00 6.13 6.00 6.08 8.00 5.25 4.00 3.10 4.00 5.39 19.00 12.5012.00 9.13 12.00 8.15 8.00 5.56 7.00 7.26 7.00 6.42 8.00 7.91 5.00 4.74 5.00 5.73 8.00 6.89

Intercept = 3Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816

which is identical to the analysis for data set 1 One might naivelyassume that the two data sets are "equivalent" since that is what thestatistics tell us; but what do the statistics not tell us?

Intercept = 3Slope = 0.5Residual standard deviation = 1.236Correlation = 0.816 (0.817 for data set 4)which implies that in some quantitative sense, all four of the data setsare "equivalent" In fact, the four data sets are far from "equivalent"and a scatter plot of each data set, which would be step 1 of any EDAapproach, would tell us that immediately

1.1.6 An EDA/Graphics Example

http://www.itl.nist.gov/div898/handbook/eda/section1/eda16.htm (3 of 5) [5/1/2006 9:56:15 AM]

Ngày đăng: 21/06/2014, 21:20