1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Exploratory Data Analysis_2 pdf

42 145 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Exploratory Data Analysis
Trường học National Institute of Standards and Technology
Chuyên ngành Data Analysis
Thể loại Essay
Năm xuất bản 2006
Thành phố Gaithersburg
Định dạng
Số trang 42
Dung lượng 2,91 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

If the autocorrelation plot is being used to test forrandomness i.e., there is no time dependence in thedata, the following formula is recommended: where N is the sample size, z is the p

Trang 1

1 Exploratory Data Analysis

promotes insight into important aspects of the process that may nothave surfaced otherwise

no more powerful catalyst for discovery than the bringing together of

an experienced/expert scientist/engineer and a data set ripe withintriguing "anomalies" and characteristics

Trang 2

1 Exploratory Data Analysis

If the randomness assumption does not hold, then

All of the usual statistical tests are invalid

One specific and common type of non-randomness is

autocorrelation Autocorrelation is the correlation between Y t and

Y t-k , where k is an integer that defines the lag for the

autocorrelation That is, autocorrelation is a time dependentnon-randomness This means that the value of the current point is

highly dependent on the previous point if k = 1 (or k points ago if k

is not 1) Autocorrelation is typically detected via an

autocorrelation plot or a lag plot

If the data are not random due to autocorrelation, then

Adjacent data values may be related

1

There may not be n independent snapshots of the

phenomenon under study

2

1.2.5.1 Consequences of Non-Randomness

Trang 3

There may be undetected "junk"-outliers.

Trang 4

1 Exploratory Data Analysis

The usual formula for the uncertainty of the mean:

may be invalid and the numerical value optimistically small

Trang 5

1 Exploratory Data Analysis

Trang 6

1 Exploratory Data Analysis

Case Studies The airplane glass failure case study gives an example of determining

an appropriate distribution and estimating the parameters of thatdistribution The uniform random numbers case study gives anexample of determining a more appropriate centrality parameter for anon-normal distribution

Other consequences that flow from problems with distributionalassumptions are:

Distribution 1 The distribution may be changing

The single distribution estimate may be meaningless (if theprocess distribution is changing)

Trang 7

Model 1 The model may be changing.

The single model estimate may be meaningless

Process 1 The process may be out-of-control

The process may be unpredictable

Trang 8

1 Exploratory Data Analysis

1.3 EDA Techniques

Summary After you have collected a set of data, how do you do an exploratory

data analysis? What techniques do you employ? What do the varioustechniques focus on? What conclusions can you expect to reach?This section provides answers to these kinds of questions via a gallery

of EDA techniques and a detailed description of each technique Thetechniques are divided into graphical and quantitative techniques Forexploratory data analysis, the emphasis is primarily on the graphicaltechniques

Trang 9

1 Exploratory Data Analysis

we have divided the descriptions into graphical and quantitativetechniques This is for organizational clarity and is not meant todiscourage the use of both graphical and quantitiative techniques whenanalyzing data

techniques

Availability

in Software

The sample plots and output in this section were generated with the

Dataplot software program Other general purpose statistical dataanalysis programs can generate most of the plots, intervals, and testsdiscussed here, or macros can be written to acheive the same result.1.3.1 Introduction

http://www.itl.nist.gov/div898/handbook/eda/section3/eda31.htm [5/1/2006 9:56:27 AM]

Trang 10

1 Exploratory Data Analysis

question-specific EDA technqiues

1.3.2 Analysis Questions

Trang 11

to address these problems.

1.3.2 Analysis Questions

http://www.itl.nist.gov/div898/handbook/eda/section3/eda32.htm (2 of 2) [5/1/2006 9:56:27 AM]

Trang 12

1 Exploratory Data Analysis

1.3 EDA Techniques

1.3.3 Graphical Techniques: Alphabetic

This section provides a gallery of some useful graphical techniques Thetechniques are ordered alphabetically, so this section is not intended to

be read in a sequential fashion The use of most of these graphicaltechniques is demonstrated in the case studies in this chapter A few ofthese graphical techniques are demonstrated in later chapters

1.3.3.6

Box Plot: 1.3.3.7 Complex

DemodulationAmplitude Plot:1.3.3.8

Trang 13

Quantile-QuantilePlot: 1.3.3.24

Trang 14

Star Plot: 1.3.3.29 Weibull Plot:

Trang 15

1 Exploratory Data Analysis

be near zero for any and all time-lag separations If non-random,then one or more of the autocorrelations will be significantlynon-zero

In addition, autocorrelation plots are used in the model identificationstage for Box-Jenkins autoregressive, moving average time seriesmodels

1.3.3.1 Autocorrelation Plot

http://www.itl.nist.gov/div898/handbook/eda/section3/eda331.htm (1 of 5) [5/1/2006 9:56:30 AM]

Trang 16

r(h) versus h

Autocorrelation plots are formed by

Vertical axis: Autocorrelation coefficient

where C h is the autocovariance function

and C 0 is the variance function

Note R h is between -1 and +1

Note Some sources may use the following formula for theautocovariance function

Although this definition has less bias, the (1/N) formulation

has some desirable statistical properties and is the form mostcommonly used in the statistics literature See pages 20 and49-50 in Chatfield for details

If the autocorrelation plot is being used to test forrandomness (i.e., there is no time dependence in thedata), the following formula is recommended:

where N is the sample size, z is the percent point

function of the standard normal distribution and isthe significance level In this case, the confidencebands have fixed width that depends on the samplesize This is the formula that was used to generate theconfidence bands in the above plot

1

1.3.3.1 Autocorrelation Plot

Trang 17

Autocorrelation plots are also used in the modelidentification stage for fitting ARIMA models In thiscase, a moving average model is assumed for the dataand the following confidence bands should be

generated:

where k is the lag, N is the sample size, z is the percent

point function of the standard normal distribution and

is the significance level In this case, the confidencebands increase as the lag increases

Trang 18

Most standard statistical tests depend on randomness Thevalidity of the test conclusions is directly linked to thevalidity of the randomness assumption.

1

Many commonly-used statistical formulae depend on therandomness assumption, the most common formula being theformula for determining the standard deviation of the samplemean:

where is the standard deviation of the data Althoughheavily used, the results from using this formula are of novalue unless the randomness assumption holds

3

In short, if the analyst does not check for randomness, then thevalidity of many of the statistical conclusions becomes suspect Theautocorrelation plot is an excellent way of checking for such

randomness

Examples Examples of the autocorrelation plot for several common situations

are given in the following pages

Random (= White Noise)

Case Study The autocorrelation plot is demonstrated in the beam deflection data

case study

1.3.3.1 Autocorrelation Plot

Trang 19

Software Autocorrelation plots are available in most general purpose

statistical software programs including Dataplot.1.3.3.1 Autocorrelation Plot

http://www.itl.nist.gov/div898/handbook/eda/section3/eda331.htm (5 of 5) [5/1/2006 9:56:30 AM]

Trang 20

1 Exploratory Data Analysis

The following is a sample autocorrelation plot

Conclusions We can make the following conclusions from this plot

There are no significant autocorrelations

Trang 21

Discussion Note that with the exception of lag 0, which is always 1 by

definition, almost all of the autocorrelations fall within the 95%confidence limits In addition, there is no apparent pattern (such asthe first twenty-five being positive and the second twenty-five beingnegative) This is the abscence of a pattern we expect to see if thedata are in fact random

A few lags slightly outside the 95% and 99% confidence limits donot neccessarily indicate non-randomness For a 95% confidenceinterval, we might expect about one out of twenty lags to bestatistically significant due to random fluctuations

There is no associative ability to infer from a current value Yi as to

what the next value Yi+1 will be Such non-association is the essense

of randomness In short, adjacent observations do not "co-relate", so

we call this the "no autocorrelation" case

1.3.3.1.1 Autocorrelation Plot: Random Data

http://www.itl.nist.gov/div898/handbook/eda/section3/eda3311.htm (2 of 2) [5/1/2006 9:56:30 AM]

Trang 22

1 Exploratory Data Analysis

The following is a sample autocorrelation plot

Conclusions We can make the following conclusions from this plot

The data come from an underlying autoregressive model withmoderate positive autocorrelation

1

Discussion The plot starts with a moderately high autocorrelation at lag 1

(approximately 0.75) that gradually decreases The decreasingautocorrelation is generally linear, but with significant noise Such apattern is the autocorrelation plot signature of "moderate

autocorrelation", which in turn provides moderate predictability ifmodeled properly

1.3.3.1.2 Autocorrelation Plot: Moderate Autocorrelation

Trang 23

randomness, the residuals after fitting Yi against Yi-1 should result inrandom residuals Assessing whether or not the proposed model infact sufficiently removed the randomness is discussed in detail in the

Process Modeling chapter

The residual standard deviation for this autoregressive model will bemuch smaller than the residual standard deviation for the defaultmodel

1.3.3.1.2 Autocorrelation Plot: Moderate Autocorrelation

http://www.itl.nist.gov/div898/handbook/eda/section3/eda3312.htm (2 of 2) [5/1/2006 9:56:30 AM]

Trang 24

1 Exploratory Data Analysis

Autocorrelation

Plot for Strong

Autocorrelation

The following is a sample autocorrelation plot

Conclusions We can make the following conclusions from the above plot

The data come from an underlying autoregressive model withstrong positive autocorrelation

1

1.3.3.1.3 Autocorrelation Plot: Strong Autocorrelation and Autoregressive Model

Trang 25

Discussion The plot starts with a high autocorrelation at lag 1 (only slightly less

than 1) that slowly declines It continues decreasing until it becomesnegative and starts showing an incresing negative autocorrelation.The decreasing autocorrelation is generally linear with little noise.Such a pattern is the autocorrelation plot signature of "strongautocorrelation", which in turn provides high predictability ifmodeled properly

randomness, the residuals after fitting Yi against Yi-1 should result inrandom residuals Assessing whether or not the proposed model infact sufficiently removed the randomness is discussed in detail in the

Process Modeling chapter

The residual standard deviation for this autoregressive model will bemuch smaller than the residual standard deviation for the defaultmodel

1.3.3.1.3 Autocorrelation Plot: Strong Autocorrelation and Autoregressive Model

http://www.itl.nist.gov/div898/handbook/eda/section3/eda3313.htm (2 of 2) [5/1/2006 9:56:31 AM]

Trang 26

1 Exploratory Data Analysis

The following is a sample autocorrelation plot

Conclusions We can make the following conclusions from the above plot

The data come from an underlying sinusoidal model

1

Discussion The plot exhibits an alternating sequence of positive and negative

spikes These spikes are not decaying to zero Such a pattern is theautocorrelation plot signature of a sinusoidal model

Trang 27

1.3.3.1.4 Autocorrelation Plot: Sinusoidal Model

http://www.itl.nist.gov/div898/handbook/eda/section3/eda3314.htm (2 of 2) [5/1/2006 9:56:31 AM]

Trang 28

1 Exploratory Data Analysis

It is also based on the common and well-understood histogram

Trang 29

factor has a significant effect on the location (typical value) for strengthand hence batch is said to be "significant" or to "have an effect" Wethus see graphically and convincingly what a t-test or analysis ofvariance would indicate quantitatively.

With respect to variation, note that the spread (variation) of theabove-axis batch 1 histogram does not appear to be that much differentfrom the below-axis batch 2 histogram With respect to distributionalshape, note that the batch 1 histogram is skewed left while the batch 2histogram is more symmetric with even a hint of a slight skewness tothe right

Thus the bihistogram reveals that there is a clear difference between thebatches with respect to location and distribution, but not in regard tovariation Comparing batch 1 and batch 2, we also note that batch 1 isthe "better batch" due to its 100-unit higher average strength (around725)

Definition:

Two

adjoined

histograms

Bihistograms are formed by vertically juxtaposing two histograms:

Above the axis: Histogram of the response variable for condition1

Below the axis: Histogram of the response variable for condition2

Questions The bihistogram can provide answers to the following questions:

Is a (2-level) factor significant?

The bihistogram is an important EDA tool for determining if a factor

"has an effect" Since the bihistogram provides insight into the validity

of three (location, variation, and distribution) out of the four (missingonly randomness) underlying assumptions in a measurement process, it

is an especially valuable tool Because of the dual (above/below) nature

of the plot, the bihistogram is restricted to assessing factors that haveonly two levels However, this is very common in the

before-versus-after character of many scientific and engineeringexperiments

1.3.3.2 Bihistogram

http://www.itl.nist.gov/div898/handbook/eda/section3/eda332.htm (2 of 3) [5/1/2006 9:56:31 AM]

Trang 30

Techniques

t test (for shift in location)

F test (for shift in variation)

Kolmogorov-Smirnov test (for shift in distribution)

Quantile-quantile plot (for shift in location and distribution)

Case Study The bihistogram is demonstrated in the ceramic strength data case

study

Software The bihistogram is not widely available in general purpose statistical

software programs Bihistograms can be generated using Dataplot

1.3.3.2 Bihistogram

Trang 31

1 Exploratory Data Analysis

It replaces the analysis of variance test with a lessassumption-dependent binomial test and should be routinely usedwhenever we are trying to robustly decide whether a primary factor has

This block plot reveals that in 10 of the 12 cases (bars), weld method 2

is lower (better) than weld method 1 From a binomial point of view,weld method is statistically significant

1.3.3.3 Block Plot

http://www.itl.nist.gov/div898/handbook/eda/section3/eda333.htm (1 of 4) [5/1/2006 9:56:32 AM]

Trang 32

Definition Block Plots are formed as follows:

Vertical axis: Response variable Y

called "general conclusions" If we find that one weld method settingdoes better (smaller average defects per hour) than the other weldmethod setting for all or most of these 12 nuisance factor combinations,then the conclusion is in fact general and robust

Ordering

along the

horizontal

axis

In the above chart, the ordering along the horizontal axis is as follows:

The left 6 bars are from plant 1 and the right 6 bars are from plant2

The first 3 bars are from speed 1, the next 3 bars are from speed

2, the next 3 bars are from speed 1, and the last 3 bars are fromspeed 2

Bars 1, 4, 7, and 10 are from the first shift, bars 2, 5, 8, and 11 arefrom the second shift, and bars 3, 6, 9, and 12 are from the thirdshift

1.3.3.3 Block Plot

Ngày đăng: 21/06/2014, 21:20

TỪ KHÓA LIÊN QUAN