1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Exploratory Data Analysis_17 pptx

42 219 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Exploratory Data Analysis
Trường học National Institute of Standards and Technology
Chuyên ngành Statistics / Data Analysis
Thể loại Case Studies
Định dạng
Số trang 42
Dung lượng 2,89 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Demodulation Amplitude Plot The complex demodulation amplitude plot for this data shows that: The amplitude is fixed at approximately 390.. Interpretation The assumptions are addressed b

Trang 1

STATISTIC = NUMBER OF RUNS DOWN

NUMBER OF NEGATIVE DIFFERENCES = 241 NUMBER OF ZERO DIFFERENCES = 0

1.4.2.5.2 Test Underlying Assumptions

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4252.htm (8 of 9) [5/1/2006 9:58:51 AM]

Trang 2

Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically significant at the 5% level Numerous values in this column are much larger than +/-1.96,

so we conclude that the data are not random.

Distributional

Assumptions

Since the quantitative tests show that the assumptions of constant scale and non-randomness are not met, the distributional measures will not be meaningful.

Therefore these quantitative tests are omitted.

1.4.2.5.2 Test Underlying Assumptions

Trang 3

1 Exploratory Data Analysis

1.4 EDA Case Studies

To obtain a good fit, sinusoidal models require good starting values for C, the

amplitude, and the frequency.

Good Starting

Value for C

A good starting value for C can be obtained by calculating the mean of the data.

If the data show a trend, i.e., the assumption of constant location is violated, we

can replace C with a linear or quadratic least squares fit That is, the model becomes

or

Since our data did not have any meaningful change of location, we can fit the

simpler model with C equal to the mean From the summary output in the

previous page, the mean is -177.44.

Trang 4

We could generate the demodulation phase plot for 0.3 and then use trial and error to obtain a better estimate for the frequency To simplify this, we generate

16 of these plots on a single page starting with a frequency of 0.28, increasing in increments of 0.0025, and stopping at 0.3175.

Interpretation The plots start with lines sloping from left to right but gradually change to a right

to left slope The relatively flat slope occurs for frequency 0.3025 (third row, second column) The complex demodulation phase plot restricts the range from

to This is why the plot appears to show some breaks.

That is, we replace with a function of time A linear fit is specified in the model above, but this can be replaced with a more elaborate function if needed.1.4.2.5.3 Develop a Better Model

Trang 5

Demodulation

Amplitude

Plot

The complex demodulation amplitude plot for this data shows that:

The amplitude is fixed at approximately 390.

Fit Output Using starting estimates of 0.3025 for the frequency, 390 for the amplitude, and

-177.44 for C, Dataplot generated the following output for the fit.

LEAST SQUARES NON-LINEAR FIT

SAMPLE SIZE N = 200

MODEL Y =C + AMP*SIN(2*3.14159*FREQ*T + PHASE)

NO REPLICATION CASE

ITERATION CONVERGENCE RESIDUAL * PARAMETER

NUMBER MEASURE STANDARD * ESTIMATES

4 0.96108E-01 0.15585E+03 *-0.17879E+03-0.36177E+03

1.4.2.5.3 Develop a Better Model

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4253.htm (3 of 4) [5/1/2006 9:58:52 AM]

Trang 6

2 AMP -361.766 ( 26.19 ) -13.81

3 FREQ 0.302596 (0.1510E-03) 2005

4 PHASE 1.46536 (0.4909E-01) 29.85

RESIDUAL STANDARD DEVIATION = 155.8484

RESIDUAL DEGREES OF FREEDOM = 196

Model From the fit output, our proposed model is:

We will evaluate the adequacy of this model in the next section.

1.4.2.5.3 Develop a Better Model

Trang 7

1 Exploratory Data Analysis

1.4 EDA Case Studies

Trang 8

Interpretation The assumptions are addressed by the graphics shown above:

The run sequence plot (upper left) indicates that the data do not have any significant shifts in location There does seem to be some shifts in scale A start-up effect was detected previously by the complex demodulation amplitude plot There does appear to

The histogram (lower left) and the normal probability plot

(lower right) do not show any serious non-normality in the residuals However, the bend in the left portion of the normal probability plot shows some cause for concern.

Dataplot generated the following fit output after removing 3 outliers.

LEAST SQUARES NON-LINEAR FIT

SAMPLE SIZE N = 197

MODEL Y =C + AMP*SIN(2*3.14159*FREQ*T + PHASE)

NO REPLICATION CASE

ITERATION CONVERGENCE RESIDUAL * PARAMETER

NUMBER MEASURE STANDARD * ESTIMATES

2 AMP -361.759 ( 25.45 ) -14.22

3 FREQ 0.302597 (0.1457E-03) 2077.

4 PHASE 1.46533 (0.4715E-01) 31.08

RESIDUAL STANDARD DEVIATION = 148.3398

1.4.2.5.4 Validate New Model

Trang 9

RESIDUAL DEGREES OF FREEDOM = 193

New

Fit to

Edited

Data

The original fit, with a residual standard deviation of 155.84, was:

The new fit, with a residual standard deviation of 148.34, is:

There is minimal change in the parameter estimates and about a 5% reduction in the residual standard deviation In this case, removing the residuals has a modest benefit in terms of reducing the variability of the model.

Trang 10

1 Exploratory Data Analysis

1.4 EDA Case Studies

Click on the links below to start Dataplot and run this case

study yourself Each step may use results from previous steps,

so please be patient Wait until the software verifies that the

current step is complete before clicking on the next step.

The links in this column will connect you with more detailed information about each analysis step from the case study description.

1 Invoke Dataplot and read data

1 Read in the data

2 Generate a run sequence plot

3 Generate a lag plot

4 Generate an autocorrelation plot

1 Based on the 4-plot, there are no obvious shifts in location and scale, but the data are not random

2 Based on the run sequence plot, there are no obvious shifts in location and scale

3 Based on the lag plot, the data are not random

4 The autocorrelation plot shows significant autocorrelation at lag 1

5 The spectral plot shows a single dominant1.4.2.5.5 Work This Example Yourself

Trang 11

5 Generate a spectral plot.

6 Generate a table of summary

statistics

7 Generate a linear fit to detect

drift in location

8 Detect drift in variation by

dividing the data into quarters and

computing Levene's test statistic for

equal standard deviations

9 Check for randomness by generating

a runs test

low frequency peak

6 The summary statistics table displays 25+ statistics

7 The linear fit indicates no drift in location since the slope parameter

is not statistically significant

8 Levene's test indicates no significant drift in variation

9 The runs test indicates significant non-randomness

3 Fit the non-linear model

1 Complex demodulation phase plot indicates a starting frequency

1.4.2.5.5 Work This Example Yourself

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4255.htm (2 of 3) [5/1/2006 9:58:53 AM]

Trang 12

4 Validate fit.

1 Generate a 4-plot of the residuals

from the fit

2 Generate a nonlinear fit with

outliers removed

3 Generate a 4-plot of the residuals

from the fit with the outliers

removed

1 The 4-plot indicates that the assumptions

of constant location and scale are valid The lag plot indicates that the data are random The histogram and normal

probability plot indicate that the residuals that the normality assumption for the

residuals are not seriously violated, although there is a bend on the probablity plot that warrants attention

2 The fit after removing 3 outliers shows some marginal improvement in the model (a 5% reduction in the residual standard deviation)

3 The 4-plot of the model fit after

3 outliers removed shows marginal improvement in satisfying model assumptions

1.4.2.5.5 Work This Example Yourself

Trang 13

1 Exploratory Data Analysis

1.4 EDA Case Studies

Trang 14

1 Exploratory Data Analysis

1.4 EDA Case Studies

1.4.2 Case Studies

1.4.2.6 Filter Transmittance

1.4.2.6.1 Background and Data

Generation This data set was collected by NIST chemist Radu Mavrodineaunu in

the 1970's from an automatic data acquisition system for a filter transmittance experiment The response variable is transmittance The motivation for studying this data set is to show how the underlying autocorrelation structure in a relatively small data set helped the

scientist detect problems with his automatic data acquisition system This file can be read by Dataplot with the following commands:

SKIP 25 READ MAVRO.DAT Y

Resulting

Data

The following are the data used for this case study.

2.00180 2.00170 2.00180 2.00190 2.00180 2.00170 2.00150 2.00140 2.00150 2.00150 2.00170 2.00180 2.00180 2.00190 2.00190 2.00210 2.00200 2.00160 2.00140

1.4.2.6.1 Background and Data

Trang 15

2.00130 2.00130 2.00150 2.00150 2.00160 2.00150 2.00140 2.00130 2.00140 2.00150 2.00140 2.00150 2.00160 2.00150 2.00160 2.00190 2.00200 2.00200 2.00210 2.00220 2.00230 2.00240 2.00250 2.00270 2.00260 2.00260 2.00260 2.00270 2.00260 2.00250 2.00240

1.4.2.6.1 Background and Data

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4261.htm (2 of 2) [5/1/2006 9:58:53 AM]

Trang 16

1 Exploratory Data Analysis

1.4 EDA Case Studies

1.4.2 Case Studies

1.4.2.6 Filter Transmittance

1.4.2.6.2 Graphical Output and

Interpretation

Determine if the univariate model:

is appropriate and valid.

Determine if the confidence interval

is appropriate and valid where s is the standard deviation of the

original data.

3

1.4.2.6.2 Graphical Output and Interpretation

Trang 17

4-Plot of

Data

Interpretation The assumptions are addressed by the graphics shown above:

The run sequence plot (upper left) indicates a significant shift in location around x=35.

measurement The solution was to rerun the experiment allowing more time between samples.

1.4.2.6.2 Graphical Output and Interpretation

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4262.htm (2 of 4) [5/1/2006 9:58:53 AM]

Trang 18

Simple graphical techniques can be quite effective in revealing unexpected results in the data When this occurs, it is important to investigate whether the unexpected result is due to problems in the experiment and data collection or is indicative of unexpected underlying structure in the data This determination cannot be made on the basis of statistics alone The role of the graphical and statistical analysis is to detect problems or unexpected results in the data.

Resolving the issues requires the knowledge of the scientist or engineer.

Individual

Plots

Although it is generally unnecessary, the plots can be generated individually to give more detail Since the lag plot indicates significant non-randomness, we omit the distributional plots.

Trang 19

1.4.2.6.2 Graphical Output and Interpretation

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4262.htm (4 of 4) [5/1/2006 9:58:53 AM]

Trang 20

1 Exploratory Data Analysis

1.4 EDA Case Studies

Trang 21

Location One way to quantify a change in location over time is to fit a straight line to the data set

using the index variable X = 1, 2, , N, with N denoting the number of observations If there is no significant drift in the location, the slope parameter should be zero For this data set, Dataplot generates the following output:

LEAST SQUARES MULTILINEAR FIT SAMPLE SIZE N = 50 NUMBER OF VARIABLES = 1

NO REPLICATION CASE

PARAMETER ESTIMATES (APPROX ST DEV.) TVALUE

1 A0 2.00138 (0.9695E-04) 0.2064E+05

2 A1 X 0.184685E-04 (0.3309E-05) 5.582

RESIDUAL STANDARD DEVIATION = 0.3376404E-03 RESIDUAL DEGREES OF FREEDOM = 48

The slope parameter, A1, has a t value of 5.6, which is statistically significant The value

of the slope parameter is 0.0000185 Although this number is nearly zero, we need to take into account that the original scale of the data is from about 2.0012 to 2.0028 In this case, we conclude that there is a drift in location, although by a relatively minor amount.

Variation One simple way to detect a change in variation is with a Bartlett test after dividing the

data set into several equal sized intervals However, the Bartlett test is not robust for non-normality Since the normality assumption is questionable for these data, we use the alternative Levene test In partiuclar, we use the Levene test based on the median rather the mean The choice of the number of intervals is somewhat arbitrary, although values of

4 or 8 are reasonable Dataplot generated the following output for the Levene test.

LEVENE F-TEST FOR SHIFT IN VARIATION (ASSUMPTION: NORMALITY)

1 STATISTICS NUMBER OF OBSERVATIONS = 50 NUMBER OF GROUPS = 4 LEVENE F TEST STATISTIC = 0.9714893

FOR LEVENE TEST STATISTIC

Trang 22

99 % POINT = 4.238307 99.9 % POINT = 6.424733

58.56597 % Point: 0.9714893

3 CONCLUSION (AT THE 5% LEVEL):

THERE IS NO SHIFT IN VARIATION

THUS: HOMOGENEOUS WITH RESPECT TO VARIATION

In this case, since the Levene test statistic value of 0.971 is less than the critical value of 2.806 at the 5% level, we conclude that there is no evidence of a change in variation.

Randomness There are many ways in which data can be non-random However, most common forms

of non-randomness can be detected with a few simple tests The lag plot in the 4-plot in the previous seciton is a simple graphical technique.

One check is an autocorrelation plot that shows the autocorrelations for various lags Confidence bands can be plotted at the 95% and 99% confidence levels Points outside this band indicate statistically significant values (lag 0 is always 1) Dataplot generated the following autocorrelation plot.

The lag 1 autocorrelation, which is generally the one of most interest, is 0.93 The critical values at the 5% level are -0.277 and 0.277 This indicates that the lag 1 autocorrelation

is statistically significant, so there is strong evidence of non-randomness.

A common test for randomness is the runs test

RUNS UP STATISTIC = NUMBER OF RUNS UP

OF LENGTH EXACTLY I

I STAT EXP(STAT) SD(STAT) Z

1.4.2.6.3 Quantitative Output and Interpretation

Trang 24

NUMBER OF NEGATIVE DIFFERENCES = 18 NUMBER OF ZERO DIFFERENCES = 8

Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically significant at the 5% level Due to the number of values that are much larger than the 1.96 cut-off, we conclude that the data are not random.

Distributional

Analysis

Since we rejected the randomness assumption, the distributional tests are not meaningful Therefore, these quantitative tests are omitted We also omit Grubbs' outlier test since it also assumes the data are approximately normally distributed.

1.4.2.6.3 Quantitative Output and Interpretation

Ngày đăng: 21/06/2014, 21:20

TỪ KHÓA LIÊN QUAN