1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Exploratory Data Analysis_18 pot

42 135 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Exploratory Data Analysis
Tác giả Ron Dziuba
Trường học National Institute of Standards and Technology
Chuyên ngành Data Analysis
Thể loại Bài luận
Năm xuất bản 2006
Thành phố Gaithersburg
Định dạng
Số trang 42
Dung lượng 2,87 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

4-Plot ofData Interpretation The assumptions are addressed by the graphics shown above: The run sequence plot upper left indicates significant shifts in both location and variation.. Sin

Trang 1

28.0486 28.0427 28.0548 28.0616 28.0298 28.0726 28.0695 28.0629 28.0503 28.0493 28.0537 28.0613 28.0643 28.0678 28.0564 28.0703 28.0647 28.0579 28.0630 28.0716 28.0586 28.0607 28.0601 28.0611 28.0606 28.0611 28.0066 28.0412 28.0558 28.0590 28.0750 28.0483 28.0599 28.0490 28.0499 28.0565 28.0612 28.0634 28.0627 28.0519 28.0551 28.0696 28.0581 28.0568 28.0572 28.0529

1.4.2.7.1 Background and Data

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4271.htm (15 of 23) [5/1/2006 9:58:55 AM]

Trang 2

28.0421 28.0432 28.0211 28.0363 28.0436 28.0619 28.0573 28.0499 28.0340 28.0474 28.0534 28.0589 28.0466 28.0448 28.0576 28.0558 28.0522 28.0480 28.0444 28.0429 28.0624 28.0610 28.0461 28.0564 28.0734 28.0565 28.0503 28.0581 28.0519 28.0625 28.0583 28.0645 28.0642 28.0535 28.0510 28.0542 28.0677 28.0416 28.0676 28.0596 28.0635 28.0558 28.0623 28.0718 28.0585 28.0552

1.4.2.7.1 Background and Data

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4271.htm (16 of 23) [5/1/2006 9:58:55 AM]

Trang 3

28.0684 28.0646 28.0590 28.0465 28.0594 28.0303 28.0533 28.0561 28.0585 28.0497 28.0582 28.0507 28.0562 28.0715 28.0468 28.0411 28.0587 28.0456 28.0705 28.0534 28.0558 28.0536 28.0552 28.0461 28.0598 28.0598 28.0650 28.0423 28.0442 28.0449 28.0660 28.0506 28.0655 28.0512 28.0407 28.0475 28.0411 28.0512 28.1036 28.0641 28.0572 28.0700 28.0577 28.0637 28.0534 28.0461

1.4.2.7.1 Background and Data

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4271.htm (17 of 23) [5/1/2006 9:58:55 AM]

Trang 4

28.0701 28.0631 28.0575 28.0444 28.0592 28.0684 28.0593 28.0677 28.0512 28.0644 28.0660 28.0542 28.0768 28.0515 28.0579 28.0538 28.0526 28.0833 28.0637 28.0529 28.0535 28.0561 28.0736 28.0635 28.0600 28.0520 28.0695 28.0608 28.0608 28.0590 28.0290 28.0939 28.0618 28.0551 28.0757 28.0698 28.0717 28.0529 28.0644 28.0613 28.0759 28.0745 28.0736 28.0611 28.0732 28.0782

1.4.2.7.1 Background and Data

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4271.htm (18 of 23) [5/1/2006 9:58:55 AM]

Trang 5

28.0682 28.0756 28.0857 28.0739 28.0840 28.0862 28.0724 28.0727 28.0752 28.0732 28.0703 28.0849 28.0795 28.0902 28.0874 28.0971 28.0638 28.0877 28.0751 28.0904 28.0971 28.0661 28.0711 28.0754 28.0516 28.0961 28.0689 28.1110 28.1062 28.0726 28.1141 28.0913 28.0982 28.0703 28.0654 28.0760 28.0727 28.0850 28.0877 28.0967 28.1185 28.0945 28.0834 28.0764 28.1129 28.0797

1.4.2.7.1 Background and Data

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4271.htm (19 of 23) [5/1/2006 9:58:55 AM]

Trang 6

28.0707 28.1008 28.0971 28.0826 28.0857 28.0984 28.0869 28.0795 28.0875 28.1184 28.0746 28.0816 28.0879 28.0888 28.0924 28.0979 28.0702 28.0847 28.0917 28.0834 28.0823 28.0917 28.0779 28.0852 28.0863 28.0942 28.0801 28.0817 28.0922 28.0914 28.0868 28.0832 28.0881 28.0910 28.0886 28.0961 28.0857 28.0859 28.1086 28.0838 28.0921 28.0945 28.0839 28.0877 28.0803 28.0928

1.4.2.7.1 Background and Data

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4271.htm (20 of 23) [5/1/2006 9:58:55 AM]

Trang 7

28.0885 28.0940 28.0856 28.0849 28.0955 28.0955 28.0846 28.0871 28.0872 28.0917 28.0931 28.0865 28.0900 28.0915 28.0963 28.0917 28.0950 28.0898 28.0902 28.0867 28.0843 28.0939 28.0902 28.0911 28.0909 28.0949 28.0867 28.0932 28.0891 28.0932 28.0887 28.0925 28.0928 28.0883 28.0946 28.0977 28.0914 28.0959 28.0926 28.0923 28.0950 28.1006 28.0924 28.0963 28.0893 28.0956

1.4.2.7.1 Background and Data

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4271.htm (21 of 23) [5/1/2006 9:58:55 AM]

Trang 8

28.0980 28.0928 28.0951 28.0958 28.0912 28.0990 28.0915 28.0957 28.0976 28.0888 28.0928 28.0910 28.0902 28.0950 28.0995 28.0965 28.0972 28.0963 28.0946 28.0942 28.0998 28.0911 28.1043 28.1002 28.0991 28.0959 28.0996 28.0926 28.1002 28.0961 28.0983 28.0997 28.0959 28.0988 28.1029 28.0989 28.1000 28.0944 28.0979 28.1005 28.1012 28.1013 28.0999 28.0991 28.1059 28.0961

1.4.2.7.1 Background and Data

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4271.htm (22 of 23) [5/1/2006 9:58:55 AM]

Trang 9

28.0981 28.1045 28.1047 28.1042 28.1146 28.1113 28.1051 28.1065 28.1065 28.0985 28.1000 28.1066 28.1041 28.0954 28.1090

1.4.2.7.1 Background and Data

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4271.htm (23 of 23) [5/1/2006 9:58:55 AM]

Trang 10

1 Exploratory Data Analysis

1.4 EDA Case Studies

1.4.2 Case Studies

1.4.2.7 Standard Resistor

1.4.2.7.2 Graphical Output and

Interpretation

Goal The goal of this analysis is threefold:

Determine if the univariate model:

is appropriate and valid.

Determine if the confidence interval

is appropriate and valid where s is the standard deviation of the

original data.

3

1.4.2.7.2 Graphical Output and Interpretation

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4272.htm (1 of 4) [5/1/2006 9:58:56 AM]

Trang 11

4-Plot of

Data

Interpretation The assumptions are addressed by the graphics shown above:

The run sequence plot (upper left) indicates significant shifts in both location and variation Specifically, the location is

increasing with time The variability seems greater in the first and last third of the data than it does in the middle third.

The distributional plots, the histogram (lower left) and the

normal probability plot (lower right), are not interpreted since the randomness assumption is so clearly violated.

However, discussions with the scientist revealed the following:

the drift with respect to location was expected.

Trang 12

data in the first and last thirds was collected in winter while the more stable middle third was collected in the summer The seasonal effect was determined to be caused by the amount of humidity affecting the measurement equipment In this case, the solution was to modify the test equipment to be less sensitive to enviromental factors.

Simple graphical techniques can be quite effective in revealing unexpected results in the data When this occurs, it is important to investigate whether the unexpected result is due to problems in the experiment and data collection, or is it in fact indicative of an unexpected underlying structure in the data This determination cannot

be made on the basis of statistics alone The role of the graphical and statistical analysis is to detect problems or unexpected results in the data Resolving the issues requires the knowledge of the scientist or engineer.

Individual

Plots

Although it is generally unnecessary, the plots can be generated individually to give more detail Since the lag plot indicates significant non-randomness, we omit the distributional plots.

Trang 13

Lag Plot

1.4.2.7.2 Graphical Output and Interpretation

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4272.htm (4 of 4) [5/1/2006 9:58:56 AM]

Trang 14

1 Exploratory Data Analysis

1.4 EDA Case Studies

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4273.htm (1 of 7) [5/1/2006 9:58:57 AM]

Trang 15

The autocorrelation coefficient of 0.972 is evidence of significant non-randomness.

Location One way to quantify a change in location over time is to fit a straight line to the data set

using the index variable X = 1, 2, , N, with N denoting the number of observations If there is no significant drift in the location, the slope parameter estimate should be zero For this data set, Dataplot generates the following output:

LEAST SQUARES MULTILINEAR FIT SAMPLE SIZE N = 1000 NUMBER OF VARIABLES = 1

NO REPLICATION CASE

PARAMETER ESTIMATES (APPROX ST DEV.) TVALUE

1 A0 27.9114 (0.1209E-02) 0.2309E+05

2 A1 X 0.209670E-03 (0.2092E-05) 100.2

RESIDUAL STANDARD DEVIATION = 0.1909796E-01 RESIDUAL DEGREES OF FREEDOM = 998

COEF AND SD(COEF) WRITTEN OUT TO FILE DPST1F.DAT SD(PRED),95LOWER,95UPPER,99LOWER,99UPPER

WRITTEN OUT TO FILE DPST2F.DAT REGRESSION DIAGNOSTICS WRITTEN OUT TO FILE DPST3F.DAT PARAMETER VARIANCE-COVARIANCE MATRIX AND

INVERSE OF X-TRANSPOSE X MATRIX WRITTEN OUT TO FILE DPST4F.DATThe slope parameter, A1, has a t value of 100 which is statistically significant The value

of the slope parameter estimate is 0.00021 Although this number is nearly zero, we need

to take into account that the original scale of the data is from about 27.8 to 28.2 In this case, we conclude that there is a drift in location.

1.4.2.7.3 Quantitative Output and Interpretation

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4273.htm (2 of 7) [5/1/2006 9:58:57 AM]

Trang 16

Variation One simple way to detect a change in variation is with a Bartlett test after dividing the

data set into several equal-sized intervals However, the Bartlett test is not robust for non-normality Since the normality assumption is questionable for these data, we use the alternative Levene test In partiuclar, we use the Levene test based on the median rather the mean The choice of the number of intervals is somewhat arbitrary, although values of

4 or 8 are reasonable Dataplot generated the following output for the Levene test.

LEVENE F-TEST FOR SHIFT IN VARIATION (ASSUMPTION: NORMALITY)

1 STATISTICS NUMBER OF OBSERVATIONS = 1000 NUMBER OF GROUPS = 4 LEVENE F TEST STATISTIC = 140.8509

FOR LEVENE TEST STATISTIC

100.0000 % Point: 140.8509

3 CONCLUSION (AT THE 5% LEVEL):

THERE IS A SHIFT IN VARIATION

THUS: NOT HOMOGENEOUS WITH RESPECT TO VARIATION

In this case, since the Levene test statistic value of 140.9 is greater than the 5%

significance level critical value of 2.6, we conclude that there is significant evidence of nonconstant variation.

Randomness

There are many ways in which data can be non-random However, most common forms

of non-randomness can be detected with a few simple tests The lag plot in the 4-plot in the previous section is a simple graphical technique.

One check is an autocorrelation plot that shows the autocorrelations for various lags Confidence bands can be plotted at the 95% and 99% confidence levels Points outside this band indicate statistically significant values (lag 0 is always 1) Dataplot generated the following autocorrelation plot.

1.4.2.7.3 Quantitative Output and Interpretation

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4273.htm (3 of 7) [5/1/2006 9:58:57 AM]

Trang 17

The lag 1 autocorrelation, which is generally the one of greatest interest, is 0.97 The critical values at the 5% significance level are -0.062 and 0.062 This indicates that the lag 1 autocorrelation is statistically significant, so there is strong evidence of

non-randomness.

A common test for randomness is the runs test.

RUNS UP

STATISTIC = NUMBER OF RUNS UP

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4273.htm (4 of 7) [5/1/2006 9:58:57 AM]

Trang 18

RUNS TOTAL = RUNS UP + RUNS DOWN

STATISTIC = NUMBER OF RUNS TOTAL

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4273.htm (5 of 7) [5/1/2006 9:58:57 AM]

Trang 19

STATISTIC = NUMBER OF RUNS TOTAL

NUMBER OF POSITIVE DIFFERENCES = 505 NUMBER OF NEGATIVE DIFFERENCES = 469 NUMBER OF ZERO DIFFERENCES = 25

Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically significant at the 5% level Due to the number of values that are larger than the 1.96 cut-off, we conclude that the data are not random However, in this case the evidence from the runs test is not nearly as strong as it is from the autocorrelation plot.

Distributional

Analysis

Since we rejected the randomness assumption, the distributional tests are not meaningful Therefore, these quantitative tests are omitted Since the Grubbs' test for outliers also assumes the approximate normality of the data, we omit Grubbs' test as well.

Univariate

Report

It is sometimes useful and convenient to summarize the above results in a report.

Analysis for resistor case study

1: Sample Size = 1000

2: Location Mean = 28.01635 Standard Deviation of Mean = 0.002008 95% Confidence Interval for Mean = (28.0124,28.02029) Drift with respect to location? = NO

3: Variation Standard Deviation = 0.063495 95% Confidence Interval for SD = (0.060829,0.066407) Change in variation?

(based on Levene's test on quarters

of the data) = YES

4: Randomness Autocorrelation = 0.9721581.4.2.7.3 Quantitative Output and Interpretation

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4273.htm (6 of 7) [5/1/2006 9:58:57 AM]

Trang 20

Data Are Random?

(as measured by autocorrelation) = NO

5: Distribution Distributional test omitted due to non-randomness of the data

6: Statistical Control (i.e., no drift in location or scale, data are random, distribution is fixed)

Data Set is in Statistical Control? = NO

7: Outliers?

(Grubbs' test omitted due to non-randomness of the data

1.4.2.7.3 Quantitative Output and Interpretation

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4273.htm (7 of 7) [5/1/2006 9:58:57 AM]

Trang 21

1 Exploratory Data Analysis

1.4 EDA Case Studies

Click on the links below to start Dataplot and run this case study

yourself Each step may use results from previous steps, so please be

patient Wait until the software verifies that the current step is

complete before clicking on the next step.

NOTE: This case study has 1,000 points For better performance, it

is highly recommended that you check the "No Update" box on the

Spreadsheet window for this case study This will suppress

subsequent updating of the Spreadsheet window as the data are

Trang 22

1 Invoke Dataplot and read data.

1 Read in the data

1 You have read 1 column of numbers into Dataplot, variable Y

2 4-plot of the data

in location and variation and the data are not random

3 Generate the individual plots

1 Generate a run sequence plot

2 Generate a lag plot

1 The run sequence plot indicates that there are shifts of location and variation

2 The lag plot shows a strong linear pattern, which indicates significant non-randomness

4 Generate summary statistics, quantitative

analysis, and print a univariate report

1 Generate a table of summary

statistics

2 Generate the sample mean, a confidence

interval for the population mean, and

compute a linear fit to detect drift in

1 The summary statistics table displays 25+ statistics

2 The mean is 28.0163 and a 95%

confidence interval is (28.0124,28.02029) The linear fit indicates drift in

1.4.2.7.4 Work This Example Yourself

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4274.htm (2 of 3) [5/1/2006 9:58:57 AM]

Trang 23

location.

3 Generate the sample standard deviation,

a confidence interval for the population

standard deviation, and detect drift in

variation by dividing the data into

quarters and computing Levene's test for

equal standard deviations

4 Check for randomness by generating an

autocorrelation plot and a runs test

5 Print a univariate report (this assumes

steps 2 thru 5 have already been run)

location since the slope parameter estimate is statistically significant

3 The standard deviation is 0.0635 with

a 95% confidence interval of (0.060829,0.066407) Levene's test indicates significant

change in variation

4 The lag 1 autocorrelation is 0.97

From the autocorrelation plot, this is outside the 95% confidence interval bands, indicating significant non-randomness

5 The results are summarized in a convenient report

1.4.2.7.4 Work This Example Yourself

http://www.itl.nist.gov/div898/handbook/eda/section4/eda4274.htm (3 of 3) [5/1/2006 9:58:57 AM]

Ngày đăng: 21/06/2014, 21:20