1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Exploratory Data Analysis_9 docx

42 119 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Yates Analysis and Related Techniques
Trường học National Institute of Standards and Technology
Chuyên ngành Statistics
Thể loại Case Study
Năm xuất bản 2006
Thành phố Gaithersburg
Định dạng
Số trang 42
Dung lượng 2,89 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

In the absence of an a priori difference, a good rough rule for the minimum engineering significant is to keep only those factors whose effect is greater than, say, 10% of the current pr

Trang 1

Case Study The Yates analysis is demonstrated in the Eddy current case study.

Software Many general purpose statistical software programs, including

Dataplot, can perform a Yates analysis

1.3.5.18 Yates Analysis

http://www.itl.nist.gov/div898/handbook/eda/section3/eda35i.htm (5 of 5) [5/1/2006 9:57:48 AM]

Trang 2

1 Exploratory Data Analysis

is not the case when the design is orthogonal, as is a 23 full factorial design Fororthogonal designs, the estimates for the previously included terms do not change asadditional terms are added This means the ranked list of effect estimates

simultaneously serves as the least squares coefficient estimates for progressively morecomplicated models

Yates

Table For convenience, we list the sample Yates output for the Eddy current data set here.

(NOTE DATA MUST BE IN STANDARD ORDER)NUMBER OF OBSERVATIONS = 8NUMBER OF FACTORS = 3

NO REPLICATION CASE

PSEUDO-REPLICATION STAND DEV = 0.20152531564E+00PSEUDO-DEGREES OF FREEDOM = 1

(THE PSEUDO-REP STAND DEV ASSUMES ALL

3, 4, 5, -TERM INTERACTIONS ARE NOT REAL,BUT MANIFESTATIONS OF RANDOM ERROR)

STANDARD DEVIATION OF A COEF = 0.14249992371E+00(BASED ON PSEUDO-REP ST DEV.)

GRAND MEAN = 0.26587500572E+01GRAND STANDARD DEVIATION = 0.17410624027E+01

99% CONFIDENCE LIMITS (+-) = 0.90710897446E+0195% CONFIDENCE LIMITS (+-) = 0.18106349707E+0199.5% POINT OF T DISTRIBUTION = 0.63656803131E+021.3.5.18.1 Defining Models and Prediction Equations

http://www.itl.nist.gov/div898/handbook/eda/section3/eda35i1.htm (1 of 3) [5/1/2006 9:57:49 AM]

Trang 3

97.5% POINT OF T DISTRIBUTION = 0.12706216812E+02

IDENTIFIER EFFECT T VALUE RESSD: RESSD: MEAN + MEAN + TERM CUM TERMS - MEAN 2.65875 1.74106 1.74106

The last column of the Yates table gives the residual standard deviation for 8 possiblemodels, each with one more term than the previous model

http://www.itl.nist.gov/div898/handbook/eda/section3/eda35i1.htm (2 of 3) [5/1/2006 9:57:49 AM]

Trang 4

has a residual standard deviation of 0.18031 ohms.

has a residual standard deviation of 0.0 ohms Note that the model with allpossible terms included will have a zero residual standard deviation This willalways occur with an unreplicated two-level factorial design

interactions) to include in the model

1.3.5.18.1 Defining Models and Prediction Equations

http://www.itl.nist.gov/div898/handbook/eda/section3/eda35i1.htm (3 of 3) [5/1/2006 9:57:49 AM]

Trang 5

1 Exploratory Data Analysis

These criteria will be examined in the context of the Eddy current data set The Yates Analysis

page gave the sample Yates output for these data and the Defining Models and Predictions page listed the potential models from the Yates analysis.

In practice, not all of these criteria will be used with every analysis (and some analysts may have additional criteria) These critierion are given as useful guidelines Mosts analysts will focus on those criteria that they find most useful.

1.3.5.18.2 Important Factors

http://www.itl.nist.gov/div898/handbook/eda/section3/eda35i2.htm (1 of 7) [5/1/2006 9:57:49 AM]

Trang 6

Engineering

Significance

The minimum engineering significant difference is defined as

where is the absolute value of the parameter estimate (i.e., the effect) and is the minimum engineering significant difference.

That is, declare a factor as "important" if the effect is greater than some a priori declared engineering difference This implies that the engineering staff have in fact stated what a minimum effect will be Oftentimes this is not the case In the absence of an a priori difference, a good rough rule for the minimum engineering significant is to keep only those factors whose effect

is greater than, say, 10% of the current production average In this case, let's say that the average detector has a sensitivity of 2.5 ohms This would suggest that we would declare all factors whose effect is greater than 10% of 2.5 ohms = 0.25 ohm to be significant (from an engineering point of view).

Based on this minimum engineering significant difference criterion, we conclude that we should keep two terms: X1 and X2.

Effects:

Order of

Magnitude

The order of magnitude criterion is defined as

That is, exclude any factor that is less than 10% of the maximum effect size We may or may not keep the other factors This criterion is neither engineering nor statistical, but it does offer some additional numerical insight For the current example, the largest effect is from X1 (3.10250 ohms), and so 10% of that is 0.31 ohms, which suggests keeping all factors whose effects exceed 0.31 ohms.

Based on the order-of-magnitude criterion, we thus conclude that we should keep two terms: X1 and X2 A third term, X2*X3 (.29750), is just slightly under the cutoff level, so we may consider keeping it based on the other criterion.

Effects:

Statistical

Significance

Statistical significance is defined as

That is, declare a factor as important if its effect is more than 2 standard deviations away from 0 (0, by definition, meaning "no effect").

The "2" comes from normal theory (more specifically, a value of 1.96 yields a 95% confidence

interval) More precise values would come from t-distribution theory.

The difficulty with this is that in order to invoke this criterion we need the standard deviation, ,

of an observation This is problematic because the engineer may not know ;

For the Eddy current example:

the engineer did not know ;

Trang 7

ignoring 3-term interactions and higher interactions leads to an estimate of based on omitting only a single term: the X1*X2*X3 interaction.

Probability plots can be used in the following manner.

Normal Probability Plot: Keep a factor as "important" if it is well off the line through zero

on a normal probability plot of the effect estimates.

Since the half-normal probability plot is only concerned with effect magnitudes as opposed to signed effects (which are subject to the vagaries of how the initial factor codings +1 and -1 were assigned), the half-normal probability plot is preferred by some over the normal probability plot.

Trang 8

For the example at hand, both probability plots clearly show two factors displaced off the line, and from the third plot (with factor tags included), we see that those two factors are factor 1 and factor 2 All of the remaining five effects are behaving like random drawings from a normal distribution centered at zero, and so are deemed to be statistically non-significant In conclusion, this rule keeps two factors: X1 (3.10250) and X2 (-.86750).

of the Youden plot for comparing pairs of items leads to the technique of generating a Youden plot of the low and high averages.

1.3.5.18.2 Important Factors

http://www.itl.nist.gov/div898/handbook/eda/section3/eda35i2.htm (4 of 7) [5/1/2006 9:57:49 AM]

Trang 9

Youden Plot

of Effect

Estimatess

The following is the Youden plot of the effect estimatess for the Eddy current data.

For the example at hand, the Youden plot clearly shows a cluster of points near the grand average (2.65875) with two displaced points above (factor 1) and below (factor 2) Based on the Youden plot, we conclude to keep two factors: X1 (3.10250) and X2 (-.86750).

This criterion is different from the others in that it is model focused In practice, this criterion states that starting with the largest effect, we cumulatively keep adding terms to the model and monitor how the residual standard deviation for each progressively more complicated model becomes smaller At some point, the cumulative model will become complicated enough and comprehensive enough that the resulting residual standard deviation will drop below the pre-specified engineering cutoff for the residual standard deviation At that point, we stop adding terms and declare all of the model-included terms to be "important" and everything not in the model to be "unimportant".

This approach implies that the engineer has considered what a minimum residual standard deviation should be In effect, this relates to what the engineer can tolerate for the magnitude of the typical residual (= difference between the raw data and the predicted value from the model).

1.3.5.18.2 Important Factors

http://www.itl.nist.gov/div898/handbook/eda/section3/eda35i2.htm (5 of 7) [5/1/2006 9:57:49 AM]

Trang 10

In other words, how good does the engineer want the prediction equation to be Unfortunately, this engineering specification has not always been formulated and so this criterion can become moot.

In the absence of a prior specified cutoff, a good rough rule for the minimum engineering residual standard deviation is to keep adding terms until the residual standard deviation just dips below, say, 5% of the current production average For the Eddy current data, let's say that the average detector has a sensitivity of 2.5 ohms Then this would suggest that we would keep adding terms

to the model until the residual standard deviation falls below 5% of 2.5 ohms = 0.125 ohms Based on the minimum residual standard deviation criteria, and by scanning the far right column

of the Yates table , we would conclude to keep the following terms:

7 (with a cumulative residual standard deviation = 0.00000)

Note that we must include all terms in order to drive the residual standard deviation below 0.125.

Again, the 5% rule is a rough-and-ready rule that has no basis in engineering or statistics, but is simply a "numerics" Ideally, the engineer has a better cutoff for the residual standard deviation that is based on how well he/she wants the equation to peform in practice If such a number were available, then for this criterion and data set we would select something less than the entire collection of terms.

where is the standard deviation of an observation under replicated conditions.

That is, declare a term as "important" until the cumulative model that includes the term has a residual standard deviation smaller than In essence, we are allowing that we cannot demand a model fit any better than what we would obtain if we had replicated data; that is, we cannot demand that the residual standard deviation from any fitted model be any smaller than the (theoretical or actual) replication standard deviation We can drive the fitted standard deviation down (by adding terms) until it achieves a value close to , but to attempt to drive it down further means that we are, in effect, trying to fit noise.

In practice, this criterion may be difficult to apply because the engineer may not know ;

1

the experiment might not have replication, and so a model-free estimate of is not obtainable.

2

For the current case study:

the engineer did not know ;

1

the design (a 2 3 full factorial) did not have replication The most common way of having

replication in such designs is to have replicated center points at the center of the cube ((X1,X2,X3) = (0,0,0)).

2

Thus for this current case, this criteria could not be used to yield a subset of "important" factors.

1.3.5.18.2 Important Factors

http://www.itl.nist.gov/div898/handbook/eda/section3/eda35i2.htm (6 of 7) [5/1/2006 9:57:49 AM]

Trang 11

Conclusions In summary, the seven criteria for specifying "important" factors yielded the following for the

Eddy current data:

Effects, Engineering Significance:

7 not applicable Such conflicting results are common Arguably, the three most important criteria (listed in order

of most important) are:

Effects, Probability Plots:

1

Parsimonious Prediction Equation:

(with a residual standard deviation of 30429 ohms)

Trang 12

1 Exploratory Data Analysis

Some practical uses of probability distributions are:

To calculate confidence intervals for parameters and to calculatecritical regions for hypothesis tests

Trang 13

1 Exploratory Data Analysis

The sum of p(x) over all possible values of x is 1, that is

where j represents all possible values that x can have and p j is theprobability at xj

One consequence of properties 2 and 3 is that 0 <= p(x) <= 1

3

What does this actually mean? A discrete probability function is afunction that can take a discrete number of values (not necessarilyfinite) This is most often the non-negative integers or some subset ofthe non-negative integers There is no mathematical restriction thatdiscrete probability functions only be defined at integers, but in practicethis is usually what makes sense For example, if you toss a coin 6times, you can get 2 heads or 3 heads but not 2 1/2 heads Each of thediscrete values has a certain probability of occurrence that is betweenzero and one That is, a discrete function that allows negative values orvalues greater than one is not a probability function The condition thatthe probabilities sum to one means that at least one of the values has tooccur

1.3.6.1 What is a Probability Distribution

http://www.itl.nist.gov/div898/handbook/eda/section3/eda361.htm (1 of 2) [5/1/2006 9:57:50 AM]

Trang 14

Distributions

The mathematical definition of a continuous probability function, f(x),

is a function that satisfies the following properties

The probability that x is between two points a and b is

measured over intervals, not single points That is, the area under thecurve between two distinct points defines the probability for thatinterval This means that the height of the probability function can infact be greater than one The property that the integral must equal one isequivalent to the property for discrete distributions that the sum of allthe probabilities must equal one

1.3.6.1 What is a Probability Distribution

http://www.itl.nist.gov/div898/handbook/eda/section3/eda361.htm (2 of 2) [5/1/2006 9:57:50 AM]

Trang 15

1 Exploratory Data Analysis

Trang 16

For a continuous distribution, this can be expressed mathematically as

For a discrete distribution, the cdf can be expressed as

The following is the plot of the normal cumulative distribution function

1.3.6.2 Related Distributions

http://www.itl.nist.gov/div898/handbook/eda/section3/eda362.htm (2 of 8) [5/1/2006 9:57:51 AM]

Trang 17

The horizontal axis is the allowable domain for the given probabilityfunction Since the vertical axis is a probability, it must fall betweenzero and one It increases from zero to one as we go from left to right onthe horizontal axis.

or alternatively

The following is the plot of the normal percent point function

1.3.6.2 Related Distributions

http://www.itl.nist.gov/div898/handbook/eda/section3/eda362.htm (3 of 8) [5/1/2006 9:57:51 AM]

Trang 18

Since the horizontal axis is a probability, it goes from zero to one Thevertical axis goes from the smallest to the largest value of the

cumulative distribution function

Trang 19

Hazard plots are most commonly used in reliability applications Notethat Johnson, Kotz, and Balakrishnan refer to this as the conditionalfailure density function rather than the hazard function.

This can alternatively be expressed as

The following is the plot of the normal cumulative hazard function

1.3.6.2 Related Distributions

http://www.itl.nist.gov/div898/handbook/eda/section3/eda362.htm (5 of 8) [5/1/2006 9:57:51 AM]

Trang 20

Cumulative hazard plots are most commonly used in reliabilityapplications Note that Johnson, Kotz, and Balakrishnan refer to this asthe hazard function rather than the cumulative hazard function.

Survival

Function

Survival functions are most often used in reliability and related fields.The survival function is the probability that the variate takes a valuegreater than x

The following is the plot of the normal distribution survival function

1.3.6.2 Related Distributions

http://www.itl.nist.gov/div898/handbook/eda/section3/eda362.htm (6 of 8) [5/1/2006 9:57:51 AM]

Trang 21

For a survival function, the y value on the graph starts at 1 andmonotonically decreases to zero The survival function should becompared to the cumulative distribution function.

The following is the plot of the normal distribution inverse survivalfunction

1.3.6.2 Related Distributions

http://www.itl.nist.gov/div898/handbook/eda/section3/eda362.htm (7 of 8) [5/1/2006 9:57:51 AM]

Ngày đăng: 21/06/2014, 21:20

TỪ KHÓA LIÊN QUAN