g-Correlation – A Six Sigma tool to discover the correlation between variables

Understanding the strength of the correlation between pairs of variables would be helpful for quality personnel in deciding the amount of attention that each factor needs to be given in

Trang 1

 Abstract Number : 003-0287

 Title: g-Correlation – A Six Sigma tool to discover the correlation between variables

 Name of the Conference: Sixteenth Annual Conference of POMS, Chicago, IL, April 29

- May 2, 2005

 Authors:

Author 1

 Name : Sarika Tyagi

 Institution : Laboratory for Responsible Manufacturing

 Address : 334 SN, Department of MIE

Northeastern University

360 Huntington Ave Boston, MA 02115 USA

 E-mail : sarikatyagi@gmail.com

 Phone : 617-373-7635

 Fax : 617-373-2921

Author 2

 Name : Prof Sagar Kamarthi

 E-mail : sagar@coe.neu.edu

 Phone : 617-373-3070

 Fax : 617-373-2921

Author 3

 Name : Srikanth Vadde

 E-mail : srikant@coe.neu.edu

 Phone : 617-373-7635

 Fax : 617-373-2921

Trang 2

Abstract

In Six Sigma it is important to discover the interconnections between the variables

involved in the problem being examined at the Analyze phase of the DMAIC methodology.

Understanding the strength of the correlation between pairs of variables would be helpful for quality personnel in deciding the amount of attention that each factor needs to be given in order

to improve the sigma level of the process

Normally, Pearson correlation coefficient (which assumes variables to be normally distributed or to have linear/parametric relationship) and Spearman correlation coefficient (which deals with the variables that are measured on ordinal scale or that have non-parametric relationship) are employed to determine the correlation between a given pair of variables Similarly, Fechner Correlation coefficient, to a lesser extent, is also used to calculate the correlation between two monotonously related variables This paper presents a novel correlation

called Correlation which is more general and less restrictive The effectiveness of

g-Correlation is demonstrated on two representative examples

Trang 3

What is Six Sigma

The basic foundation of Six Sigma is based on the concept of normal curve, introduced

by Carl Frederick Gauss in 1800s [1] Six Sigma is a thoroughly well-structured and a

data-driven methodology for process improvement It targets to eliminate a broad range of defects, quality problems and anything that doesn’t add value [2] It also aims at improving customer satisfaction by considering their needs and requirements while maintaining the profitability and competitiveness It is being widely used in a variety of fields – manufacturing, service delivery, management, and other business operations

Six Sigma methodologies integrate principles of business, statistics and engineering to

achieve tangible results This methodology combines well-established statistical quality control

techniques, simple and advanced data analysis methods, and the systematic training of all personnel at every level in the organization The Six Sigma methodology and management strategies together provide an overall framework for organizing company wide quality control efforts There are numerous Six Sigma success stories in US-based as well as international companies

Trang 4

Six Sigma is not defined with a single view in mind [3] It is perceived as a (1) vision to

improve the existing processes to or design completely new processes keeping customers’

requirements in mind; (2) philosophy that emphasizes reduction in variation, customer-focus and data driven decisions; (3) symbol for quality improvement; (4) metric that is derived from data collection and data analysis; (5) goal of achieving a Six Sigma process that will have less than 4 defects in a million opportunities; and (6) methodology for structured problem solving

Statistical Explanation: Six Sigma

Six Sigma is a statistical measure [4] to objectively evaluate processes Six Sigma at many organizations simply means a measure of quality that strives for near perfection but the statistical implications of a Six Sigma program go well beyond the qualitative eradication of customer-perceptible defects It's a methodology that is well rooted in mathematics and statistics The graph in Figure 1 shows a percentage slip of variation expected for different sigma processes For Example, 1σ process will have (34.13 % + 34.13 % = 68.26 %) events within the allowable specifications

Trang 5

Figure 1: Statistical explanation of Six Sigma [4]

The processes are prone to being influenced by special and/or assignable causes that impact the overall performance of the process relative to the customers’ specifications This means the long term performance of a Six Sigma process will always be less than 6σ The difference between the short term process capability and the long term process capability is known as a “Shift” and is depicted as Zshift For a typical process, the value of shift is 1.5 Therefore, the statistical term “Six Sigma” indicates the short term capability of the process to be

Mean (μ)

+1σ -1σ

34.13 % 34.13 %

13.06 %

Lower Limit

Upper Limit

Trang 6

6σ and the long term capability to be 4.5σ As the process sigma value increases from zero to six, the variation of the process around the mean value decreases

DMAIC Process

Six Sigma implementation follows a well known methodology known as DMAIC (Define, Measure, Analyze, Improve, and Control) It is an improvement system for existing processes falling below specification limits that need incremental improvement This methodology should be used when a product or process already exists but either has not been meeting customer specification or has not been performing adequately Each step in the cyclical DMAIC process is required to ensure the best possible results [6]

The Define phase is concerned with the definition of project goals, involvement of core

business processes, identification of the customers’ expectations, and the identification of issues that need to be addressed to achieve the higher Sigma level The tools being used in this step include input-process-output diagrams, process flow maps, and quality function deployment

The Measure phase measures the current process performance by gathering information

(baseline data) about the current process and identifies the problem areas by calculating the difference between the current and the expected process Usually data collection plans/forms and sampling techniques are employed at this stage

The Analyze phase identifies the root cause of quality problems and confirms those

causes using appropriate data analysis tools The cause and effect diagrams and failure mode and effect analysis are used for this purpose

Trang 7

The Improve phase implements solutions that address root causes identified during the Analyze phase to improve the target processes Brain storming, mistake proofing, and design of

experiments are some of the techniques that are used in this phase

The Control phase evaluates and monitors the results after every defined intervals of

time Control charts and time series methods are generally used in this phase

Correlation Methods

After identifying the root causes (variables that came into play), the Analyze phase of Six

Sigma attempts to discover the correlations that measure the strength or closeness of relationship between pairs of variables These correlations can then be used for prioritizing and eliminating the causes for an effect

There are several correlation measures [7] being used in practice: Parametric correlation analysis like Pearson correlation coefficient and Non Parametric correlation analysis like Spearman correlation coefficient, Kendall’s Tau correlation, and Fechner correlation coefficient

Pearson correlation coefficient (r p ) is a parametric statistic which has the following underlying

assumptions:

 The data set is a randomly collected sample

 Both the variables are either interval or ratio

 Both the variables are more or less normally distributed

Trang 8

 The relationship between two variables is always linear

The Pearson correlation coefficient is given by

S

S x x

r

x x

Cov

p

2 1

) ,



where the numerator expresses the extent to which x 1 and x 2 covary, and S and x1 S x2 are the

product of the standard deviations

Spearman correlation coefficient (r s) is nonparametric statistic and is also known as a rank correlation Its underlying assumption is that the variables under consideration should be measured on an ordinal (rank order) scale or the individual observations should be ranked into two ordered series In other words the variables should have monotonic relationship

The Spearman correlation coefficient is given by:

6

2



N D

where r s is the Spearman correlation coefficient, N is the number of pairs, and D is the difference

between each pair

Kendall tau correlation is equivalent to Spearman correlation coefficient with regard to the

underlying assumptions However, both will give different values because their underlying logic

as well as their computational formulas are different The Kendall’s correlation represents probability whereas the Spearman correlation doesn’t

Fechner correlation coefficient is used, however to a lesser extent, to calculate the correlation

between two monotonously related variables

Trang 9

The correlation coefficients reviewed above impose certain assumptions on the variables

in question Existing correlations work well if all their assumptions are met However, if the data doesn’t meet any of those assumptions, these coefficients do not give the correct relational interpretation between the variables under the question Therefore this paper presents a more

general non-parametric correlation coefficient called g-Correlation which has no restrictions on

the variables being analyzed

g-Correlation can also detect correlations which are neither functional nor monotonic It

can work better than the existing correlation coefficients when it comes to the weaker correlations It allows one to compute a coarse correlation between pairs of variables in situations where accurate correlations are not possible

g-Correlation

Definition: For two random variables X and Y, first, consider the line y = μ y , where μ y is the

median of the random variable Y, to divide the space of measurements into the two classes[7]:

C 1 := { (x, y )  R 2 : y  μ y } and C 2 := { (x, y ) R 2 : y < μ y }

A random variable X is called g-Correlated with another random variable Y if there exists

a value c  R such that the criterion

x 0  c allows to assign the coordinate point (x 0, y 0 ) of the random vector (X, Y) to the classes C 1 and C 2

with an accuracy of more than 50%, i.e if

Trang 10

P (X  c, (X, Y) C 1 ) + P (X < c, (X, Y)  C 2 ) ≠

2 1

Otherwise, X is not g-Correlated with Y [7].

Properties of g-Correlation

1 If a random variable X is 100% correlated with a random variable Y, then Y is 100% g-Correlated with X.

2 If Y=f(X) for a strictly monotonic function f and if Y has a unique median μ y, then X is 100% g-Correlated with Y.

Estimating g-Correlation

To estimate the g-Correlation coefficient, divide the given data set into two subsets: a training set T of size q and an evaluation set E of size (n-q) First, estimates for the separating lines y = μ y and x = c with an appropriate value of c have to be computed using the training data

set

A value “c” which gives an optimal classification for the given set T of measurements with respect to the classes C 1 and C 2 can be computed using the following algorithm

Step 1: Sort all pairs in the sequence s = [( x 1 , y 1 ), (x 2 , y 2 ),… (x q , y q )] in ascending order with

regard to the x- values.

Trang 11

Step 2: Examine the arithmetic means of the x-values of all successive pairs in s as candidates

for c Start this examination with the smallest value c and proceed successively to the highest

value

Step 3: For the first candidate for c count the number p 1 of pairs (x i , y i ) of s with x i < c and

y i < μ y as well as the number p 2 of pairs with x i  c and y i  μ y For all the other candidates

update the numbers p 1 and p 2 depending upon whether the pairs belong to C 1 or C 2

Step 4: Store the maximum classification percentage max {p 1 + p 2 , q – p 1 – p 2 }/q achieved for the training set T along with the corresponding candidate for c.

Finally, g-Correlation coefficient is approximated based on the calculated values μ y and c using the g-Correlation definition for the evaluation set E.

Examples

Trang 12

Figure 2: Scatter plot to demonstrate correlation between crime rate and low status [11]

Example 1: This example describes the relationship between per capita crime rate by town and

the percent of lower status of the population

Trang 13

Figure 3: Scatter plot to demonstrate correlation between price and quantity bought

Example 2: This example shows the relationship between the price of a commodity and the

quantity bought Here, with the increase in the price of a commodity over a period of the year’s time, the sale of the commodity increases even with the increase in the price This happens due to seasonal requirements The commodity represents things required in some particular time period

of the year For example, on east coast of the USA, sale of winter clothes will be higher during the second half of a year and hence the price of winter clothes is high during that time of the year

Trang 14

In Example 1, [see figure 2], various correlations between crime rate and low statistics are, Pearson correlation coefficient = 0.40

Spearman correlation coefficient = 0.44

g-Correlation coefficient = 75%

In Example 2, [see figure 3], various correlations between price and quantity bought (hypothetical situation) are,

Pearson correlation coefficient = 0.45

Spearman correlation coefficient = 0.26

g-Correlation coefficient = 79.5%

The results above explain that for some set of data where Pearson and Spearman correlation coefficients fail to indicate the considerable correlation that exists between pair of variables, g-Correlation can give a clear and a better understanding of the relationship Correlations like Pearson and Spearman can fail to give accurate or near accurate results for data set representing seasonal variation whereas g-Correlation can give a good estimate of relationship between such variables as seen in Example 2

Định dạng
Số trang	16
Dung lượng	153 KB