Understanding the strength of the correlation between pairs of variables would be helpful for quality personnel in deciding the amount of attention that each factor needs to be given in
Trang 1 Abstract Number : 003-0287
Title: g-Correlation – A Six Sigma tool to discover the correlation between variables
Name of the Conference: Sixteenth Annual Conference of POMS, Chicago, IL, April 29
- May 2, 2005
Authors:
Author 1
Name : Sarika Tyagi
Institution : Laboratory for Responsible Manufacturing
Address : 334 SN, Department of MIE
Northeastern University
360 Huntington Ave Boston, MA 02115 USA
E-mail : sarikatyagi@gmail.com
Phone : 617-373-7635
Fax : 617-373-2921
Author 2
Name : Prof Sagar Kamarthi
Institution : Laboratory for Responsible Manufacturing
Address : 334 SN, Department of MIE
Northeastern University
360 Huntington Ave Boston, MA 02115 USA
E-mail : sagar@coe.neu.edu
Phone : 617-373-3070
Fax : 617-373-2921
Author 3
Name : Srikanth Vadde
Institution : Laboratory for Responsible Manufacturing
Address : 334 SN, Department of MIE
Northeastern University
360 Huntington Ave Boston, MA 02115 USA
E-mail : srikant@coe.neu.edu
Phone : 617-373-7635
Fax : 617-373-2921
Trang 2Abstract
In Six Sigma it is important to discover the interconnections between the variables
involved in the problem being examined at the Analyze phase of the DMAIC methodology.
Understanding the strength of the correlation between pairs of variables would be helpful for quality personnel in deciding the amount of attention that each factor needs to be given in order
to improve the sigma level of the process
Normally, Pearson correlation coefficient (which assumes variables to be normally distributed or to have linear/parametric relationship) and Spearman correlation coefficient (which deals with the variables that are measured on ordinal scale or that have non-parametric relationship) are employed to determine the correlation between a given pair of variables Similarly, Fechner Correlation coefficient, to a lesser extent, is also used to calculate the correlation between two monotonously related variables This paper presents a novel correlation
called Correlation which is more general and less restrictive The effectiveness of
g-Correlation is demonstrated on two representative examples
Trang 3What is Six Sigma
The basic foundation of Six Sigma is based on the concept of normal curve, introduced
by Carl Frederick Gauss in 1800s [1] Six Sigma is a thoroughly well-structured and a
data-driven methodology for process improvement It targets to eliminate a broad range of defects, quality problems and anything that doesn’t add value [2] It also aims at improving customer satisfaction by considering their needs and requirements while maintaining the profitability and competitiveness It is being widely used in a variety of fields – manufacturing, service delivery, management, and other business operations
Six Sigma methodologies integrate principles of business, statistics and engineering to
achieve tangible results This methodology combines well-established statistical quality control
techniques, simple and advanced data analysis methods, and the systematic training of all personnel at every level in the organization The Six Sigma methodology and management strategies together provide an overall framework for organizing company wide quality control efforts There are numerous Six Sigma success stories in US-based as well as international companies
Trang 4Six Sigma is not defined with a single view in mind [3] It is perceived as a (1) vision to
improve the existing processes to or design completely new processes keeping customers’
requirements in mind; (2) philosophy that emphasizes reduction in variation, customer-focus and data driven decisions; (3) symbol for quality improvement; (4) metric that is derived from data collection and data analysis; (5) goal of achieving a Six Sigma process that will have less than 4 defects in a million opportunities; and (6) methodology for structured problem solving
Statistical Explanation: Six Sigma
Six Sigma is a statistical measure [4] to objectively evaluate processes Six Sigma at many organizations simply means a measure of quality that strives for near perfection but the statistical implications of a Six Sigma program go well beyond the qualitative eradication of customer-perceptible defects It's a methodology that is well rooted in mathematics and statistics The graph in Figure 1 shows a percentage slip of variation expected for different sigma processes For Example, 1σ process will have (34.13 % + 34.13 % = 68.26 %) events within the allowable specifications
Trang 5Figure 1: Statistical explanation of Six Sigma [4]
The processes are prone to being influenced by special and/or assignable causes that impact the overall performance of the process relative to the customers’ specifications This means the long term performance of a Six Sigma process will always be less than 6σ The difference between the short term process capability and the long term process capability is known as a “Shift” and is depicted as Zshift For a typical process, the value of shift is 1.5 Therefore, the statistical term “Six Sigma” indicates the short term capability of the process to be
Mean (μ)
+1σ -1σ
34.13 % 34.13 %
13.06 %
Lower Limit
Upper Limit
Trang 66σ and the long term capability to be 4.5σ As the process sigma value increases from zero to six, the variation of the process around the mean value decreases
DMAIC Process
Six Sigma implementation follows a well known methodology known as DMAIC (Define, Measure, Analyze, Improve, and Control) It is an improvement system for existing processes falling below specification limits that need incremental improvement This methodology should be used when a product or process already exists but either has not been meeting customer specification or has not been performing adequately Each step in the cyclical DMAIC process is required to ensure the best possible results [6]
The Define phase is concerned with the definition of project goals, involvement of core
business processes, identification of the customers’ expectations, and the identification of issues that need to be addressed to achieve the higher Sigma level The tools being used in this step include input-process-output diagrams, process flow maps, and quality function deployment
The Measure phase measures the current process performance by gathering information
(baseline data) about the current process and identifies the problem areas by calculating the difference between the current and the expected process Usually data collection plans/forms and sampling techniques are employed at this stage
The Analyze phase identifies the root cause of quality problems and confirms those
causes using appropriate data analysis tools The cause and effect diagrams and failure mode and effect analysis are used for this purpose
Trang 7The Improve phase implements solutions that address root causes identified during the Analyze phase to improve the target processes Brain storming, mistake proofing, and design of
experiments are some of the techniques that are used in this phase
The Control phase evaluates and monitors the results after every defined intervals of
time Control charts and time series methods are generally used in this phase
Correlation Methods
After identifying the root causes (variables that came into play), the Analyze phase of Six
Sigma attempts to discover the correlations that measure the strength or closeness of relationship between pairs of variables These correlations can then be used for prioritizing and eliminating the causes for an effect
There are several correlation measures [7] being used in practice: Parametric correlation analysis like Pearson correlation coefficient and Non Parametric correlation analysis like Spearman correlation coefficient, Kendall’s Tau correlation, and Fechner correlation coefficient
Pearson correlation coefficient (r p ) is a parametric statistic which has the following underlying
assumptions:
The data set is a randomly collected sample
Both the variables are either interval or ratio
Both the variables are more or less normally distributed
Trang 8 The relationship between two variables is always linear
The Pearson correlation coefficient is given by
S
S x x
r
x x
Cov
p
2 1
) ,
where the numerator expresses the extent to which x 1 and x 2 covary, and S and x1 S x2 are the
product of the standard deviations
Spearman correlation coefficient (r s) is nonparametric statistic and is also known as a rank correlation Its underlying assumption is that the variables under consideration should be measured on an ordinal (rank order) scale or the individual observations should be ranked into two ordered series In other words the variables should have monotonic relationship
The Spearman correlation coefficient is given by:
6
2
N D
where r s is the Spearman correlation coefficient, N is the number of pairs, and D is the difference
between each pair
Kendall tau correlation is equivalent to Spearman correlation coefficient with regard to the
underlying assumptions However, both will give different values because their underlying logic
as well as their computational formulas are different The Kendall’s correlation represents probability whereas the Spearman correlation doesn’t
Fechner correlation coefficient is used, however to a lesser extent, to calculate the correlation
between two monotonously related variables
Trang 9The correlation coefficients reviewed above impose certain assumptions on the variables
in question Existing correlations work well if all their assumptions are met However, if the data doesn’t meet any of those assumptions, these coefficients do not give the correct relational interpretation between the variables under the question Therefore this paper presents a more
general non-parametric correlation coefficient called g-Correlation which has no restrictions on
the variables being analyzed
g-Correlation can also detect correlations which are neither functional nor monotonic It
can work better than the existing correlation coefficients when it comes to the weaker correlations It allows one to compute a coarse correlation between pairs of variables in situations where accurate correlations are not possible
g-Correlation
Definition: For two random variables X and Y, first, consider the line y = μ y , where μ y is the
median of the random variable Y, to divide the space of measurements into the two classes[7]:
C 1 := { (x, y ) R 2 : y μ y } and C 2 := { (x, y ) R 2 : y < μ y }
A random variable X is called g-Correlated with another random variable Y if there exists
a value c R such that the criterion
x 0 c allows to assign the coordinate point (x 0, y 0 ) of the random vector (X, Y) to the classes C 1 and C 2
with an accuracy of more than 50%, i.e if
Trang 10P (X c, (X, Y) C 1 ) + P (X < c, (X, Y) C 2 ) ≠
2 1
Otherwise, X is not g-Correlated with Y [7].
Properties of g-Correlation
1 If a random variable X is 100% correlated with a random variable Y, then Y is 100% g-Correlated with X.
2 If Y=f(X) for a strictly monotonic function f and if Y has a unique median μ y, then X is 100% g-Correlated with Y.
Estimating g-Correlation
To estimate the g-Correlation coefficient, divide the given data set into two subsets: a training set T of size q and an evaluation set E of size (n-q) First, estimates for the separating lines y = μ y and x = c with an appropriate value of c have to be computed using the training data
set
A value “c” which gives an optimal classification for the given set T of measurements with respect to the classes C 1 and C 2 can be computed using the following algorithm
Step 1: Sort all pairs in the sequence s = [( x 1 , y 1 ), (x 2 , y 2 ),… (x q , y q )] in ascending order with
regard to the x- values.
Trang 11Step 2: Examine the arithmetic means of the x-values of all successive pairs in s as candidates
for c Start this examination with the smallest value c and proceed successively to the highest
value
Step 3: For the first candidate for c count the number p 1 of pairs (x i , y i ) of s with x i < c and
y i < μ y as well as the number p 2 of pairs with x i c and y i μ y For all the other candidates
update the numbers p 1 and p 2 depending upon whether the pairs belong to C 1 or C 2
Step 4: Store the maximum classification percentage max {p 1 + p 2 , q – p 1 – p 2 }/q achieved for the training set T along with the corresponding candidate for c.
Finally, g-Correlation coefficient is approximated based on the calculated values μ y and c using the g-Correlation definition for the evaluation set E.
Examples
Trang 12Figure 2: Scatter plot to demonstrate correlation between crime rate and low status [11]
Example 1: This example describes the relationship between per capita crime rate by town and
the percent of lower status of the population
Trang 13Figure 3: Scatter plot to demonstrate correlation between price and quantity bought
Example 2: This example shows the relationship between the price of a commodity and the
quantity bought Here, with the increase in the price of a commodity over a period of the year’s time, the sale of the commodity increases even with the increase in the price This happens due to seasonal requirements The commodity represents things required in some particular time period
of the year For example, on east coast of the USA, sale of winter clothes will be higher during the second half of a year and hence the price of winter clothes is high during that time of the year
Trang 14In Example 1, [see figure 2], various correlations between crime rate and low statistics are, Pearson correlation coefficient = 0.40
Spearman correlation coefficient = 0.44
g-Correlation coefficient = 75%
In Example 2, [see figure 3], various correlations between price and quantity bought (hypothetical situation) are,
Pearson correlation coefficient = 0.45
Spearman correlation coefficient = 0.26
g-Correlation coefficient = 79.5%
The results above explain that for some set of data where Pearson and Spearman correlation coefficients fail to indicate the considerable correlation that exists between pair of variables, g-Correlation can give a clear and a better understanding of the relationship Correlations like Pearson and Spearman can fail to give accurate or near accurate results for data set representing seasonal variation whereas g-Correlation can give a good estimate of relationship between such variables as seen in Example 2