1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Data Analysis Machine Learning and Applications Episode 1 Part 9 doc

25 317 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 25
Dung lượng 609,82 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Factorial Analysis of a Set of Contingency Tables 221As a result, SA proceeds by performing a principal component analysis PCA of the matrix X, X=√D 1X1.. The choice of this matrix Dwall

Trang 1

Factorial Analysis of a Set of Contingency Tables 221

As a result, SA proceeds by performing a principal component analysis (PCA)

of the matrix X, X=√D

1X1 D

t X t √D

T X TThe PCA results are also obtained using the SVD of X, giving singular values

Os on the s-th dimension and corresponding left and right singular vectors usand

vs

We calculate projections on the s-th axis of the columns as principal coordinates

gs, gs= OsD−1/2 c vswhere Dc(J ×J), is a diagonal matrix of all the column masses,

that is all the Dt

c

One of the aims of the joint analysis of several data tables is to compare themthrough the points corresponding to the same row in the different tables These points

will be called partial rows and denoted by i t

The projection on the s-th axis of each partial row is denoted by f t

isand the vector

of projections of all the partial rows for table t is denoted by f t

i. The choice of this matrix

Dwallows us to expand the projections of the (overall) rows to keep them inside thecorresponding set of projections of partial rows, and is appropriate when the partialrows have different weights in the tables With this weighting the projections of theoverall and partial rows are related as follows:

f is= t∈T √ p i. t

t∈T √ p t

i. f t is

So the projection of a row is a weighted average of the projections of partial rows It

is closer to those partial rows that are more similar to the overall row in terms of therelation expressed by the axis and have a greater weight than the rest of the partialrows The dispersal of the projections of the partial rows with regard to the projection

of their (overall) row indicates discrepancies between the same row in the differenttables

Notice that if p t

i. is equal in all the tables then fs = (1/T) t∈Tft

s, that is theoverall row is projected as the average of the projections of the partial rows

Interpretation rules for simultaneous analysis

In SA the transition relations between projections of different points create a taneous representation that provides more detailed knowledge of the matter beingstudied

simul-Relation between f t

is and g js : The projection of a partial row on axis s depends

on the projections of the columns:

Trang 2

222 Amaya Zárraga and Beatriz Goitisolo

Except for the factor Dt /O s , the projection of a partial row on axis s, is, as in CA, the centroid of the projections of the columns of table t.

Relation between f is and g js : The projection of an overall row on axis s may be

expressed in terms of the projections of the columns as follows:

f is= t∈TDt

√ p t i.

t∈T p t i.

1

The projection of the row is therefore, except for the coefficients Dt/O s, the

weighted average of the centroids of the projections of the columns for each table

Relation between g js and f is or f t

is : The projection on the axis s, of the column j for table t, can be expressed in the following way:

g js= Dt

Os



i∈I

t∈T p t

i. p t i.

p t

i j −p i. t p t j

p t

i p t j

i∈I p t i.

p t

i j −p t i. p t j

p t

i p t j

t∈T p t

i. f t is



The same aids to interpretation are available in SA as in standard factorial ysis as regards the contribution of points to principal axes and the quality of display

anal-of a point on axis s.

2.3 Stage three: comparison of the tables: interstructure

In order to compare the different tables, SA allows us, to represent each of them bymeans of a point and to project them on the axes

The coordinate of table t on axis s, f ts, represents the projected inertia of the table

on the axis and, therefore, indicates the importance of the table in the determination

of the axis Thus, f ts= j∈J t p t

j g2

js= Inertias (t) where Inertia s (t) represents the projected inertia of the sum of columns of the table t on the axis s.

Due to the weighting of the tables chosen by SA, the maximum value of this

inertia on the first axis is 1 A value of f ts close to 0 would indicate orthogonalitybetween the first axes of the separate analyses with regard the Simultaneous Anal-

ysis A value of f ts close to 1 would indicate that the axis of the joint analysis isapproximately the same as in the separate analysis of each table So, if all the tablespresent a coordinate close to the maximum value, 1, on the first factorial axis of the

SA, the projected inertia onto it is approximately T , the number of tables, and this

confirms that this first direction is accurately depicting the relevant associations ofeach table

Trang 3

Factorial Analysis of a Set of Contingency Tables 223

2.4 Relations between factors of the analyses

In SA it is also possible to calculate the following measurements of the relationbetween the factors of the different analyses

Relation between factors of the individual analyses: The correlation coefficient

can be used to measure the degree of similarity between the factors of the separate

CA of different tables This is possible when the marginals p t

The relation between the factors s and s  of the tables t and t respectively would

be calculated as:

r(fst , f s  t ) = i∈I √ f ist

Ot s p t i.

+

p t 

i. +f istOt

where f ist and f is  t  are the projections on the axes s and s  of the separate CA of

the tables t and t respectively and where Ot

s and Ot 

s  are the inertias associated withthese axes This measurement allows us to verify whether the factors of the separateanalyses are similar and check the possible rotations that occur

Relation between factors of the SA and factors of the separate analyses: wise, it is possible to calculate for each factor s of the SA, the relation with each of the factors s of the separate analyses of the different tables:

3 Application

In this section we apply SA to the data taken from an on-line survey drawn up by theSpanish Ministry of Education and Science, from January to March 2006, to Spanishstudents who participate in the Erasmus program in European universities

This application presents a comparative study for Spanish students, according togender, of the relationships between the countries that they choose as destination tocarry out the university interchange in the Erasmus program and the scientific fields

in which they are studying

Trang 4

224 Amaya Zárraga and Beatriz Goitisolo

The 15 countries that they choose as destination are Austria, Belgium, CzechRepublic, Denmark, Finland, France, Germany, Ireland, Italy, Netherlands, Norway,Poland, Portugal, Sweden and United Kingdom The scientific fields in which theyare studying are: Social and Legal Sciences, Engineering and Technology, Humani-ties, Health Science and Experimental Science

Therefore, we have two data tables whose rows (countries) and columns entific fields) correspond to the same modalities but refer to two different sets ofindividuals, depending on their gender In these tables both the marginals and thegrand-totals are different This fact suggests analyzing the tables by SA since the re-sults of applying other methods can be affected by the above mentioned differences(Zárraga and Goitisolo (2002))

(sci-The first factorial plane of SA (figure 1) explains nearly 60% of total inertia Inthe plane we observe that male and female students of Humanities Area, Health Sci-ence and specially Engineering and Technology have a similar behavior in the choice

of the country of destination to realize their studies, whereas students of Social andLegal Sciences and of Experimental Science choose different countries as destinydepending on their gender

The plane shows that students of Humanities Area, both male and female, choosethe United Kingdom as destiny country, followed by Ireland The countries chosen

as destiny for students of both gender of Engineering and Technology are mainlyAustria, Sweden and Denmark Finally, the males and females students of HealthScience Area prefer Portugal and Finland

The students of Experimental Science Area select different countries to realizethe interchange depending on their gender While male students go mainly to Portu-gal and Netherlands, females go to Norway

Also students of Social and Legal Sciences Area have a different behavior TheNetherlands and Ireland are selected as destiny country by males and females butmales also go to Belgium, the United Kingdom and Italy while females do it toNorway and Sweden

The projection of partial rows of each table, joined by segments, allows us toappreciate the differences between males and females in each destiny country Wewill only remark some of them

For example, United Kingdom is a country to which males and females students

go in a greater proportion among the students of Humanities Nevertheless malesalso choose United Kingdom to carry out Social and Legal studies whereas females

do not

Male and female students that come to Portugal agree in selecting this countryover the average for Health degrees But, males also go to Portugal to study Ex-perimental Science while females prefer this country for studies of Engineering andTechnology

Spanish students who go to Finland share the selection of this country over therest of the countries to study in the areas of Health and Engineering but there aremore females in the former area and males in the last one

Trang 5

Factorial Analysis of a Set of Contingency Tables 225

Fig 1 Projection of columns, overall rows and partial rows

In the other hand, not big differences between males and females are found inGermany, France, Belgium and Norway as it is indicate by the close projections ofoverall and partial rows

As conclusion of this application we can say that Simultaneous Analysis allows

us to show the common structure inside each table as well as the differences in thestructure of both tables A more extensive application to the joint study of the interand intra-structure of a bigger number of contingency tables can be found in Zárragaand Goitisolo (2006)

4 Discussion

The joint study of several data tables has given rise to an extensive list of factorialmethods, some of which have been gathered by Cazes (2004), for both quantitativeand categorical data tables In the correspondence analysis (CA) approach Cazesshows the similarity between some methods in the case of proportional row mar-gins and shows the problem that arises in a joint analysis when the row margins aredifferent or not proportional

Comments on the appropriateness of SA and a comparison with different ods, especially with Multiple Factor Analysis for Contingency Tables (Pagès andBécue-Bertaut (2006)), in the cases where row margins are equal, proportional andnot proportional between the tables can be found in Zárraga and Goitisolo (2006)

Trang 6

meth-226 Amaya Zárraga and Beatriz Goitisolo

5 Software notes

Software for performing Simultaneous Analysis, written in S-Plus 2000 can be found

in Goitisolo (2002) The AnSimult package for R can be obtained from the authors

References

CAZES, P (1980): L’ analyse de certains tableaux rectangulaires décomposés en blocs:généralisation des propriétes rencontrées dans l’ étude des correspondances multiples I

Définitions et applications à l’ analyse canonique des variables qualitatives Les Cahiers

de l’ Analyse des Données, V, 2, 145–161.

CAZES, P.(1981): L’ analyse de certains tableaux rectangulaires décomposés en blocs:généralisation des propriétes rencontrées dans l’ étude des correspondances multiples

IV Cas modèles Les Cahiers de l’ Analyse des Données, VI, 2, 135–143.

CAZES, P (1982): Note sur les éléments supplémentaires en analyse des correspondances II

Tableaux multiples Les Cahiers de l’ Analyse des Données, VII, 133–154.

CAZES, P (2004): Quelques methodes d’ analyse factorielle d’ une serie de tableaux de

don-nées La Revue de Modulad, 31, 1–31.

D’ AMBRA, L and LAURO, N (1989): Non symetrical analysis of three-way contingency

tables Multiway Data Analysis, 301–315.

ESCOFIER, B (1983): Généralisation de l’ analyse des correspondances à la comparaison de

tableaux de fréquence INRIA, Mai, 207, 1–33.

ESCOFIER, B and PAGÈS, J (1988 (1998, 3e édition) ): Analyses Factorielles Simples et

Multiples Objetifs, méthodes et interprétation Dunod, París.

GOITISOLO, B (2002): El análisis simultáneo Propuesta y aplicación de un nuevo método

de análisis factorial de tablas de contingencia Phd Thesis Basque Country University

Press Bilbao Spain

LAURO, N and D’ AMBRA, L (1984): L’ Analyse non symétrique des correspondances

Data Analysis and Informatics, III, 433–446.

MÉOT, A and LECLERC, B (1997): Voisinages a priori et analyses factorielles: Illustration

dans le cas de proximités géographiques Revue de Statistique Appliquée, XLV, 25–44.

PAGÈS, J and BÉCUE-BERTAUT, M (2006): Multiple Factor Analysis for Contingency

Tables In: M Greenacre and J.Blasius (Eds.): Multiple Correspondence Analysis and

Related Methods Chapman & Hall/CRC, 299–326.

ZÁRRAGA, A and GOITISOLO, B (2002): Méthode factorielle pour l’analyse simultanée

de tableaux de contingence Revue de Statistique Appliquée L(2), 47-70.

ZÁRRAGA, A and GOITISOLO, B (2003): Étude de la structure Inter-tableaux à travers

l’Analyse Simultanée Revue de Statistique Appliquée LI(3), 39-60.

ZÁRRAGA, A and GOITISOLO, B (2006): Simultaneous Analysis: A Joint Study of eral Contingency Tables with Different Margins In: M Greenacre and J.Blasius (Eds.):

Sev-Multiple Correspondence Analysis and Related Methods Chapman & Hall/CRC, 327–

350

Trang 7

Non Parametric Control Chart by Multivariate Additive Partial Least Squares via Spline

Rosaria Lombardo1, Amalia Vanacore2and Jean-Francçois Durand3

1 Faculty of Economics, Second University of Naples, Italy

Abstract Statistical process control (SPC) chart is aimed at monitoring a process over time in

order to detect any special event that may occur and find assignable causes for it Controllingboth product quality variables and process variables is a complex problem Multivariate meth-ods permit to treat all the data simultaneously extracting information on the “directionality"

of the process variation Highlighting the dependence relationships between process variablesand product quality variables, we propose the construction of a non-parametric chart, based onMultivariate Additive Partial Least Squares Splines; proper control limits are built by applyingthe Bootstrap approach

1 Introduction

The multivariate nature of product quality (response or output variables) and cess characteristics (predictors or input variables) highlights the limits of any anal-ysis based exclusively on descriptive and univariate statistics On the other hand,the possibility for process managers of extracting knowledge from large databases,opens the way to analyze the multivariate dependence relationships between qual-ity product and process variables via predictive and regressive techniques like PLS

pro-(Tenenhaus, 1998; Wold, 1966) and its generalizations (Durand, 2001; Lombardo et al., 2007) In this paper, the application of a multivariate control chart based on a generalization of PLS-T2chart (Kourti and MacGregor, 1996) is proposed in order

to analyze the in-control process and monitoring it over time Furthermore, in order

to face the problem of the unknown distribution of the statistic to be charted, a parametric approach is applied for the selection of the control limits Distribution-free or non-parametric control charts have been proposed in literature to overcomethe problems related to the lack of normality in process data An overview in lit-

non-erature on univariate non-parametric control charts is given by Chakraborti et al.

(2001) The principles on which non-parametric control charts rest can be ized to multivariate settings In particular, the bootstrap approach to estimate control

Trang 8

general-202 Rosaria Lombardo, Amalia Vanacore and Jean-Francçois Durand

limits (Wu and Wang, 1997; Jones and Woodall, 1998; Liu and Tang, 1996) has beenfollowed

2 Multivariate control charts based on projection methods

A standard multivariate quality control problem occurs when an observed vector ofmeasurements on quality characteristics exhibits a significant shift from a set of tar-get (or standard) values The first attempt to face the problem of multivariate process

control is due to Hotelling (1947) who introduced the well-known T2chart based

on variance-covariance matrix Successively, different approaches to take into count the multivariate nature of the problem were proposed (Woodall, Ncube, 1985;

ac-Lowry et al., 1992; Jackson, 1991; Liu, 1995; Kourti and MacGregor, 1996,

Mac-Gregor, 1997) In particular, we focus on the approach based on PLS componentsproposed by Kourti and MacGregor (1996), in order to monitor over time the depen-dence structure between a set of process variables and one or more product qualityvariables (Hawkins, 1991) The PLS approach proves to be effective in presence of

a low-ratio of observations to variables and in case of multicollinearity among thepredictors, but a major limit of this approach is that it assumes a linear dependencestructure Generally, linearity assumption in a model is reasonable as first researchstep, but in practice relationships between the process variables and the product qual-ity variables are often non-linear and in order to study the dependence structure itcould be much more appropriate the use of non-linear models (PLS via Spline, i.e

PLSS; Durand, 2001) as proposed by Vanacore and Lombardo (2005) The PLSS-T2chart allows to handle non-linear dependence relationships in data structure, miss-ing values and outliers, but it presents two major drawbacks: 1) it does not take intoaccount the possible effect of interactions between process variables; 2) it requirestesting normality assumption on the component scores, even when original data aremultinormal (in fact, in case of spline, i.e non linear transformations of originalprocess variables, the multinormality assumption cannot be guaranteed anymore)

To overcome these drawbacks we present non-parametric Multivariate Additive PLS

Spline-T2chart based on Multivariate Additive PLSS (MAPLSS, Lombardo et al.,

2007) briefly described in sub-section 2.2

2.1 Review of MAPLSS

MAPLSS is just the application of linear PLS regression of the response (matrix Y

of dimension n,q) on linear combinations of the transformed predictors (matrix X

of dimension n, p) and their interactions The predictors and bivariate interactions are transformed via a set of K = d + 1 + m (d is the spline degree and m is the knot number) basis functions, called B-splines B l (.), so as to represent any spline as a

Trang 9

MAPLSS-T2control chart 203where E = (E1, , EK) is the vector of spline coefficients computed via regression of

y ∈ IR on the B l(.) The centered coding matrix or design matrix including interactions

where K1 and K2 are index sets for single variables and bivariate interactions,

re-spectively In a generic form, the MAPLSS model, for the response j, can be written

where A is the space dimension parameter and L is the index set pointing out the

pre-dictors as well as the bivariate interactions retained by MAPLSS It is thus a purely

additive model that depends on A which in turn depends on the spline parameters

(i.e degree, number and location of knots)

Increasing the order of interaction in MAPLSS implies expanding the dimension of

the design matrix B MAPLSS constructs a sequence of centered and uncorrelated predictors, i.e the MAPLSS (latent) components (t1, , t A) We now briefly describethe MAPLSS building-model stage In the first phase we do not consider interactions

in the design matrix This phase consists of the following steps

step 1 Denote B0= B and Y0= Y the design and response data matrices, tively Define t1= B0w1and u1= Y0c1as the first MAPLSS components, where

respec-the weighting unit vectors w1and c1are computed by maximizing the ance between linear compromises of the transformed predictors and response

Trang 10

204 Rosaria Lombardo, Amalia Vanacore and Jean-Francçois Durand

the end, in the final phase we include in the design matrix B the selected interactions

and repeat the algorithm from step 1 to the final step.

A simple way to illustrate the contribution of predictors to response variables, sists of ordering the predictors with respect to their decreasing influence on the re-

con-sponse ˆy j (A), using as a criterion, the range of the si(x i , ˆEj i (A)) values of the formed sample xi (see figure 3) One can also use the same criterion to prune themodel, by eliminating the predictors and/or the interactions of low influence so as toobtain a more parsimonious model

T2chart(Jackson, 1991), PLS-T2chart (Kourti and MacGregor, 1996) and PLSS-T2

chart (Vanacore and Lombardo, 2005), the MAPLSS-T2chart is based on the first A components The MAPLSS-T2chart is an effective monitoring tool: it incorporatesthe variability structure underlying process data and quality product data extractinginformation on the directionality of the process variation The scores of each new

observation are monitored by the MAPLSS-T2control chart based on the followingstatistic

where Oaand ta for a = 1, ,A are the eigenvalues and the component scores,

re-spectively, of the previously defined covariance matrix The control limits of the

MAPLSS-T2chart are based on the percentiles qD(for D ≤ 10%) of the empirical distributions, F N , of MAPLSS component scores, computed on a large number N of

bootstrap samples

Multivariate control charts can detect an unusual event but do not provide a reasonfor it Following the diagnostic approach proposed by Kourti and MacGregor (1996)and using some new tools, we can investigate observations falling out of the limitsthrough

(1) bar plots of standardized out-of control scores (ta / √

Oafor a = 1, ,A), to focus

on the most important dimensions;

(2) bar plot of the contributions of the process variables on the dimensions identified

as the most important ones, to evaluate how each process variable involved in thecalculation of that score contributes to it;

(3) bar plot of the contributions of the process variables on product variables sured by the spline range) to evaluate the importance of process variables

Trang 11

(mea-MAPLSS-T2control chart 205

3 Application: monitoring the painting process of hot-rolled aluminium foils

In this section we illustrate the usefulness of MAPLSS-T2chart and the related agnosis tools by applying them to monitor a real manufacturing process We focus

di-on the modeling phase of statistical process cdi-ontrol The data refer to a turing firm of Naples, specialized in hot-rolling of aluminium foils The manufac-turing process consists in simultaneously painting the lower and upper surfaces of

manufac-an aluminium foil The process starts by setting the aluminium roll on the ing swift The aluminium foil, pulled by the draught rein that manages the crossingspeed, reaches the painting station where it is uniformly painted on both surfaces bydeflector rolls The paint drying and polymerization is realized in a flotation ovenconsisting of 6 distinct modules (each module is characterized by a specific temper-ature and can be gradually boosted and independently tuned up)

unwind-The process stops by rewinding the aluminium roll unwind-The key product quality acteristics are the uniformity and stability of the alumium painting Both of them

char-depend on the Peak Metal Temperature, PMT , reached during the polymerization.

By managing the temperatures of the stay of the aluminium foil in the oven, one can

influence the PMT Thus PMT has been selected as the only quality product able, whereas the temperatures characterizing the six modules (T1,T2,T3,T4,T5,T6) and the post-combustion temperature (Tpost) have been selected as process vari- ables The MAPLSS-T2 control chart is built on an historical data set of n= 100independent unit samples The computational strategy consists in performing at firstthe MAPLSS regression (see Table 1) using low degree and knot number (degree=1,

vari-knots=1), deciding the dimension space A by Cross Validation (we get A= 3 with

PRESS = 0.15) Using the balance between the goodness of fit (R2) and thriftness

(PRESS), we select only one interaction among the candidates, the resulting best one

is T4*T5 Afterwards we extract N= 500 Bootstrap samples and perform MAPLSS

Fig 1 MAPLSS−T2control chart

Trang 12

206 Rosaria Lombardo, Amalia Vanacore and Jean-Francçois Durand

Fig 2 Bar plot of contributions of process variables to the second dimension

Fig 3 Bar plot of contributions of process variables to PMT

regression procedure on each of them, having properly fixed the model parameters

(degree=1, knots=1, A=3) The computation of the T2scores for all Bootstrap

sam-ples allows to estimate the empirical distribution function of T2 We fix the trol chart upper and lower limits at the percentiles with D = 1% and D = 99%(UCL=393.03, LCL=2.81)

con-Looking at the resulting control chart (see figure 1) we note two points out of control

at the beginning of the sequence (points 5 and 13) They must be investigated, usingbar plots (1) for points 5 and 13, the dimension 2 results as the most important onefor both out of control points The bar plot (2) of process variables which contribute

Ngày đăng: 05/08/2014, 21:21

TỪ KHÓA LIÊN QUAN