Springer advances in data analysis (2007) 3540709800

Model Selection Criteria for Model-Based Clustering ofCategorical Time Series Data: A Monte Carlo Study M´ ario A.T.. Contents XIPart III Data and Time Series Analysis Simultaneous Selec

Trang 2

Studies in Classiﬁcation, Data Analysis, and Knowledge Organization

Managing Editors Editorial Board

H.-H Bock, Aachen Ph Arabie, Newark

W Gaul, Karlsruhe D Baier, Cottbus

M Vichi, Rome F Critchley, Milton Keynes

Trang 3

Titles in the Series

W Gaul and D Pfeifer (Eds.)

From Data to Knowledge 1995

H.-H Bock and W Polasek (Eds.)

Data Analysis and Information

Systems 1996

E Diday, Y Lechevallier, and

O Opitz (Eds.) Ordinal and

Symbolic Data Analysis 1996

R Klar and O Opitz (Eds.)

Classiﬁcation and Knowledge

Organization 1997

C Hayashi, N.Ohsumi, K.Yajima,

Y Tanaka, H.-H Bock, and Y Baba

(Eds.)

Data Science, Classifaction,

and Related Methods 1998

I Balderjahn, R Mathar, and

M Vichi and O Optiz (Eds.)

Classiﬁcation and Data Analysis

1999

W Gaul and H Locarek-Junge (Eds.)

Classiﬁcation in the Information

Age 1999

H.-H Bock and E Diday (Eds.)

Analysis of Symbolic Data 2000

H A L Kiers, J.-P Rasson, P J F

Groenen, and M Schader (Eds.)

Data Analysis, Classiﬁcation, and

Related Methods 2000

W Gaul, O Opitz, M Schader (Eds.)

Data Analysis 2000

R Decker and W Gaul (Eds.)

Classiﬁcation and Information

Processing at the Turn of the

Millenium 2000

S Borra, R Rocci, M Vichi,

and M Schader (Eds.)

Advances in Classiﬁcation and Data

Analysis 2000

W Gaul and G Ritter (Eds.)Classiﬁcation, Automation, and NewMedia 2002

K Jajuga, A Sokołowski, andH.-H Bock (Eds.)

Classiﬁcation, Clustering and DataAnalysis 2002

M Schwaiger and O Opitz (Eds.)Exploratory Data Analysis inEmpirical Research 2003

M Schader, W Gaul, and M Vichi(Eds.)

Between Data Science and AppliedData Analysis 2003

H.-H Bock, M Chiodi, and

A Mineo (Eds.)Advances in Multivariate DataAnalysis 2004

D Banks, L House, F.R McMorris,

P Arabie, and W Gaul (Eds.)Classiﬁcation, Clustering, and DataMinig Applications 2004

D Baier and K.-D Wernecke (Eds.)Innovations in Classiﬁcation, DataScience, and Information Systems.2005

M Vichi, P Monari, S Mignani, and

A Montanari (Eds.)New Developments in Classiﬁcationand Data Analysis 2005

D Baier, R Decker, and L Thieme (Eds.)

Schmidt-Data Analysis and Decision Support.2005

C Weihs and W Gaul (Eds.)Classiﬁcation - the UbiquitousChallenge 2005

M Spiliopoulou, R Kruse, C.Borgelt, A Nürnberger, and W Gaul(Eds.)

From Data and Information Analysis

to Knowledge Engineering 2006

V Batagelj, H.-H Bock, A Ferligoj,and A ˇZiberna (Eds.)

Data Science and Classiﬁcation 2006

S Zani, A Cerioli, M Riani, M Vichi(Eds.)

Data Analysis, Classiﬁcation and theForward Search 2006

Trang 4

Proceedings of the 30thAnnual Conference

of the Gesellschaft für Klassiﬁkation e.V., Freie Universität Berlin, March 8-10, 2006

With 202 Figures and 92 Tables

123

Trang 5

Professor Dr Reinhold Decker

Department of Business Administration and Economics

ISBN 978-3-540-70980-0 Springer Berlin Heidelberg New York

This work is subject to copyright All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broad- casting, reproduction on microfilm or in any other way, and storage in data banks Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer Violations are liable to prosecution under the German Copyright Law.

Springer is a part of Springer Science+Business Media

springer.com

The use of general descriptive names, registered names, trademarks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

Production: LE-TEX Jelonek, Schmidt & Vockler GbR, Leipzig

Cover-design: WMX Design GmbH, Heidelberg

Trang 6

This volume contains the revised versions of selected papers presented duringthe 30thAnnual Conference of the German Classification Society (Gesellschaftfür Klassifikation – GfKl) on “Advances in Data Analysis” The conference washeld at the Freie Universität Berlin, Germany, in March 2006 The scientificprogram featured 7 parallel tracks with more than 200 contributed talks in 63sessions Additionally, thanks to the support of the DFG (German ResearchFoundation), 18 plenary and semi-plenary speakers from Europe and overseascould be invited to talk about their current research in classification and dataanalysis With 325 participants from 24 countries in Europe and overseas thisGfKl Conference, once again, provided an international forum for discussionsand mutual exchange of knowledge with colleagues from different fields ofinterest From altogether 115 full papers that had been submitted for thisvolume 77 were finally accepted

The scientiﬁc program included a broad range of topics from classiﬁcationand data analysis Interdisciplinary research and the interaction between the-ory and practice were particularly emphasized The following sections (withchairs in alphabetical order) were established:

I Theory and Methods

Clustering and Classiﬁcation (H.-H Bock and T Imaizumi); ExploratoryData Analysis and Data Mining (M Meyer and M Schwaiger); PatternRecognition and Discrimination (G Ritter); Visualization and Scaling Meth-ods (P Groenen and A Okada); Bayesian, Neural, and Fuzzy Clustering(R Kruse and A Ultsch); Graphs, Trees, and Hierarchies (E Godehardtand J Hansohm); Evaluation of Clustering Algorithms and Data Structures(C Hennig); Data Analysis and Time Series Analysis (S Lang); Data Cleaningand Pre-Processing (H.-J Lenz); Text and Web Mining (A N¨urnberger and

M Spiliopoulou); Personalization and Intelligent Agents (A Geyer-Schulz);Tools for Intelligent Data Analysis (M Hahsler and K Hornik)

II Applications

Subject Indexing and Library Science (H.-J Hermes and B Lorenz); ing, Management Science, and OR (D Baier and O Opitz); E-commerce, Rec-

Trang 7

Market-VI Preface

ommender Systems, and Business Intelligence (L Schmidt-Thieme); Bankingand Finance (K Jajuga and H Locarek-Junge); Economics (G Kauermannand W Polasek); Biostatistics and Bioinformatics (B Lausen and U Mans-mann); Genome and DNA Analysis (A Schliep); Medical and Health Sci-ences (K.-D Wernecke and S Willich); Archaeology (I Herzog, T Kerig, and

A Posluschny); Statistical Musicology (C Weihs); Image and Signal cessing (J Buhmann); Linguistics (H Goebl and P Grzybek); Psychology(S Krolak-Schwerdt); Technology and Production (M Feldmann)

Pro-Additionally, the following invited sessions were organized by colleaguesfrom associated societies: Classiﬁcation with Complex Data Structures (A Ce-rioli); Machine Learning (D.A Zighed); Classiﬁcation and Dimensionality Re-duction (M Vichi)

The editors would like to emphatically thank the section chairs for doingsuch a great job regarding the organization of their sections and the asso-ciated paper reviews The same applies to W Esswein for organizing theDoctoral Workshop and to H.-H Hermes and B Lorenz for organizing theLibrarians Workshop Cordial thanks also go to the members of the scientiﬁcprogram committee for their conceptual and practical support (in alphabeti-cal order): D Baier (Cottbus), H.-H Bock (Aachen), H.W Brachinger (Fri-bourg), R Decker (Bielefeld, Chair), D Dubois (Toulouse), A Gammerman(London), W Gaul (Karlsruhe), A Geyer-Schulz (Karlsruhe), B Goldfarb(Paris), P Groenen (Rotterdam), D Hand (London), T Imaizumi (Tokyo),

K Jajuga (Wroclaw), G Kauermann (Bielefeld), R Kruse (Magdeburg),

S Lang (Innsbruck), B Lausen (Erlangen-N¨urnberg), H.-J Lenz (Berlin),

F Murtagh (London), A Okada (Tokyo), L Schmidt-Thieme (Hildesheim)

M Spiliopoulou (Magdeburg), W Stützle (Washington), and C Weihs mund) The review process was additionally supported by the following col-leagues: A Cerioli, E Gatnar, T Kneib, V Köppen, M Meißner, I Michalar-ias, F Mörchen, W Steiner, and M Walesiak

(Dort-The great success of this conference would not have been possible withoutthe support of many people mainly working in the backstage Representativefor the whole team we would like to particularly thank M Darkow (Bielefeld)and A Wnuk (Berlin) for their exceptional eﬀorts and great commitmentwith respect to the preparation, organization and post-processing of the con-ference We thank very much our web masters I Michalarias (Berlin) and

A Omelchenko (Berlin) Furthermore, we would cordially thank V K¨oppen(Berlin) and M Meißner (Bielefeld) for providing an excellent support re-garding the management of the reviewing process and the ﬁnal editing of thepapers printed in this volume

The GfKl Conference 2006 would not have been possible in the way

it took place without the financial and/or material support of the ing institutions and companies (in alphabetical order): Deutsche Forschungs-gemeinschaft, Freie Universität Berlin, Gesellschaft für Klassifikation e.V.,Land Software-Entwicklung, Microsoft München, SAS Deutschland, Springer-

Trang 8

follow-Preface VIIVerlag, SPSS München, Universität Bielefeld, and Westfälisch-Lippische Uni-versitätsgesellschaft We express our gratitude to all of them.

Finally, we would like to thank Dr Martina Bihn of Springer-Verlag, delberg, for her support and dedication to the production of this volume

Reinhold Decker

Trang 9

Model Selection Criteria for Model-Based Clustering of

Categorical Time Series Data: A Monte Carlo Study

M´ ario A.T Figueiredo 39

A Method for Analyzing the Asymptotic Behavior of the

Walk Process in Restricted Random Walk Cluster Algorithm

Markus Franke, Andreas Geyer-Schulz 51

Cluster and Select Approach to Classiﬁer Fusion

Eugeniusz Gatnar 59

Random Intersection Graphs and Classiﬁcation

Erhard Godehardt, Jerzy Jaworski, Katarzyna Rybarczyk 67

Optimized Alignment and Visualization of Clustering Results

Martin Hoﬀmann, D¨ orte Radke, Ulrich M¨ oller 75

Trang 10

X Contents

Finding Cliques in Directed Weighted Graphs Using Complex Hermitian Adjacency Matrices

Bettina Hoser, Thomas Bierhance 83

Text Clustering with String Kernels in R

Alexandros Karatzoglou, Ingo Feinerer 91

Automatic Classiﬁcation of Functional Data with Extremal

Information

Fabrizio Laurini, Andrea Cerioli 99

Typicality Degrees and Fuzzy Prototypes for Clustering

Marie-Jeanne Lesot, Rudolf Kruse 107

On Validation of Hierarchical Clustering

Hans-Joachim Mucha 115

Part II Classiﬁcation

Rearranging Classiﬁed Items in Hierarchies Using

Categorization Uncertainty

Korinna Bade, Andreas N¨ urnberger 125

Localized Linear Discriminant Analysis

Irina Czogiel, Karsten Luebke, Marc Zentgraf, Claus Weihs 133

Calibrating Classiﬁer Scores into Probabilities

Martin Gebel, Claus Weihs 141

Nonlinear Support Vector Machines Through Iterative

Majorization and I-Splines

Patrick J.F Groenen, Georgi Nalbantov, J Cor Bioch 149

Deriving Consensus Rankings from Benchmarking

Experiments

Kurt Hornik, David Meyer 163

Classiﬁcation of Contradiction Patterns

Heiko M¨ uller, Ulf Leser, Johann-Christoph Freytag 171

Selecting SVM Kernels and Input Variable Subsets in Credit Scoring Models

Klaus B Schebesch, Ralf Stecking 179

Trang 11

Contents XI

Part III Data and Time Series Analysis

Simultaneous Selection of Variables and Smoothing

Parameters in Geoadditive Regression Models

Christiane Belitz, Stefan Lang 189

Modelling and Analysing Interval Data

Paula Brito 197

Testing for Genuine Multimodality in Finite Mixture Models: Application to Linear Regression Models

Bettina Gr¨ un, Friedrich Leisch 209

Happy Birthday to You, Mr Wilcoxon!

Invariance, Semiparametric Eﬃciency, and Ranks

Marc Hallin 217

Equivalent Number of Degrees of Freedom for Neural

Networks

Salvatore Ingrassia, Isabella Morlini 229

Model Choice for Panel Spatial Models: Crime Modeling in Japan

Kazuhiko Kakamu, Wolfgang Polasek, Hajime Wago 237

A Boosting Approach to Generalized Monotonic Regression

Florian Leitenstorfer, Gerhard Tutz 245

From Eigenspots to Fisherspots – Latent Spaces in the

Nonlinear Detection of Spot Patterns in a Highly Varying

Background

Bjoern H Menze, B Michael Kelm, Fred A Hamprecht 255

Identifying and Exploiting Ultrametricity

Fionn Murtagh 263

Factor Analysis for Extraction of Structural Components and Prediction in Time Series

Carsten Schneider, Gerhard Arminger 273

Classiﬁcation of the U.S Business Cycle by Dynamic Linear Discriminant Analysis

Roland Schuhr 281

Trang 12

XII Contents

Examination of Several Results of Diﬀerent Cluster Analyses with a Separate View to Balancing the Economic and

Ecological Performance Potential of Towns and Cities

Nguyen Xuan Thinh, Martin Behnisch, Alfred Ultsch 289

Part IV Visualization and Scaling Methods

VOS: A New Method for Visualizing Similarities Between

Objects

Nees Jan van Eck, Ludo Waltman 299

Multidimensional Scaling of Asymmetric Proximities with a Dominance Point

Akinori Okada, Tadashi Imaizumi 307

Single Cluster Visualization to Optimize Air Traﬃc

Management

Frank Rehm, Frank Klawonn, Rudolf Kruse 319

Rescaling Proximity Matrix Using Entropy Analyzed by

INDSCAL

Satoru Yokoyama, Akinori Okada 327

Part V Information Retrieval, Data and Web Mining

Canonical Forms for Frequent Graph Mining

Christian Borgelt 337

Applying Clickstream Data Mining to Real-Time Web

Crawler Detection and Containment Using ClickTips

Platform

An´ alia Louren¸ co, Orlando Belo 351

Plagiarism Detection Without Reference Collections

Sven Meyer zu Eissen, Benno Stein, Marion Kulig 359

Putting Successor Variety Stemming to Work

Benno Stein, Martin Potthast 367

Collaborative Filtering Based on User Trends

Panagiotis Symeonidis, Alexandros Nanopoulos, Apostolos

Papadopoulos, Yannis Manolopoulos 375

Investigating Unstructured Texts with Latent Semantic

Analysis

Fridolin Wild, Christina Stahl 383

Trang 13

Contents XIII

Part VI Marketing, Management Science and Economics

Heterogeneity in Preferences for Odd Prices

Bernhard Baumgartner, Winfried J Steiner 393

Classiﬁcation of Reference Models

Robert Braun, Werner Esswein 401

Adaptive Conjoint Analysis for Pricing Music Downloads

Christoph Breidert, Michael Hahsler 409

Improving the Probabilistic Modeling of Market Basket Data

Christian Buchta 417

Classiﬁcation in Marketing Research by Means of

LEM2-generated Rules

Reinhold Decker, Frank Kroll 425

Pricing Energy in a Multi-Utility Market

Markus Franke, Andreas Kamper, Anke Eßer 433

Disproportionate Samples in Hierarchical Bayes CBC

Analysis

Sebastian Fuchs, Manfred Schwaiger 441

Building on the Arules Infrastructure for Analyzing

Transaction Data with R

Michael Hahsler, Kurt Hornik 449

Balanced Scorecard Simulator – A Tool for Stochastic

Business Figures

Veit K¨ oppen, Marina Allgeier, Hans-J Lenz 457

Integration of Customer Value into Revenue Management

Tobias von Martens, Andreas Hilbert 465

Women’s Occupational Mobility and Segregation in the

Labour Market: Asymmetric Multidimensional Scaling

Miki Nakai 473

Multilevel Dimensions of Consumer Relationships in the

Healthcare Service Market M-L IRT vs M-L SEM Approach

Iga Rudawska, Adam Sagan 481

Trang 14

XIV Contents

Data Mining in Higher Education

Karoline Sch¨ onbrunn, Andreas Hilbert 489

Attribute Aware Anonymous Recommender Systems

Manuel Stritt, Karen H.L Tso, Lars Schmidt-Thieme 497

Part VII Banking and Finance

On the Notions and Properties of Risk and Risk Aversion in the Time Optimal Approach to Decision Making

Martin Bouzaima, Thomas Burkhardt 507

A Model of Rational Choice Among Distributions of Goal

Reaching Times

Thomas Burkhardt 515

On Goal Reaching Time Distributions Estimated from DAX Stock Index Investments

Thomas Burkhardt, Michael Haasis 523

Credit Risk of Collaterals: Examining the Systematic Linkage between Insolvencies and Physical Assets in Germany

Marc G¨ urtler, Dirk Heithecker, Sven Olboeter 531

Foreign Exchange Trading with Support Vector Machines

Christian Ullrich, Detlef Seese, Stephan Chalup 539

The Inﬂuence of Speciﬁc Information on the Credit Risk

Level

Miroslaw W´ ojciak, Aleksandra W´ ojcicka-Krenz 547

Part VIII Bio- and Health Sciences

Enhancing Bluejay with Scalability, Genome Comparison and Microarray Visualization

Anguo Dong, Andrei L Turinsky, Andrew C Ah-Seng, Morgan

Taschuk, Paul M.K Gordon, Katharina Hochauer, Sabrina Fr¨ ols, Jung Soh, Christoph W Sensen 557

Discovering Biomarkers for Myocardial Infarction from

SELDI-TOF Spectra

Christian H¨ oner zu Siederdissen, Susanne Ragg, Sven Rahmann 569

Joint Analysis of In-situ Hybridization and Gene Expression Data

Lennart Opitz, Alexander Schliep, Stefan Posch 577

Trang 15

Contents XV

Unsupervised Decision Trees Structured by Gene Ontology

(GO-UDTs) for the Interpretation of Microarray Data

Henning Redestig, Florian Sohler, Ralf Zimmer, Joachim Selbig 585

Part IX Linguistics and Text Analysis

Clustering of Polysemic Words

Laurent Cicurel, Stephan Bloehdorn, Philipp Cimiano 595

Classifying German Questions According to Ontology-Based Answer Types

Adriana Davidescu, Andrea Heyl, Stefan Kazalski, Irene Cramer,

Dietrich Klakow 603

The Relationship of Word Length and Sentence Length: The Inter-Textual Perspective

Peter Grzybek, Ernst Stadlober, Emmerich Kelih 611

Comparing the Stability of Diﬀerent Clustering Results of

Dialect Data

Edgar Haimerl, Hans-Joachim Mucha 619

Part-of-Speech Discovery by Clustering Contextual Features

Reinhard Rapp 627

Part X Statistical Musicology and Sound Classiﬁcation

A Probabilistic Framework for Audio-Based Tonal Key and

Chord Recognition

Benoit Catteau, Jean-Pierre Martens, Marc Leman 637

Using MCMC as a Stochastic Optimization Procedure for

Monophonic and Polyphonic Sound

Katrin Sommer, Claus Weihs 645

Vowel Classiﬁcation by a Neurophysiologically Parameterized Auditory Model

Gero Szepannek, Tam´ as Harczos, Frank Klefenz, Andr´ as Katai, Patrick Schikowski, Claus Weihs 653

Trang 16

XVI Contents

Part XI Archaeology

Uncovering the Internal Structure of the Roman Brick and

Tile Making in Frankfurt-Nied by Cluster Validation

Jens Dolata, Hans-Joachim Mucha, Hans-Georg Bartel 663

Where Did I See You Before

A Holistic Method to Compare and Find Archaeological

Artifacts

Vincent Mom 671

Keywords 681 Author Index 685

Trang 17

Part I

Clustering

Trang 18

Mixture Models for Classiﬁcation

Gilles CeleuxInria Futurs, Orsay, France; Gilles.Celeux@inria.fr

Abstract. Finite mixture distributions provide efficient approaches of model-basedclustering and classification The advantages of mixture models for unsupervisedclassification are reviewed Then, the article is focusing on the model selection prob-lem The usefulness of taking into account the modeling purpose when selecting amodel is advocated in the unsupervised and supervised classification contexts Thispoint of view had lead to the definition of two penalized likelihood criteria, ICL andBEC, which are presented and discussed Criterion ICL is the approximation of theintegrated completed likelihood and is concerned with model-based cluster analysis.Criterion BEC is the approximation of the integrated conditional likelihood and isconcerned with generative models of classification The behavior of ICL for choos-ing the number of components in a mixture model and of BEC to choose a modelminimizing the expected error rate are analyzed in contrast with standard modelselection criteria

1 Introduction

Finite mixtures models has been extensively studied for decades and provide

a fruitful framework for classiﬁcation (McLachlan and Peel (2000)) In thisarticle some of the main features and advantages of ﬁnite mixture analysisfor model-based clustering are reviewed in Section 2 An important interest

of ﬁnite mixture model is to provide a rigorous setting to assess the number

of clusters in an unsupervised classiﬁcation context or to assess the stability

of a classiﬁcation function It is focused on those two questions in Section 3.Model-based clustering (MBC) consists of assuming that the data comefrom a source with several subpopulations Each subpopulation is modeledseparately and the overall population is a mixture of these subpopulations

The resulting model is a ﬁnite mixture model Observations x = (x1, , x n)

in Rndare assumed to be a sample from a probability distribution with density

Trang 19

4 Gilles Celeux

where the p k ’s are the mixing proportions (0 < p k < 1 for all k = 1, , K

and

k p k = 1), φ( | a k ) denotes a parameterized density and θ K =

(p1, , p K −1 , a1, , a K) When data are multivariate continuous

observa-tions, the component parameterized density is usually the d-dimensional sian density and parameter ak = (µk , Σ k), µk being the mean and Σk the

Gaus-variance matrix of component k When data are discrete, the component

pa-rameterized density is usually the multivariate multinomial density which isassuming conditional Independence of the observations knowing their com-

ponent mixture and the a k = (a j k , j = 1, , d)’s are the multinomial abilities for variable j and mixture component k The resulting model is the

prob-so-called Latent Class Model (see for instance Goodman (1974))

The mixture model is an incomplete data structure model: The completedata are

y = (y1, , y n) = ((x1, z1), , (x n , z n))

where the missing data are z = (z1, , z n), with zi = (z i1 , , z iK) are

binary vectors such that z ik = 1 iﬀ xi arises from group k The z’s deﬁne a

partition P = (P1, , P K ) of the observed data x with P k={x i | z ik = 1}.

In this article, it is considered that the mixture models at hand are mated through maximum likelihood (ml) or related methods Despite it hasreceived a lot of attention, since the seminal article of Diebolt and Robert(1994), Bayesian inference is not considered here Bayesian analysis of uni-variate mixtures has became the standard Bayesian tool for density estima-tion But, especially in the multivariate setting a lot of problems (possible slowconvergence of MCMC algorithms, deﬁnition of subjective weakly informativepriors, identiﬁability, ) remain and it cannot be regarded as a standard toolfor Bayesian clustering of multivariate data (see Aitkin (2001)) The reader isreferred to the survey article of Marin et al (2005) for a readable state of theart of Bayesian inference for mixture models

esti-2 Some advantages of model-based clustering

In this section, some important and nice features of finite mixture analysis aresketched The advantages of finite mixture analysis in a clustering context,highlighted here, are: Many versatile or parsimonious models are available,many algorithms to estimate the mixture parameters are available, specialquestions can be tackled in a proper way in the MBC context, and, last butnot least, finite mixture models can be compared and assessed in an objectiveway It allows in particular to assess the number of clusters properly Thediscussion on this important point is postponed to Section 3

Many versatile or parsimonious models are available.

In the multivariate Gaussian mixture context, the variance matrix eigenvaluedecomposition

Trang 20

Mixture Models for Classiﬁcation 5

Σ k = V k D t k A k D k where V k =|Σ k)| 1/d deﬁnes the component volume, D k the matrix of eigen-

vectors of Σ deﬁnes the component orientation, and Ak the diagonal matrix

of normalized eigenvalues defines the component shape, leads to get differentand easily interpreted models by allowing some of these quantities to varybetween components Following Banfield and Raftery (1993) or Celeux andGovaert (1995), a large range of fourteen versatile (from the most complex

to the simplest one) models derived from this eigenvalue decomposition can

be considered Assuming equal or free volumes, orientations and shapes leads

to eight diﬀerent models Assuming in addition that the component variancematrices are diagonal leads to four models And, ﬁnally, assuming in additionthat the component variance matrices are proportional to the identity matrixleads to two other models

In the Latent Class Model, a re-parameterization is possible to lead tovarious models taking account of the scattering around centers of the clusters

in diﬀerent ways (Celeux and Govaert (1991)) This re-parameterization is as

follows The multinomial probabilities ak are decomposed in (mk, ε k) where

binary vector mk = (m1, , m d ) provides the mode levels in cluster k for variable j

For instance, if aj k = (0.7, 0.2, 0.1), the new parameters are m j k = (1, 0, 0) and

ε j k = (0.3, 0.2, 0.1) This parameterization can lead to ﬁve latent class models Denoting h(jk) the mode level for variable j and cluster k and h(ij) the level

of object i for the variable j, the model can be written

k )x jh(jk) i (ε jh(ij) k )x jh(ij) i −x jh(jk)

k

.

Using this form, it is possible to impose various constraints to the scattering

parameters ε jh k The models of interest are the following:

• The standard latent class model [ε jh

k ]: The scattering is depending uponclusters, variables and levels

• [ε j

k]: The scattering is depending upon clusters and variables but not uponlevels

• [ε k]: The scattering is depending upon clusters, but not upon variables

• [ε j]: The scattering is depending upon variables, but not upon clusters

• [ε]: The scattering is constant over variables and clusters.

Trang 21

6 Gilles Celeux

Many algorithms available from diﬀerent points of view

The EM algorithm of Dempster et al (1977) is the reference tool to derivethe ml estimates in a mixture model An iteration of EM is as follows:

• E step: Compute the conditional probabilities t ik , i = 1, , n, k =

1, , K that xi arises from the kth component for the current value

of the mixture parameters

• M step: Update the mixture parameter estimates maximizing the expected

value of the completed likelihood It leads to use standard formulas where

the observation i for group k is weighted with the conditional probability

t ik

Others algorithms are taking proﬁt of the missing data structure of the ture model For instance, the classiﬁcation EM (CEM), see Celeux and Gov-

mix-aert (1992) is directly concerned with the estimation of the missing labels z.

An iteration of CEM is as follows:

• E step: As in EM.

• C step: Assign each point x ito the component maximizing the conditional

probability tik using a maximum a posteriori (MAP) principle

• M step: Update the mixture parameter estimates maximizing the

com-pleted likelihood

CEM aims to maximize the completed likelihood where the component label

of each sample point is included in the data set CEM is a K-means-likealgorithm and, contrary to EM, it converges in a ﬁnite number of iterations.But CEM provides biased estimates of the mixture parameters This algorithm

is interesting in a clustering context when the clusters are well separated (seeCeleux and Govaert (1993))

From an other point of view, the Stochastic EM (SEM) algorithm can beuseful It is as follows:

• E step: As in EM.

• S step: Assign each point x iat random to one of the component according

to the distribution deﬁned by the (t ik , k = 1, , K).

• M step: Update the mixture parameter estimates maximizing the

com-pleted likelihood

SEM generates a Markov chain whose stationary distribution is (more or less)concentrated around the ML parameter estimator Thus a natural parameterestimate from a SEM sequence is the mean of the iterates values obtain after

a burn-in period An alternative estimate is to consider the parameter valueleading to the largest likelihood in a SEM sequence In any cases, SEM isexpected to avoid insensible maxima of the likelihood that EM cannot avoid,but SEM can be jeopardized by spurious maxima (see Celeux et al (1996) orMcLachlan and Peel (2000) for details) Note that diﬀerent variants (MonteCarlo EM, Simulated Annealing EM) are possible (see, for instance, Celeux et

Trang 22

al (1996)) Note also that Biernacki et al (2003) proposed simple strategiesfor getting sensible ml estimates Those strategies are acting in two ways todeal with this problem They choose particular starting values from CEM orSEM and they run several times EM or algorithms combining CEM and EM

Special questions can be tackled in a proper way in the MBC context

Robust Cluster Analysis can be obtained by making use of multivariate dent distributions instead of Multivariate Gaussian distributions It lead toattenuate the inﬂuence of outliers (McLachlan and Peel (2000)) On an otherhand, including in the mixture a group from a uniform distribution allows totake into account noisy data (DasGupta and Raftery (1998))

Stu-To avoid spurious maxima of likelihood, shrinking the group variance trix toward a matrix proportional to the identity matrix can be quite eﬃcient

ma-On of the most achieved work in this domain is Ciuperca et al (2003).Taking profit of the probabilistic framework, it is possible to deal withmissing data at random in a proper way with mixture models (Hunt andBasford (2001)) Also, simple, natural and efficient methods of semi-supervisedclassification can be derived in the mixture framework (an example of pioneerarticle on this subject, recently followed by many others, is Ganesalingam andMcLachlan (1978)) Finally, it can be noted that promising variable selectionprocedures for Model-Based Clustering begin to appear (Raftery and Dean(2006))

3 Choosing a model in a classiﬁcation purpose

In statistical inference from data selecting a parsimonious model among acollection of models is an important but diﬃcult task This general prob-lem receives much attention since the seminal articles of Akaike (1974) andSchwarz (1978) A model selection problem consists essentially of solving thebias-variance dilemma A classical approach to the model assessing problemconsists of penalizing the ﬁt of a model by a measure of its complexity Crite-rion AIC of Akaike (1974) is an asymptotic approximation of the expectation

of the deviance It is

AIC(m) = 2 log p(x |m, ˆθ m) − 2ν m (2)where ˆθ m is the ml estimate of parameter θ m and ν m is the number of free

parameters of model m.

An other point of view consists of basing the model selection on the grated likelihood of the data in a Bayesian perspective (see Kass and Raftery(1995)) This integrated likelihood is

inte-p(x|m) =

Trang 23

8 Gilles Celeux

π(θ m ) being a prior distribution for parameter θ m The essential technicalproblem is to approximate this integrated likelihood in a right way A classicalasymptotic approximation of the logarithm of the integrated likelihood is theBIC criterion of Schwarz (1978) It is

BIC(m) = log p(x |m, ˆθ m)− ν m

Beyond technical diﬃculties, the scope of this section is to show how it can

be fruitful to take into account the purpose of the model user to get reliableand useful models for statistical description or decision tasks Two situationsare considered to support this idea: Choosing the number of components in

a mixture model in a cluster analysis perspective, and choosing a generativeprobabilistic model in a supervised classiﬁcation context

3.1 Choosing the number of clusters

Assessing the number K of components in a mixture model is a diﬃcult

ques-tion, from both theoretical and practical points of view, which had receivedmuch attention in the past two decades This section does not propose a state

of the art of this problem which has not been completely solved The reader

is referred to the chapter 6 of the book of McLachlan and Peel (2000) for anexcellent overview on this subject This section is essentially aiming to discusselements of practical interest regarding the problem of choosing the number

of mixture components when concerned with cluster analysis

From the theoretical point of view, even when K ∗ the right number ofcomponent is assumed to exist, if K ∗ < K0 then K ∗ is not identiﬁable in theparameter space Θ K0 (see for instance McLachlan and Peel (2000), chapter6)

But, here, we want to stress the importance of taking into account themodeling context to select a reasonable number of mixture components Ouropinion is that, behind the theoretical difficulties, assessing the number ofcomponents in a mixture model from data is a weakly identifiable statisticalproblem Mixture densities with different number of components can lead

to quite similar resulting probability distributions For instance, the galaxyvelocities data of Roeder (1990) has became a benchmark data set and is used

by many authors to illustrate procedures for choosing the number of mixture

components Now, according to those authors the answer lies from K = 2 to

K = 10, and it is not exaggerating a lot to say that all the answers between

2 and 10 have been proposed as a good answer, at least one time, in thearticles considering this particular data set (An interesting and illuminatingcomparative study on this data set can be found in Aitkin (2001).) Thus, we

consider that it is highly desirable to choose K by keeping in mind what is

expected from the mixture modeling to get a relevant answer to this question.Actually, mixture modeling can be used in quite diﬀerent purposes It can be

Trang 24

Mixture Models for Classiﬁcation 9regarded as a semi parametric tool for density estimation purpose or as a toolfor cluster analysis.

In the ﬁrst perspective, much considered by Bayesian statisticians, merical experiments (see Fraley and Raftery (1998)) show that the BIC ap-proximation of the integrated likelihood works well at a practical level And,under regularity conditions including the fact that the component densitiesare ﬁnite, Keribin (2000) proved that BIC provides a consistent estimator of

nu-K.

But, in the second perspective, the integrated likelihood does not takeinto account the clustering purpose for selecting a mixture model in a model-based clustering setting As a consequence, in the most current situationswhere the distribution from which the data arose is not in the collection ofconsidered mixture models, BIC criterion will tend to overestimate the correctsize regardless of the separation of the clusters (see Biernacki et al (2000))

To overcome this limitation, it can be advantageous to choose K in order

to get the mixture giving rise to partitioning data with the greatest evidence.With that purpose in mind, Biernacki et al (2000) considered the integrated

likelihood of the complete data (x, z) (or integrated completed likelihood).

(Recall that z = (z1, , z n) is denoting the missing data such that zi =

(zi1 , , z iK) are binary K-dimensional vectors with zik= 1 if and only if xi

arises from component k.) Then, the integrated complete likelihood is

p(x, z | K) =

Θ K

p(x, z | K, θ)π(θ | K)dθ, (5)where

Trang 25

10 Gilles Celeux

As a consequence, ICL favors K values giving rise to partitioning the data

with the greatest evidence, as highlighted in the numerical experiments inBiernacki et al (2000), because of this additional entropy term More gener-

ally, ICL appears to provide a stable and reliable estimate of K for real data

sets and also for simulated data sets from mixtures when the components arenot too much overlapping (see McLachlan and Peel (2000)) But ICL, which isnot aiming to discover the true number of mixture components, can underesti-mate the number of components for simulated data arising from mixture withpoorly separated components as illustrated in Figueiredo and Jain (2002)

On the contrary, BIC performs remarkably well to assess the true number

of components from simulated data (see Biernacki et al (2000), Fraley andRaftery (1998) for instance) But, for real world data sets, BIC has a markedtendency to overestimate the numbers of components The reason is that realdata sets do not arise from the mixture densities at hand, and the penaltyterm of BIC is not strong enough to balance the tendency of the loglikelihood

to increase with K in order to improve the ﬁt of the mixture model.

3.2 Model selection in classiﬁcation

Supervised classiﬁcation is about guessing the unknown group among K

groups from the knowledge of d variables entering in a vector x i for unit i.

This group for unit i is deﬁned by z i = (z i1 , , z iK ) a binary K-dimensional vector with z ik = 1 if and only if xi arises from group k For that purpose,

a decision function, called a classiﬁer, δ(x) : R d → {1, , K} is designed

from a learning sample (xi, z i), i = 1, , n A classical approach to design

a classiﬁer is to represent the group conditional densities with a parametric

model p(x|m, z k = 1, θm) for k = 1, , K Then the classiﬁer is

assign-ing an observation x to the group k maximizassign-ing the conditional probability

p(z k = 1|m, x, θ m ) Using the Bayes rule, it leads to set δ(x) = j if and only

if j = arg max k p kp(x|m, z k = 1, ˆ θ m), ˆθ m being the ml estimate of the group

conditional parameters θ m and p k being the prior probability of group k This

approach is known as the generative discriminant analysis in the MachineLearning community

In this context, it could be expected to improve the actual error rate by

selecting a generative model m among a large collection of models M (see for

instance Friedman (1989) or Bensmail and Celeux (1996)) For instance Hastieand Tibshirani (1996) proposed to model each group density with a mixture ofGaussian distributions In this approach the number of mixture components

per group are sensitive tuning parameters They can be supplied by the user, as

in Hastie and Tibshirani (1996), but it is clearly a sub-optimal solution They

can be chosen to minimize the v-fold cross-validated error rate, as done in

Friedman (1989) or Bensmail and Celeux (1996) for other tuning parameters

Despite the fact the choice of v can be sensitive, it can be regarded as a nearly

optimal solution But it is highly CPU time consuming and choosing tuningparameters with a penalized loglikelihood criterion, as BIC, can be expected

Trang 26

to be much more eﬃcient in many situations But, BIC measures the ﬁt of

the model m to the data (x, z) rather than its ability to produce a reliable

classiﬁer Thus, in many situations, BIC can have a tendency to overestimatethe complexity of the generative classiﬁcation model to be chosen In order

to counter this tendency, a penalized likelihood criterion taking into accountthe classiﬁcation task when evaluating the performance of a model has beenproposed by Bouchard and Celeux (2006) It the so-called Bayesian EntropyCriterion (BEC) that it is now presented

As stated above, a classiﬁer deduced from model m is assigning an

obser-vation x to the group k maximizing p(z k= 1|m, x, ˆθ m) Thus, from the

clas-siﬁcation point of view, the conditional likelihood p(z|m, x, θ m) has a centralposition For this very reason, Bouchard and Celeux (2006) proposed to makeuse of the integrated conditional likelihood

p(z|m, x) =

p(z|m, x, θ m )π(θ m |x)dθ m , (7)where

π(θ m |x) ∝ π(θ m)p(x |m, θ m)

is the posterior distribution of θ m knowing x, to select a relevant model m.

As for the integrated likelihood, this integral is generally diﬃcult to calculateand has to be approximated We have

p(x|m) =

p(x|m, θ m )π(θ m )dθ m (10)Denoting

log p(z|m, x) = log p(x, z|m, ˆθ m)− log p(x|m, ˜θ m ) + O(1). (11)

Thus the approximation of log p(z|m, x) that Bouchard and Celeux (2006)

proposed is

Trang 27

θ is the ml estimate of a ﬁnite mixture distribution It can be derived from the

EM algorithm And, the EM algorithm can be initiated in a quite natural andunique way with ˆθ Thus the calculation of ˜ θ avoids all the possible diﬃculties

which can be encountered with the EM algorithm Despite the need to use the

EM algorithm to estimate this parameter, it would be estimated in a stableand reliable way It can also be noted that when the learning data set hasbeen obtained through the diagnosis paradigm, the proportions in the mixture

distribution are ﬁxed: p k = card{i such that z ik = 1}/n for k = 1, , K.

Numerical experiments reported in Bouchard and Celeux (2006) show thatBEC and cross validated error rate criteria select most of the times the samemodels contrary to BIC which often selects suboptimal models

4 Discussion

As sketched in the Section 2 of this article, ﬁnite mixture analysis is deﬁnitively

a powerful framework for model-based cluster analysis Many free and able softwares for mixture analysis are available: C.A.Man, Emmix, Flemix,MClust, mixmod, Multimix, Sob We want to insist on the software mix-mod on which we are working for years (Biernacki et al (2006)) It is a mixturesoftware for cluster analysis and classiﬁcation which contains most of the fea-tures described here and which last version is quite rapid It is available aturl http://www-math.univ-fcomte.fr/mixmod

valu-In the second part of this article, we highlighted how it could be useful totake into account the model purpose to select a relevant and useful model Thispoint of view can lead to define different selection criteria than the classicalBIC criterion It has been illustrated in two situations: modeling in a clusteringpurpose and modeling in a supervised classification purpose This leads to twopenalized likelihood criteria ICL and BEC for which the the penalty is datadriven and is expected to choose a useful, if not true, model

Now, it can be noticed that we did not consider the modeling purposewhen estimating the model parameters In both situations, we simply con-sidered the maximum likelihood estimator Taking into account the modelingpurpose in the estimation process could be regarded as an interesting point

of view However we do not think that this point of view is fruitful and,moreover, we think it can jeopardize the statistical analysis For instance, inthe cluster analysis context, it could be thought of as more natural to com-

pute the parameter value maximizing the complete loglikelihood log p(x, z |θ)

Trang 28

rather than the observed loglikelihood log p(x|θ) But as proved in Bryant

and Williamson (1978), this strategy leads to asymptotically biased estimates

of the mixture parameters In the same manner, in the supervised

classiﬁ-cation context, considering the parameter value θ ∗ maximizing directly the

conditional likelihood log p(z|x, θ) could be regarded as an alternative to the

classical maximum likelihood estimation But this would lead to diﬃcult timization problems and would provide unstable estimated values Finally, we

op-do not recommend taking into account the modeling purpose when ing the model parameters because it could lead to cumbersome algorithms orprovoke undesirable biases in the estimation On the contrary, we think thattaking into account the model purpose when assessing a model could lead tochoose reliable and stable models especially in unsupervised and supervisedclassiﬁcation context

estimat-References

AITKIN, M (2001): Likelihood and Bayesian Analysis of Mixtures Statistical

Mod-eling, 1, 287–304.

AKAIKE, H (1974): A New Look at Statistical Model Identiﬁcation IEEE

Trans-actions on Automatic Control, 19, 716–723.

BANFIELD and RAFTERY, A.E (1993): Model-based Gaussian and Non-Gaussian

Clustering Biometrics, 49, 803–821.

BENSMAIL, H and CELEUX, G (1996): Regularized Gaussian Discriminant

Anal-ysis Through Eigenvalue Decomposition Journal of the American Statistical

Association, 91, 1743–48.

BIERNACKI, C., CELEUX., G and GOVAERT, G (2000): Assessing a Mixture

Model for Clustering with the Integrated Completed Likelihood IEEE Trans.

Software, Computational Statistics and Data Analysis (to appear).

BOUCHARD, G and CELEUX, G (2006): Selection of Generative Models in

Clas-siﬁcation IEEE Trans on PAMI, 28, 544–554.

BRYANT, P and WILLIAMSON, J (1978): Asymptotic Behavior of Classiﬁcation

Maximum Likelihood Estimates Biometrika, 65, 273–281.

CELEUX, G., CHAUVEAU, D and DIEBOLT, J (1996): Some Stochastic Versions

of the EM Algorithm Journal of Statistical Computation and Simulation, 55,

287–314.

CELEUX, G and GOVAERT, G (1991): Clustering Criteria for Discrete Data and

Latent Class Model Journal of Classiﬁcation, 8, 157–176.

CELEUX, G and GOVAERT, G (1992): A Classiﬁcation EM Algorithm for

Cluster-ing and Two Stochastic Versions Computational Statistics and Data Analysis,

14, 315–332.

Trang 29

14 Gilles Celeux

CELEUX, G and GOVAERT, G (1993): Comparison of the Mixture and the

Clas-siﬁcation Maximum Likelihood in Cluster Analysis Journal of Computational

and Simulated Statistics, 14, 315–332.

CIUPERCA, G., IDIER, J and RIDOLFI, A.(2003): Penalized Maximum

Likeli-hood Estimator for Normal Mixtures Scandinavian Journal of Statistics, 30,

45–59.

DEMPSTER, A.P., LAIRD, N.M and RUBIN, D.B (1977): Maximum Likelihood

From Incomplete Data Via the EM Algorithm (With Discussion) Journal of

the Royal Statistical Society, Series B, 39, 1–38.

DIEBOLT, J and ROBERT, C P (1994): Estimation of Finite Mixture

Distribu-tions by Bayesian Sampling Journal of the Royal Statistical Society, Series B,

56, 363–375.

FIGUEIREDO, M and JAIN, A.K (2002): Unsupervised Learning of Finite Mixture

Models IEEE Trans on PAMI, 24, 381–396.

FRALEY, C and RAFTERY, A.E (1998): How Many Clusters? Answers via

Model-based Cluster Analysis The Computer Journal, 41, 578–588.

FRIEDMAN, J (1989): Regularized Discriminant Analysis Journal of the American

Statistical Association, 84, 165–175.

GANESALINGAM, S and MCLACHLAN, G J (1978): The Eﬃciency of a Linear

Discriminant Function Based on Unclassiﬁed Initial Samples Biometrika, 65,

658–662.

GOODMAN, L.A (1974): Exploratory Latent Structure Analysis Using Both

Iden-tiﬁable and UnidenIden-tiﬁable Models Biometrika, 61, 215–231.

HASTIE, T and TIBSHIRANI, R (1996): Discriminant Analysis By Gaussian

Mix-tures Journal of the Royal Statistical Society, Series B, 58, 158–176.

HUNT, L.A and BASFORD K.E (2001): Fitting a Mixture Model to Three-mode

Three-way Data With Missing Information Journal of Classiﬁcation, 18, 209–

MARIN, J.-M., MENGERSEN, K and ROBERT, C.P (2005): Bayesian Analysis

of Finite mixtures Handbook of Statistics, Vol 25, Chapter 16 Elsevier B.V MCLACHLAN, G.J and PEEL, D (2000): Finite Mixture Models Wiley, New York.

RAFTERY, A.E (1995): Bayesian Model Selection in Social Research (With

Dis-cussion) In: P.V Marsden (Ed.): Sociological Methodology 1995, Oxford, U.K.:

Blackwells, 111–196

RAFTERY, A.E and DEAN, N (2006): Journal of the American Statistical

Asso-ciation, 101, 168–78.

ROEDER, K (1990): Density Estimation with Conﬁdence Sets Exempliﬁed by

Su-perclusters and Voids in Galaxies Journal of the American Statistical

Associa-tion, 85, 617–624.

SCHWARZ, G (1978): Estimating the Dimension of a Model The Annals of

Statis-tics, 6, 461–464.

Trang 30

How to Choose the Number of Clusters: The Cramer Multiplicity Solution

Adriana Climescu-HaulicaInstitut d’Informatique et Mathématiques Appliquées, 51 rue des Mathématiques,Laboratoire Biologie, Informatique, Mathématiques, CEA 17 rue des Martyrs,Grenoble, France; adriana.climescu@imag.fr

Abstract. We propose a method for estimating the number of clusters in data setsbased only on the knowledge of the similarity measure between the data points to beclustered The technique is inspired from spectral learning algorithms and Cramermultiplicity and is tested on synthetic and real data sets This approach has theadvantage to be computationally inexpensive while being an a priori method,independent from the clustering technique

1 Introduction

Clustering is a technique used in the analysis of microarray gene expressiondata as a preprocessing step, in functional genomic for example, or as the maindiscriminating tool in the tumor classiﬁcation study (Dudoit et al (2002)).While in recent years many clustering methods were developed, it is acknowl-edged that the reliability of allocation of units to a cluster and the computa-tion of the number of clusters are questions waiting for a joint theoretical andpractical validation (Dudoit et al (2002))

Two main approaches are used in data analysis practice to determine thenumber of clusters The most common procedure is to use the number ofclusters as a parameter of the clustering method and to select it from a max-imum reliability criteria This approach is strongly dependent on the clus-tering method The second approach uses statistical procedures (for examplethe sampling with respect to a reference distribution) and are less dependent

on the clustering method Examples of methods in this category are the Gapstatistic (Tibshirani et al (2001)) and the Clest procedure (Fridlyand andDudoit (2002)) All of the methods reviewed and compared by Fridlyand and

Dudoit (2002) are a posteriori methods, in the sense that they include

clus-tering algorithms as a preprocessing step In this paper we propose a methodfor choosing the number of clusters based only on the knowledge of the similar-ity measure between the data points to be clustered This criterion is inspired

Trang 31

16 Adriana Climescu-Haulica

from spectral learning algorithms and Cramer multiplicity The novelty of themethod is given by the direct extraction of the number of clusters from data,with no assumption about effective clusters The procedure is the following:the clustering problem is mapped to the framework of spectral graph theory bymeans of the ”min-cut” problem This induces the passage from the discretedomain to the continuous one, by the definition of the time-continous Markovprocess associated with the graph It is the analysis on the continuous domainwhich allows the screening of the Cramer multiplicity, otherwise set to 1 forthe discrete case We evaluate the method on artificial data obtained as sam-ples from different gaussian distributions and on yeast cell data for which thesimilarity metric is well established in the literature Compared to methodsevaluated in Fridlyand and Dudoit (2002) our algorithm is computationallyless expensive

2 Clustering by min-cut

The clustering framework is given by the min-cut problem on graphs Let

G = V, E be a weighted graph where to each vertex in the set V we assign one of the m points to be clustered The weights on the edges E of the graph

represent how ”similar” one data point is to another and are assigned by a

similarity measure S : V × V → IR+ In the min-cut problem a partition

on subsets{V1, V2, V k } of V is searched such that the sum of the weights

corresponding to the edges going from one subset to another is minimized.This is a NP-hard optimization problem The Laplacian operator associated

with the graph G is deﬁned on the space of functions f : V (G) → IR by

measure between the vertices i and j is interpreted as the probability that a random walk moves from vertex i to vertex j in one step We associate with the graph G a time-continuous Markov process with the state space given by the vertex set V and the transition matrix given by the normalized similarity

matrix Assuming the graph is without loops, the paths of the time-continuousMarkov process are Hilbert space valuated with respect to the norm given bythe quadratic mean

Trang 32

How to Choose the Number of Clusters: The Cramer Multiplicity Solution 17

3 Cramer multiplicity

As a stochastic process, the time-continuous Markov process deﬁned by means

of the similarity matrix is associated with a unique, up to an equivalence class,sequence of spectral measures, by means of its Cramer decomposition (Cramer(1964)) The length of the sequence of spectral measures is named the Cramer

multiplicity of the stochastic process More precisely, let X : Ω × [0, ∞) → V

be the time continuous Markov process associated with the graph G and let

N be its Cramer multiplicity Then, by the Cramer decomposition theorem (Cramer (1964)) there exist N mutually orthogonal stochastic processes with orthogonal increments Z i : Ω × [0, ∞) → V , 1 ≤ n ≤ N such that

a decreasing sequence of measures with respect to the absolute continuityrelationship

F Z1  F Z2 F Z N (4)

No representation of the form 3 with these properties exists for any smaller

value of N If the time index set of a process is discrete then its Cramer

multiplicity is 1 It is easily seen that Cramer decomposition is a generalization

of the Fourier representation, applying to stochastic processes and allowing

a more general class of orthogonal bases The processes Zn , 1 ≤ n ≤ N are interpreted as innovation processes associated with X The idea of considering

the Cramer multiplicity as the number of clusters resides in this interpretation

4 Envelope intensity algorithm

We present here an algorithm derived heuristically from the observations

above Nevertheless the relationship between the Laplacian of the graph G

and the Kolmogorov equations associated with the time continuous Markov

process X and hence its Cramer representation is an open problem to be

ad-dressed in the future The input of the algorithm is the similarity matrix andthe output is a function we called the envelope intensity associated with thesimilarity matrix This is a piecewise ”continuous” increasing function whosenumber of jumps contributes to the approximation of the Cramer multiplicity

1 Construct the normalized similarity matrix W = D −1 S where D is the

diagonal matrix with elements the sum of the corresponding rows fromthe matrix S

2 Compute the matrix L = I − W corresponding to the Laplacian operator.

Trang 33

3 Find y1, y2, y m , the eigenvectors of L, chosen to be orthogonal to

each other in the case of repeated eigenvalues and form the matrix

Y = [y1y2 y m]∈ IR m ×mby stacking the eigenvectors in columns.

4 Compute the Fourier transform of Y column by column and construct W

the matrix corresponding to the absolute values of matrix elements

5 Assign, in increasing order, the maximal value of each column from W to the vector U ∈ IR+ called envelope intensity

Steps 1 to 3 are derived from the spectral clustering and steps 3 to 5 are parts

of the heuristic program to approximate the Cramer multiplicity Step 3, responding to the spectral decomposition of the graph’s Laplacian, is theirjunction part Two spectral decompositions are applied: the Laplacian decom-position on eigenvectors and the eigenvectors decomposition on their Hilbertspace, exempliﬁed by the Fourier transform

cor-5 Data analysis

In order to check the potential of our approach on obtaining a priori

in-formation about the number of clusters, based only on a similarity trix, primary we have to apply the envelope intensity algorithm to classes

ma-of sets whose number ma-of clusters is well established We choose two

Fig 1.Example of data sampled from three diﬀerent gaussian distributions

gories of sets The ﬁrst category is classical for the clustering analysis, wegive two sets as examples The ﬁrst set is represented in Figure 1 and isgiven by 200 points from a mixture of three Gaussian distributions (from

http://iew3.technion.ac.il/CE/data/3gauss.mat) The plot in Figure 3 responds to a mixture of ﬁve Gaussian distributions generated by x =

cor-m x + R cos U and y = m y + R sin U where (m x , m y) is the local mean pointchosen from the set {(3, 18), (3, 9), (9, 3), (18, 9), (18, 18)} R and U are random variables distributed N ormal(0, 1) and U nif orm(0, π) respectively The

Trang 34

Trang 35

data from http://faculty.washington.edu/kayee/model/ already analyzed by a

model based approach and by spectral clustering in Meila and Verma (2001).This data set has the advantage to come from real experiments and mean-while, the number of clusters to be intrinsically determined, given by the ﬁvephases of the cell cycle We applied the envelope intensity algorithm for twosets The ﬁrst yeast cell set contains 384 genes selected from general data basessuch that each gene has one and only one phase associated to it To each genecorresponds a vector of intensity points measured at 17 distinct time points.The raw data is normalized by a Z score transformation The result of theenvelope intensity computation, with respect to the similarity measure given

by the correlation plus 1, as in Meila and Verma (2001) is shown in Figure 6.The ﬁve regions appear distinctly separated by jumps The second yeast cellset is selected from the ﬁrst one, corresponding to some functional categoriesand only four phases It contains 237 genes, it is log normalized and the simi-

Trang 36

6 Conclusions

We propose an algorithm which is able to indicate the number of clusters basedonly on the data similarity matrix This algorithm is inspired from ideas onspectral clustering, stochastic processes on graphs and Cramer decompositiontheory It combines two types of spectral decomposition: the matrix spectral

Trang 37

decomposition and the spectral decomposition on Hilbert spaces The rithm is easy to implement as it is resumed to the computation of the envelopeintensity of the Fourier transformed eigenvectors of the Laplacian associatedwith the similarity matrix The data analysis we performed shows that theenvelope intensity computed by the algorithm is separated by jumps in con-nected or single point regions, whose number coincides with the number ofclusters Still more theoretical results have to be developed, this algorithm is

algo-an exploratory tool on clustering algo-analysis

Discrimi-Journal of the American Statistical Association, 97, 77–87.

FRIDLYAND, J and DUDOIT, S (2002): A Prediction-based Resampling Method

to Estimate the Number of Clusters in a Dataset Genome Biology, 3, 7.

MEILA, M and SHI, J (2001): A Random Walks View of Spectral Segmentation

Proceedings of the International Workshop of Artiﬁcial Intelegence and tics.

Statis-MEILA, M and VERMA, D (2001): A Comparison of Spectral Clustering

Algo-rithms UW CSE Technical Report.

NG, A., JORDAN, M and WEISS, Y (2002): Spectral Clustering: Analysis and an

Algorithm In: T Dietterich, S Becker, and Z Ghahramani (Eds.): Advances

in Neural Information Processing Systems (NIPS).

TIBSHIRANI, R., GUENTHER, W.G and HASTIE, T (2001): Estimating the

Number of Clusters in a Dataset Via the Gap Statistic Journal of the Royal

Statistical Society, B, 63, 411-423.

Trang 38

Model Selection Criteria for Model-Based Clustering of Categorical Time Series Data:

A Monte Carlo Study

Jos´e G DiasDepartment of Quantitative Methods – GIESTA/UNIDE,

ISCTE – Higher Institute of Social Sciences and Business Studies,

Av das For¸cas Armadas, 1649–026 Lisboa, Portugal; jose.dias@iscte.pt

Abstract. An open issue in the statistical literature is the selection of the number

of components for model-based clustering of time series data with a ﬁnite number

of states (categories) that are observed several times We set a finite mixture ofMarkov chains for which the performance of selection methods that use differentinformation criteria is compared across a large experimental design The resultsshow that the performance of the information criteria vary across the design Overall,AIC3 outperforms more widespread information criteria such as AIC and BIC forthese finite mixture models

1 Introduction

Time series or longitudinal data have played an important role in the standing of the dynamics of the human behavior in most of the social sciences.Despite extensive analyses for continuous data time series, little research hasbeen conducted on the unobserved heterogeneity for categorical time seriesdata Exceptions are the application of ﬁnite mixtures of Markov chains, e.g.,

under-in marketunder-ing (Poulsen (1990)), machunder-ine learnunder-ing (Cadez et al (2003)) or mography (Dias and Willekens (2005)) Despite the increasing use of thesemixtures little is known about model selection of the number of components.Information criteria have become popular as a useful approach to modelselection Some of them such as Akaike Information Criterion (AIC) andBayesian Information Criterion (BIC) have been widely used The perfor-mance of information criteria has been studied extensively in the finite mix-ture literature, mostly focused on finite mixtures of Gaussian distributions(McLachlan and Peel (2000)) Therefore, in this article a Monte Carlo exper-iment is designed to assess the ability of the different information criteria toretrieve the true model and to measure the effect of the design factors for

Trang 39

de-24 Jos´e G Dias

finite mixtures of Markov chains The results reported in this paper extendthe conclusions in Dias (2006) from the zero-order Markov model (latent classmodel) to the one-order Markov model (finite mixture of Markov chains).This paper is organized as follows Section 2 describes the finite mixture

of Markov chains In Section 3, we review the literature on model selectioncriteria In Section 4, we describe the design of the Monte Carlo study InSection 5, we present and discuss the results The paper concludes with asummary of main ﬁndings, implications, and suggestions for further research

2 Finite mixture of Markov chains

Let X itbe the random variable denoting the category (state) of the individual

i at time t, and x ita particular realization We will assume discrete time from

0 to T (t = 0, 1, , T ) Thus, the vectors X i and xi denote the consecutive

observations (time series) – respectively X it and x it –, with t = 0, , T The

probability density P (X i = xi ) = P (X i0 = x i0 , X i1 = x i1 , , X iT = x iT) can

be extremely diﬃcult to characterize, due to its possibly huge dimension (T +

1) A common procedure to simplify P (Xi = xi) is by assuming the Markov

property stating that the occurrence of event Xt = xt only depends on the

previous state Xt −1 = xt −1 ; that is, conditional on Xt −1 , Xt is independent

of the states at the other time points From the Markov property, it followsthat

P (X = x i ) = P (X i0 = x i0)T

t=1 P (X it = x it |X i,t −1 = x i,t −1 ), (1)where P (X i0 = x i0 ) is the initial distribution and P (X it = x it |X i,t −1 = x i,t −1)

is the probability that individual i is in state x it at t, given that he is in state

x i,t −1 at time t − 1 A ﬁrst-order Markov chain is speciﬁed by its transition

probabilities and initial distribution Hereafter, we denote the initial and the

transition probabilities as λ j = P (X i0 = j) and a jk = P (X t = k |X t −1 =j), respectively Note that we assume that transition probabilities are time

homogeneous, which means that our model is a stationary ﬁrst-order Markovmodel

The ﬁnite mixture of Markov chains assumes discrete heterogeneity

Indi-viduals are clustered into S segments, each denoted by s (s = 1, , S) The clusters, including its number, are not known a priori Thus, in advance one

does not know how the sample will be partitioned into clusters The

compo-nent that individual i belongs to is denoted by the latent discrete variable

Z i ∈ {1, 2, , S} Let z = (z1, , z n) Because z is not observed, the inference

problem is to estimate the parameters of the model, say ϕ, using only

infor-mation on x = (x1, , xn) More precisely, the estimation procedure has to

be based on the marginal distribution of xi which is obtained as follows:

P (X i= xi; ϕ) =

S

π s P (X i= xi|Z i = s). (2)

Trang 40

Model Selection for Mixtures of Markov Chains 25

This equation defines a finite mixture model with S components The nent proportions, π s = P (Z i = s; ϕ), correspond to the a priori probability that individual i belongs to the segment s, and gives the segment relative size Moreover, πs satisfies πs > 0 andS

compo-s=1 π s= 1

Within each latent segment s, observation xi is characterized by P (Xi=

xi |Z i = s) = P (Xi= xi|Z i = s; θs) which implies that all individuals in ment s have the same probability distribution deﬁned by the segment-speciﬁc parameters θs The parameters of the model are ϕ = (π1, , π S −1 , θ1, , θ S)

seg-The θ s includes the transition and initial probabilities a sjk = P (X it =

k |X i,t −1 = j, Z i = s) and λ sk = P (X i0 = k |Z i = s), respectively A ﬁnite

mixture of Markov chains is not a Markov chain, which enables the ing of very complex patterns (see Cadez et al (2003), Dias and Willekens

model-(2005)) The independent parameters of the model are S − 1 prior ities, S(K − 1) initial probabilities, and SK(K − 1) transition probabilities, where K is the number of categories or states Thus, the total number of independent parameters is SK2− 1 The log-likelihood function for ϕ, given

probabil-that xi are independent, is S (ϕ; x) =n

i=1 log P (Xi= xi; ϕ) and the

maxi-mum likelihood estimator (MLE) is ˆϕ = arg max

ϕ S (ϕ; x) For estimating this

model by the EM algorithm, we refer to Dias and Willekens (2005)

3 Information criteria for model selection

The traditional approach to the selection of the best among diﬀerent models

is using a likelihood ratio test, which under regularity conditions has a simpleasymptotic theory (Wilks (1938)) However, in the context of ﬁnite mixturemodels this approach is problematic The null hypothesis under test is deﬁned

on the boundary of the parameter space, and consequently the regularitycondition of Cramer on the asymptotic properties of the MLE is not valid.Some recent results have been achieved (see, e.g., Lo et al (2001)) However,most of these results are diﬃcult to implement and usually derived for ﬁnitemixtures of Gaussian distributions

As an alternative information statistics have received much attention cently in ﬁnite mixture modeling These statistics are based on the value

re-of −2 S( ˆϕ; x) of the model adjusted for the number of free parameters in

the model (and other factors such as the sample size) The basic ple under these information criteria is the parsimony: all other things be-ing the same (log-likelihood), we choose the simplest model (with fewer

princi-parameters) Thus, we select the number S which minimizes the criterion

C S =−2 S( ˆϕ; x) + dN S, where NS is the number of free parameters of the

model For diﬀerent values of d, we have the Akaike Information Criterion (AIC: Akaike (1974)) (d = 2), the Bayesian Information Criterion (BIC: Schwarz (1978)) (d = log n), and the Consistent Akaike Information Crite- rion (CAIC: Bozdogan (1987)) (d = log n + 1).

Định dạng
Số trang	689
Dung lượng	11,67 MB