Model Selection Criteria for Model-Based Clustering ofCategorical Time Series Data: A Monte Carlo Study M´ ario A.T.. Contents XIPart III Data and Time Series Analysis Simultaneous Selec
Trang 2Studies in Classification, Data Analysis, and Knowledge Organization
Managing Editors Editorial Board
H.-H Bock, Aachen Ph Arabie, Newark
W Gaul, Karlsruhe D Baier, Cottbus
M Vichi, Rome F Critchley, Milton Keynes
Trang 3Titles in the Series
W Gaul and D Pfeifer (Eds.)
From Data to Knowledge 1995
H.-H Bock and W Polasek (Eds.)
Data Analysis and Information
Systems 1996
E Diday, Y Lechevallier, and
O Opitz (Eds.) Ordinal and
Symbolic Data Analysis 1996
R Klar and O Opitz (Eds.)
Classification and Knowledge
Organization 1997
C Hayashi, N.Ohsumi, K.Yajima,
Y Tanaka, H.-H Bock, and Y Baba
(Eds.)
Data Science, Classifaction,
and Related Methods 1998
I Balderjahn, R Mathar, and
M Vichi and O Optiz (Eds.)
Classification and Data Analysis
1999
W Gaul and H Locarek-Junge (Eds.)
Classification in the Information
Age 1999
H.-H Bock and E Diday (Eds.)
Analysis of Symbolic Data 2000
H A L Kiers, J.-P Rasson, P J F
Groenen, and M Schader (Eds.)
Data Analysis, Classification, and
Related Methods 2000
W Gaul, O Opitz, M Schader (Eds.)
Data Analysis 2000
R Decker and W Gaul (Eds.)
Classification and Information
Processing at the Turn of the
Millenium 2000
S Borra, R Rocci, M Vichi,
and M Schader (Eds.)
Advances in Classification and Data
Analysis 2000
W Gaul and G Ritter (Eds.)Classification, Automation, and NewMedia 2002
K Jajuga, A Sokołowski, andH.-H Bock (Eds.)
Classification, Clustering and DataAnalysis 2002
M Schwaiger and O Opitz (Eds.)Exploratory Data Analysis inEmpirical Research 2003
M Schader, W Gaul, and M Vichi(Eds.)
Between Data Science and AppliedData Analysis 2003
H.-H Bock, M Chiodi, and
A Mineo (Eds.)Advances in Multivariate DataAnalysis 2004
D Banks, L House, F.R McMorris,
P Arabie, and W Gaul (Eds.)Classification, Clustering, and DataMinig Applications 2004
D Baier and K.-D Wernecke (Eds.)Innovations in Classification, DataScience, and Information Systems.2005
M Vichi, P Monari, S Mignani, and
A Montanari (Eds.)New Developments in Classificationand Data Analysis 2005
D Baier, R Decker, and L Thieme (Eds.)
Schmidt-Data Analysis and Decision Support.2005
C Weihs and W Gaul (Eds.)Classification - the UbiquitousChallenge 2005
M Spiliopoulou, R Kruse, C.Borgelt, A Nürnberger, and W Gaul(Eds.)
From Data and Information Analysis
to Knowledge Engineering 2006
V Batagelj, H.-H Bock, A Ferligoj,and A ˇZiberna (Eds.)
Data Science and Classification 2006
S Zani, A Cerioli, M Riani, M Vichi(Eds.)
Data Analysis, Classification and theForward Search 2006
Trang 4Proceedings of the 30thAnnual Conference
of the Gesellschaft für Klassifikation e.V., Freie Universität Berlin, March 8-10, 2006
With 202 Figures and 92 Tables
123
Trang 5Professor Dr Reinhold Decker
Department of Business Administration and Economics
ISBN 978-3-540-70980-0 Springer Berlin Heidelberg New York
This work is subject to copyright All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broad- casting, reproduction on microfilm or in any other way, and storage in data banks Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer Violations are liable to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springer.com
© Springer-Verlag Berlin Heidelberg 2007
The use of general descriptive names, registered names, trademarks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
Production: LE-TEX Jelonek, Schmidt & Vockler GbR, Leipzig
Cover-design: WMX Design GmbH, Heidelberg
Trang 6This volume contains the revised versions of selected papers presented duringthe 30thAnnual Conference of the German Classification Society (Gesellschaftf¨ur Klassifikation – GfKl) on “Advances in Data Analysis” The conference washeld at the Freie Universit¨at Berlin, Germany, in March 2006 The scientificprogram featured 7 parallel tracks with more than 200 contributed talks in 63sessions Additionally, thanks to the support of the DFG (German ResearchFoundation), 18 plenary and semi-plenary speakers from Europe and overseascould be invited to talk about their current research in classification and dataanalysis With 325 participants from 24 countries in Europe and overseas thisGfKl Conference, once again, provided an international forum for discussionsand mutual exchange of knowledge with colleagues from different fields ofinterest From altogether 115 full papers that had been submitted for thisvolume 77 were finally accepted
The scientific program included a broad range of topics from classificationand data analysis Interdisciplinary research and the interaction between the-ory and practice were particularly emphasized The following sections (withchairs in alphabetical order) were established:
I Theory and Methods
Clustering and Classification (H.-H Bock and T Imaizumi); ExploratoryData Analysis and Data Mining (M Meyer and M Schwaiger); PatternRecognition and Discrimination (G Ritter); Visualization and Scaling Meth-ods (P Groenen and A Okada); Bayesian, Neural, and Fuzzy Clustering(R Kruse and A Ultsch); Graphs, Trees, and Hierarchies (E Godehardtand J Hansohm); Evaluation of Clustering Algorithms and Data Structures(C Hennig); Data Analysis and Time Series Analysis (S Lang); Data Cleaningand Pre-Processing (H.-J Lenz); Text and Web Mining (A N¨urnberger and
M Spiliopoulou); Personalization and Intelligent Agents (A Geyer-Schulz);Tools for Intelligent Data Analysis (M Hahsler and K Hornik)
II Applications
Subject Indexing and Library Science (H.-J Hermes and B Lorenz); ing, Management Science, and OR (D Baier and O Opitz); E-commerce, Rec-
Trang 7Market-VI Preface
ommender Systems, and Business Intelligence (L Schmidt-Thieme); Bankingand Finance (K Jajuga and H Locarek-Junge); Economics (G Kauermannand W Polasek); Biostatistics and Bioinformatics (B Lausen and U Mans-mann); Genome and DNA Analysis (A Schliep); Medical and Health Sci-ences (K.-D Wernecke and S Willich); Archaeology (I Herzog, T Kerig, and
A Posluschny); Statistical Musicology (C Weihs); Image and Signal cessing (J Buhmann); Linguistics (H Goebl and P Grzybek); Psychology(S Krolak-Schwerdt); Technology and Production (M Feldmann)
Pro-Additionally, the following invited sessions were organized by colleaguesfrom associated societies: Classification with Complex Data Structures (A Ce-rioli); Machine Learning (D.A Zighed); Classification and Dimensionality Re-duction (M Vichi)
The editors would like to emphatically thank the section chairs for doingsuch a great job regarding the organization of their sections and the asso-ciated paper reviews The same applies to W Esswein for organizing theDoctoral Workshop and to H.-H Hermes and B Lorenz for organizing theLibrarians Workshop Cordial thanks also go to the members of the scientificprogram committee for their conceptual and practical support (in alphabeti-cal order): D Baier (Cottbus), H.-H Bock (Aachen), H.W Brachinger (Fri-bourg), R Decker (Bielefeld, Chair), D Dubois (Toulouse), A Gammerman(London), W Gaul (Karlsruhe), A Geyer-Schulz (Karlsruhe), B Goldfarb(Paris), P Groenen (Rotterdam), D Hand (London), T Imaizumi (Tokyo),
K Jajuga (Wroclaw), G Kauermann (Bielefeld), R Kruse (Magdeburg),
S Lang (Innsbruck), B Lausen (Erlangen-N¨urnberg), H.-J Lenz (Berlin),
F Murtagh (London), A Okada (Tokyo), L Schmidt-Thieme (Hildesheim)
M Spiliopoulou (Magdeburg), W St¨utzle (Washington), and C Weihs mund) The review process was additionally supported by the following col-leagues: A Cerioli, E Gatnar, T Kneib, V K¨oppen, M Meißner, I Michalar-ias, F M¨orchen, W Steiner, and M Walesiak
(Dort-The great success of this conference would not have been possible withoutthe support of many people mainly working in the backstage Representativefor the whole team we would like to particularly thank M Darkow (Bielefeld)and A Wnuk (Berlin) for their exceptional efforts and great commitmentwith respect to the preparation, organization and post-processing of the con-ference We thank very much our web masters I Michalarias (Berlin) and
A Omelchenko (Berlin) Furthermore, we would cordially thank V K¨oppen(Berlin) and M Meißner (Bielefeld) for providing an excellent support re-garding the management of the reviewing process and the final editing of thepapers printed in this volume
The GfKl Conference 2006 would not have been possible in the way
it took place without the financial and/or material support of the ing institutions and companies (in alphabetical order): Deutsche Forschungs-gemeinschaft, Freie Universit¨at Berlin, Gesellschaft f¨ur Klassifikation e.V.,Land Software-Entwicklung, Microsoft M¨unchen, SAS Deutschland, Springer-
Trang 8follow-Preface VIIVerlag, SPSS M¨unchen, Universit¨at Bielefeld, and Westf¨alisch-Lippische Uni-versit¨atsgesellschaft We express our gratitude to all of them.
Finally, we would like to thank Dr Martina Bihn of Springer-Verlag, delberg, for her support and dedication to the production of this volume
Reinhold Decker
Trang 9Model Selection Criteria for Model-Based Clustering of
Categorical Time Series Data: A Monte Carlo Study
M´ ario A.T Figueiredo 39
A Method for Analyzing the Asymptotic Behavior of the
Walk Process in Restricted Random Walk Cluster Algorithm
Markus Franke, Andreas Geyer-Schulz 51
Cluster and Select Approach to Classifier Fusion
Eugeniusz Gatnar 59
Random Intersection Graphs and Classification
Erhard Godehardt, Jerzy Jaworski, Katarzyna Rybarczyk 67
Optimized Alignment and Visualization of Clustering Results
Martin Hoffmann, D¨ orte Radke, Ulrich M¨ oller 75
Trang 10X Contents
Finding Cliques in Directed Weighted Graphs Using Complex Hermitian Adjacency Matrices
Bettina Hoser, Thomas Bierhance 83
Text Clustering with String Kernels in R
Alexandros Karatzoglou, Ingo Feinerer 91
Automatic Classification of Functional Data with Extremal
Information
Fabrizio Laurini, Andrea Cerioli 99
Typicality Degrees and Fuzzy Prototypes for Clustering
Marie-Jeanne Lesot, Rudolf Kruse 107
On Validation of Hierarchical Clustering
Hans-Joachim Mucha 115
Part II Classification
Rearranging Classified Items in Hierarchies Using
Categorization Uncertainty
Korinna Bade, Andreas N¨ urnberger 125
Localized Linear Discriminant Analysis
Irina Czogiel, Karsten Luebke, Marc Zentgraf, Claus Weihs 133
Calibrating Classifier Scores into Probabilities
Martin Gebel, Claus Weihs 141
Nonlinear Support Vector Machines Through Iterative
Majorization and I-Splines
Patrick J.F Groenen, Georgi Nalbantov, J Cor Bioch 149
Deriving Consensus Rankings from Benchmarking
Experiments
Kurt Hornik, David Meyer 163
Classification of Contradiction Patterns
Heiko M¨ uller, Ulf Leser, Johann-Christoph Freytag 171
Selecting SVM Kernels and Input Variable Subsets in Credit Scoring Models
Klaus B Schebesch, Ralf Stecking 179
Trang 11Contents XI
Part III Data and Time Series Analysis
Simultaneous Selection of Variables and Smoothing
Parameters in Geoadditive Regression Models
Christiane Belitz, Stefan Lang 189
Modelling and Analysing Interval Data
Paula Brito 197
Testing for Genuine Multimodality in Finite Mixture Models: Application to Linear Regression Models
Bettina Gr¨ un, Friedrich Leisch 209
Happy Birthday to You, Mr Wilcoxon!
Invariance, Semiparametric Efficiency, and Ranks
Marc Hallin 217
Equivalent Number of Degrees of Freedom for Neural
Networks
Salvatore Ingrassia, Isabella Morlini 229
Model Choice for Panel Spatial Models: Crime Modeling in Japan
Kazuhiko Kakamu, Wolfgang Polasek, Hajime Wago 237
A Boosting Approach to Generalized Monotonic Regression
Florian Leitenstorfer, Gerhard Tutz 245
From Eigenspots to Fisherspots – Latent Spaces in the
Nonlinear Detection of Spot Patterns in a Highly Varying
Background
Bjoern H Menze, B Michael Kelm, Fred A Hamprecht 255
Identifying and Exploiting Ultrametricity
Fionn Murtagh 263
Factor Analysis for Extraction of Structural Components and Prediction in Time Series
Carsten Schneider, Gerhard Arminger 273
Classification of the U.S Business Cycle by Dynamic Linear Discriminant Analysis
Roland Schuhr 281
Trang 12XII Contents
Examination of Several Results of Different Cluster Analyses with a Separate View to Balancing the Economic and
Ecological Performance Potential of Towns and Cities
Nguyen Xuan Thinh, Martin Behnisch, Alfred Ultsch 289
Part IV Visualization and Scaling Methods
VOS: A New Method for Visualizing Similarities Between
Objects
Nees Jan van Eck, Ludo Waltman 299
Multidimensional Scaling of Asymmetric Proximities with a Dominance Point
Akinori Okada, Tadashi Imaizumi 307
Single Cluster Visualization to Optimize Air Traffic
Management
Frank Rehm, Frank Klawonn, Rudolf Kruse 319
Rescaling Proximity Matrix Using Entropy Analyzed by
INDSCAL
Satoru Yokoyama, Akinori Okada 327
Part V Information Retrieval, Data and Web Mining
Canonical Forms for Frequent Graph Mining
Christian Borgelt 337
Applying Clickstream Data Mining to Real-Time Web
Crawler Detection and Containment Using ClickTips
Platform
An´ alia Louren¸ co, Orlando Belo 351
Plagiarism Detection Without Reference Collections
Sven Meyer zu Eissen, Benno Stein, Marion Kulig 359
Putting Successor Variety Stemming to Work
Benno Stein, Martin Potthast 367
Collaborative Filtering Based on User Trends
Panagiotis Symeonidis, Alexandros Nanopoulos, Apostolos
Papadopoulos, Yannis Manolopoulos 375
Investigating Unstructured Texts with Latent Semantic
Analysis
Fridolin Wild, Christina Stahl 383
Trang 13Contents XIII
Part VI Marketing, Management Science and Economics
Heterogeneity in Preferences for Odd Prices
Bernhard Baumgartner, Winfried J Steiner 393
Classification of Reference Models
Robert Braun, Werner Esswein 401
Adaptive Conjoint Analysis for Pricing Music Downloads
Christoph Breidert, Michael Hahsler 409
Improving the Probabilistic Modeling of Market Basket Data
Christian Buchta 417
Classification in Marketing Research by Means of
LEM2-generated Rules
Reinhold Decker, Frank Kroll 425
Pricing Energy in a Multi-Utility Market
Markus Franke, Andreas Kamper, Anke Eßer 433
Disproportionate Samples in Hierarchical Bayes CBC
Analysis
Sebastian Fuchs, Manfred Schwaiger 441
Building on the Arules Infrastructure for Analyzing
Transaction Data with R
Michael Hahsler, Kurt Hornik 449
Balanced Scorecard Simulator – A Tool for Stochastic
Business Figures
Veit K¨ oppen, Marina Allgeier, Hans-J Lenz 457
Integration of Customer Value into Revenue Management
Tobias von Martens, Andreas Hilbert 465
Women’s Occupational Mobility and Segregation in the
Labour Market: Asymmetric Multidimensional Scaling
Miki Nakai 473
Multilevel Dimensions of Consumer Relationships in the
Healthcare Service Market M-L IRT vs M-L SEM Approach
Iga Rudawska, Adam Sagan 481
Trang 14XIV Contents
Data Mining in Higher Education
Karoline Sch¨ onbrunn, Andreas Hilbert 489
Attribute Aware Anonymous Recommender Systems
Manuel Stritt, Karen H.L Tso, Lars Schmidt-Thieme 497
Part VII Banking and Finance
On the Notions and Properties of Risk and Risk Aversion in the Time Optimal Approach to Decision Making
Martin Bouzaima, Thomas Burkhardt 507
A Model of Rational Choice Among Distributions of Goal
Reaching Times
Thomas Burkhardt 515
On Goal Reaching Time Distributions Estimated from DAX Stock Index Investments
Thomas Burkhardt, Michael Haasis 523
Credit Risk of Collaterals: Examining the Systematic Linkage between Insolvencies and Physical Assets in Germany
Marc G¨ urtler, Dirk Heithecker, Sven Olboeter 531
Foreign Exchange Trading with Support Vector Machines
Christian Ullrich, Detlef Seese, Stephan Chalup 539
The Influence of Specific Information on the Credit Risk
Level
Miroslaw W´ ojciak, Aleksandra W´ ojcicka-Krenz 547
Part VIII Bio- and Health Sciences
Enhancing Bluejay with Scalability, Genome Comparison and Microarray Visualization
Anguo Dong, Andrei L Turinsky, Andrew C Ah-Seng, Morgan
Taschuk, Paul M.K Gordon, Katharina Hochauer, Sabrina Fr¨ ols, Jung Soh, Christoph W Sensen 557
Discovering Biomarkers for Myocardial Infarction from
SELDI-TOF Spectra
Christian H¨ oner zu Siederdissen, Susanne Ragg, Sven Rahmann 569
Joint Analysis of In-situ Hybridization and Gene Expression Data
Lennart Opitz, Alexander Schliep, Stefan Posch 577
Trang 15Contents XV
Unsupervised Decision Trees Structured by Gene Ontology
(GO-UDTs) for the Interpretation of Microarray Data
Henning Redestig, Florian Sohler, Ralf Zimmer, Joachim Selbig 585
Part IX Linguistics and Text Analysis
Clustering of Polysemic Words
Laurent Cicurel, Stephan Bloehdorn, Philipp Cimiano 595
Classifying German Questions According to Ontology-Based Answer Types
Adriana Davidescu, Andrea Heyl, Stefan Kazalski, Irene Cramer,
Dietrich Klakow 603
The Relationship of Word Length and Sentence Length: The Inter-Textual Perspective
Peter Grzybek, Ernst Stadlober, Emmerich Kelih 611
Comparing the Stability of Different Clustering Results of
Dialect Data
Edgar Haimerl, Hans-Joachim Mucha 619
Part-of-Speech Discovery by Clustering Contextual Features
Reinhard Rapp 627
Part X Statistical Musicology and Sound Classification
A Probabilistic Framework for Audio-Based Tonal Key and
Chord Recognition
Benoit Catteau, Jean-Pierre Martens, Marc Leman 637
Using MCMC as a Stochastic Optimization Procedure for
Monophonic and Polyphonic Sound
Katrin Sommer, Claus Weihs 645
Vowel Classification by a Neurophysiologically Parameterized Auditory Model
Gero Szepannek, Tam´ as Harczos, Frank Klefenz, Andr´ as Katai, Patrick Schikowski, Claus Weihs 653
Trang 16XVI Contents
Part XI Archaeology
Uncovering the Internal Structure of the Roman Brick and
Tile Making in Frankfurt-Nied by Cluster Validation
Jens Dolata, Hans-Joachim Mucha, Hans-Georg Bartel 663
Where Did I See You Before
A Holistic Method to Compare and Find Archaeological
Artifacts
Vincent Mom 671
Keywords 681 Author Index 685
Trang 17Part I
Clustering
Trang 18Mixture Models for Classification
Gilles CeleuxInria Futurs, Orsay, France; Gilles.Celeux@inria.fr
Abstract. Finite mixture distributions provide efficient approaches of model-basedclustering and classification The advantages of mixture models for unsupervisedclassification are reviewed Then, the article is focusing on the model selection prob-lem The usefulness of taking into account the modeling purpose when selecting amodel is advocated in the unsupervised and supervised classification contexts Thispoint of view had lead to the definition of two penalized likelihood criteria, ICL andBEC, which are presented and discussed Criterion ICL is the approximation of theintegrated completed likelihood and is concerned with model-based cluster analysis.Criterion BEC is the approximation of the integrated conditional likelihood and isconcerned with generative models of classification The behavior of ICL for choos-ing the number of components in a mixture model and of BEC to choose a modelminimizing the expected error rate are analyzed in contrast with standard modelselection criteria
1 Introduction
Finite mixtures models has been extensively studied for decades and provide
a fruitful framework for classification (McLachlan and Peel (2000)) In thisarticle some of the main features and advantages of finite mixture analysisfor model-based clustering are reviewed in Section 2 An important interest
of finite mixture model is to provide a rigorous setting to assess the number
of clusters in an unsupervised classification context or to assess the stability
of a classification function It is focused on those two questions in Section 3.Model-based clustering (MBC) consists of assuming that the data comefrom a source with several subpopulations Each subpopulation is modeledseparately and the overall population is a mixture of these subpopulations
The resulting model is a finite mixture model Observations x = (x1, , x n)
in Rndare assumed to be a sample from a probability distribution with density
Trang 194 Gilles Celeux
where the p k ’s are the mixing proportions (0 < p k < 1 for all k = 1, , K
and
k p k = 1), φ( | a k ) denotes a parameterized density and θ K =
(p1, , p K −1 , a1, , a K) When data are multivariate continuous
observa-tions, the component parameterized density is usually the d-dimensional sian density and parameter ak = (µk , Σ k), µk being the mean and Σk the
Gaus-variance matrix of component k When data are discrete, the component
pa-rameterized density is usually the multivariate multinomial density which isassuming conditional Independence of the observations knowing their com-
ponent mixture and the a k = (a j k , j = 1, , d)’s are the multinomial abilities for variable j and mixture component k The resulting model is the
prob-so-called Latent Class Model (see for instance Goodman (1974))
The mixture model is an incomplete data structure model: The completedata are
y = (y1, , y n) = ((x1, z1), , (x n , z n))
where the missing data are z = (z1, , z n), with zi = (z i1 , , z iK) are
binary vectors such that z ik = 1 iff xi arises from group k The z’s define a
partition P = (P1, , P K ) of the observed data x with P k={x i | z ik = 1}.
In this article, it is considered that the mixture models at hand are mated through maximum likelihood (ml) or related methods Despite it hasreceived a lot of attention, since the seminal article of Diebolt and Robert(1994), Bayesian inference is not considered here Bayesian analysis of uni-variate mixtures has became the standard Bayesian tool for density estima-tion But, especially in the multivariate setting a lot of problems (possible slowconvergence of MCMC algorithms, definition of subjective weakly informativepriors, identifiability, ) remain and it cannot be regarded as a standard toolfor Bayesian clustering of multivariate data (see Aitkin (2001)) The reader isreferred to the survey article of Marin et al (2005) for a readable state of theart of Bayesian inference for mixture models
esti-2 Some advantages of model-based clustering
In this section, some important and nice features of finite mixture analysis aresketched The advantages of finite mixture analysis in a clustering context,highlighted here, are: Many versatile or parsimonious models are available,many algorithms to estimate the mixture parameters are available, specialquestions can be tackled in a proper way in the MBC context, and, last butnot least, finite mixture models can be compared and assessed in an objectiveway It allows in particular to assess the number of clusters properly Thediscussion on this important point is postponed to Section 3
Many versatile or parsimonious models are available.
In the multivariate Gaussian mixture context, the variance matrix eigenvaluedecomposition
Trang 20Mixture Models for Classification 5
Σ k = V k D t k A k D k where V k =|Σ k)| 1/d defines the component volume, D k the matrix of eigen-
vectors of Σ defines the component orientation, and Ak the diagonal matrix
of normalized eigenvalues defines the component shape, leads to get differentand easily interpreted models by allowing some of these quantities to varybetween components Following Banfield and Raftery (1993) or Celeux andGovaert (1995), a large range of fourteen versatile (from the most complex
to the simplest one) models derived from this eigenvalue decomposition can
be considered Assuming equal or free volumes, orientations and shapes leads
to eight different models Assuming in addition that the component variancematrices are diagonal leads to four models And, finally, assuming in additionthat the component variance matrices are proportional to the identity matrixleads to two other models
In the Latent Class Model, a re-parameterization is possible to lead tovarious models taking account of the scattering around centers of the clusters
in different ways (Celeux and Govaert (1991)) This re-parameterization is as
follows The multinomial probabilities ak are decomposed in (mk, ε k) where
binary vector mk = (m1, , m d ) provides the mode levels in cluster k for variable j
For instance, if aj k = (0.7, 0.2, 0.1), the new parameters are m j k = (1, 0, 0) and
ε j k = (0.3, 0.2, 0.1) This parameterization can lead to five latent class models Denoting h(jk) the mode level for variable j and cluster k and h(ij) the level
of object i for the variable j, the model can be written
k )x jh(jk) i (ε jh(ij) k )x jh(ij) i −x jh(jk)
k
.
Using this form, it is possible to impose various constraints to the scattering
parameters ε jh k The models of interest are the following:
• The standard latent class model [ε jh
k ]: The scattering is depending uponclusters, variables and levels
• [ε j
k]: The scattering is depending upon clusters and variables but not uponlevels
• [ε k]: The scattering is depending upon clusters, but not upon variables
• [ε j]: The scattering is depending upon variables, but not upon clusters
• [ε]: The scattering is constant over variables and clusters.
Trang 216 Gilles Celeux
Many algorithms available from different points of view
The EM algorithm of Dempster et al (1977) is the reference tool to derivethe ml estimates in a mixture model An iteration of EM is as follows:
• E step: Compute the conditional probabilities t ik , i = 1, , n, k =
1, , K that xi arises from the kth component for the current value
of the mixture parameters
• M step: Update the mixture parameter estimates maximizing the expected
value of the completed likelihood It leads to use standard formulas where
the observation i for group k is weighted with the conditional probability
t ik
Others algorithms are taking profit of the missing data structure of the ture model For instance, the classification EM (CEM), see Celeux and Gov-
mix-aert (1992) is directly concerned with the estimation of the missing labels z.
An iteration of CEM is as follows:
• E step: As in EM.
• C step: Assign each point x ito the component maximizing the conditional
probability tik using a maximum a posteriori (MAP) principle
• M step: Update the mixture parameter estimates maximizing the
com-pleted likelihood
CEM aims to maximize the completed likelihood where the component label
of each sample point is included in the data set CEM is a K-means-likealgorithm and, contrary to EM, it converges in a finite number of iterations.But CEM provides biased estimates of the mixture parameters This algorithm
is interesting in a clustering context when the clusters are well separated (seeCeleux and Govaert (1993))
From an other point of view, the Stochastic EM (SEM) algorithm can beuseful It is as follows:
• E step: As in EM.
• S step: Assign each point x iat random to one of the component according
to the distribution defined by the (t ik , k = 1, , K).
• M step: Update the mixture parameter estimates maximizing the
com-pleted likelihood
SEM generates a Markov chain whose stationary distribution is (more or less)concentrated around the ML parameter estimator Thus a natural parameterestimate from a SEM sequence is the mean of the iterates values obtain after
a burn-in period An alternative estimate is to consider the parameter valueleading to the largest likelihood in a SEM sequence In any cases, SEM isexpected to avoid insensible maxima of the likelihood that EM cannot avoid,but SEM can be jeopardized by spurious maxima (see Celeux et al (1996) orMcLachlan and Peel (2000) for details) Note that different variants (MonteCarlo EM, Simulated Annealing EM) are possible (see, for instance, Celeux et
Trang 22Mixture Models for Classification 7
al (1996)) Note also that Biernacki et al (2003) proposed simple strategiesfor getting sensible ml estimates Those strategies are acting in two ways todeal with this problem They choose particular starting values from CEM orSEM and they run several times EM or algorithms combining CEM and EM
Special questions can be tackled in a proper way in the MBC context
Robust Cluster Analysis can be obtained by making use of multivariate dent distributions instead of Multivariate Gaussian distributions It lead toattenuate the influence of outliers (McLachlan and Peel (2000)) On an otherhand, including in the mixture a group from a uniform distribution allows totake into account noisy data (DasGupta and Raftery (1998))
Stu-To avoid spurious maxima of likelihood, shrinking the group variance trix toward a matrix proportional to the identity matrix can be quite efficient
ma-On of the most achieved work in this domain is Ciuperca et al (2003).Taking profit of the probabilistic framework, it is possible to deal withmissing data at random in a proper way with mixture models (Hunt andBasford (2001)) Also, simple, natural and efficient methods of semi-supervisedclassification can be derived in the mixture framework (an example of pioneerarticle on this subject, recently followed by many others, is Ganesalingam andMcLachlan (1978)) Finally, it can be noted that promising variable selectionprocedures for Model-Based Clustering begin to appear (Raftery and Dean(2006))
3 Choosing a model in a classification purpose
In statistical inference from data selecting a parsimonious model among acollection of models is an important but difficult task This general prob-lem receives much attention since the seminal articles of Akaike (1974) andSchwarz (1978) A model selection problem consists essentially of solving thebias-variance dilemma A classical approach to the model assessing problemconsists of penalizing the fit of a model by a measure of its complexity Crite-rion AIC of Akaike (1974) is an asymptotic approximation of the expectation
of the deviance It is
AIC(m) = 2 log p(x |m, ˆθ m) − 2ν m (2)where ˆθ m is the ml estimate of parameter θ m and ν m is the number of free
parameters of model m.
An other point of view consists of basing the model selection on the grated likelihood of the data in a Bayesian perspective (see Kass and Raftery(1995)) This integrated likelihood is
inte-p(x|m) =
Trang 23
8 Gilles Celeux
π(θ m ) being a prior distribution for parameter θ m The essential technicalproblem is to approximate this integrated likelihood in a right way A classicalasymptotic approximation of the logarithm of the integrated likelihood is theBIC criterion of Schwarz (1978) It is
BIC(m) = log p(x |m, ˆθ m)− ν m
Beyond technical difficulties, the scope of this section is to show how it can
be fruitful to take into account the purpose of the model user to get reliableand useful models for statistical description or decision tasks Two situationsare considered to support this idea: Choosing the number of components in
a mixture model in a cluster analysis perspective, and choosing a generativeprobabilistic model in a supervised classification context
3.1 Choosing the number of clusters
Assessing the number K of components in a mixture model is a difficult
ques-tion, from both theoretical and practical points of view, which had receivedmuch attention in the past two decades This section does not propose a state
of the art of this problem which has not been completely solved The reader
is referred to the chapter 6 of the book of McLachlan and Peel (2000) for anexcellent overview on this subject This section is essentially aiming to discusselements of practical interest regarding the problem of choosing the number
of mixture components when concerned with cluster analysis
From the theoretical point of view, even when K ∗ the right number ofcomponent is assumed to exist, if K ∗ < K0 then K ∗ is not identifiable in theparameter space Θ K0 (see for instance McLachlan and Peel (2000), chapter6)
But, here, we want to stress the importance of taking into account themodeling context to select a reasonable number of mixture components Ouropinion is that, behind the theoretical difficulties, assessing the number ofcomponents in a mixture model from data is a weakly identifiable statisticalproblem Mixture densities with different number of components can lead
to quite similar resulting probability distributions For instance, the galaxyvelocities data of Roeder (1990) has became a benchmark data set and is used
by many authors to illustrate procedures for choosing the number of mixture
components Now, according to those authors the answer lies from K = 2 to
K = 10, and it is not exaggerating a lot to say that all the answers between
2 and 10 have been proposed as a good answer, at least one time, in thearticles considering this particular data set (An interesting and illuminatingcomparative study on this data set can be found in Aitkin (2001).) Thus, we
consider that it is highly desirable to choose K by keeping in mind what is
expected from the mixture modeling to get a relevant answer to this question.Actually, mixture modeling can be used in quite different purposes It can be
Trang 24Mixture Models for Classification 9regarded as a semi parametric tool for density estimation purpose or as a toolfor cluster analysis.
In the first perspective, much considered by Bayesian statisticians, merical experiments (see Fraley and Raftery (1998)) show that the BIC ap-proximation of the integrated likelihood works well at a practical level And,under regularity conditions including the fact that the component densitiesare finite, Keribin (2000) proved that BIC provides a consistent estimator of
nu-K.
But, in the second perspective, the integrated likelihood does not takeinto account the clustering purpose for selecting a mixture model in a model-based clustering setting As a consequence, in the most current situationswhere the distribution from which the data arose is not in the collection ofconsidered mixture models, BIC criterion will tend to overestimate the correctsize regardless of the separation of the clusters (see Biernacki et al (2000))
To overcome this limitation, it can be advantageous to choose K in order
to get the mixture giving rise to partitioning data with the greatest evidence.With that purpose in mind, Biernacki et al (2000) considered the integrated
likelihood of the complete data (x, z) (or integrated completed likelihood).
(Recall that z = (z1, , z n) is denoting the missing data such that zi =
(zi1 , , z iK) are binary K-dimensional vectors with zik= 1 if and only if xi
arises from component k.) Then, the integrated complete likelihood is
p(x, z | K) =
Θ K
p(x, z | K, θ)π(θ | K)dθ, (5)where
Trang 2510 Gilles Celeux
As a consequence, ICL favors K values giving rise to partitioning the data
with the greatest evidence, as highlighted in the numerical experiments inBiernacki et al (2000), because of this additional entropy term More gener-
ally, ICL appears to provide a stable and reliable estimate of K for real data
sets and also for simulated data sets from mixtures when the components arenot too much overlapping (see McLachlan and Peel (2000)) But ICL, which isnot aiming to discover the true number of mixture components, can underesti-mate the number of components for simulated data arising from mixture withpoorly separated components as illustrated in Figueiredo and Jain (2002)
On the contrary, BIC performs remarkably well to assess the true number
of components from simulated data (see Biernacki et al (2000), Fraley andRaftery (1998) for instance) But, for real world data sets, BIC has a markedtendency to overestimate the numbers of components The reason is that realdata sets do not arise from the mixture densities at hand, and the penaltyterm of BIC is not strong enough to balance the tendency of the loglikelihood
to increase with K in order to improve the fit of the mixture model.
3.2 Model selection in classification
Supervised classification is about guessing the unknown group among K
groups from the knowledge of d variables entering in a vector x i for unit i.
This group for unit i is defined by z i = (z i1 , , z iK ) a binary K-dimensional vector with z ik = 1 if and only if xi arises from group k For that purpose,
a decision function, called a classifier, δ(x) : R d → {1, , K} is designed
from a learning sample (xi, z i), i = 1, , n A classical approach to design
a classifier is to represent the group conditional densities with a parametric
model p(x|m, z k = 1, θm) for k = 1, , K Then the classifier is
assign-ing an observation x to the group k maximizassign-ing the conditional probability
p(z k = 1|m, x, θ m ) Using the Bayes rule, it leads to set δ(x) = j if and only
if j = arg max k p kp(x|m, z k = 1, ˆ θ m), ˆθ m being the ml estimate of the group
conditional parameters θ m and p k being the prior probability of group k This
approach is known as the generative discriminant analysis in the MachineLearning community
In this context, it could be expected to improve the actual error rate by
selecting a generative model m among a large collection of models M (see for
instance Friedman (1989) or Bensmail and Celeux (1996)) For instance Hastieand Tibshirani (1996) proposed to model each group density with a mixture ofGaussian distributions In this approach the number of mixture components
per group are sensitive tuning parameters They can be supplied by the user, as
in Hastie and Tibshirani (1996), but it is clearly a sub-optimal solution They
can be chosen to minimize the v-fold cross-validated error rate, as done in
Friedman (1989) or Bensmail and Celeux (1996) for other tuning parameters
Despite the fact the choice of v can be sensitive, it can be regarded as a nearly
optimal solution But it is highly CPU time consuming and choosing tuningparameters with a penalized loglikelihood criterion, as BIC, can be expected
Trang 26Mixture Models for Classification 11
to be much more efficient in many situations But, BIC measures the fit of
the model m to the data (x, z) rather than its ability to produce a reliable
classifier Thus, in many situations, BIC can have a tendency to overestimatethe complexity of the generative classification model to be chosen In order
to counter this tendency, a penalized likelihood criterion taking into accountthe classification task when evaluating the performance of a model has beenproposed by Bouchard and Celeux (2006) It the so-called Bayesian EntropyCriterion (BEC) that it is now presented
As stated above, a classifier deduced from model m is assigning an
obser-vation x to the group k maximizing p(z k= 1|m, x, ˆθ m) Thus, from the
clas-sification point of view, the conditional likelihood p(z|m, x, θ m) has a centralposition For this very reason, Bouchard and Celeux (2006) proposed to makeuse of the integrated conditional likelihood
p(z|m, x) =
p(z|m, x, θ m )π(θ m |x)dθ m , (7)where
π(θ m |x) ∝ π(θ m)p(x |m, θ m)
is the posterior distribution of θ m knowing x, to select a relevant model m.
As for the integrated likelihood, this integral is generally difficult to calculateand has to be approximated We have
p(x|m) =
p(x|m, θ m )π(θ m )dθ m (10)Denoting
log p(z|m, x) = log p(x, z|m, ˆθ m)− log p(x|m, ˜θ m ) + O(1). (11)
Thus the approximation of log p(z|m, x) that Bouchard and Celeux (2006)
proposed is
Trang 27θ is the ml estimate of a finite mixture distribution It can be derived from the
EM algorithm And, the EM algorithm can be initiated in a quite natural andunique way with ˆθ Thus the calculation of ˜ θ avoids all the possible difficulties
which can be encountered with the EM algorithm Despite the need to use the
EM algorithm to estimate this parameter, it would be estimated in a stableand reliable way It can also be noted that when the learning data set hasbeen obtained through the diagnosis paradigm, the proportions in the mixture
distribution are fixed: p k = card{i such that z ik = 1}/n for k = 1, , K.
Numerical experiments reported in Bouchard and Celeux (2006) show thatBEC and cross validated error rate criteria select most of the times the samemodels contrary to BIC which often selects suboptimal models
4 Discussion
As sketched in the Section 2 of this article, finite mixture analysis is definitively
a powerful framework for model-based cluster analysis Many free and able softwares for mixture analysis are available: C.A.Man, Emmix, Flemix,MClust, mixmod, Multimix, Sob We want to insist on the software mix-mod on which we are working for years (Biernacki et al (2006)) It is a mixturesoftware for cluster analysis and classification which contains most of the fea-tures described here and which last version is quite rapid It is available aturl http://www-math.univ-fcomte.fr/mixmod
valu-In the second part of this article, we highlighted how it could be useful totake into account the model purpose to select a relevant and useful model Thispoint of view can lead to define different selection criteria than the classicalBIC criterion It has been illustrated in two situations: modeling in a clusteringpurpose and modeling in a supervised classification purpose This leads to twopenalized likelihood criteria ICL and BEC for which the the penalty is datadriven and is expected to choose a useful, if not true, model
Now, it can be noticed that we did not consider the modeling purposewhen estimating the model parameters In both situations, we simply con-sidered the maximum likelihood estimator Taking into account the modelingpurpose in the estimation process could be regarded as an interesting point
of view However we do not think that this point of view is fruitful and,moreover, we think it can jeopardize the statistical analysis For instance, inthe cluster analysis context, it could be thought of as more natural to com-
pute the parameter value maximizing the complete loglikelihood log p(x, z |θ)
Trang 28Mixture Models for Classification 13
rather than the observed loglikelihood log p(x|θ) But as proved in Bryant
and Williamson (1978), this strategy leads to asymptotically biased estimates
of the mixture parameters In the same manner, in the supervised
classifi-cation context, considering the parameter value θ ∗ maximizing directly the
conditional likelihood log p(z|x, θ) could be regarded as an alternative to the
classical maximum likelihood estimation But this would lead to difficult timization problems and would provide unstable estimated values Finally, we
op-do not recommend taking into account the modeling purpose when ing the model parameters because it could lead to cumbersome algorithms orprovoke undesirable biases in the estimation On the contrary, we think thattaking into account the model purpose when assessing a model could lead tochoose reliable and stable models especially in unsupervised and supervisedclassification context
estimat-References
AITKIN, M (2001): Likelihood and Bayesian Analysis of Mixtures Statistical
Mod-eling, 1, 287–304.
AKAIKE, H (1974): A New Look at Statistical Model Identification IEEE
Trans-actions on Automatic Control, 19, 716–723.
BANFIELD and RAFTERY, A.E (1993): Model-based Gaussian and Non-Gaussian
Clustering Biometrics, 49, 803–821.
BENSMAIL, H and CELEUX, G (1996): Regularized Gaussian Discriminant
Anal-ysis Through Eigenvalue Decomposition Journal of the American Statistical
Association, 91, 1743–48.
BIERNACKI, C., CELEUX., G and GOVAERT, G (2000): Assessing a Mixture
Model for Clustering with the Integrated Completed Likelihood IEEE Trans.
Software, Computational Statistics and Data Analysis (to appear).
BOUCHARD, G and CELEUX, G (2006): Selection of Generative Models in
Clas-sification IEEE Trans on PAMI, 28, 544–554.
BRYANT, P and WILLIAMSON, J (1978): Asymptotic Behavior of Classification
Maximum Likelihood Estimates Biometrika, 65, 273–281.
CELEUX, G., CHAUVEAU, D and DIEBOLT, J (1996): Some Stochastic Versions
of the EM Algorithm Journal of Statistical Computation and Simulation, 55,
287–314.
CELEUX, G and GOVAERT, G (1991): Clustering Criteria for Discrete Data and
Latent Class Model Journal of Classification, 8, 157–176.
CELEUX, G and GOVAERT, G (1992): A Classification EM Algorithm for
Cluster-ing and Two Stochastic Versions Computational Statistics and Data Analysis,
14, 315–332.
Trang 2914 Gilles Celeux
CELEUX, G and GOVAERT, G (1993): Comparison of the Mixture and the
Clas-sification Maximum Likelihood in Cluster Analysis Journal of Computational
and Simulated Statistics, 14, 315–332.
CIUPERCA, G., IDIER, J and RIDOLFI, A.(2003): Penalized Maximum
Likeli-hood Estimator for Normal Mixtures Scandinavian Journal of Statistics, 30,
45–59.
DEMPSTER, A.P., LAIRD, N.M and RUBIN, D.B (1977): Maximum Likelihood
From Incomplete Data Via the EM Algorithm (With Discussion) Journal of
the Royal Statistical Society, Series B, 39, 1–38.
DIEBOLT, J and ROBERT, C P (1994): Estimation of Finite Mixture
Distribu-tions by Bayesian Sampling Journal of the Royal Statistical Society, Series B,
56, 363–375.
FIGUEIREDO, M and JAIN, A.K (2002): Unsupervised Learning of Finite Mixture
Models IEEE Trans on PAMI, 24, 381–396.
FRALEY, C and RAFTERY, A.E (1998): How Many Clusters? Answers via
Model-based Cluster Analysis The Computer Journal, 41, 578–588.
FRIEDMAN, J (1989): Regularized Discriminant Analysis Journal of the American
Statistical Association, 84, 165–175.
GANESALINGAM, S and MCLACHLAN, G J (1978): The Efficiency of a Linear
Discriminant Function Based on Unclassified Initial Samples Biometrika, 65,
658–662.
GOODMAN, L.A (1974): Exploratory Latent Structure Analysis Using Both
Iden-tifiable and UnidenIden-tifiable Models Biometrika, 61, 215–231.
HASTIE, T and TIBSHIRANI, R (1996): Discriminant Analysis By Gaussian
Mix-tures Journal of the Royal Statistical Society, Series B, 58, 158–176.
HUNT, L.A and BASFORD K.E (2001): Fitting a Mixture Model to Three-mode
Three-way Data With Missing Information Journal of Classification, 18, 209–
MARIN, J.-M., MENGERSEN, K and ROBERT, C.P (2005): Bayesian Analysis
of Finite mixtures Handbook of Statistics, Vol 25, Chapter 16 Elsevier B.V MCLACHLAN, G.J and PEEL, D (2000): Finite Mixture Models Wiley, New York.
RAFTERY, A.E (1995): Bayesian Model Selection in Social Research (With
Dis-cussion) In: P.V Marsden (Ed.): Sociological Methodology 1995, Oxford, U.K.:
Blackwells, 111–196
RAFTERY, A.E and DEAN, N (2006): Journal of the American Statistical
Asso-ciation, 101, 168–78.
ROEDER, K (1990): Density Estimation with Confidence Sets Exemplified by
Su-perclusters and Voids in Galaxies Journal of the American Statistical
Associa-tion, 85, 617–624.
SCHWARZ, G (1978): Estimating the Dimension of a Model The Annals of
Statis-tics, 6, 461–464.
Trang 30How to Choose the Number of Clusters: The Cramer Multiplicity Solution
Adriana Climescu-HaulicaInstitut d’Informatique et Math´ematiques Appliqu´ees, 51 rue des Math´ematiques,Laboratoire Biologie, Informatique, Math´ematiques, CEA 17 rue des Martyrs,Grenoble, France; adriana.climescu@imag.fr
Abstract. We propose a method for estimating the number of clusters in data setsbased only on the knowledge of the similarity measure between the data points to beclustered The technique is inspired from spectral learning algorithms and Cramermultiplicity and is tested on synthetic and real data sets This approach has theadvantage to be computationally inexpensive while being an a priori method,independent from the clustering technique
1 Introduction
Clustering is a technique used in the analysis of microarray gene expressiondata as a preprocessing step, in functional genomic for example, or as the maindiscriminating tool in the tumor classification study (Dudoit et al (2002)).While in recent years many clustering methods were developed, it is acknowl-edged that the reliability of allocation of units to a cluster and the computa-tion of the number of clusters are questions waiting for a joint theoretical andpractical validation (Dudoit et al (2002))
Two main approaches are used in data analysis practice to determine thenumber of clusters The most common procedure is to use the number ofclusters as a parameter of the clustering method and to select it from a max-imum reliability criteria This approach is strongly dependent on the clus-tering method The second approach uses statistical procedures (for examplethe sampling with respect to a reference distribution) and are less dependent
on the clustering method Examples of methods in this category are the Gapstatistic (Tibshirani et al (2001)) and the Clest procedure (Fridlyand andDudoit (2002)) All of the methods reviewed and compared by Fridlyand and
Dudoit (2002) are a posteriori methods, in the sense that they include
clus-tering algorithms as a preprocessing step In this paper we propose a methodfor choosing the number of clusters based only on the knowledge of the similar-ity measure between the data points to be clustered This criterion is inspired
Trang 3116 Adriana Climescu-Haulica
from spectral learning algorithms and Cramer multiplicity The novelty of themethod is given by the direct extraction of the number of clusters from data,with no assumption about effective clusters The procedure is the following:the clustering problem is mapped to the framework of spectral graph theory bymeans of the ”min-cut” problem This induces the passage from the discretedomain to the continuous one, by the definition of the time-continous Markovprocess associated with the graph It is the analysis on the continuous domainwhich allows the screening of the Cramer multiplicity, otherwise set to 1 forthe discrete case We evaluate the method on artificial data obtained as sam-ples from different gaussian distributions and on yeast cell data for which thesimilarity metric is well established in the literature Compared to methodsevaluated in Fridlyand and Dudoit (2002) our algorithm is computationallyless expensive
2 Clustering by min-cut
The clustering framework is given by the min-cut problem on graphs Let
G = V, E be a weighted graph where to each vertex in the set V we assign one of the m points to be clustered The weights on the edges E of the graph
represent how ”similar” one data point is to another and are assigned by a
similarity measure S : V × V → IR+ In the min-cut problem a partition
on subsets{V1, V2, V k } of V is searched such that the sum of the weights
corresponding to the edges going from one subset to another is minimized.This is a NP-hard optimization problem The Laplacian operator associated
with the graph G is defined on the space of functions f : V (G) → IR by
measure between the vertices i and j is interpreted as the probability that a random walk moves from vertex i to vertex j in one step We associate with the graph G a time-continuous Markov process with the state space given by the vertex set V and the transition matrix given by the normalized similarity
matrix Assuming the graph is without loops, the paths of the time-continuousMarkov process are Hilbert space valuated with respect to the norm given bythe quadratic mean
Trang 32How to Choose the Number of Clusters: The Cramer Multiplicity Solution 17
3 Cramer multiplicity
As a stochastic process, the time-continuous Markov process defined by means
of the similarity matrix is associated with a unique, up to an equivalence class,sequence of spectral measures, by means of its Cramer decomposition (Cramer(1964)) The length of the sequence of spectral measures is named the Cramer
multiplicity of the stochastic process More precisely, let X : Ω × [0, ∞) → V
be the time continuous Markov process associated with the graph G and let
N be its Cramer multiplicity Then, by the Cramer decomposition theorem (Cramer (1964)) there exist N mutually orthogonal stochastic processes with orthogonal increments Z i : Ω × [0, ∞) → V , 1 ≤ n ≤ N such that
a decreasing sequence of measures with respect to the absolute continuityrelationship
F Z1 F Z2 F Z N (4)
No representation of the form 3 with these properties exists for any smaller
value of N If the time index set of a process is discrete then its Cramer
multiplicity is 1 It is easily seen that Cramer decomposition is a generalization
of the Fourier representation, applying to stochastic processes and allowing
a more general class of orthogonal bases The processes Zn , 1 ≤ n ≤ N are interpreted as innovation processes associated with X The idea of considering
the Cramer multiplicity as the number of clusters resides in this interpretation
4 Envelope intensity algorithm
We present here an algorithm derived heuristically from the observations
above Nevertheless the relationship between the Laplacian of the graph G
and the Kolmogorov equations associated with the time continuous Markov
process X and hence its Cramer representation is an open problem to be
ad-dressed in the future The input of the algorithm is the similarity matrix andthe output is a function we called the envelope intensity associated with thesimilarity matrix This is a piecewise ”continuous” increasing function whosenumber of jumps contributes to the approximation of the Cramer multiplicity
1 Construct the normalized similarity matrix W = D −1 S where D is the
diagonal matrix with elements the sum of the corresponding rows fromthe matrix S
2 Compute the matrix L = I − W corresponding to the Laplacian operator.
Trang 3318 Adriana Climescu-Haulica
3 Find y1, y2, y m , the eigenvectors of L, chosen to be orthogonal to
each other in the case of repeated eigenvalues and form the matrix
Y = [y1y2 y m]∈ IR m ×mby stacking the eigenvectors in columns.
4 Compute the Fourier transform of Y column by column and construct W
the matrix corresponding to the absolute values of matrix elements
5 Assign, in increasing order, the maximal value of each column from W to the vector U ∈ IR+ called envelope intensity
Steps 1 to 3 are derived from the spectral clustering and steps 3 to 5 are parts
of the heuristic program to approximate the Cramer multiplicity Step 3, responding to the spectral decomposition of the graph’s Laplacian, is theirjunction part Two spectral decompositions are applied: the Laplacian decom-position on eigenvectors and the eigenvectors decomposition on their Hilbertspace, exemplified by the Fourier transform
cor-5 Data analysis
In order to check the potential of our approach on obtaining a priori
in-formation about the number of clusters, based only on a similarity trix, primary we have to apply the envelope intensity algorithm to classes
ma-of sets whose number ma-of clusters is well established We choose two
Fig 1.Example of data sampled from three different gaussian distributions
gories of sets The first category is classical for the clustering analysis, wegive two sets as examples The first set is represented in Figure 1 and isgiven by 200 points from a mixture of three Gaussian distributions (from
http://iew3.technion.ac.il/CE/data/3gauss.mat) The plot in Figure 3 responds to a mixture of five Gaussian distributions generated by x =
cor-m x + R cos U and y = m y + R sin U where (m x , m y) is the local mean pointchosen from the set {(3, 18), (3, 9), (9, 3), (18, 9), (18, 18)} R and U are ran- dom variables distributed N ormal(0, 1) and U nif orm(0, π) respectively The
Trang 34How to Choose the Number of Clusters: The Cramer Multiplicity Solution 19
Trang 35data from http://faculty.washington.edu/kayee/model/ already analyzed by a
model based approach and by spectral clustering in Meila and Verma (2001).This data set has the advantage to come from real experiments and mean-while, the number of clusters to be intrinsically determined, given by the fivephases of the cell cycle We applied the envelope intensity algorithm for twosets The first yeast cell set contains 384 genes selected from general data basessuch that each gene has one and only one phase associated to it To each genecorresponds a vector of intensity points measured at 17 distinct time points.The raw data is normalized by a Z score transformation The result of theenvelope intensity computation, with respect to the similarity measure given
by the correlation plus 1, as in Meila and Verma (2001) is shown in Figure 6.The five regions appear distinctly separated by jumps The second yeast cellset is selected from the first one, corresponding to some functional categoriesand only four phases It contains 237 genes, it is log normalized and the simi-
Trang 36How to Choose the Number of Clusters: The Cramer Multiplicity Solution 21
6 Conclusions
We propose an algorithm which is able to indicate the number of clusters basedonly on the data similarity matrix This algorithm is inspired from ideas onspectral clustering, stochastic processes on graphs and Cramer decompositiontheory It combines two types of spectral decomposition: the matrix spectral
Trang 3722 Adriana Climescu-Haulica
decomposition and the spectral decomposition on Hilbert spaces The rithm is easy to implement as it is resumed to the computation of the envelopeintensity of the Fourier transformed eigenvectors of the Laplacian associatedwith the similarity matrix The data analysis we performed shows that theenvelope intensity computed by the algorithm is separated by jumps in con-nected or single point regions, whose number coincides with the number ofclusters Still more theoretical results have to be developed, this algorithm is
algo-an exploratory tool on clustering algo-analysis
Discrimi-Journal of the American Statistical Association, 97, 77–87.
FRIDLYAND, J and DUDOIT, S (2002): A Prediction-based Resampling Method
to Estimate the Number of Clusters in a Dataset Genome Biology, 3, 7.
MEILA, M and SHI, J (2001): A Random Walks View of Spectral Segmentation
Proceedings of the International Workshop of Artificial Intelegence and tics.
Statis-MEILA, M and VERMA, D (2001): A Comparison of Spectral Clustering
Algo-rithms UW CSE Technical Report.
NG, A., JORDAN, M and WEISS, Y (2002): Spectral Clustering: Analysis and an
Algorithm In: T Dietterich, S Becker, and Z Ghahramani (Eds.): Advances
in Neural Information Processing Systems (NIPS).
TIBSHIRANI, R., GUENTHER, W.G and HASTIE, T (2001): Estimating the
Number of Clusters in a Dataset Via the Gap Statistic Journal of the Royal
Statistical Society, B, 63, 411-423.
Trang 38Model Selection Criteria for Model-Based Clustering of Categorical Time Series Data:
A Monte Carlo Study
Jos´e G DiasDepartment of Quantitative Methods – GIESTA/UNIDE,
ISCTE – Higher Institute of Social Sciences and Business Studies,
Av das For¸cas Armadas, 1649–026 Lisboa, Portugal; jose.dias@iscte.pt
Abstract. An open issue in the statistical literature is the selection of the number
of components for model-based clustering of time series data with a finite number
of states (categories) that are observed several times We set a finite mixture ofMarkov chains for which the performance of selection methods that use differentinformation criteria is compared across a large experimental design The resultsshow that the performance of the information criteria vary across the design Overall,AIC3 outperforms more widespread information criteria such as AIC and BIC forthese finite mixture models
1 Introduction
Time series or longitudinal data have played an important role in the standing of the dynamics of the human behavior in most of the social sciences.Despite extensive analyses for continuous data time series, little research hasbeen conducted on the unobserved heterogeneity for categorical time seriesdata Exceptions are the application of finite mixtures of Markov chains, e.g.,
under-in marketunder-ing (Poulsen (1990)), machunder-ine learnunder-ing (Cadez et al (2003)) or mography (Dias and Willekens (2005)) Despite the increasing use of thesemixtures little is known about model selection of the number of components.Information criteria have become popular as a useful approach to modelselection Some of them such as Akaike Information Criterion (AIC) andBayesian Information Criterion (BIC) have been widely used The perfor-mance of information criteria has been studied extensively in the finite mix-ture literature, mostly focused on finite mixtures of Gaussian distributions(McLachlan and Peel (2000)) Therefore, in this article a Monte Carlo exper-iment is designed to assess the ability of the different information criteria toretrieve the true model and to measure the effect of the design factors for
Trang 39de-24 Jos´e G Dias
finite mixtures of Markov chains The results reported in this paper extendthe conclusions in Dias (2006) from the zero-order Markov model (latent classmodel) to the one-order Markov model (finite mixture of Markov chains).This paper is organized as follows Section 2 describes the finite mixture
of Markov chains In Section 3, we review the literature on model selectioncriteria In Section 4, we describe the design of the Monte Carlo study InSection 5, we present and discuss the results The paper concludes with asummary of main findings, implications, and suggestions for further research
2 Finite mixture of Markov chains
Let X itbe the random variable denoting the category (state) of the individual
i at time t, and x ita particular realization We will assume discrete time from
0 to T (t = 0, 1, , T ) Thus, the vectors X i and xi denote the consecutive
observations (time series) – respectively X it and x it –, with t = 0, , T The
probability density P (X i = xi ) = P (X i0 = x i0 , X i1 = x i1 , , X iT = x iT) can
be extremely difficult to characterize, due to its possibly huge dimension (T +
1) A common procedure to simplify P (Xi = xi) is by assuming the Markov
property stating that the occurrence of event Xt = xt only depends on the
previous state Xt −1 = xt −1 ; that is, conditional on Xt −1 , Xt is independent
of the states at the other time points From the Markov property, it followsthat
P (X = x i ) = P (X i0 = x i0)T
t=1 P (X it = x it |X i,t −1 = x i,t −1 ), (1)where P (X i0 = x i0 ) is the initial distribution and P (X it = x it |X i,t −1 = x i,t −1)
is the probability that individual i is in state x it at t, given that he is in state
x i,t −1 at time t − 1 A first-order Markov chain is specified by its transition
probabilities and initial distribution Hereafter, we denote the initial and the
transition probabilities as λ j = P (X i0 = j) and a jk = P (X t = k |X t −1 =j), respectively Note that we assume that transition probabilities are time
homogeneous, which means that our model is a stationary first-order Markovmodel
The finite mixture of Markov chains assumes discrete heterogeneity
Indi-viduals are clustered into S segments, each denoted by s (s = 1, , S) The clusters, including its number, are not known a priori Thus, in advance one
does not know how the sample will be partitioned into clusters The
compo-nent that individual i belongs to is denoted by the latent discrete variable
Z i ∈ {1, 2, , S} Let z = (z1, , z n) Because z is not observed, the inference
problem is to estimate the parameters of the model, say ϕ, using only
infor-mation on x = (x1, , xn) More precisely, the estimation procedure has to
be based on the marginal distribution of xi which is obtained as follows:
P (X i= xi; ϕ) =
S
π s P (X i= xi|Z i = s). (2)
Trang 40Model Selection for Mixtures of Markov Chains 25
This equation defines a finite mixture model with S components The nent proportions, π s = P (Z i = s; ϕ), correspond to the a priori probability that individual i belongs to the segment s, and gives the segment relative size Moreover, πs satisfies πs > 0 andS
compo-s=1 π s= 1
Within each latent segment s, observation xi is characterized by P (Xi=
xi |Z i = s) = P (Xi= xi|Z i = s; θs) which implies that all individuals in ment s have the same probability distribution defined by the segment-specific parameters θs The parameters of the model are ϕ = (π1, , π S −1 , θ1, , θ S)
seg-The θ s includes the transition and initial probabilities a sjk = P (X it =
k |X i,t −1 = j, Z i = s) and λ sk = P (X i0 = k |Z i = s), respectively A finite
mixture of Markov chains is not a Markov chain, which enables the ing of very complex patterns (see Cadez et al (2003), Dias and Willekens
model-(2005)) The independent parameters of the model are S − 1 prior ities, S(K − 1) initial probabilities, and SK(K − 1) transition probabilities, where K is the number of categories or states Thus, the total number of independent parameters is SK2− 1 The log-likelihood function for ϕ, given
probabil-that xi are independent, is S (ϕ; x) =n
i=1 log P (Xi= xi; ϕ) and the
maxi-mum likelihood estimator (MLE) is ˆϕ = arg max
ϕ S (ϕ; x) For estimating this
model by the EM algorithm, we refer to Dias and Willekens (2005)
3 Information criteria for model selection
The traditional approach to the selection of the best among different models
is using a likelihood ratio test, which under regularity conditions has a simpleasymptotic theory (Wilks (1938)) However, in the context of finite mixturemodels this approach is problematic The null hypothesis under test is defined
on the boundary of the parameter space, and consequently the regularitycondition of Cramer on the asymptotic properties of the MLE is not valid.Some recent results have been achieved (see, e.g., Lo et al (2001)) However,most of these results are difficult to implement and usually derived for finitemixtures of Gaussian distributions
As an alternative information statistics have received much attention cently in finite mixture modeling These statistics are based on the value
re-of −2 S( ˆϕ; x) of the model adjusted for the number of free parameters in
the model (and other factors such as the sample size) The basic ple under these information criteria is the parsimony: all other things be-ing the same (log-likelihood), we choose the simplest model (with fewer
princi-parameters) Thus, we select the number S which minimizes the criterion
C S =−2 S( ˆϕ; x) + dN S, where NS is the number of free parameters of the
model For different values of d, we have the Akaike Information Criterion (AIC: Akaike (1974)) (d = 2), the Bayesian Information Criterion (BIC: Schwarz (1978)) (d = log n), and the Consistent Akaike Information Crite- rion (CAIC: Bozdogan (1987)) (d = log n + 1).