DSpace at VNU: Picture fuzzy clustering for complex data tài liệu, giáo án, bài giảng , luận văn, luận án, đồ án, bài tậ...
Trang 1Picture fuzzy clustering for complex data
VNU University of Science, Vietnam National University, 334 Nguyen Trai, Thanh Xuan, Hanoi, Viet Nam
a r t i c l e i n f o
Article history:
Received 24 April 2016
Received in revised form
9 August 2016
Accepted 9 August 2016
Keywords:
Complex data
Distinct structured data
Fuzzy clustering
Mix data type
Picture fuzzy clustering
a b s t r a c t Fuzzy clustering is a useful segmentation tool which has been widely used in many applications in real life problems such as in pattern recognition, recommender systems, forecasting, etc Fuzzy clustering algorithm on picture fuzzy set (FC-PFS) is an advanced fuzzy clustering algorithm constructed on the basis of picture fuzzy set with the appearance of three membership degrees namely the positive, the neutral and the refusal degrees combined within an entropy component in the objective function to handle the problem of incomplete modeling in fuzzy clustering A disadvantage of FC-PFS is its capability
to handle complex data which include mix data type (categorical and numerical data) and distinct structured data In this paper, we propose a novel picture fuzzy clustering algorithm for complex data called PFCA-CD that deals with both mix data type and distinct data structures The idea of this method is the modification of FC-PFS, using a new measurement for categorical attributes, multiple centers of one cluster and an evolutionary strategy– particle swarm optimization Experiments indicate that the pro-posed algorithm results in better clustering quality than others through clustering validity indices
& 2016 Elsevier Ltd All rights reserved
1 Introduction
Fuzzy clustering is used for partitioning dataset into clusters
where each element in the dataset can belong to all clusters with
different membership values (Bezdek et al., 1984) Fuzzy clustering
“Fuzzy C-Means (FCM)” This algorithm is based on the idea of
K-Means clustering with membership values being attached to the
objective function for partitioning all data elements in the dataset
than K-Means algorithm, especially in overlapping and uncertainty
dataset (Bezdek et al., 1984) Moreover, FCM has many applications
in real life problems such as in pattern recognition, recommender
systems, forecasting, etc (Son et al., 2012a,2012b,Son et al., 2013,
2014; Thong and Son, 2014;Son, 2014a,2014b;Son and Thong,
2015; Thong and Son, 2015; Son, 2015b, 2015c, 2016; Son and
Tuan, 2016;Son and Hai, 2016;Wijayanto et al., 2016;Tuan et al.,
2016;Thong et al., 2016;Tuan et al., 2016)
However, FCM still has some limitations regarding clustering
quality, hesitation, noises and outliers (Ferreira and de Carvalho,
2012;De Carvalho et al., 2013;Thong and Son, 2016b) There have
been many researches proposed to overcome these limitations;
one of them is innovating FCM on advanced fuzzy sets such as the
sets (Atanassov, 1986) and picture fuzzy sets (Cuong, 2014) Fuzzy
2016b) is an extension of FCM with the appearance of three membership degrees of picture fuzzy sets namely the positive, the neutral and the refusal degrees combined within an entropy component in the objective function to handle the problem of
shown to have better accuracy than other fuzzy clustering
2016b)
FC-PFS algorithm extracted from our experiments through various
which include mix data types and distinct structure data Mix data are known as categorical and numerical data, which can be
and de Carvalho, 2012) Distinct structure data contains non-sphere structured data such as data scatter in a linear line or a ring types, etc that prevent clustering algorithms to partition data elements into exact clusters Almost fuzzy clustering methods,
have been many researches on developing new fuzzy clustering algorithms that employed dissimilarity distances and kernel
Hwang, 1998;Ji et al., 2013,2012) However, they solved either mix data types or distinct structure data but not all of them so this
Contents lists available atScienceDirect
http://dx.doi.org/10.1016/j.engappai.2016.08.009
0952-1976/& 2016 Elsevier Ltd All rights reserved.
n Corresponding author.
E-mail addresses: thongph@vnu.edu.vn (P.H Thong), sonlh@vnu.edu.vn,
chinhson2002@gmail.com (L.H Son).
Trang 2leaves the motivation for this paper to work on.
In this paper, we propose a novel picture fuzzy clustering
algo-rithm for complex data called PFCA-CD that deals with both mix
data type and distinct data structures The idea of this method is
attributes, multiple centers of one cluster and an evolutionary
strategy - particle swarm optimization Experiments indicate that
the proposed algorithm results in better clustering quality than
others through clustering validity indices
de-scribes the background with literature review and some particular
bench-mark UCI datasets Finally, conclusions and further works are
covered in last section
2 Background
methods for clustering complex data inSection 2.1.Sections 2.2–2.3
review two typical methods of this approach
2.1 Literature review
The related works for clustering complex data is divided into
two groups: mixed type of data including categorical and
nu-merical data and distinct structure of data (Fig 1)
extended the k-means algorithm for clustering large datasets
in-cluding categorical values.Yang et al (2004)used fuzzy clustering
algorithms to partition mixed feature variables by giving a
(2012, 2013) proposed fuzzy k-prototype clustering algorithms
combining the mean and fuzzy centroid to represent the prototype
of a cluster and employing a new measure based on co-occurrence
of values to evaluate the dissimilarity between data objects and
prototypes of clusters.Chen et al (2016)presented a soft subspace
clustering of categorical data by using a novel soft
feature-selec-tion scheme to make each categorical attribute be automatically
assigned a weight that correlates with the smoothed dispersion of
the categories in a cluster A series of methods based on multiple
dissimilarity matrices to handle with mix data was introduced by
De Carvalho et al (2013) The main ideas of these methods were to
obtain a collaborative role of the different dissimilarity matrices to
complex distinct structure of data
In the second group, many researchers tried to partition
com-plex structure of data which had intrinsic geometry of non-sphere
method called DifFuzzy combining ideas from FCM and diffusion
on graph to handle the problem of clusters with a complex non-linear geometric structure This method is applicable to a larger class of clustering problems which do not require any prior
(2012)presented kernel fuzzy clustering methods based on local adaptive distances to partition complex data The main idea of these methods were based on a local adaptive distance where dissimilarity measures were obtained as sums of the Euclidean distance between patterns and centroids computed individually for each variable by means of kernel functions Dissimilarity measure is utilized to learn the weights of variables during the clustering process that improves performance of the algorithms However, this method could deal with numerical data only
It has been shown that the DifFuzzy algorithm (Cominetti et al.,
dis-similarity matrices (Disdis-similarity) (De Carvalho et al., 2013) are two typical clustering methods in each group Therefore, we will analyze these methods more detailed in the next sections 2.2 DifFuzzy
DifFuzzy clustering algorithm (Cominetti et al., 2010) is based
on FCM and the diffusion on graph to partition the dataset into clusters with a complex nonlinear geometric structure Firstly, the auxiliary function is defined:
σ
whereσ ∈(0,∞)be a positive number Thei−thandj−thnodes are connected by an edge if: ‖X i−X j‖ <σ F( )σ is equal to the
number of components of the σ− neighborhood graph which
contain at least M vertices, where M is the mandatory parameter
maximum value, before settling back down to a value of 1 ( ) σ
( )
2
0,
( )
β
=
( )
∧
⎧
⎨
⎪
⎪⎪
⎩
⎪
⎪⎪ ⎛⎝⎜⎜ ⎞⎠⎟⎟
if i and j are hard po s in the same core clusters
otherwise
,
3
i j
L( ) (β : 0,∞ →) (0,∞)is:
( )
= =
∧
4
i N
j
N
i j
1 1 ,
It has two well defined limits:
( )
5
i
C
i i
0
1
2
rela-tion:
( )β* = −γ +∑ ( − ) +γ
( )
=
⎛
⎝
6
i i
C
1
2
whereγ ∈ 0, 11 ( )is an internal parameter of the method Its default value is 0.3 Then the auxiliary matrices are defined as follows
( )β
Complex data
Mix data types
(categorical and
numerical data)
Distinct structure of data (different distribution of data) Fig 1 Classification of methods dealing with complex data.
Trang 3The matrix D is defined as a diagonal matrix with diagonal
elements
∑ω
( )
=
8
i j
j
N
i j
,
1
,
where ω i j, are the entries of matrix W Finally, the matrix P is
defined as,
γ
( )
=
D
9
i j
2 , 1,
where I∈R N N× is the identity matrix and γ2is an internal
para-meter of DifFuzzy Its default value is 0.1 DifFuzzy also computes
γ
=
( )
⎢
⎣
⎢
⎢
⎥
⎦
⎥
⎥
3
2
formula is used
where e j( )=1 if j=s, and e j( )=0 otherwise Finally the
mem-bership value of the soft point X sin the c−thcluster, u X c( )s, is
determined as
=
−
=
−
dist X l
, ,
12
s
l
C
s
1
1
1
This procedure is applied to every soft data point X sand every
cluster c∈{1, 2, ,C} The output of DifFuzzy is a number of
re-present the degree of membership in each cluster The
member-ship value ofX i, i=1, 2, ,N, in thec−thcluster,c=1, 2, ,C,
is denoted byu X c( )i The degree of membership is a number
be-tween 0 and 1, where the values close to 1 correspond to points
that are very likely to belong to that cluster The sum of the
membership values of a data point in all clusters is always one
2.3 Dissimilarity
The dissimilarity algorithm (De Carvalho et al., 2013) is Fuzzy
K-Medoids with relevance weight for each dissimilarity matrix
consisting of 5 steps below
2.3.1 Initialization
FixC(the number of clusters),2≤C< <n;fix m, <1 m< + ∞;
fixs,1≤s< + ∞;fixT(an iteration limit);fixε > 0andε < < 1 Fix
the cardinality1≤q< <nof the prototypes G k k( =1, ,C) Set
=
t 0 Set λ k( )0 =(γ k( )10, ,γ kr( )0 =(1, , 1) )or set λ k( )0 =(γ k( )10, ,γ kr( )0)
= 1r, ,1r , k=1, ,C Randomly select C distinct prototypes
( )∈ ( ) =
G k0 E q k 1, ,C
For each objecte i i( =1, ,n)compute its membership degreeu( )0(k=1 ,C)on fuzzy cluster C:
( ) ( )
∑
∑
( )
( ) ( )
( ) ( )
γ γ
=
=
( )
( ) ( )
( ) ( )
γ
γ
=
−
−
=
−
−
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
⎛
⎝
⎜
⎜
⎜
⎜⎜
⎞
⎠
⎟
⎟
⎟
⎟⎟
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎛
⎝
⎜
⎜
⎜
⎞
⎠
⎟
⎟
⎟
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎛
⎝
⎞
⎠
⎛
⎝
⎞
⎠
u
d e e
d e e
, ,
, ,
13
ik h
m
h
C j r kj s
j r hj s
m
0 1 , 0
, 0
1 1
1 1 0
1 0
1 1
k
h
k
h
0 0
0 0
∑ ∑
=
=
( )
( )
( )
γ
= =
∈
⎜ ⎟
⎛
⎝ ⎞⎠
,
14
k C
i
n ik m
k C
i
n ik m j
r kj s
e G
j i e
0
1 1
0
, 0
1 1 0 1
0
,
k
k
0
0
2.3.2 Compute the best prototypes
Λ t−1 = γ1t−1, ,γ K t−1 and the fuzzy partition represented by
(−)= (−) (−)
t
1 1
arefixed The prototypeG k( )t =G*∈E( )q
of fuzzy cluster C k k( =1, ,C) is calculated according to the procedure described in Proposition: The prototype G k=G* ∈E( )q
of fuzzy clusterC k k( =1, ,C)is chosen to minimizes the clus-tering criterion J:∑i n=1(u ik) ∑m r j=1( )γ kj s D e G j( i, * →) Min
2.3.3 Compute the best relevance weight vector When the vector of prototypes G( )t =(G1( )t, ,G k( )t)and the
t
1 1
are fixed, the components γ kj( )t(j=1, ,r)of relevance weight vector
( )
γ k t k=1 ,C are computed as in Eqs.(15)or(17)if the matching function given by Eqs.(16)or(18), respectively
γ =
∑
=
= =
=
⎡
⎡
, , , ,
,
15
kj h r i n ik m
r
i n ik m
h r i n ik m
r
i n ik m
1 1
1
1
1 1
1
1
k
k
( )
( )
γ
( )
16
j
r kj s
j
r kj
e G
j i
k
k
∑
∑
∑
=
( )
=
=
=
−
−
=
−
−
⎡
⎣
⎢
⎢
⎢
⎛
⎝
⎠
⎟
⎤
⎦
⎥
⎥
⎥
⎡
⎣
⎢
⎢
⎢
⎢
⎛
⎝
⎜
⎜
⎞
⎠
⎟
⎟
⎤
⎦
⎥
⎥
⎥
⎥
, ,
,
17
kj h
r i n ik m
i n ik m
s
h
r i n ik m
i n ik m
s
1 1 1
1 1
1 1 1
1 1
k k
( )
γ
( )
18
j
r kj s
j
r kj s
e G
j i
,
k
k
Trang 42.3.4 Define the best fuzzy partition
The vector of prototypesG( )t =(G( )t, ,G( ))
k t
relevance weights Λ( )t =(γ i( )t, ,γ k( )t)arefixed The membership
degree u ik( )t of object e i i( =1, ,n) in fuzzy cluster
C k k 1, ,C is calculated as in Eq.(19)
( ) ( )
∑
∑
( )
( ) ( )
( ) ( )
γ γ
=
=
( )
( )
( )
( ) ( )
γ
γ
=
−
−
=
−
−
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎛
⎝
⎜
⎜
⎜
⎞
⎠
⎟
⎟
⎟
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎛
⎝
⎞
⎠
⎛
⎝
⎞
⎠
u
d e e
d e e
, ,
, ,
19
ik
t
h
t
t m
h
C
j
r
kj
t s
j
r
hj
t s
m
1
,
,
1 1
1
1
1
1 1
k
t
k
t
k t
h t
2.3.5 Stopping criterion
Compute:
( )
∑ ∑
( )
( )
γ
=
=
γ
= =
∈
⎛
⎝
⎞
⎠
⎛
⎝
⎞
⎠
⎛
⎝
⎞
⎠
⎛
⎝
⎞
⎠
⎛
⎝
⎞
⎠
20
,
,
t
k
C
n
n
ik
t m
k t s i k t
k
C
i
n
ik
t m
j
r kj
t s
e G k t
j i
If J( )t −J(t 1−) ≤εort<T : STOP; otherwise go to Step A.
2.4 Particle swarm optimization
Eber-hart and Kennedy (1995)is an evolutionary strategy that opti-mizes a problem by iteratively trying to improve a candidate so-lution with regard to a given measure of quality It simulates the
them is presented to be one solution of the problems and is
en-coded with their location ( loc) and velocity ( vec) The PSO pro-cedure consists of these steps: initializing the swarm, calculating
Firstly, the location and velocity of each particle are initiated
value is design to assess the quality of solution Finally, the update process is demonstrated in Eqs.(21)and(22)
vec i vec i C rand loc1 Pbest loc i C rand loc2 Gbest loc , i 21
whereC C1, 2≥0are PSO’s parameters Generally,C C1, 2are often set
as 1 loc Pbestis the location that particleihas best current solution
solution
The whole process is repeated until the number of iteration has reached or the best solution in the two continuous steps has not been changed Details of this method are described inFig 2
Trang 53 The proposed method
picture fuzzy clustering method for complex data (PFCA-CD) is
pre-sented The idea of this method is to overcome the complex structure
and mix data by enhancing FC-PFS with multiple centers,
measure-ment of categorical attributes and evolutionary strategy with Particle
the multiple centers are used to deal with complex structure of data
because data with complex structures have many different shapes
that cannot be represented by one center The centers are
alter-natively selected from data elements because of categorical data
Moreover, categorical attributes are not be measured by using the
same way with numerical attributes; therefore new measurements
are used to cope with mix data The subsections are as follow:
Sec-tion 3.1 describes details of FC-PFS, Section 3.2 proposes a new
3.1 Fuzzy clustering on picture fuzzy set
whereμ ( ) Ȧ x is the positive degree of each elementx∈X , η ( ) Ȧ x
is the neutral membership and γ ( ) Ȧ x is the negative degree
sa-tisfying the constraints,
μ A( )̇ x,η A( )̇x,γ A( ) ∈̇x ⎡⎣0, 1 ,⎤⎦ ∀x∈X, (24)
The refusal degree of an element is:
ξ A( ) =̇ x 1− μ A( ) +̇ x η A( ) +̇x γ A( )̇x , ∀x∈X (26)
proposed a picture fuzzy model for clustering problem called
FC-PFS, which was proven to get better clustering quality than other
data points inrdimensions The objective function for dividing the
dataset intoCgroups is:
∑ ∑
∑ ∑
( )
= =
= =
27
k
N
j
C
m
k
N
j
C
1 1
2
1 1
∑ μ −ξ =
( )
=
30
j
C
1
∑ η + ξ = = =
( )
=
⎛
⎝
⎞
⎠
j
C
kj
kj
1
By using the Lagranian method, the authors determined the optimal solutions of model in Eqs.(32)–(35)
( )
α α
1
1
refusal degree in PFS sets
μ
ξ
=
( )
=
‖ − ‖
‖ − ‖
−
1 2
33
kj i C ki
m
1
2
k j
k i
∑
∑
( )
ξ ξ
−
=
−
=
⎛
⎝
e
1
34
kj i C
i
C ki
kj ki
=
( )
=
=
2
35
j k N
m k
k N
m
1 1
Details for FC-PFS are described inTable 1 3.2 A new measurement for categorical attributes Supposes thatd x x h( i, j)is a distance of the elementx i and x jon
attribute h ( i=1,N, j=1,N,h=1,R ) If the h thattribute is nu-merical data, d x x h( i, j) is calculated based on Euclid distance
Otherwise, if the h thattribute is categorical data, d x( ih,v jh)is cal-culated by Eq.(36)
( )
⎧
⎨
⎩
otherwise
1
,
36
ih jh
This means that data input has to be normalized in range [0,1] with 0 respects to minimum of distance between two objects and
1 respects to the maximum one In Eq.(36), if the two categorical objects are not equal, the distance between them is the maximum one
3.3 The PFCA-CD algorithm
In order to partition dataset with mix data type and distinct data structure, we combine FC-PFS with PSO as follows Suppose
Table 1 FC-PFS.
Fuzzy clustering method on picture fuzzy sets I: Data X whose number of elements (N ) in r dimensions; Number of clusters (C ); the fuzzifier m; Threshold ε; the maximum iteration maxSteps40 O: Matrices u, η, ξ and centers V ;
FC-PFS:
1: t¼0 2: u kj( )t ←random ; η kj( )t ←random ; ξ kj( )t←random( =k 1,N, =j 1,C) satisfy Eqs (28) and (29)
3: Repeat 4: t¼tþ1 5: CalculateV j( )t ( =j 1,C) by Eq (35) 6: Calculateu kj( )t ( =k 1,N; =j 1,C) by Eq (33) 7: Calculate η ( )
kj
t ( =k 1,N; =j 1,C) by Eq (34) 8: Calculate ξ ( )
kj t ( =k 1,N; =j 1,C) by Eq (32) 9: Until μ‖( )t −μ( − )t 1‖ + ‖η( )t −η( − )t1‖ + ‖ξ( )t −ξ( − )t1‖ ≤εor maxSteps has reached
Trang 6that dataset X contains mix numerical and categorical data with
using the iteration of FC-PFS, PSO iteration is employed The initial
population of PSO is encoded asP={p p1, 2, ,p popsize}where each
particle consists of the following components:
– (μ kj,η kj,ξ kj): the positive, neutral and refusal degrees of elements
in X respectively
– (μ Pbest
kj,η Pbest
kj ,ξ Pbest kj): the positive, neutral and refusal degrees of
– V j and V Pbest
j: the set of cluster centers corresponding to (μ kj,η kj,
ξ kj) and (μ Pbest
kj,η Pbest
kj , ξ Pbest
kj) respectively
– Pbest i: the best quality value that a particle achieves
A particle starts from a given values of (μ kj,η kj,ξ kj) and tries to
chosen by the same way to calculate the optimization function
(27)as in Eq.(37)
∑ ∑
∑ ∑
( )
= =
= =
37
k
N
j
C
m
k
N
j
C
1 1
2
1 1
value with current status of the particle If the achieved solutions
are better than the previous ones, the local optimal solutions Pbest
– (μ Pbest
kj,η Pbest
kj , ξ Pbest
kj , V Pbest kj) of the particle are recorded Then, evolution of the particle is made by changing the value of (μ kj,η kj,
ξ kj and V j) In the evolution, (μ kj,η kj) are calculated by Eqs.(33)–(34)
and(38)–(39)as below
,
,
where (μ Gbest,η Gbest , ξ Gbest and V Gbest) are the best values of the
pro-cedure inTable 2
The evolution of all particles is continued until a number of
suitable values of its clustering centers and membership matrices
3.4 Remarks
The proposed method has some advantages:
– The proposed method uses multiple centers for each cluster so
that a cluster with data elements scattering in un-sphered and
distinct structure can be easily presented by these centers
– The proposed method employing FC-PFS with the PSO strategy
can enhance the convergence process
– The proposed method employs a new measurement for
cate-gorical attribute values that is appropriate in calculating
dis-tance between two objects
However, this method still has some limitations:
– The use of PSO algorithm may result in good solutions, but not
sure to be the best ones
– The computational time for PSO strategy is quite high The
particle and one loop, where N is the number of elements in the
numSteps and popsizeare the number of iterations and number of particles respectively BecausepopsizeandCare always small, the complexity of the proposed algorithm is aboutO N( 2×numSteps)
IfnumSteps=1, the complexity of the algorithm is onlyO N( )2 In
proposed algorithm may be the highest
4 Experiments 4.1 Materials and system configuration The following benchmark datasets of UCI Machine Learning Repository (University of California, 2007) are used for the vali-dation of performance of algorithms (Table 4) They are very
con-sisting of seven datasets with different sizes, number of attributes and number of classes The largest dataset is ABALONE including
4177 elements and 8 numerical attributes The dataset contains largest attributes is AUTOMOBILE with 15 numerical and 10 cate-gorical attributes In the experiments, we do not normalize the dataset The aim is to verify the quality of clustering algorithms from small to large sizes and mix datatype (numerical and cate-gorical attributes) In order to assess the quality, the number of classes in each dataset is used as the‘correct’ number of clusters
addition to the DifFuzzy algorithm (Cominetti et al., 2010) and the
program-ming language and executed them on a Linux Cluster 1350 with eight computing nodes of 51.2GFlops Each node contains two Intel Xeon dual core 3.2 GHz, 2GB Ram The experimental results are taken as the average values after 50 runs
Cluster validity measurement: Mean Accuracy (MA), the Davies-Bouldin (DB) index (Davies and Bouldin, 1979), the Rand index (RI)
used to evaluate the qualities of solutions for clustering algo-rithms The DB index is shown as below
∑
( )
= ≠
⎛
⎝
⎜⎜ ⎧⎨
⎩
⎫
⎬
⎭
⎞
⎠
⎟⎟
DB C
M
1
40
i
C
j j i
ij
1 :
∑
( )
=
S
1
41
i
i j
T
1 2
i
Table 2 Choosing centers for clusters.
Determining center for cluster j (V j)
1: V j= ∅ 2: Repeat 3:
Findx i∉V j,x i∈X, such that = ∑ (μ ( −ξ ) ) ‖ − ‖
h N k
N
kj kj m
k h
1, 1
2
4: V j=V j∪x i
5:
Until ∑ (μ ( −ξ ) ) ‖ − ‖ >
h N k N
kj kj m
k h
1, 1
2 andx h∉V j
Trang 7= ‖ − ‖ ( = … ≠ ) ( )
whereT i is the size of cluster ith S iis a measure of scatter within
the cluster, and M ijis a measure of separation between cluster ith
and jth The minimum value indicates the better performance for
where a( b) is the number of pairs of data points belonging to the same class inR and to the same (different) cluster in Q with Rand
Q being two ubiquitous clusters c(d) is the number of pairs of data
(dif-ferent) cluster The Rand index the larger, the better is Alternative Silhouette (ASWC), is invoked to measure the clustering quality
∑
=
( )
ASWC
1 ,
44
N
x i
Fig 3 Schema of PFCA-CD.
Trang 8=
x
p i
p i
,
,
i
where a p i, is the average distance of elementito all other elements
in cluster p, b p i, is the average distance of element i to all other
nor-malized data) used to avoid division by zero when a p i, =0 The
maximum value indicates the better performance for the ASWC
index
=
Particularly for PFCA-CD, we set C1=C2=1, α ∈ 0.6 and ε =10−3
(Thong & Son, 2016)
Objectives: We aim to evaluate the clustering qualities of
algo-rithms through validity indices Some experiments by various
cases of parameters are also considered
4.2 Results and discussions
Table 5indicates the average validity index values of algorithm
It can be seen that the proposed PFCA-CD algorithm has better
clustering quality based on validity indices than others In most
case, the proposed algorithm has at least one best value of validity
indices For instance, in AUTOMOBILE and SERVO datasets, the
values for PFCA-CD are better in MA, DB and RI indices than those
of DifFuzzy and Dissimilarity There are 93.157 (MA), 5.319 (DB)
and 69.458 (RI) compared to 33.333 (MA), 6.437 (DB) and 63.912 (RI) of DifFuzzy and 82.146 (MA), 8.721 (DB) and 64.371 (RI) of
the MA and RI values of algorithms over different dataset In most case of dataset, the proposed method has better values than those
of DifFuzzy and Dissimilarity
Fig 5shows the values of ASWC and DB of all algorithms with different datasets It can be seen that the proposed algorithm re-sults in smaller values of DB than those of others in STATLOG, SERVO, AUTOMOBILE datasets In ASWC, the proposed algorithm has better in IRIS and GLASS datasets, which are only numeric datasets This means that ASWC maybe not good for complex data
Table 6 show the times each algorithm has reached the best
Dissimilarity ranks second within 9 values and the remained
for validity index changes is presented inTable 7
InTable 7, the std values for validity indices DifFuzzy algorithm are not change over time to time because this algorithm is not employed heuristic strategy The std values of PFCA-CD changes less than those of Dissimilarity in general This means that the proposed method results in more stable solutions than those of Dissimilarity method
Table 8 shows the computational time of all algorithms
Table 3
Picture fuzzy clustering algorithm for complex data.
Picture fuzzy clustering algorithm for complex data
I: Data X whose number of elements ( N ) in r dimensions; Number of
clusters (C ); threshold ε; fuzzifier m and the maximal number of iteration
>
Steps
O: Matrices μ, η, ξ and centers V ;
PFCA-CD
1: t¼0
2: u kj( )t ←random ; η kj( )t ←random ; ξ kj( )t ←random( =k 1,N, =j 1,C) satisfy
(28–29)
3: Repeat
4: t¼tþ1
5: For each particle i
6: Choosing centersV j( )t ( =j 1,C) as in Table 2
7: Calculate μ ( )
kj t ( =k 1,N; =j 1,C) by Eq (38)
8: Calculate η ( )
kj
t
( =k 1,N; =j 1,C) by Eq (39) 9: Calculate ξ ( )
kj t ( =k 1,N; =j 1,C) by Eq (32)
10: Calculate fitness value by Eq (37)
11: Update Pbest value
12: Update Gbest value
13: End
14: Until Gbest unchanges or maxSteps has reached
15: Output ( μ, η, ξ,V ) ¼ ( μ Gbest , η Gbest , ξ Gbest ,V Gbest)
Table 4
Descriptions of experimental datasets.
Dataset No elements No numerical
attributes
No categorical attributes
No classes
Table 5 The average validity index values of algorithms (Bold values mean the best one in each dataset and validity index).
IRIS DifFuzzy 66.667 1.569 2.707 81.960
Dissimilarity 92.639 1.946 9.915 79.092 PFCA-CD 92.667 1.971 11.217 76.599 GLASS DifFuzzy 66.667 0.621 4.654 67.557
Dissimilarity 86.888 1.082 4.239 57.303 PFCA-CD 88.785 1.142 11.808 66.994
Dissimilarity 96.404 1.715 3.812 62.56 PFCA-CD 96.464 1.147 4.936 61.346 AUTOMOBILE DifFuzzy 33.333 0.745 6.437 63.912
Dissimilarity 82.146 1.411 8.721 64.371 PFCA-CD 93.157 0.937 5.319 69.458
Dissimilarity 77.35 1.21 8.279 63.862 PFCA-CD 94.538 1.035 4.667 66.607
Dissimilarity 100 1.076 2.868 54.006
Trang 9accompanied with STD values The computational time of the
proposed method is less than those of other algorithms in GLASS,
AUTOMOBILE, SERVO and STATLOG datasets Only in AUTOMOBILE
and IRIS datasets, the proposed algorithm is higher In IRIS dataset,
the proposed algorithm need 4.743 s compared to 4.165 s of
Dissimilarity The discrepancy is less than one second and parti-cularly the std value of the proposed algorithm is much less than that of others, these mean that the proposed algorithm is as good
as others in runtime for IRIS dataset Only for AUTOMOBILE da-taset, the proposed algorithm take more time to run (7643.86 compared to 5214.669 of Dissimilarity) This indicates that the proposed algorithm is not effectively in large and only numerical dataset
5 Conclusions
In this paper, we presented a novel picture fuzzy clustering algorithm for complex data (PFCA-CD) that enables to cluster mix numerical and categorical data with distinct structures PFCA-CD made uses of hybridization between Particle Swarm Optimization strategy to Picture Fuzzy Clustering where combined solutions consisting of equivalent clustering centers and membership ma-trices are packed in PSO The idea of each cluster can be shortly captured as more than one center can deal with complex structure
of data where the shape of data is not sphere The use of a novel measurement for categorical attributes can cope with mix data also Thus, this process created both the most suitable solutions for the problem The experimental results on the benchmark datasets
of UCI Machine Learning Repository indicated that in most cases the PFCA-CD algorithm not only produced solution with better clustering quality but also was faster than other algorithms Fur-ther research directions of this paper could be lean to the fol-lowing ways: i) investigate a distributed version of PFCA-CD; ii) consider the semi-supervised situations for PFCA-CD; iii) apply the algorithm to recommended systems and other problems
Appendix Source codes and the experimental datasets of this paper can
code/ci/master/tree/
References
Atanassov, K.T., 1986 Intuitionistic fuzzy sets Fuzzy Sets Syst 20 (1), 87–96 Bezdek, J.C., Ehrlich, R., Full, W., 1984 FCM: the fuzzy c-means clustering algorithm Comput Geosci 10 (2), 191–203.
Chen, L., Wang, S., Wang, K., Zhu, J., 2016 Soft subspace clustering of categorical data with probabilistic distance Pattern Recognit 51, 322–332.
Cominetti, O., Matzavinos, A., Samarasinghe, S., Kulasiri, D., Liu, S., Maini, P., Erban, R., 2010 DifFUZZY: a fuzzy clustering algorithm for complex datasets Int J Comput Intell Bioinform Syst Biol 1 (4), 402–417.
Cuong, B.C., 2014 Picture fuzzy sets J Comput Sci Cybern 30 (4), 409–416 Davies, D.L., Bouldin, D.W., 1979 A cluster separation measure IEEE Trans Pattern Anal Mach Intell 2, 224–227.
De Carvalho, F.D.A., Lechevallier, Y., De Melo, F.M., 2013 Relational partitioning fuzzy clustering algorithms based on multiple dissimilarity matrices Fuzzy Sets Syst 215, 1–28.
Eberhart, R.C., Kennedy, J., 1995 A new optimizer using particle swarm theory, In: Proceedings of the Sixth International Symposium on Micro Machine and Human Science, 1, pp 39–43.
Ferreira, M.R., de Carvalho, F.D., 2012 Kernel fuzzy clustering methods based on local adaptive distances, In: Proceedings of 2012 IEEE International Conference
on In Fuzzy Systems (FUZZ-IEEE), pp 1–8.
Hwang, Z., 1998 Extensions to the k-means algorithm for clustering large data sets with categorical values Data Min Knowl Discov 2 (3), 283–304.
Ji, J., Pang, W., Zhou, C., Han, X., Wang, Z., 2012 A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data Knowl – Based Syst 30, 129–135.
Ji, J., Bai, T., Zhou, C., Ma, C., Wang, Z., 2013 An improved k-prototypes clustering algorithm for mixed numeric and categorical data Neurocomputing 120, 590–596.
Mendel, J.M., John, R.I.B., 2002 Type-2 fuzzy sets made simple IEEE Trans Fuzzy Syst 10 (2), 117–127.
Fig 5 The chart of ASWC and DB values of all algorithms with different datasets.
Table 6
Times to achieve best values of algorithms.
(Bold values mean the best one).
Algorithms Times to achieve best value
DifFuzzy 3
Dissimilarity 9
PFCA-CD 12
Table 7
The STD values for validity indices of algorithms.
Dissimilarity 7.144 0.447 6.512 11.375
PFCA-CD 8.55 0.267 3.663 6.886
Dissimilarity 38.586 1.014 5.352 49.76
PFCA-CD 3.184 0.094 3.354 0.936
Dissimilarity 2.63 0.039 4.113 0.247
Dissimilarity 9.694 1.433 12.587 4.23
PFCA-CD 1.938 0.132 5.938 0.806
Dissimilarity 5.23 0.088 13.358 1.121
PFCA-CD 3.435 0.053 5.089 0.864
Dissimilarity 0 2.96E4 0.433 0.075
PFCA-CD 4.612 0.011 9.891 0.688
Table 8
The computational time (with STD values) for algorithms in seconds.
DifFuzzy Dissimilarity PFCA-CD
IRIS 31.048 (1.369) 4.165 (3.626) 4.743 (0.919)
GLASS 522.184 (35.528) 122.44 (121.87) 17.577 (1.39)
ABALONE – 5214.669 (4457.619) 7643.86 (844.934)
AUTOMOBILE 149.553 (0.058) 318.622 (71.871) 22.9 (9.017)
SERVO 16.975 (2.9E3) 19.124 (3.439) 19.064 (6.279)
STATLOG – 3082.443 (253.439) 108.688 (5.991)
Trang 10context fuzzy clustering type-2 and particle swarm optimization Appl Soft
Comput 22, 566–584.
Son, L.H., 2014b HU-FCF: a hybrid user-based fuzzy collaborative filtering method
in recommender systems Expert Syst Appl 41 (15), 6861–6870.
Son, L.H., 2015a DPFCM: a novel distributed picture fuzzy clustering method on
picture fuzzy sets Expert Syst Appl 42 (1), 51–66.
Son, L.H., 2015b A novel kernel fuzzy clustering algorithm for geo-demographic
analysis Inf Sci 317, 202–223.
Son, L.H., 2015c HU-FCFþ þ: a novel hybrid method for the new user cold-start
problem in recommender systems Eng Appl Artif Intell 41, 207–222.
Son, L.H., 2016 Dealing with the new user cold-start problem in recommender
systems: a comparative review Inf Syst 58, 87–104.
Son, L.H., Thong, N.T., 2015 Intuitionistic fuzzy recommender systems: an effective
tool for medical diagnosis Knowl – Based Syst 74, 133–150.
Son, L.H., Tuan, T.M., 2016 A cooperative semi-supervised fuzzy clustering
frame-work for dental X-ray image segmentation Expert Syst Appl 46, 380–393.
Son, L.H., Hai, P.V., 2016 A novel multiple fuzzy clustering method based on
in-ternal clustering validation measures with gradient descent Int J Fuzzy Syst.
http://dx.doi.org/10.1007/s40815-015-0117-1
Son, L.H., Cuong, B.C., Long, H.V., 2013 Spatial interaction – modification model and
applications to geo-demographic analysis Knowl – Based Syst 49, 152–170.
Son, L.H., Linh, N.D., Long, H.V., 2014 A lossless DEM compression for fast retrieval
method using fuzzy clustering and MANFIS neural network Eng Appl Artif.
Intell 29, 33–42.
Son, L.H., Cuong, B.C., Lanzi, P.L., Thong, N.T., 2012a A novel intuitionistic fuzzy
clustering method for geo-demographic analysis Expert Syst Appl 39 (10),
9848–9859.
Son, L.H., Lanzi, P.L., Cuong, B.C., Hung, H.A., 2012b Data mining in GIS: a novel
context-based fuzzy geographically weighted clustering algorithm Int J Mach.
Learn Comput 2 (3), 235–238.
Thong, P.H., Son, L.H., 2014 A new approach to multi-variables fuzzy forecasting using picture fuzzy clustering and picture fuzzy rules interpolation method, In: Proceeding of 6th International Conference on Knowledge and Systems En-gineering, pp 679–690.
Thong, N.T., Son, L.H., 2015 HIFCF: an effective hybrid model between picture fuzzy clustering and intuitionistic fuzzy recommender systems for medical diagnosis Expert Syst Appl 42 (7), 3682–3701.
Thong, P.H., Son, L.H., Fujita, H., 2016 Interpolative Picture Fuzzy Rules: A Novel Forecast Method for Weather Nowcasting, In: Proceeding of the 2016 IEEE In-ternational Conference on Fuzzy Systems (FUZZ-IEEE 2016), pp 86–93 Thong, P.H., Son, L.H., 2016a Picture fuzzy clustering: a new computational in-telligence method Soft Comput 20 (9), 3549–3562.
Thong, P.H., Son, L.H., 2016b An overview of semi-supervised fuzzy clustering al-gorithms Int J Eng Technol 8 (4), 301–306.
Tuan, T.M., Ngan, T.T., Son, L.H., 2016 A novel semi-supervised fuzzy clustering method based on interactive fuzzy satisficing for dental X-ray image segmen-tation Appl Intell 45 (2), 402–428.
Tuan, T.M., Duc, N.T., Hai, P.V., Son, L.H., 2016 Dental diagnosis from X-Ray images using fuzzy rule-based systems Int J Fuzzy Syst Appl (in press).
University of California, 2007 UCI Repository of Machine Learning Databases
〈http://archive.ics.uci.edu/ml/〉.
Vendramin, L., Campello, R.J., Hruschka, E.R., 2010 Relative clustering validity cri-teria: a comparative overview Stat Anal Data Min 3 (4), 209–235.
Wijayanto, A.W., Purwarianti, A., Son, L.H., 2016 Fuzzy geographically weighted clustering using artificial bee colony: an efficient geo-demographic analysis algorithm and applications to the analysis of crime behavior in population Appl Intell 44 (2), 377–398.
Yang, M.S., Hwang, P.Y., Chen, D.H., 2004 Fuzzy clustering algorithms for mixed feature variables Fuzzy Sets Syst 141 (2), 301–317.