DSpace at VNU: Picture fuzzy clustering for complex data

DSpace at VNU: Picture fuzzy clustering for complex data tài liệu, giáo án, bài giảng , luận văn, luận án, đồ án, bài tậ...

Trang 1

Picture fuzzy clustering for complex data

VNU University of Science, Vietnam National University, 334 Nguyen Trai, Thanh Xuan, Hanoi, Viet Nam

a r t i c l e i n f o

Article history:

Received 24 April 2016

Received in revised form

9 August 2016

Accepted 9 August 2016

Keywords:

Complex data

Distinct structured data

Fuzzy clustering

Mix data type

Picture fuzzy clustering

a b s t r a c t Fuzzy clustering is a useful segmentation tool which has been widely used in many applications in real life problems such as in pattern recognition, recommender systems, forecasting, etc Fuzzy clustering algorithm on picture fuzzy set (FC-PFS) is an advanced fuzzy clustering algorithm constructed on the basis of picture fuzzy set with the appearance of three membership degrees namely the positive, the neutral and the refusal degrees combined within an entropy component in the objective function to handle the problem of incomplete modeling in fuzzy clustering A disadvantage of FC-PFS is its capability

to handle complex data which include mix data type (categorical and numerical data) and distinct structured data In this paper, we propose a novel picture fuzzy clustering algorithm for complex data called PFCA-CD that deals with both mix data type and distinct data structures The idea of this method is the modiﬁcation of FC-PFS, using a new measurement for categorical attributes, multiple centers of one cluster and an evolutionary strategy– particle swarm optimization Experiments indicate that the pro-posed algorithm results in better clustering quality than others through clustering validity indices

1 Introduction

Fuzzy clustering is used for partitioning dataset into clusters

where each element in the dataset can belong to all clusters with

different membership values (Bezdek et al., 1984) Fuzzy clustering

“Fuzzy C-Means (FCM)” This algorithm is based on the idea of

K-Means clustering with membership values being attached to the

objective function for partitioning all data elements in the dataset

than K-Means algorithm, especially in overlapping and uncertainty

dataset (Bezdek et al., 1984) Moreover, FCM has many applications

in real life problems such as in pattern recognition, recommender

systems, forecasting, etc (Son et al., 2012a,2012b,Son et al., 2013,

2014; Thong and Son, 2014;Son, 2014a,2014b;Son and Thong,

2015; Thong and Son, 2015; Son, 2015b, 2015c, 2016; Son and

Tuan, 2016;Son and Hai, 2016;Wijayanto et al., 2016;Tuan et al.,

2016;Thong et al., 2016;Tuan et al., 2016)

However, FCM still has some limitations regarding clustering

quality, hesitation, noises and outliers (Ferreira and de Carvalho,

2012;De Carvalho et al., 2013;Thong and Son, 2016b) There have

been many researches proposed to overcome these limitations;

one of them is innovating FCM on advanced fuzzy sets such as the

sets (Atanassov, 1986) and picture fuzzy sets (Cuong, 2014) Fuzzy

2016b) is an extension of FCM with the appearance of three membership degrees of picture fuzzy sets namely the positive, the neutral and the refusal degrees combined within an entropy component in the objective function to handle the problem of

shown to have better accuracy than other fuzzy clustering

2016b)

FC-PFS algorithm extracted from our experiments through various

which include mix data types and distinct structure data Mix data are known as categorical and numerical data, which can be

and de Carvalho, 2012) Distinct structure data contains non-sphere structured data such as data scatter in a linear line or a ring types, etc that prevent clustering algorithms to partition data elements into exact clusters Almost fuzzy clustering methods,

have been many researches on developing new fuzzy clustering algorithms that employed dissimilarity distances and kernel

Hwang, 1998;Ji et al., 2013,2012) However, they solved either mix data types or distinct structure data but not all of them so this

Contents lists available atScienceDirect

http://dx.doi.org/10.1016/j.engappai.2016.08.009

n Corresponding author.

E-mail addresses: thongph@vnu.edu.vn (P.H Thong), sonlh@vnu.edu.vn,

chinhson2002@gmail.com (L.H Son).

Trang 2

leaves the motivation for this paper to work on.

In this paper, we propose a novel picture fuzzy clustering

algo-rithm for complex data called PFCA-CD that deals with both mix

data type and distinct data structures The idea of this method is

attributes, multiple centers of one cluster and an evolutionary

strategy - particle swarm optimization Experiments indicate that

the proposed algorithm results in better clustering quality than

others through clustering validity indices

de-scribes the background with literature review and some particular

bench-mark UCI datasets Finally, conclusions and further works are

covered in last section

2 Background

methods for clustering complex data inSection 2.1.Sections 2.2–2.3

review two typical methods of this approach

2.1 Literature review

The related works for clustering complex data is divided into

two groups: mixed type of data including categorical and

nu-merical data and distinct structure of data (Fig 1)

extended the k-means algorithm for clustering large datasets

in-cluding categorical values.Yang et al (2004)used fuzzy clustering

algorithms to partition mixed feature variables by giving a

(2012, 2013) proposed fuzzy k-prototype clustering algorithms

combining the mean and fuzzy centroid to represent the prototype

of a cluster and employing a new measure based on co-occurrence

of values to evaluate the dissimilarity between data objects and

prototypes of clusters.Chen et al (2016)presented a soft subspace

clustering of categorical data by using a novel soft

feature-selec-tion scheme to make each categorical attribute be automatically

assigned a weight that correlates with the smoothed dispersion of

the categories in a cluster A series of methods based on multiple

dissimilarity matrices to handle with mix data was introduced by

De Carvalho et al (2013) The main ideas of these methods were to

obtain a collaborative role of the different dissimilarity matrices to

complex distinct structure of data

In the second group, many researchers tried to partition

com-plex structure of data which had intrinsic geometry of non-sphere

method called DifFuzzy combining ideas from FCM and diffusion

on graph to handle the problem of clusters with a complex non-linear geometric structure This method is applicable to a larger class of clustering problems which do not require any prior

(2012)presented kernel fuzzy clustering methods based on local adaptive distances to partition complex data The main idea of these methods were based on a local adaptive distance where dissimilarity measures were obtained as sums of the Euclidean distance between patterns and centroids computed individually for each variable by means of kernel functions Dissimilarity measure is utilized to learn the weights of variables during the clustering process that improves performance of the algorithms However, this method could deal with numerical data only

It has been shown that the DifFuzzy algorithm (Cominetti et al.,

dis-similarity matrices (Disdis-similarity) (De Carvalho et al., 2013) are two typical clustering methods in each group Therefore, we will analyze these methods more detailed in the next sections 2.2 DifFuzzy

DifFuzzy clustering algorithm (Cominetti et al., 2010) is based

on FCM and the diffusion on graph to partition the dataset into clusters with a complex nonlinear geometric structure Firstly, the auxiliary function is deﬁned:

σ

whereσ ∈(0,∞)be a positive number Thei−thandj−thnodes are connected by an edge if: ‖X i−X j‖ <σ F( )σ is equal to the

number of components of the σ− neighborhood graph which

contain at least M vertices, where M is the mandatory parameter

maximum value, before settling back down to a value of 1 ( ) σ

( )

2

0,

( )

β

=

( )

∧

⎧

⎨

⎪

⎪⎪

⎩

⎪

⎪⎪ ⎛⎝⎜⎜ ⎞⎠⎟⎟

if i and j are hard po s in the same core clusters

otherwise

,

3

i j

L( ) (β : 0,∞ →) (0,∞)is:

( )

= =

∧

4

i N

j

N

i j

1 1 ,

It has two well deﬁned limits:

( )

5

i

C

i i

0

1

2

rela-tion:

( )β* = −γ +∑ ( − ) +γ

( )

=

⎛

⎝

6

i i

C

1

2

whereγ ∈ 0, 11 ( )is an internal parameter of the method Its default value is 0.3 Then the auxiliary matrices are deﬁned as follows

( )β

Complex data

Mix data types

(categorical and

numerical data)

Distinct structure of data (different distribution of data) Fig 1 Classiﬁcation of methods dealing with complex data.

Trang 3

The matrix D is deﬁned as a diagonal matrix with diagonal

elements

∑ω

( )

=

8

i j

j

N

i j

,

1

,

where ω i j, are the entries of matrix W Finally, the matrix P is

deﬁned as,

γ

( )

=

D

9

i j

2 , 1,

where I∈R N N× is the identity matrix and γ2is an internal

para-meter of DifFuzzy Its default value is 0.1 DifFuzzy also computes

γ

=

( )

⎢

⎣

⎢

⎥

⎦

⎥

3

2

formula is used

where e j( )=1 if j=s, and e j( )=0 otherwise Finally the

mem-bership value of the soft point X sin the c−thcluster, u X c( )s, is

determined as

=

−

=

−

dist X l

, ,

12

s

l

C

s

1

This procedure is applied to every soft data point X sand every

cluster c∈{1, 2, ,C} The output of DifFuzzy is a number of

re-present the degree of membership in each cluster The

member-ship value ofX i, i=1, 2, ,N, in thec−thcluster,c=1, 2, ,C,

is denoted byu X c( )i The degree of membership is a number

be-tween 0 and 1, where the values close to 1 correspond to points

that are very likely to belong to that cluster The sum of the

membership values of a data point in all clusters is always one

2.3 Dissimilarity

The dissimilarity algorithm (De Carvalho et al., 2013) is Fuzzy

K-Medoids with relevance weight for each dissimilarity matrix

consisting of 5 steps below

2.3.1 Initialization

FixC(the number of clusters),2≤C< <n;ﬁx m, <1 m< + ∞;

fixs,1≤s< + ∞;fixT(an iteration limit);fixε > 0andε < < 1 Fix

the cardinality1≤q< <nof the prototypes G k k( =1, ,C) Set

=

t 0 Set λ k( )0 =(γ k( )10, ,γ kr( )0 =(1, , 1) )or set λ k( )0 =(γ k( )10, ,γ kr( )0)

= 1r, ,1r , k=1, ,C Randomly select C distinct prototypes

( )∈ ( ) =

G k0 E q k 1, ,C

For each objecte i i( =1, ,n)compute its membership degreeu( )0(k=1 ,C)on fuzzy cluster C:

( ) ( )

∑

( )

( ) ( )

γ γ

=

( )

( ) ( )

γ

=

−

=

−

⎡

⎣

⎢

⎛

⎝

⎜

⎜⎜

⎞

⎠

⎟

⎟⎟

⎤

⎦

⎥

⎡

⎣

⎢

⎛

⎝

⎜

⎞

⎠

⎟

⎤

⎦

⎥

⎛

⎝

⎞

⎠

⎛

⎝

⎞

⎠

u

d e e

, ,

13

ik h

m

h

C j r kj s

j r hj s

m

0 1 , 0

, 0

1 1

1 1 0

1 0

1 1

k

h

k

h

0 0

∑ ∑

=

( )

γ

= =

∈

⎜ ⎟

⎛

⎝ ⎞⎠

,

14

k C

i

n ik m

k C

i

n ik m j

r kj s

e G

j i e

0

1 1

0

, 0

1 1 0 1

0

,

k

0

2.3.2 Compute the best prototypes

Λ t−1 = γ1t−1, ,γ K t−1 and the fuzzy partition represented by

(−)= (−) (−)

t

1 1

areﬁxed The prototypeG k( )t =G*∈E( )q

of fuzzy cluster C k k( =1, ,C) is calculated according to the procedure described in Proposition: The prototype G k=G* ∈E( )q

of fuzzy clusterC k k( =1, ,C)is chosen to minimizes the clus-tering criterion J:∑i n=1(u ik) ∑m r j=1( )γ kj s D e G j( i, * →) Min

2.3.3 Compute the best relevance weight vector When the vector of prototypes G( )t =(G1( )t, ,G k( )t)and the

t

1 1

are ﬁxed, the components γ kj( )t(j=1, ,r)of relevance weight vector

( )

γ k t k=1 ,C are computed as in Eqs.(15)or(17)if the matching function given by Eqs.(16)or(18), respectively

γ =

∑

=

= =

=

⎡

, , , ,

,

15

kj h r i n ik m

r

i n ik m

h r i n ik m

r

i n ik m

1 1

1

1 1

1

k

( )

γ

( )

16

j

r kj s

j

r kj

e G

j i

k

∑

=

( )

=

−

=

−

⎡

⎣

⎢

⎛

⎝

⎠

⎟

⎤

⎦

⎥

⎡

⎣

⎢

⎛

⎝

⎜

⎞

⎠

⎟

⎤

⎦

⎥

, ,

,

17

kj h

r i n ik m

i n ik m

s

h

r i n ik m

i n ik m

s

1 1 1

1 1

1 1 1

1 1

k k

( )

γ

( )

18

j

r kj s

j

r kj s

e G

j i

,

k

Trang 4

2.3.4 Deﬁne the best fuzzy partition

The vector of prototypesG( )t =(G( )t, ,G( ))

k t

relevance weights Λ( )t =(γ i( )t, ,γ k( )t)areﬁxed The membership

degree u ik( )t of object e i i( =1, ,n) in fuzzy cluster

C k k 1, ,C is calculated as in Eq.(19)

( ) ( )

∑

( )

( ) ( )

γ γ

=

( )

( ) ( )

γ

=

−

=

−

⎡

⎣

⎢

⎤

⎦

⎥

⎡

⎣

⎢

⎛

⎝

⎜

⎞

⎠

⎟

⎤

⎦

⎥

⎛

⎝

⎞

⎠

⎛

⎝

⎞

⎠

u

d e e

, ,

19

ik

t

h

t

t m

h

C

j

r

kj

t s

j

r

hj

t s

m

1

,

1 1

1

1 1

k

t

k

t

k t

h t

2.3.5 Stopping criterion

Compute:

( )

∑ ∑

( )

γ

=

γ

= =

∈

⎛

⎝

⎞

⎠

⎛

⎝

⎞

⎠

⎛

⎝

⎞

⎠

⎛

⎝

⎞

⎠

⎛

⎝

⎞

⎠

20

,

t

k

C

n

ik

t m

k t s i k t

k

C

i

n

ik

t m

j

r kj

t s

e G k t

j i

If J( )t −J(t 1−) ≤εort<T : STOP; otherwise go to Step A.

2.4 Particle swarm optimization

Eber-hart and Kennedy (1995)is an evolutionary strategy that opti-mizes a problem by iteratively trying to improve a candidate so-lution with regard to a given measure of quality It simulates the

them is presented to be one solution of the problems and is

en-coded with their location ( loc) and velocity ( vec) The PSO pro-cedure consists of these steps: initializing the swarm, calculating

Firstly, the location and velocity of each particle are initiated

value is design to assess the quality of solution Finally, the update process is demonstrated in Eqs.(21)and(22)

vec i vec i C rand loc1 Pbest loc i C rand loc2 Gbest loc , i 21

whereC C1, 2≥0are PSO’s parameters Generally,C C1, 2are often set

as 1 loc Pbestis the location that particleihas best current solution

solution

The whole process is repeated until the number of iteration has reached or the best solution in the two continuous steps has not been changed Details of this method are described inFig 2

Trang 5

3 The proposed method

picture fuzzy clustering method for complex data (PFCA-CD) is

pre-sented The idea of this method is to overcome the complex structure

and mix data by enhancing FC-PFS with multiple centers,

measure-ment of categorical attributes and evolutionary strategy with Particle

the multiple centers are used to deal with complex structure of data

because data with complex structures have many different shapes

that cannot be represented by one center The centers are

alter-natively selected from data elements because of categorical data

Moreover, categorical attributes are not be measured by using the

same way with numerical attributes; therefore new measurements

are used to cope with mix data The subsections are as follow:

Sec-tion 3.1 describes details of FC-PFS, Section 3.2 proposes a new

3.1 Fuzzy clustering on picture fuzzy set

whereμ ( ) Ȧ x is the positive degree of each elementx∈X , η ( ) Ȧ x

is the neutral membership and γ ( ) Ȧ x is the negative degree

sa-tisfying the constraints,

μ A( )̇ x,η A( )̇x,γ A( ) ∈̇x ⎡⎣0, 1 ,⎤⎦ ∀x∈X, (24)

The refusal degree of an element is:

ξ A( ) =̇ x 1− μ A( ) +̇ x η A( ) +̇x γ A( )̇x , ∀x∈X (26)

proposed a picture fuzzy model for clustering problem called

FC-PFS, which was proven to get better clustering quality than other

data points inrdimensions The objective function for dividing the

dataset intoCgroups is:

∑ ∑

( )

= =

27

k

N

j

C

m

k

N

j

C

1 1

2

1 1

∑ μ −ξ =

( )

=

30

j

C

1

∑ η + ξ = = =

( )

=

⎛

⎝

⎞

⎠

j

C

kj

1

By using the Lagranian method, the authors determined the optimal solutions of model in Eqs.(32)–(35)

( )

α α

1

refusal degree in PFS sets

μ

ξ

=

( )

=

‖ − ‖

−

1 2

33

kj i C ki

m

1

2

k j

k i

∑

( )

ξ ξ

−

=

−

=

⎛

⎝

e

1

34

kj i C

i

C ki

kj ki

=

( )

=

2

35

j k N

m k

k N

m

1 1

Details for FC-PFS are described inTable 1 3.2 A new measurement for categorical attributes Supposes thatd x x h( i, j)is a distance of the elementx i and x jon

attribute h ( i=1,N, j=1,N,h=1,R ) If the h thattribute is nu-merical data, d x x h( i, j) is calculated based on Euclid distance

Otherwise, if the h thattribute is categorical data, d x( ih,v jh)is cal-culated by Eq.(36)

( )

⎧

⎨

⎩

otherwise

1

,

36

ih jh

This means that data input has to be normalized in range [0,1] with 0 respects to minimum of distance between two objects and

1 respects to the maximum one In Eq.(36), if the two categorical objects are not equal, the distance between them is the maximum one

3.3 The PFCA-CD algorithm

In order to partition dataset with mix data type and distinct data structure, we combine FC-PFS with PSO as follows Suppose

Table 1 FC-PFS.

Fuzzy clustering method on picture fuzzy sets I: Data X whose number of elements (N ) in r dimensions; Number of clusters (C ); the fuzziﬁer m; Threshold ε; the maximum iteration maxSteps40 O: Matrices u, η, ξ and centers V ;

FC-PFS:

1: t¼0 2: u kj( )t ←random ; η kj( )t ←random ; ξ kj( )t←random( =k 1,N, =j 1,C) satisfy Eqs (28) and (29)

3: Repeat 4: t¼tþ1 5: CalculateV j( )t ( =j 1,C) by Eq (35) 6: Calculateu kj( )t ( =k 1,N; =j 1,C) by Eq (33) 7: Calculate η ( )

kj

t ( =k 1,N; =j 1,C) by Eq (34) 8: Calculate ξ ( )

kj t ( =k 1,N; =j 1,C) by Eq (32) 9: Until μ‖( )t −μ( − )t 1‖ + ‖η( )t −η( − )t1‖ + ‖ξ( )t −ξ( − )t1‖ ≤εor maxSteps has reached

Trang 6

that dataset X contains mix numerical and categorical data with

using the iteration of FC-PFS, PSO iteration is employed The initial

population of PSO is encoded asP={p p1, 2, ,p popsize}where each

particle consists of the following components:

– (μ kj,η kj,ξ kj): the positive, neutral and refusal degrees of elements

in X respectively

– (μ Pbest

kj,η Pbest

kj ,ξ Pbest kj): the positive, neutral and refusal degrees of

– V j and V Pbest

j: the set of cluster centers corresponding to (μ kj,η kj,

ξ kj) and (μ Pbest

kj,η Pbest

kj , ξ Pbest

kj) respectively

– Pbest i: the best quality value that a particle achieves

A particle starts from a given values of (μ kj,η kj,ξ kj) and tries to

chosen by the same way to calculate the optimization function

(27)as in Eq.(37)

∑ ∑

( )

= =

37

k

N

j

C

m

k

N

j

C

1 1

2

1 1

value with current status of the particle If the achieved solutions

are better than the previous ones, the local optimal solutions Pbest

– (μ Pbest

kj,η Pbest

kj , ξ Pbest

kj , V Pbest kj) of the particle are recorded Then, evolution of the particle is made by changing the value of (μ kj,η kj,

ξ kj and V j) In the evolution, (μ kj,η kj) are calculated by Eqs.(33)–(34)

and(38)–(39)as below

,

where (μ Gbest,η Gbest , ξ Gbest and V Gbest) are the best values of the

pro-cedure inTable 2

The evolution of all particles is continued until a number of

suitable values of its clustering centers and membership matrices

3.4 Remarks

The proposed method has some advantages:

– The proposed method uses multiple centers for each cluster so

that a cluster with data elements scattering in un-sphered and

distinct structure can be easily presented by these centers

– The proposed method employing FC-PFS with the PSO strategy

can enhance the convergence process

– The proposed method employs a new measurement for

cate-gorical attribute values that is appropriate in calculating

dis-tance between two objects

However, this method still has some limitations:

– The use of PSO algorithm may result in good solutions, but not

sure to be the best ones

– The computational time for PSO strategy is quite high The

particle and one loop, where N is the number of elements in the

numSteps and popsizeare the number of iterations and number of particles respectively BecausepopsizeandCare always small, the complexity of the proposed algorithm is aboutO N( 2×numSteps)

IfnumSteps=1, the complexity of the algorithm is onlyO N( )2 In

proposed algorithm may be the highest

4 Experiments 4.1 Materials and system conﬁguration The following benchmark datasets of UCI Machine Learning Repository (University of California, 2007) are used for the vali-dation of performance of algorithms (Table 4) They are very

con-sisting of seven datasets with different sizes, number of attributes and number of classes The largest dataset is ABALONE including

4177 elements and 8 numerical attributes The dataset contains largest attributes is AUTOMOBILE with 15 numerical and 10 cate-gorical attributes In the experiments, we do not normalize the dataset The aim is to verify the quality of clustering algorithms from small to large sizes and mix datatype (numerical and cate-gorical attributes) In order to assess the quality, the number of classes in each dataset is used as the‘correct’ number of clusters

addition to the DifFuzzy algorithm (Cominetti et al., 2010) and the

program-ming language and executed them on a Linux Cluster 1350 with eight computing nodes of 51.2GFlops Each node contains two Intel Xeon dual core 3.2 GHz, 2GB Ram The experimental results are taken as the average values after 50 runs

Cluster validity measurement: Mean Accuracy (MA), the Davies-Bouldin (DB) index (Davies and Bouldin, 1979), the Rand index (RI)

used to evaluate the qualities of solutions for clustering algo-rithms The DB index is shown as below

∑

( )

= ≠

⎛

⎝

⎜⎜ ⎧⎨

⎩

⎫

⎬

⎭

⎞

⎠

⎟⎟

DB C

M

1

40

i

C

j j i

ij

1 :

∑

( )

=

S

1

41

i

i j

T

1 2

i

Table 2 Choosing centers for clusters.

Determining center for cluster j (V j)

1: V j= ∅ 2: Repeat 3:

Findx i∉V j,x i∈X, such that = ∑ (μ ( −ξ ) ) ‖ − ‖

h N k

N

kj kj m

k h

1, 1

2

4: V j=V j∪x i

5:

Until ∑ (μ ( −ξ ) ) ‖ − ‖ >

h N k N

kj kj m

k h

1, 1

2 andx h∉V j

Trang 7

= ‖ − ‖ ( = … ≠ ) ( )

whereT i is the size of cluster ith S iis a measure of scatter within

the cluster, and M ijis a measure of separation between cluster ith

and jth The minimum value indicates the better performance for

where a( b) is the number of pairs of data points belonging to the same class inR and to the same (different) cluster in Q with Rand

Q being two ubiquitous clusters c(d) is the number of pairs of data

(dif-ferent) cluster The Rand index the larger, the better is Alternative Silhouette (ASWC), is invoked to measure the clustering quality

∑

=

( )

ASWC

1 ,

44

N

x i

Fig 3 Schema of PFCA-CD.

Trang 8

=

x

p i

,

i

where a p i, is the average distance of elementito all other elements

in cluster p, b p i, is the average distance of element i to all other

nor-malized data) used to avoid division by zero when a p i, =0 The

maximum value indicates the better performance for the ASWC

index

=

Particularly for PFCA-CD, we set C1=C2=1, α ∈ 0.6 and ε =10−3

(Thong & Son, 2016)

Objectives: We aim to evaluate the clustering qualities of

algo-rithms through validity indices Some experiments by various

cases of parameters are also considered

4.2 Results and discussions

Table 5indicates the average validity index values of algorithm

It can be seen that the proposed PFCA-CD algorithm has better

clustering quality based on validity indices than others In most

case, the proposed algorithm has at least one best value of validity

indices For instance, in AUTOMOBILE and SERVO datasets, the

values for PFCA-CD are better in MA, DB and RI indices than those

of DifFuzzy and Dissimilarity There are 93.157 (MA), 5.319 (DB)

and 69.458 (RI) compared to 33.333 (MA), 6.437 (DB) and 63.912 (RI) of DifFuzzy and 82.146 (MA), 8.721 (DB) and 64.371 (RI) of

the MA and RI values of algorithms over different dataset In most case of dataset, the proposed method has better values than those

of DifFuzzy and Dissimilarity

Fig 5shows the values of ASWC and DB of all algorithms with different datasets It can be seen that the proposed algorithm re-sults in smaller values of DB than those of others in STATLOG, SERVO, AUTOMOBILE datasets In ASWC, the proposed algorithm has better in IRIS and GLASS datasets, which are only numeric datasets This means that ASWC maybe not good for complex data

Table 6 show the times each algorithm has reached the best

Dissimilarity ranks second within 9 values and the remained

for validity index changes is presented inTable 7

InTable 7, the std values for validity indices DifFuzzy algorithm are not change over time to time because this algorithm is not employed heuristic strategy The std values of PFCA-CD changes less than those of Dissimilarity in general This means that the proposed method results in more stable solutions than those of Dissimilarity method

Table 8 shows the computational time of all algorithms

Table 3

Picture fuzzy clustering algorithm for complex data.

Picture fuzzy clustering algorithm for complex data

I: Data X whose number of elements ( N ) in r dimensions; Number of

clusters (C ); threshold ε; fuzziﬁer m and the maximal number of iteration

>

Steps

O: Matrices μ, η, ξ and centers V ;

PFCA-CD

1: t¼0

2: u kj( )t ←random ; η kj( )t ←random ; ξ kj( )t ←random( =k 1,N, =j 1,C) satisfy

(28–29)

3: Repeat

4: t¼tþ1

5: For each particle i

6: Choosing centersV j( )t ( =j 1,C) as in Table 2

7: Calculate μ ( )

kj t ( =k 1,N; =j 1,C) by Eq (38)

8: Calculate η ( )

kj

t

( =k 1,N; =j 1,C) by Eq (39) 9: Calculate ξ ( )

kj t ( =k 1,N; =j 1,C) by Eq (32)

10: Calculate ﬁtness value by Eq (37)

11: Update Pbest value

12: Update Gbest value

13: End

14: Until Gbest unchanges or maxSteps has reached

15: Output ( μ, η, ξ,V ) ¼ ( μ Gbest , η Gbest , ξ Gbest ,V Gbest)

Table 4

Descriptions of experimental datasets.

Dataset No elements No numerical

attributes

No categorical attributes

No classes

Table 5 The average validity index values of algorithms (Bold values mean the best one in each dataset and validity index).

IRIS DifFuzzy 66.667 1.569 2.707 81.960

Dissimilarity 92.639 1.946 9.915 79.092 PFCA-CD 92.667 1.971 11.217 76.599 GLASS DifFuzzy 66.667 0.621 4.654 67.557

Dissimilarity 86.888 1.082 4.239 57.303 PFCA-CD 88.785 1.142 11.808 66.994

Dissimilarity 96.404 1.715 3.812 62.56 PFCA-CD 96.464 1.147 4.936 61.346 AUTOMOBILE DifFuzzy 33.333 0.745 6.437 63.912

Dissimilarity 100 1.076 2.868 54.006

Trang 9

accompanied with STD values The computational time of the

proposed method is less than those of other algorithms in GLASS,

AUTOMOBILE, SERVO and STATLOG datasets Only in AUTOMOBILE

and IRIS datasets, the proposed algorithm is higher In IRIS dataset,

the proposed algorithm need 4.743 s compared to 4.165 s of

Dissimilarity The discrepancy is less than one second and parti-cularly the std value of the proposed algorithm is much less than that of others, these mean that the proposed algorithm is as good

as others in runtime for IRIS dataset Only for AUTOMOBILE da-taset, the proposed algorithm take more time to run (7643.86 compared to 5214.669 of Dissimilarity) This indicates that the proposed algorithm is not effectively in large and only numerical dataset

5 Conclusions

In this paper, we presented a novel picture fuzzy clustering algorithm for complex data (PFCA-CD) that enables to cluster mix numerical and categorical data with distinct structures PFCA-CD made uses of hybridization between Particle Swarm Optimization strategy to Picture Fuzzy Clustering where combined solutions consisting of equivalent clustering centers and membership ma-trices are packed in PSO The idea of each cluster can be shortly captured as more than one center can deal with complex structure

of data where the shape of data is not sphere The use of a novel measurement for categorical attributes can cope with mix data also Thus, this process created both the most suitable solutions for the problem The experimental results on the benchmark datasets

of UCI Machine Learning Repository indicated that in most cases the PFCA-CD algorithm not only produced solution with better clustering quality but also was faster than other algorithms Fur-ther research directions of this paper could be lean to the fol-lowing ways: i) investigate a distributed version of PFCA-CD; ii) consider the semi-supervised situations for PFCA-CD; iii) apply the algorithm to recommended systems and other problems

Appendix Source codes and the experimental datasets of this paper can

code/ci/master/tree/

References

Atanassov, K.T., 1986 Intuitionistic fuzzy sets Fuzzy Sets Syst 20 (1), 87–96 Bezdek, J.C., Ehrlich, R., Full, W., 1984 FCM: the fuzzy c-means clustering algorithm Comput Geosci 10 (2), 191–203.

Chen, L., Wang, S., Wang, K., Zhu, J., 2016 Soft subspace clustering of categorical data with probabilistic distance Pattern Recognit 51, 322–332.

Cominetti, O., Matzavinos, A., Samarasinghe, S., Kulasiri, D., Liu, S., Maini, P., Erban, R., 2010 DifFUZZY: a fuzzy clustering algorithm for complex datasets Int J Comput Intell Bioinform Syst Biol 1 (4), 402–417.

Cuong, B.C., 2014 Picture fuzzy sets J Comput Sci Cybern 30 (4), 409–416 Davies, D.L., Bouldin, D.W., 1979 A cluster separation measure IEEE Trans Pattern Anal Mach Intell 2, 224–227.

De Carvalho, F.D.A., Lechevallier, Y., De Melo, F.M., 2013 Relational partitioning fuzzy clustering algorithms based on multiple dissimilarity matrices Fuzzy Sets Syst 215, 1–28.

Eberhart, R.C., Kennedy, J., 1995 A new optimizer using particle swarm theory, In: Proceedings of the Sixth International Symposium on Micro Machine and Human Science, 1, pp 39–43.

Ferreira, M.R., de Carvalho, F.D., 2012 Kernel fuzzy clustering methods based on local adaptive distances, In: Proceedings of 2012 IEEE International Conference

on In Fuzzy Systems (FUZZ-IEEE), pp 1–8.

Hwang, Z., 1998 Extensions to the k-means algorithm for clustering large data sets with categorical values Data Min Knowl Discov 2 (3), 283–304.

Ji, J., Pang, W., Zhou, C., Han, X., Wang, Z., 2012 A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data Knowl – Based Syst 30, 129–135.

Ji, J., Bai, T., Zhou, C., Ma, C., Wang, Z., 2013 An improved k-prototypes clustering algorithm for mixed numeric and categorical data Neurocomputing 120, 590–596.

Mendel, J.M., John, R.I.B., 2002 Type-2 fuzzy sets made simple IEEE Trans Fuzzy Syst 10 (2), 117–127.

Fig 5 The chart of ASWC and DB values of all algorithms with different datasets.

Table 6

Times to achieve best values of algorithms.

(Bold values mean the best one).

Algorithms Times to achieve best value

DifFuzzy 3

Dissimilarity 9

PFCA-CD 12

Table 7

The STD values for validity indices of algorithms.

Dissimilarity 7.144 0.447 6.512 11.375

PFCA-CD 8.55 0.267 3.663 6.886

Dissimilarity 38.586 1.014 5.352 49.76

PFCA-CD 3.184 0.094 3.354 0.936

Dissimilarity 2.63 0.039 4.113 0.247

Dissimilarity 9.694 1.433 12.587 4.23

PFCA-CD 1.938 0.132 5.938 0.806

Dissimilarity 5.23 0.088 13.358 1.121

PFCA-CD 3.435 0.053 5.089 0.864

Dissimilarity 0 2.96E4 0.433 0.075

PFCA-CD 4.612 0.011 9.891 0.688

Table 8

The computational time (with STD values) for algorithms in seconds.

DifFuzzy Dissimilarity PFCA-CD

IRIS 31.048 (1.369) 4.165 (3.626) 4.743 (0.919)

GLASS 522.184 (35.528) 122.44 (121.87) 17.577 (1.39)

ABALONE – 5214.669 (4457.619) 7643.86 (844.934)

AUTOMOBILE 149.553 (0.058) 318.622 (71.871) 22.9 (9.017)

SERVO 16.975 (2.9E3) 19.124 (3.439) 19.064 (6.279)

STATLOG – 3082.443 (253.439) 108.688 (5.991)

Trang 10

context fuzzy clustering type-2 and particle swarm optimization Appl Soft

Comput 22, 566–584.

Son, L.H., 2014b HU-FCF: a hybrid user-based fuzzy collaborative ﬁltering method

in recommender systems Expert Syst Appl 41 (15), 6861–6870.

Son, L.H., 2015a DPFCM: a novel distributed picture fuzzy clustering method on

picture fuzzy sets Expert Syst Appl 42 (1), 51–66.

Son, L.H., 2015b A novel kernel fuzzy clustering algorithm for geo-demographic

analysis Inf Sci 317, 202–223.

Son, L.H., 2015c HU-FCFþ þ: a novel hybrid method for the new user cold-start

problem in recommender systems Eng Appl Artif Intell 41, 207–222.

Son, L.H., 2016 Dealing with the new user cold-start problem in recommender

systems: a comparative review Inf Syst 58, 87–104.

Son, L.H., Thong, N.T., 2015 Intuitionistic fuzzy recommender systems: an effective

tool for medical diagnosis Knowl – Based Syst 74, 133–150.

Son, L.H., Tuan, T.M., 2016 A cooperative semi-supervised fuzzy clustering

frame-work for dental X-ray image segmentation Expert Syst Appl 46, 380–393.

Son, L.H., Hai, P.V., 2016 A novel multiple fuzzy clustering method based on

in-ternal clustering validation measures with gradient descent Int J Fuzzy Syst.

http://dx.doi.org/10.1007/s40815-015-0117-1

Son, L.H., Cuong, B.C., Long, H.V., 2013 Spatial interaction – modiﬁcation model and

applications to geo-demographic analysis Knowl – Based Syst 49, 152–170.

Son, L.H., Linh, N.D., Long, H.V., 2014 A lossless DEM compression for fast retrieval

method using fuzzy clustering and MANFIS neural network Eng Appl Artif.

Intell 29, 33–42.

Son, L.H., Cuong, B.C., Lanzi, P.L., Thong, N.T., 2012a A novel intuitionistic fuzzy

clustering method for geo-demographic analysis Expert Syst Appl 39 (10),

9848–9859.

Son, L.H., Lanzi, P.L., Cuong, B.C., Hung, H.A., 2012b Data mining in GIS: a novel

context-based fuzzy geographically weighted clustering algorithm Int J Mach.

Learn Comput 2 (3), 235–238.

Thong, P.H., Son, L.H., 2014 A new approach to multi-variables fuzzy forecasting using picture fuzzy clustering and picture fuzzy rules interpolation method, In: Proceeding of 6th International Conference on Knowledge and Systems En-gineering, pp 679–690.

Thong, N.T., Son, L.H., 2015 HIFCF: an effective hybrid model between picture fuzzy clustering and intuitionistic fuzzy recommender systems for medical diagnosis Expert Syst Appl 42 (7), 3682–3701.

Thong, P.H., Son, L.H., Fujita, H., 2016 Interpolative Picture Fuzzy Rules: A Novel Forecast Method for Weather Nowcasting, In: Proceeding of the 2016 IEEE In-ternational Conference on Fuzzy Systems (FUZZ-IEEE 2016), pp 86–93 Thong, P.H., Son, L.H., 2016a Picture fuzzy clustering: a new computational in-telligence method Soft Comput 20 (9), 3549–3562.

Thong, P.H., Son, L.H., 2016b An overview of semi-supervised fuzzy clustering al-gorithms Int J Eng Technol 8 (4), 301–306.

Tuan, T.M., Ngan, T.T., Son, L.H., 2016 A novel semi-supervised fuzzy clustering method based on interactive fuzzy satisﬁcing for dental X-ray image segmen-tation Appl Intell 45 (2), 402–428.

Tuan, T.M., Duc, N.T., Hai, P.V., Son, L.H., 2016 Dental diagnosis from X-Ray images using fuzzy rule-based systems Int J Fuzzy Syst Appl (in press).

University of California, 2007 UCI Repository of Machine Learning Databases

〈http://archive.ics.uci.edu/ml/〉.

Vendramin, L., Campello, R.J., Hruschka, E.R., 2010 Relative clustering validity cri-teria: a comparative overview Stat Anal Data Min 3 (4), 209–235.

Wijayanto, A.W., Purwarianti, A., Son, L.H., 2016 Fuzzy geographically weighted clustering using artiﬁcial bee colony: an efﬁcient geo-demographic analysis algorithm and applications to the analysis of crime behavior in population Appl Intell 44 (2), 377–398.

Yang, M.S., Hwang, P.Y., Chen, D.H., 2004 Fuzzy clustering algorithms for mixed feature variables Fuzzy Sets Syst 141 (2), 301–317.

Định dạng
Số trang	10
Dung lượng	913,61 KB