DSpace at VNU: A novel intuitionistic fuzzy clustering method for geo-demographic analysis

A novel intuitionistic fuzzy clustering method for geo-demographic analysisLe Hoang Sona,⇑, Bui Cong Cuongb, Pier Luca Lanzic, Nguyen Tho Thonga a Hanoi University of Science, Vietnam Na

Trang 1

A novel intuitionistic fuzzy clustering method for geo-demographic analysis

Le Hoang Sona,⇑, Bui Cong Cuongb, Pier Luca Lanzic, Nguyen Tho Thonga

a

Hanoi University of Science, Vietnam National University, 334 Nguyen Trai, Thanh Xuan, Ha Noi, Viet Nam

b

Institute of Mathematics, Vietnamese Academy of Science and Technnology, 18 Hoang Quoc Viet, Cau Giay, Ha Noi, Viet Nam

c

Department of Electronics and Information, Politecnico di Milano, Piazza Leonardo da Vinci 32, I-20133, Italy

Keywords:

Geo-demographic analysis

Geographic information systems

Intuitionistic fuzzy sets

Policies making

Possibilistic fuzzy C-means

a b s t r a c t

Geo-Demographic Analysis (GDA) is an important tool to explore the underlying rules that regulate our world, and therefore, it has been widely applied to the development of effective socio-economic policies through the analysis of data generated from Geographic Information Systems (GIS) In GDA applications, clustering plays a major role however, the current state-of-the-art algorithms, namely the Fuzzy Geo-graphically Weighted Clustering (FGWC), have demonstrated several limitations both in terms of speed and in terms of quality of the achieved results Accordingly, in this paper, we propose a novel clustering algorithm for GDA application, based on recent results regarding intuitionistic fuzzy sets and the possibi-listic fuzzy C-means, that aims at overcoming some of the limitations of the existing methods

1 Introduction

Recently, there has been a great interest in Geo-Demographic

Analysis (GDA) and its application to many branches of society;

for instance, GDA is usually applied to study the geographical

dis-tribution of speciﬁc spending groups so as to predict future

spend-ing trends Accordspend-ingly, GDA plays an essential role in plannspend-ing

distribution, managing products, services, capital, etc

Geo-Demographic Analysis (GDA) combines Geographical

Infor-mation Systems (GIS) and Data Mining algorithms GDA is usually

deﬁned as ‘‘the analysis of spatially referenced geo-demographic and

lifestyle data’’ (Sleight, 1993) and widely used in the public and

private sectors for the planning and provision of products and

ser-vices There are two main underlying assumptions in GDA (Palmer,

2008): Firstly, it assumes that two people who live in the same area

are more likely to have similar characteristics than two people

selected at random; Secondly, it assumes that two areas can be

char-acterized in terms of their population, using demographics and

other measures Based on these two principles, clustering is applied

to group geo-demographic data into meaningful clusters that

cap-ture existing regularities (relevant geo-demographic proﬁles) thus

making the data more manageable for the target analysis (Mason

& Jacobson, 2007) Notwithstanding the several clustering methods

currently available, in this speciﬁc area, the existing

state-of-the-art, namely the Fuzzy Geographically Weighted Clustering (FGWC)

(Mason & Jacobson, 2007), is limited both in terms of speed and

in terms of achieved clustering quality (Son, Lanzi, Cuong, & Hung,

2011) Accordingly, in (Son et al., 2011) we presented an approach

to improve the speed of FGWC and provided experimental results showing that our approach could perform better that FGWC

In this paper, we extend our previous work (Son et al., 2011) and present a novel clustering algorithm for GDA, called Intuitionistic Possibilistic Fuzzy Geographically Weighted Clustering (IPFGWC), that tries to cope with some of the limitations of the FGWC algorithm by integrating elements of Intuitionistic Fuzzy Sets (IFS) (Atanassov,

1986), Possibilistic Fuzzy C-Means (PFCM) (Pal, Pal, Keller, & Bezdek,

2005) and Generalized IFS (GIFS) (Liu, 2010) The rest of the paper is structured as follows Section2introduces some related works for the GDA problem Section3describes our novel algorithm while Section4validates the proposed approach through a set of experi-ments involving real-world data Finally, Section 5 draws the conclusions and delineates the future research directions

2 Related works

In this section, we provide a brief overview of the works in GDA that are more relevant to this work.Walford (2011)described a method using Principal Component Analysis (PCA) to study the spatial distribution of the 1991 census data scores The distribution

of these scores helps to determine whether the underlying demo-graphic and socio-economic phenomena are geodemo-graphically concentrated or dispersed.Loureiro et al (2006)applied self-orga-nizing maps (SOM) (Kohonen, 2001) to GDA Self-organizing maps are unsupervised neural networks that have been used as the visu-alization tool for high dimensional data When a SOM is trained with

a given dataset, its units tend to spread themselves in input space following by some density functions of the data patterns Based on the variations in edge length in a path between two units on the SOM, the authors presented a new way of calculating fuzzy

⇑Corresponding author Tel.: +84 904171284; fax: +84 0438623938.

E-mail address: sonlh@vnu.edu.vn (L.H Son).

Contents lists available atSciVerse ScienceDirect

Expert Systems with Applications

j o u r n a l h o m e p a g e : w w w e l s e v i e r c o m / l o c a t e / e s w a

Trang 2

ships of fuzzy clustering method Then, results of the proposed

method are adequate in the identiﬁcation of their main clusters In

general, this method gives better results than other ones due to

the use of SOM in the initialization phase However, it requires a

lot of memory spaces to store all neurons and weights Furthermore,

the speed of training phase is quite slow Clustering algorithms have

been widely used in GDA (Tryon & Bailey, 1970).Day, Pearce, and

Dorling (2008)used agglomerative hierarchical clustering with the

Euclidean distance measure as a mean to classify 190 countries into

groups that are homogeneous in terms of mortality rates The aim of

this study was to identify clusters of nations grouped by health

outcomes in order to provide sensible groupings for international

comparisons Then, 12 clusters of countries were identiﬁed with

the average life expectancy of each one ranged from 81.5 years

(clus-ter 1) to 37.7 years (clus(clus-ter 12) However, the in(clus-terpretation of the

hierarchy is complex and often confusing Besides, the deterministic

nature of the technique prevents re-evaluation after points are

grouped into a node Hence, people prefer using k-means clustering

to using this algorithm.Lee et al (1999)applied k-means clustering to

classify the data of population and housing census in Korea into 26

clusters with 3752 basic administrative units separately for 1990

and 1995 Then, by the hierarchical method, they made six ‘‘Grand

Clusters’’ from these 26 clusters’ proﬁles The result of this research

is the distribution of six ‘‘Grand Clusters’’ during the period

1990 –1995 Another example of using k-means for GDA was from

Gibbs et al (2010) In this literature, the authors divided all the

pri-mary schools at London in 2007 into groups or clusters that have

similar ethnic and socio-economic characteristics Indeed, this

pro-cess helps parents ﬁnd schools for their children easily by marking

the suitable variables in the map, i.e ‘‘at least 80% pupils are White

British’’ and ‘‘Eligible for Free School Meals’’ A recent result of

Peter-sen et al (2011)has conﬁrmed again the use of k-means clustering

for this problem The authors presented London Output Area

Classiﬁ-cation (LOAC) which, in essence, is a geo-demographic system The

main core of this system is k-means clustering serving for mapping

health care needs and other health indicators that are useful for

tar-geting neighborhoods in public health campaigns LOAC was also

compared to six other geo-demographic systems from both

govern-mental and commercial sources

Basically, k-means algorithm can be used for GDA with

accept-able results However, these are some limitations of this algorithm

that we should consider carefully Firstly, it is very sensitive to the

outliers Thus, in some literatures such as (Lee et al., 1999), the

authors had to try k-means clustering manually for a rather big

number of clusters and then identify some outliers Indeed, it takes

much time to do that Secondly, the algorithm starts with random

positions at the beginning Somehow, they fall into bad cases and

the outputted results are not optimal In fact, assigning a

geograph-ical area to a single group is not always the best choice This

assumption leads to the issues of ecological fallacy An ecological

fallacy is a logical fallacy in the interpretation of statistical data

in an ecological study, whereby inferences about the nature of

spe-ciﬁc individuals are based solely upon aggregate statistics collected

for the group to which those individuals belong In the other

words, it can be shortly understood that statistics accurately

describing group characteristics do not necessarily apply to

indi-viduals within that group For example, if a group of people is

mea-sured to have a lower average IQ than the general population; it is

an error to assume that all members of that group have a lower IQ

than the general population Accordingly, fuzzy clustering method is

often preferred Fuzzy clustering assigns a membership value for

each area instead of assigning a geographical area to a single group,

thus helping to overcome the issues of ecological fallacy The fuzzy

clustering algorithm typically used in GDA is the fuzzy C-means

clustering algorithm of Bezdek known as FCM (Bezdek & Ehrlich,

1984; Kannan, Ramathilagam, & Chung, 2012; Küçükdeniz, Baray,

Esnaf, & Ecerkale, 2012) However, FCM misses geographical factor

in its design For example, consider that residential area type 27 is the same and behaves in the same way wherever it happens to be located But what happens if type 27 areas respond differently depending on their map locations? To overcome this shortcoming,

Feng and Flowerdew (1998)proposed an extension to the fuzzy clustering technique, which provides for the ex post facto adjust-ment of the cluster membership values based on ‘‘Neighbourhood Effects’’ (NE) The neighbourhood effects incorporate geography into the model The neighbourhood effects formula adjusts the cluster membership as shown in the following equation,

u0

k¼a ukþ b 1

A

Xc j¼1

where u0

kis the new cluster membership of area k and ukis the old cluster membership of this area Two parametersaand b are scaling variables which affect the proportion of the original membership and satisfy the condition,

A is a factor to scale the ‘‘sum’’ term to the range [0–1] The weighted membership is calculated as follows,

wkj¼ pb

where p is the length of the common boundary between k and j; dkj

is the distance between k and j; a and b are user-deﬁned parame-ters However, Feng and Flowerdew’s neighbourhood effects have some limitations: Firstly, they ignore the effects of areas that have

no common boundaries; Secondly, they exclude the effects of popu-lation, a key geo-demographic consideration To overcome these limitations,Mason and Jacobson (2007)introduced a modiﬁed clus-ter membership adjustment which incorporates a spatial inclus-terac- interac-tion effect model The new membership funcinterac-tion is inspired by the principles of geographical spatial interaction (Birkin & Clarke,

1991) and incorporates a basic spatial interaction model into the weighting of the memberships as well as the typical neighbourhood effects of fuzzy clustering algorithms This makes cluster centers

‘‘geographically aware’’ In (Mason & Jacobson, 2007), the authors introduced the Fuzzy Geographically Weighted Clustering (FGWC) to calculate the inﬂuence of one area upon another one as the product

of the populations of the areas A distance decay effect is imple-mented in the denominator This effect is impleimple-mented through the weighting factor as described in equation below,

wkj¼ ðpopk popjÞb=dakj; ð4Þ

where popk, popjare the population of areas k and j respectively; the term dkjis the distance between k and j; a and b are user-deﬁned parameters while A is a factor to scale the ‘‘sum’’ term, and it is calculated across all clusters, ensuring that the sum of the member-ships for a given area for all clusters is equal to one

FGWC is considered as the state-of-the-art for GDA problems However, its speed and the quality of clustering achieved can be improved (Son et al., 2011) In particular, while in (Son et al.,

2011) we showed an approach to signiﬁcantly speed-up FGWC,

in this paper we instead focus on the quality of the clusters pro-duced In particular, based on the results of Intuitionistic Fuzzy Sets (IFS) (Atanassov, 1986), Possibilistic Fuzzy C-Means (PFCM) (Pal

et al., 2005) and Generalized IFS (GIFS) (Liu, 2010), we propose an integrated approach to produce clusters of higher quality The approach has been validated empirically

Trang 3

3 The proposed algorithm

3.1 Intuitionistic fuzzy sets

Fuzzy sets have been introduced byZadeh (1965) and since

then this concept has been applied to various algebraic structures

Basically, a fuzzy set is deﬁned as follows

Deﬁnition 1 A Fuzzy Set (FS)l(Zadeh, 1965) in a non-empty set X

is a function

x #lðxÞ

wherel(x) is the membership degree of each element x 2 X A fuzzy

set can be alternately deﬁned as

Along with the development of practical problems, this notion is too

cramped An obvious evidence for this consideration can be

ex-tracted from the electoral problem (Atanassov, 2003) Suppose that

we have a list of all countries (n) with elective governments

Nor-mally, if we assume p is the number of countries whose electors

vote for the corresponding government then the number of

coun-tries whose electors do not vote is n p However, it is possible that

there exist some votes given to parties or persons outside the

gov-ernment Indeed, this situation cannot be modelled by traditional

fuzzy sets

Motivated by the concept of ‘‘intuitionism’’ of L Brouwer when

he invited mathematicians to remove Aristoteles’ law of excluded

middle,Atanassov (1986)presented the idea of intuitionistic fuzzy

sets That is to say, if we have a proposition A, we can state that A is

true, that A is false, or that we do not know whether A is true or

false This idea is expressed by the deﬁnition below

Deﬁnition 2 An Intuitionistic Fuzzy Set (IFS) (Atanassov, 1986) in a

non-empty set X is

~

A ¼ fhx;lA ~ðxÞ;c~ AðxÞijx 2 Xg; ð7Þ

wherelA ~ðxÞ is the membership degree of each element x 2 X and

cA ~ðxÞ is the non-membership degree,

l~ AðxÞ;cA ~ðxÞ 2 ½0; 1; 8x 2 X; ð8Þ

0 6 l~ AðxÞ þc~ AðxÞ 6 1; 8x 2 X: ð9Þ

The intuitionistic fuzzy index of an element showing the

non-deter-minacy is denoted as

PA ~ðxÞ ¼ 1 lA ~ðxÞ þcA ~ðxÞ; 8x 2 X: ð10Þ

WhenPA ~ðxÞ ¼ 0, the IFS set becomes FS set of Zadeh

Since this notation was introduced, several applications of the

IFS set have been presented For instance, Sen and Saha (1986)

deﬁned the concept of aC-semigroup and established a relation

be-tween regular C-semigroup and C-group Kim and Jun (2001)

investigated some properties of the intuitionistic fuzzy sets in a

semigroup.Uckun, OZTURK, and Jun (2007)introduced the notion

of an intuitionistic fuzzyC-ideal of aC-semigroup and investigated

some properties connected with intuitionistic fuzzyC-ideals in a

C-semigroup Similarly,Gau and Buehrer (1993)introduced Vague

Sets In this literature, the authors noted that the drawback of using

the single membership value in fuzzy set theory is that the evidence

for x 2 X and the evidence against x 2 X are in fact mixed together

Therefore, instead of using point-based membership in FS,

inter-val-based membership is used (Lu & Ng, 2005)

Deﬁnition 3 A Vague Set (VS) (Gau & Buehrer, 1993) in a universe

of discourse X is

V ¼Xn i¼1

½aVðxiÞ; 1 bVðxiÞ=xi; xi2 X; ð11Þ

0 6 aVðxiÞ 6 1 bVðxiÞ 6 1; 1 6 i 6 n; ð12Þ

whereaV(xi) is a lower bound on the grade of membership of xi

derived from the evidence for xi, and bV(xi) is a lower bound on the grade of membership of the negation of xiderived from the evi-dence against xi The grade of membership of xiis bounded to a sub-interval [aV(xi), 1 bV(xi)] [0, 1] In case of aV(xi) = bV(xi), VS set becomes FS set of Zadeh However, VS is shown to be equivalent

to IFS with the intuitionistic fuzzy index 1 aV(xi) bV(xi) However, sometimes we do not know both the membership de-gree and the non-membership dede-gree of an element but we know its the value range instead To cope with this issue,Atanassov and Gargov (1989)extended IFS and introduced the concept of Inter-val-Valued Intuitionistic Fuzzy Set (IVIFS)

Deﬁnition 4 An Interval-Valued Intuitionistic Fuzzy Set (IVIFS) (Atanassov & Gargov, 1989) in a non-empty set X is

~

AIV¼ fhx;l~ A IVðxÞ;cA ~ IVðxÞijx 2 Xg; ð13Þ

where

lA ~ IVðxÞ ¼ lL

~

AIVðxÞ;lU

~

AIVðxÞ

lL

~

A IVðxÞ ¼ InflA ~IVðxÞ; lU

~

A IVðxÞ ¼ Supl~ AIVðxÞ; 8x 2 X; ð15Þ

cA ~ IVðxÞ ¼ cL

~

A IVðxÞ;cU

~

A IVðxÞ

cL

~

A IVðxÞ ¼ InfcA ~ IVðxÞ; cU

~

A IVðxÞ ¼ Supc~ A IVðxÞ; 8x 2 X: ð17Þ

A constraint is set up for(13)

0 6 lA ~IVðxÞ þcA ~IVðxÞ 6 1: ð18Þ

The intuitionistic fuzzy index is

P~ A IVðxÞ ¼lPLA ~IVðxÞ;PUA ~IVðxÞm

PL~ AIVðxÞ ¼ 1 lU

~

AIVðxÞ cU

~

AIVðxÞ; PUA ~IVðxÞ

¼ 1 lL

~

A IVðxÞ cL

~

Note that, when l~ A IVðxÞ ¼lL

~

A IVðxÞ ¼lU

~

A IVðxÞ and c~ A IVðxÞ ¼cL

~

A IVðxÞ ¼

cU

~

A IVðxÞ, the IVIFS set is reduced to IFS

In 2002,Mondal and Samanta (2002)presented a generalized version of IFS where the condition (9) is extended by a fuzzy operator

Deﬁnition 5 A Mondal and Samanta’s Generalized Intuitionistic Fuzzy Set (MS-GIFS) (2002) in a non-empty set X is

~

AMS¼ fhx;l~ AMSðxÞ;c~ AMSðxÞijx 2 Xg: ð21Þ

The meanings of variables are similar to those in Def 2 Neverthe-less, a little change is occurred in the condition(9)as shown below

0 6 lA ~ðxÞ ^c~ AðxÞ 6 0:5; 8x 2 X: ð22Þ

Recently,Liu (2010)has shown an extension of Mondal and

Saman-ta model by atSaman-taching an extensional index so called L-value This index was incorporated to the condition(22)above

Deﬁnition 6 A Liu’s Generalized Intuitionistic Fuzzy Set (L-GIFS) (Liu, 2010) in a non-empty set X is

~

AL¼ fhx;l~ A LðxÞ;c~ A LðxÞijx 2 Xg: ð23Þ

Trang 4

The condition(22)is replaced by

0 6 lA ~ðxÞ þc~ AðxÞ 6 1 þ L; 8x 2 X: ð24Þ

When L = 0, the Liu fuzzy sets return to IFS

Liu has also proved that for any universe of discourse X

FSðXÞ IFSðXÞ MS GIFSðXÞ L GIFSðXÞ; ð25Þ

L GIFSðXÞ å MS GIFSðXÞ å IFSðXÞ å FSðXÞ: ð26Þ

Notice that the L-value used in(25) and (26)is one Besides, from

the Def 3, we also conﬁrm that

Some algebraic operations, fuzzy relations and fuzzy topology for

these sets were discussed intensively in equivalent literatures

As we have mentioned in the previous section, fuzzy clustering

is the method of choice to tackle GDA problems However, the

usual Fuzzy C-Means (FCM) and its variants FGWC rely on the basic

principles about fuzzy sets ofZadeh (1965) This kind of fuzzy sets

has been shown to contain some limitations for current practical

problem Indeed, to fulﬁl the objective of this paper, we need to

de-velop a clustering algorithm in some extended fuzzy sets of Zadeh

Thus, in some next parts, we will pay much attention to the

Intui-tionistic Fuzzy Set (IFS) and its latest version L-GIFS As such, the

proposed clustering algorithm will be ﬁrstly constructed in IFS

and then expanded in L-GIFS

Example 1 By Matlab simulation, we illustrate these fuzzy sets

fromFigs 1–4

3.2 Status quo of clustering algorithms for IFS

Clustering algorithms for IFS sets have been being studied

intensively in recent times Providing a useful way to describe

fuz-zy sets, these kinds of algorithm often enhance the understanding

of internal structure of data

Pelekis, Iakovidis, Kotsifakos, and Kopanakis (2008)argued that

the distances determining the membership of a feature vector to a

cluster are also subject to uncertainty Current fuzzy clustering

ap-proaches do not utilize any information about uncertainty at the

constitutional feature level The most typical one, FCM algorithm

tries to partition the dataset by just looking at the feature vectors,

and as such it ignores the fact that these vectors may be

accompa-nied by qualitative information which may be given per feature

For example, following the idea of intuitionistic fuzzy set theory,

a data point Xkis not just a p-dimensional vector (Xk1, Xk2, , Xkp)

of quantitative information, but instead it is a p-dimensional vector

of triplets [(Xk1,lk1,ck1), (Xk2,lk2,ck2), , (Xkp,lkp,ckp)] where for each Xki there exists qualitative information which is provided via the intuitionistic membershiplkiand non-membershipckiof the current data point to the feature i

In this situation, the traditional distance function in FCM algo-rithm is not suitable because it operates only on the feature vectors and not on the qualitative information which may be given per fea-ture For this reason,Pelekis et al (2008)presented a modiﬁed ver-sion of FCM algorithm for IFS sets with a new intuitionistic fuzzy set distance metric,

DðA; BÞ ¼ ðdðlA;lBÞ þ dðcA;cBÞÞ=2; ð28Þ

where

dðA0;B0

Þ ¼

PN i¼1 minðA 0 ðx i Þ;B 0 ðx i ÞÞ

PN i¼1

maxðA 0 ðxiÞ;B 0 ðxiÞÞ

A0[ B0–/

1 A0[ B0¼ /

8

>

Zhang, Xu, and Chen (2007)deﬁned the concept of intuitionistic

fuz-zy similarity matrix (IFSM), and constructed an intuitionistic fuzfuz-zy equivalence matrix (IFEM) Then they transformed the IFSM into the IFEM and clustered the dataset based on the k-cutting matrix

of the interval-valued matrix However, the whole process is performed on the IVIFS sets (Def 4) that are converted from the ori-ginal IFS sets Indeed, it requires a lot of computations and loses too

Fig 2 The IFS sets.

Trang 5

much information in the process of calculating the intuitionistic

fuz-zy similarity degree (IFSD)

To overcome this limitation,Xu, Chen, and Wu (2008)proposed

an algorithm that uses the derived association coefﬁcients (AC) of

IFS to construct an association matrix (AM), and utilizes the

transi-tivity principle of the equivalent matrix (Zhang, 1983) to transform

it into an equivalent association matrix (EAM) Then, the EAM will be

used to construct the k-cutting matrix Based on this matrix,

clas-siﬁcation will be performed

CðA;BÞ ¼

PN

i¼1wi½lAðxiÞ:lBðxiÞ þcAðxiÞ:cBðxiÞ þPAðxiÞ:PBðxiÞ

max PN

i¼1wihl2ðxiÞ þc2ðxiÞ þP2ðxiÞi

;PN i¼1wihl2ðxiÞ þc2ðxiÞ þP2ðxiÞi

ð30Þ

The formula to calculate AC Eq.(30)is somehow similar to the

sim-ilarity degree of (Pelekis et al., 2008)

In 2009,Xu (2009)also introduced an intuitionistic fuzzy

hierar-chical algorithm for clustering IFS, which is based on the traditional

hierarchical clustering procedure and the intuitionistic fuzzy

aggregation operator

However, these are some limitations in the series works of Xu

et al Firstly, the number of clusters is not set up beforehand In case

we wish to partition the dataset into a pre-deﬁned number of

clus-ters, it is impossible for these algorithms Secondly, we cannot

determine the suitable value of parameter k Depend on a speciﬁc

value of this parameter, the number of clusters will be established

Thirdly, these algorithms cannot provide the information about

membership degrees of the objects to each cluster

Solving these obstacles,Xu and Wu (2010)developed another

intuitionistic fuzzy C-means algorithm to cluster IFS, which is

based on the well-known fuzzy C-means clustering method and the basic distance measures between IFS (Szmidt & Kacprzyk, 2000; Xu, 2007)

d1ðA;BÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1

2

XN i¼1

wi½ðlAðxiÞ lBðxiÞÞ2þ ðcAðxiÞ cBðxiÞÞ2þ ðPAðxiÞ PBðxiÞÞ2

v u

:

ð31Þ

The differences between two versions of modiﬁed FCM algorithm for IFS in (Pelekis et al., 2008; Xu & Wu, 2010) are not only the dis-tance metric but also the way to deﬁne clusters’ centers In (Pelekis

et al., 2008), the membership (non-membership) degree of a center

is calculated by average of the ones of its data points speciﬁed through the membership matrix While in (Xu & Wu, 2010), it is cal-culated through a combination of membership values and data points; thus giving better representation of clusters’ centers In gen-eral, Xu and Wu‘s algorithm performs more stably and effectively than any other clustering methods on IFS sets

3.3 A new fuzzy clustering method for intuitionistic fuzzy sets Throughout the previous section, a natural question can be aris-en: ‘‘Is theXu and Wu method (2010)enough for the GDA prob-lem?’’ In a certain extent, it is acceptable However, due to some limitations below, we cannot totally rely on this method and have

to think about a new one

The Xu and Wu method uses FCM as a skeleton to deploy the main algorithm As pointed out byPal et al (2005), FCM is very sensitive to outliers Consider the number of clusters is two If

an outlier is equidistant from two prototypes, then its member-ship in each cluster will be the same regardless of the absolute value of the distance of this point from the two centers as well

as from the other points in the data In such cases, it should be different membership values following by the distance criteria,

The membership values in both methods (Pelekis et al., 2008;

Xu & Wu, 2010) are still crisp value In the other word, it is con-sidered the traditional fuzzy set (FS) of Zadeh In some speciﬁc context, we should extend this set into intuitionistic one,

Most of the geo-demographic data treat their features equally

In the other word, the membership values of an element to its features are often one Indeed, we should reduce the pattern and center sets to the FS sets instead of IFS ones

From those observations, we should investigate a new fuzzy clustering algorithm for IFS.Pal et al (2005)presented an effective fuzzy clustering method so called Possibilistic Fuzzy C-Means (PFCM) PFCM was proven to solve the noise sensitivity defect of FCM, overcome the coincident clusters problem of Possibilistic

Fig 3 The MS-GIFS sets.

Fig 4 The L-GIFS sets.

Trang 6

C-Means (PCM) and eliminate the row sum constraints of Fuzzy

Possibilistic C-Means (FPCM) Therefore, it is suitable for the

pro-posed algorithm

Now, we state the objective function of the considered problem

(Intuitionistic Possibilistic Fuzzy C-Means – IPFCM)

Jm; g ;sðU; T; H; V; XÞ ¼XN

k¼1

XC j¼1

a1umþ a2tgkjþ a3hskj

kXk Vjk2

þXC j¼1

cj

XN k¼1 ð1 tkjÞg! min : ð32Þ

In(32), we have to partition the pattern set X into C clusters Each

pattern has a membership value U, a hesitation level H and a

typi-cality value T The constraints for this problem are

XC

j¼1

XC

j¼1

hkj¼ 1

ai>0; i ¼ 1; 3:

cj; j ¼ 1; C are constants

ukj;tkj;hkj2 ½0; 1:

Some special cases can be seen as follows,

þ a3¼ 0 : IPFCM ! PFCM

þ a1¼ 0 ^ a3¼ 0 : IPFCM ! PCM

þ a3¼ 0 ^ a2¼ 0 ^cj¼ 0; 8j ¼ 1;C : IPFCM ! FCM

þ a3¼ 0 ^ a01¼ a2¼ 1 ^cj¼ 0; 8j ¼ 1;C : IPFCM ! FPCM:

ð36Þ

Theorem 1 The optimal solutions of systems(32)–(35)are

ukj¼ 1

PC

i¼1

kX k V j k

kXkV i k

m1

; k ¼ 1; N; j ¼ 1; C; ð37Þ

hkj¼ 1

PC

i¼1

kXkV j k

kX k V i k

s 1

; k ¼ 1; N; j ¼ 1; C; ð38Þ

tkj¼ 1

1 þ a2 kX k V j k2

c j

g 1

; k ¼ 1; N; j ¼ 1; C; ð39Þ

Vj¼

PN

k¼1a1umþ a2tgkjþ a3hskj

Xk

PN

k¼1a1umþ a2tgkjþ a3hskj ; j ¼ 1; C: ð40Þ

Proof Fix T, H, V for the kth column Ukof U, we get the reduced

problem,

Jm;g;sðUkÞ ¼XC

j¼1

a1umþ a2tgkjþ a3hskj

kXk Vjk2! min : ð41Þ

Using Lagrange Multiplier for(41)we obtain

LðUk;kkÞ ¼XC

j¼1

a1um

kjþ a2tgkjþ a3hskj

kXk Vjk2 kk

XC j¼1

ukj 1

!

; ð42Þ

@LðUk;kkÞ

@ukj

¼ a1 m um1

@LðUk;kkÞ

@ukj

a1 m kXk Vjk2

!1 m1

; j ¼ 1;C;k ¼ 1;N: ð44Þ SincePC

j¼1ukj¼ 1, we easily have

XC j¼1

kk

a1 m kXk Vjk2

!1 m1

() kk¼ a1 m XC

j¼1

From(44) and (46), we get

ukj¼ 1

PC i¼1

kX k V j k

kXkV i k

m1

; k ¼ 1; N; j ¼ 1; C: ð47Þ

Similarly, ﬁx U, T, V for the kth column Hkof H and use Lagrange multiplier, we obtain the solution

hkj¼ 1

PC i¼1

kXkVjk

kX k V i k

s 1

; k ¼ 1; N; j ¼ 1; C: ð48Þ

Fix U, H, V for the typicality value tkj, we obtain the reduced problem,

Jkj m; g ; sðtkjÞ ¼ a1um

kXk Vjk2þcjð1 tkjÞg! min: ð49Þ

Using gradient method for(49), the solution is found as

@Jkjm;g;sðtkjÞ

@tkj ¼ a2g tgkj1kXk Vjk2cjg ð1 tkjÞg1; ð50Þ

@Jkjm;g;sðtkjÞ

@tkj ¼ 0 () a2g tg1

kj kXk Vjk2¼cjg ð1 tkjÞg1;ð51Þ

()a2kXk Vjk

2

cj ¼

1

tkj

1

g 1

() 1 þ a2kXk Vjk

2

cj

!1

g 1

¼1

() tkj¼ 1

1 þ a2 kX k V j k 2

c j

g 1

; k ¼ 1; N; j ¼ 1;C: ð54Þ

Fix U, H, T, we calculate the gradient of Jm,g,s(V) with respect to each Vj,

Jm;g;sðVÞ ¼XN

k¼1

XC j¼1

a1um

kXk Vjk2! min; ð55Þ

@Jm;g;sðVÞ

@Vj

¼XN k¼1

a1um

ð2Xkþ 2VjÞ; ð56Þ

@Jm;g;s

@Vj

¼ 0 () Vj¼

PN k¼1 a1um

Xk

PN k¼1 a1um

Similar to (Pal et al., 2005), in this paper, we try to investigate some interesting properties of the solutions above

Property 1 Since the exponential factor belongs to the interval [0, 1],

we get

Trang 7

m!1þfukjg ¼ 1

lim

m!1þ

PC i¼1

kX k V j k

kXkV i k

m1

lim

Property 2 The series expansion at m = 1 is

1 þ2 logðaÞ

2ðlog2ðaÞ þ logðaÞÞ

4log 3 ðaÞ

3 þ 4log2ðaÞ þ 2 logðaÞ

þ

2log4ðaÞ

3 þ 4log3ðaÞ þ 6log2ðaÞ þ 2logðaÞ

m4 þ

4log 5 ðaÞ

15 þ8log34ðaÞþ 8log3ðaÞ þ 8log2ðaÞ þ 2logðaÞ

m

6!

;ð60Þ

where

a ¼XC

i¼1

kXk Vjk

Thus,

lim

Obviously, when the parameter m is getting larger, the value of ukj

tends to converge to one

Property 3 Similarly, some limits of the hesitation level are

lim

s!1þfhkjg ¼ 1:

lim

s!1fhkjg ¼ 0

lim

s!1fhkjg ¼ 1

ð63Þ

Property 4 Limits of the typicality value are

lim

1 þ lim

g !1þ

a 2 kX k V j k 2

c j

g 1

¼

1 if kXk Vjk2<cj

a 2

1 if kXk Vjk2¼cj

a2

0 otherwise

8

>

lim

g !1ftkjg ¼

0 if kXk Vjk2<cj

a 2

1 if kXk Vjk2¼cj

a 2

1 otherwise

8

>

Property 5

lim

g !1ftkjg ¼1

Property 6 if a2= 0 then tkj¼ 18k ¼ 1; N; j ¼ 1; C

This property shows the fact that the typicality values are

independent of the distances between Xkand Vjwhen the parameter

a2= 0 In this case, the objective function approximates to the one of

FCM algorithm with the supplement of the hesitation level

Property 7 The double limit of the sequence {uj} is

lim

m!1

g !1

s!1

fvjg ¼ lim

ðm; g ;sÞ!ð1;1;1Þ

PN k¼1 Xk

1þ

PN l¼1;l–k a1u m

lj þa2tgljþa3hslj

a1u m þa2tgkjþa3hskj

ﬃ

PN

k¼1 Xk

1þðN1Þ lim

ðm; g ; s Þ!ð1;1;1Þ

a1u m

lj þa2tgljþa3hslj a1umþa2tgkjþa3hskj

¼

PN

k¼1 X k

¼ V l 2 ½1; N

When the parameters are quite large, all the clusters’ centers tend to move to the central point of the dataset

Property 8 If a2=cj= 1; j ¼ 1; C;g= 2 then

ZN 2

Z C 2

tkjdkdj ¼ ðN C þ 1Þ logðN C þ 1Þ þ ðN 1Þ logðN 1Þ

ðC 3Þip ðC 3Þ logðC 3Þ: ð68Þ

If a2=cj= 1; j ¼ 1; C;g= 0 then

ZN 2

Z C 2

tkjdkdj ¼ ðN 2ÞðC 2Þ þ ðN C þ 1Þ logðN C þ 1Þ

ðN 1Þ logðN 1Þ þ ðC 3Þip

From this property, we see that there is a gap of the quantity (N 2)(C 2) between the areas of typicality values in two steps

g= 0 andg= 2

Property 9

lim m!1þ

s!1þ

ukj

hkj

¼ lim m!1þ

s!1þ

XC i¼1

kXk Vjk

! 2ðm s Þ ðm1Þð s 1Þ

where ~1 is complex inﬁnity

lim m!1þ

s!1

ukj

hkj

lim m!1

s!1

ukj

hkj

¼ XC i¼1

kXk Vjk

!2

s 1

; XC i¼1

kXk Vjk

!2 m1

8

<

:

9

=

;: ð72Þ

From these formulas, we may see that ukjand hkjhave similar roles

Theorem 2 The relation between two parameters m andsto assure the constraint(33)is the IPFCM condition below

Proof Since the constraint(33), we get

1

PC i¼1

kX k V j k

m1

PC i¼1

kX k V j k

s 1

From the Cauchy inequality, we obtain

1

PC i¼1

kX k V j k

kXkVjk

m1

PC i¼1

kX k V j k

kXkVjk

s 1

PC i¼1

kX k V j k

mþ s 2

ðm1Þð s 1Þ

; ð75Þ

) m þs 2

where

A ¼ XC i¼1

kXk Vjk

!

Since A < 2

) m þs 2

ðm 1Þðs 1Þ P logA2 > 1; ð78Þ

The condition(73)allows us to choose the correct value of m andsfor the IPFCM algorithm In the proof above, the strict upper bound for A is one However, due to these results below,

Trang 8

lim

We, therefore, should choose the loose upper bound for this

param-eter In this case, we set it to two

3.4 Fuzzy clustering method for Liu’s generalized intuitionistic fuzzy

sets

In this section, we extend the IPFCM for the L-GIFS sets Notice

that the only difference between the L-GIFS and IFS sets is the

replacement of constraint(9) in IFS by constraint(24) in L-GIFS

with the support of an extensional index In the other word, the

constraint(24)is a generalization of(9)because when this index

is set to zero, the L-GIFS returns to IFS Therefore, the IPFCM

algo-rithm itself remains unchanged However, for the adaptation of

constraint(24), we now state another condition of parameters

Theorem 3 The relation between two parameters m andsto assure

the constraint(24)in the IPFCM algorithm for the L-GIFS sets is

m þs 2

ðm 1Þðs 1Þ P log2

2

Proof Similar to the above proof and following by the Cauchy

con-dition, we obtain the inequality below,

2

PC

i¼1

kXkVjk

kX k V j k

mþ s 2

ðm1Þð s 1Þ

By marking A as in(77), we get the following facts,

m þs 2

ðm 1Þðs 1Þ P logA

2

1 þ L P log2

2

Especially when L = 1,

Because

2 s P 3 2s

The values of parameters m andsin(85)are deﬁnitely stricter than

the ones in(73) Indeed, we can use these constraints to ﬁnd the

suitable values of these parameters for the IPFCM algorithm h

3.5 The intuitionistic possibilistic fuzzy geographically weighted

clustering algorithm

For all previous parts of this section, we have presented the

fuz-zy clustering algorithm for IFS and L-GIFS sets (IPFCM) Now, we

present the main algorithm of this paper for the GDA problem This

algorithm is named Intuitionistic Possibilistic Fuzzy Geographically

Weighted Clustering (IPFGWC) It is an extension of IPFCM for the

considered problem We have a relationship here,

The algorithm is stated below

Step 1: Set the number of patterns N, the number of clusters C, the

threshold e> 0 and other parameters such as

m;g;s>1; ai>0; i ¼ 1; 3,cj;j ¼ 1; C satisfying the IPFCM

condition for the L-GIFS sets(82)and geographic

parame-tersa, b, a, b,

Step 2: Initialize the centers of clusters Vj, j ¼ 1; C at t = 0,

Step 3: Use (37)–(39) to calculate the membership values, the hesitation level and the typicality values, respectively The distance used in these formulas is Euclidean Squared Distance Metric,

Step 4: Perform geographic modiﬁcations through equations(1), (2) and (4) Notice that the population of an area is decided through the current membership values,

Step 5: Use(40)to calculate the centers of clusters at t + 1, Step 6: If the error kV(t+1) V(t)k 6e then stop the algorithm Otherwise, assign V(t)= V(t+1)and returns to Step 3 The outputted results of this algorithm are the clusters’ centers, ﬁnal membership values and hesitation levels

4 Results and discussions 4.1 Experiment Environment

In this part, we describe the experimental tool, the experimen-tal datasets and the cluster validity measurement

Experimental tools: We have implemented the proposed algo-rithms (IPFGWC) in addition to the above three algoalgo-rithms: FCM (Bezdek & Ehrlich, 1984), FGWC (Mason & Jacobson,

2007) and CFGWC (Son et al., 2011) in C programming language and executed them on a PC Intel (R) Core (TM)2 Duo CPU T6400

@ 2.00 GHz (2 CPUs), 2048 MB RAM and the operating system is Windows 7 Professional 32-bit

Some parameters of these algorithms are set up as below

The proposed algorithm: Threshold e= 103, parameters

m = 3, s= 2,g= 2, a1¼ a2¼ a3¼ 1;cj¼ 1 j ¼ 1; C and geo-graphic parameters a = b = 1,a= 0.7, b = 0.3,

FCM, FGWC, CFGWC: These algorithms have the similar val-ues of geographic parameters, m and threshold with the algorithm above

Experimental datasets: We use a real dataset of socio-economic demographic variables from United Nation Organization – UNO (UNSD Statistical Databases, 2011) These data have been collected from national statistical authorities since 1948 through a set of questionnaires dispatched annually by the Uni-ted Nations Statistics Division to over 230 national statistical ofﬁces (Fig 5) They contain statistics on population size and composition, births, deaths, marriage and divorce on an annual basis, economic activity, educational attainment, household characteristics, housing, ethnicity and language, etc

Cluster Validity Measurement: We use the validity function of fuzzy clustering for spatial data namely IFV (Chunchun, Lingkui,

& Wenzhong, 2008) This index was shown to be robust and sta-ble when clustering spatial data The deﬁnition of this index is characterized below,

IFV ¼1 C

XC j¼1

1 N

XN k¼1

u2

kj log2C 1

N

XN k¼1 log2ukj

8

<

:

9

=

;

SDmax

rD : ð88Þ

The maximal distance between centers is

SDmax¼ max

The even deviation between each object and the cluster center is

rD¼1 C

XC j¼1

1 N

XN k¼1

kXk Vjk2

!

When IFV ? max, the value of IFV is said to yield the most optimal

of the dataset

Trang 9

4.2 IFV comparison

Firstly, we calculate the values of IFV index following by m and C

The results are shown inTable 1

From these results, we can recognize that the IFV values of the

proposed algorithm (IPFGWC) are larger than the ones of CFGWC,

FGWC and FCM algorithms The descending order of these

algo-rithms following by IFV index is IPFGWC, CFGWC, FGWC and

FCM It is obvious that the IPFGWC algorithm obtains the best

qual-ity of clustering among all other ones Moreover, this test also

reconﬁrms the results shown in the literature (Son et al., 2011)

where the authors compared three algorithms FCM, FGWC and

CFGWC The FCM has the smallest values of IFV index due to the

lack of geographic effects in the algorithm itself Overcoming this

deﬁciency, FGWC obtains the higher values of IFV than FCM does

However, by supporting a context term, the algorithm CFGWC

really outperforms than FGWC and, of course, FCM Finally, the

IFV values of IPFGWC are shown to be larger than the ones of

CFGWC (Figs 6 and 7) Thus, IPFGWC is considered as the most

effective algorithm in terms of quality of clustering among all

When the parameter m = 1.5, the optimal number of clusters of

IPFGWC, CFGWC, FGWC and FCM algorithm is 3, 7, 4, 6,

respec-tively These numbers in case of m = 2.5 are 3, 2, 4 and 7 Generally,

the number of clusters produced by IPFGWC is the smallest among

all results Indeed, the IPFGWC method can generate results, which are near to optimal ones more than other methods do

In IPFGWC algorithm, the average number of IFV per the num-ber of clusters C is 2452.1 and 2391.8, respectively when m = 1.5 and m = 2.5 Thus, another conclusion can be drawn from this test: When the parameter m increases, the IFV value of IPFGWC tends to decrease

Secondly, we measure the values of IFV index following byaand

C The results inTable 2clearly show that the IFV value of IPFGWC

is still the largest among all Furthermore, IPFGWC’s result is much better than CFGWC’s and FGWC’s The IFV value of IPFGWC is 1.5 times larger than CFGWC’s and 37.1 times larger than FGWC’s in average Especially when C = 2, these numbers are 33 and 1079, respectively Obviously, the supplements of typicality values and hesitation level on the objective function have made a great improvement in the IFV values of the IPFGWC algorithm

Whena= 0.4 (Fig 8), the optimal number of clusters of three algorithms is 3, 7 and 5, respectively These numbers in case of

a= 0.6 (Fig 9) anda= 0.8 (Fig 10) are 3, 2, 6 and 3, 2, 2, respec-tively The results of IPFGWC seem to be stable through different values ofa On the contrary, the results of CFGWC and FGWC tend

to be small when the number ofaincreases In case ofa< 0.5, the number of clusters of the IPFGWC algorithm is said to be the most optimal result among all

Fig 5 Populations of 233 countries in 2001.

Table 1

IFV values by m and C.

Trang 10

Similar to the previous test, when the parameteraincreases,

the IFV value of IPFGWC tends to decrease

The ﬁnal conclusion of this part is: the quality of clustering of the

IPFGWC algorithm really outperforms the ones of CFGWC, FGWC

and FCM

4.3 The running time comparison

In this part, we measure the speed of these algorithms following

by C (Table 3) anda(Table 4)

Obviously, the running time of the IPFGWC method is slower

than other ones’.Table 3shows that the running time of IPFGWC

is 4.33 times larger than the one of CFGWC in average These

numbers in case of FGWC and FCM are 4.05 and 3.12, respectively

Because of some modiﬁcations in the objective function, IPFGWC

has to undertake some extra calculations Thus, they cause the slow running time of IPFGWC as shown in this table.Table 3also reconﬁrms the experimental results in the literature (Son et al.,

2011) where the descending order of running time is FCM, FGWC and CFGWC

Another ﬁnding extracted fromTable 3is the average increment

of these algorithms per the number clusters C When a cluster is added, the running time of IPFGWC is increased by 0.132 s These incremental numbers in CFGWC, FGWC and FCM are 0.03, 0.032 and 0.043, respectively This ﬁnding helps us to estimate the run-ning time of an algorithm when more clusters are provided

InTable 4, the running time of IPFFGWC is 3.42 and 3.06 times larger than CFGWC and FGWC, respectively These numbers are smaller than the ones inTable 3 Indeed, the number of cluster C contributes to the increment of running times of these algorithms more than the parameteradoes

Similar toTable 3, the average increment of IPFGWC, CFGWC and FGWC per a= 0.1 is 0.22, 0.06 and 0.08, respectively These incremental numbers are double the ones in Table 3 Thus, the increment of the parameteraoften leads to higher values of the average increment of these algorithms than the increment of the number of cluster C does

Fig 11shows the number of iteration steps of four algorithms in equivalent to the results inTable 3

The ﬁnal conclusion of this part is: the running time of the IPFGWC algorithm is slower than the ones of CFGWC, FGWC and FCM due to extra calculations However, it is not too slow and is acceptable

4.4 The changes of objective function of IPFGWC Finally, we investigate the changes of objective function’s values of the proposed method following by its parameters (Table 5)

Fig 6 IFV values when m = 1.5.

Fig 7 IFV values when m = 2.5.

Table 2

IFV values byaand C.

Fig 8 IFV values whena= 0.4.

Định dạng
Số trang	12
Dung lượng	1,14 MB