A novel intuitionistic fuzzy clustering method for geo-demographic analysisLe Hoang Sona,⇑, Bui Cong Cuongb, Pier Luca Lanzic, Nguyen Tho Thonga a Hanoi University of Science, Vietnam Na
Trang 1A novel intuitionistic fuzzy clustering method for geo-demographic analysis
Le Hoang Sona,⇑, Bui Cong Cuongb, Pier Luca Lanzic, Nguyen Tho Thonga
a
Hanoi University of Science, Vietnam National University, 334 Nguyen Trai, Thanh Xuan, Ha Noi, Viet Nam
b
Institute of Mathematics, Vietnamese Academy of Science and Technnology, 18 Hoang Quoc Viet, Cau Giay, Ha Noi, Viet Nam
c
Department of Electronics and Information, Politecnico di Milano, Piazza Leonardo da Vinci 32, I-20133, Italy
Keywords:
Geo-demographic analysis
Geographic information systems
Intuitionistic fuzzy sets
Policies making
Possibilistic fuzzy C-means
a b s t r a c t
Geo-Demographic Analysis (GDA) is an important tool to explore the underlying rules that regulate our world, and therefore, it has been widely applied to the development of effective socio-economic policies through the analysis of data generated from Geographic Information Systems (GIS) In GDA applications, clustering plays a major role however, the current state-of-the-art algorithms, namely the Fuzzy Geo-graphically Weighted Clustering (FGWC), have demonstrated several limitations both in terms of speed and in terms of quality of the achieved results Accordingly, in this paper, we propose a novel clustering algorithm for GDA application, based on recent results regarding intuitionistic fuzzy sets and the possibi-listic fuzzy C-means, that aims at overcoming some of the limitations of the existing methods
Ó 2012 Elsevier Ltd All rights reserved
1 Introduction
Recently, there has been a great interest in Geo-Demographic
Analysis (GDA) and its application to many branches of society;
for instance, GDA is usually applied to study the geographical
dis-tribution of specific spending groups so as to predict future
spend-ing trends Accordspend-ingly, GDA plays an essential role in plannspend-ing
distribution, managing products, services, capital, etc
Geo-Demographic Analysis (GDA) combines Geographical
Infor-mation Systems (GIS) and Data Mining algorithms GDA is usually
defined as ‘‘the analysis of spatially referenced geo-demographic and
lifestyle data’’ (Sleight, 1993) and widely used in the public and
private sectors for the planning and provision of products and
ser-vices There are two main underlying assumptions in GDA (Palmer,
2008): Firstly, it assumes that two people who live in the same area
are more likely to have similar characteristics than two people
selected at random; Secondly, it assumes that two areas can be
char-acterized in terms of their population, using demographics and
other measures Based on these two principles, clustering is applied
to group geo-demographic data into meaningful clusters that
cap-ture existing regularities (relevant geo-demographic profiles) thus
making the data more manageable for the target analysis (Mason
& Jacobson, 2007) Notwithstanding the several clustering methods
currently available, in this specific area, the existing
state-of-the-art, namely the Fuzzy Geographically Weighted Clustering (FGWC)
(Mason & Jacobson, 2007), is limited both in terms of speed and
in terms of achieved clustering quality (Son, Lanzi, Cuong, & Hung,
2011) Accordingly, in (Son et al., 2011) we presented an approach
to improve the speed of FGWC and provided experimental results showing that our approach could perform better that FGWC
In this paper, we extend our previous work (Son et al., 2011) and present a novel clustering algorithm for GDA, called Intuitionistic Possibilistic Fuzzy Geographically Weighted Clustering (IPFGWC), that tries to cope with some of the limitations of the FGWC algorithm by integrating elements of Intuitionistic Fuzzy Sets (IFS) (Atanassov,
1986), Possibilistic Fuzzy C-Means (PFCM) (Pal, Pal, Keller, & Bezdek,
2005) and Generalized IFS (GIFS) (Liu, 2010) The rest of the paper is structured as follows Section2introduces some related works for the GDA problem Section3describes our novel algorithm while Section4validates the proposed approach through a set of experi-ments involving real-world data Finally, Section 5 draws the conclusions and delineates the future research directions
2 Related works
In this section, we provide a brief overview of the works in GDA that are more relevant to this work.Walford (2011)described a method using Principal Component Analysis (PCA) to study the spatial distribution of the 1991 census data scores The distribution
of these scores helps to determine whether the underlying demo-graphic and socio-economic phenomena are geodemo-graphically concentrated or dispersed.Loureiro et al (2006)applied self-orga-nizing maps (SOM) (Kohonen, 2001) to GDA Self-organizing maps are unsupervised neural networks that have been used as the visu-alization tool for high dimensional data When a SOM is trained with
a given dataset, its units tend to spread themselves in input space following by some density functions of the data patterns Based on the variations in edge length in a path between two units on the SOM, the authors presented a new way of calculating fuzzy
member-0957-4174/$ - see front matter Ó 2012 Elsevier Ltd All rights reserved.
⇑Corresponding author Tel.: +84 904171284; fax: +84 0438623938.
E-mail address: sonlh@vnu.edu.vn (L.H Son).
Contents lists available atSciVerse ScienceDirect
Expert Systems with Applications
j o u r n a l h o m e p a g e : w w w e l s e v i e r c o m / l o c a t e / e s w a
Trang 2ships of fuzzy clustering method Then, results of the proposed
method are adequate in the identification of their main clusters In
general, this method gives better results than other ones due to
the use of SOM in the initialization phase However, it requires a
lot of memory spaces to store all neurons and weights Furthermore,
the speed of training phase is quite slow Clustering algorithms have
been widely used in GDA (Tryon & Bailey, 1970).Day, Pearce, and
Dorling (2008)used agglomerative hierarchical clustering with the
Euclidean distance measure as a mean to classify 190 countries into
groups that are homogeneous in terms of mortality rates The aim of
this study was to identify clusters of nations grouped by health
outcomes in order to provide sensible groupings for international
comparisons Then, 12 clusters of countries were identified with
the average life expectancy of each one ranged from 81.5 years
(clus-ter 1) to 37.7 years (clus(clus-ter 12) However, the in(clus-terpretation of the
hierarchy is complex and often confusing Besides, the deterministic
nature of the technique prevents re-evaluation after points are
grouped into a node Hence, people prefer using k-means clustering
to using this algorithm.Lee et al (1999)applied k-means clustering to
classify the data of population and housing census in Korea into 26
clusters with 3752 basic administrative units separately for 1990
and 1995 Then, by the hierarchical method, they made six ‘‘Grand
Clusters’’ from these 26 clusters’ profiles The result of this research
is the distribution of six ‘‘Grand Clusters’’ during the period
1990 –1995 Another example of using k-means for GDA was from
Gibbs et al (2010) In this literature, the authors divided all the
pri-mary schools at London in 2007 into groups or clusters that have
similar ethnic and socio-economic characteristics Indeed, this
pro-cess helps parents find schools for their children easily by marking
the suitable variables in the map, i.e ‘‘at least 80% pupils are White
British’’ and ‘‘Eligible for Free School Meals’’ A recent result of
Peter-sen et al (2011)has confirmed again the use of k-means clustering
for this problem The authors presented London Output Area
Classifi-cation (LOAC) which, in essence, is a geo-demographic system The
main core of this system is k-means clustering serving for mapping
health care needs and other health indicators that are useful for
tar-geting neighborhoods in public health campaigns LOAC was also
compared to six other geo-demographic systems from both
govern-mental and commercial sources
Basically, k-means algorithm can be used for GDA with
accept-able results However, these are some limitations of this algorithm
that we should consider carefully Firstly, it is very sensitive to the
outliers Thus, in some literatures such as (Lee et al., 1999), the
authors had to try k-means clustering manually for a rather big
number of clusters and then identify some outliers Indeed, it takes
much time to do that Secondly, the algorithm starts with random
positions at the beginning Somehow, they fall into bad cases and
the outputted results are not optimal In fact, assigning a
geograph-ical area to a single group is not always the best choice This
assumption leads to the issues of ecological fallacy An ecological
fallacy is a logical fallacy in the interpretation of statistical data
in an ecological study, whereby inferences about the nature of
spe-cific individuals are based solely upon aggregate statistics collected
for the group to which those individuals belong In the other
words, it can be shortly understood that statistics accurately
describing group characteristics do not necessarily apply to
indi-viduals within that group For example, if a group of people is
mea-sured to have a lower average IQ than the general population; it is
an error to assume that all members of that group have a lower IQ
than the general population Accordingly, fuzzy clustering method is
often preferred Fuzzy clustering assigns a membership value for
each area instead of assigning a geographical area to a single group,
thus helping to overcome the issues of ecological fallacy The fuzzy
clustering algorithm typically used in GDA is the fuzzy C-means
clustering algorithm of Bezdek known as FCM (Bezdek & Ehrlich,
1984; Kannan, Ramathilagam, & Chung, 2012; Küçükdeniz, Baray,
Esnaf, & Ecerkale, 2012) However, FCM misses geographical factor
in its design For example, consider that residential area type 27 is the same and behaves in the same way wherever it happens to be located But what happens if type 27 areas respond differently depending on their map locations? To overcome this shortcoming,
Feng and Flowerdew (1998)proposed an extension to the fuzzy clustering technique, which provides for the ex post facto adjust-ment of the cluster membership values based on ‘‘Neighbourhood Effects’’ (NE) The neighbourhood effects incorporate geography into the model The neighbourhood effects formula adjusts the cluster membership as shown in the following equation,
u0
k¼a ukþ b 1
A
Xc j¼1
where u0
kis the new cluster membership of area k and ukis the old cluster membership of this area Two parametersaand b are scaling variables which affect the proportion of the original membership and satisfy the condition,
A is a factor to scale the ‘‘sum’’ term to the range [0–1] The weighted membership is calculated as follows,
wkj¼ pb
where p is the length of the common boundary between k and j; dkj
is the distance between k and j; a and b are user-defined parame-ters However, Feng and Flowerdew’s neighbourhood effects have some limitations: Firstly, they ignore the effects of areas that have
no common boundaries; Secondly, they exclude the effects of popu-lation, a key geo-demographic consideration To overcome these limitations,Mason and Jacobson (2007)introduced a modified clus-ter membership adjustment which incorporates a spatial inclus-terac- interac-tion effect model The new membership funcinterac-tion is inspired by the principles of geographical spatial interaction (Birkin & Clarke,
1991) and incorporates a basic spatial interaction model into the weighting of the memberships as well as the typical neighbourhood effects of fuzzy clustering algorithms This makes cluster centers
‘‘geographically aware’’ In (Mason & Jacobson, 2007), the authors introduced the Fuzzy Geographically Weighted Clustering (FGWC) to calculate the influence of one area upon another one as the product
of the populations of the areas A distance decay effect is imple-mented in the denominator This effect is impleimple-mented through the weighting factor as described in equation below,
wkj¼ ðpopk popjÞb=dakj; ð4Þ
where popk, popjare the population of areas k and j respectively; the term dkjis the distance between k and j; a and b are user-defined parameters while A is a factor to scale the ‘‘sum’’ term, and it is calculated across all clusters, ensuring that the sum of the member-ships for a given area for all clusters is equal to one
FGWC is considered as the state-of-the-art for GDA problems However, its speed and the quality of clustering achieved can be improved (Son et al., 2011) In particular, while in (Son et al.,
2011) we showed an approach to significantly speed-up FGWC,
in this paper we instead focus on the quality of the clusters pro-duced In particular, based on the results of Intuitionistic Fuzzy Sets (IFS) (Atanassov, 1986), Possibilistic Fuzzy C-Means (PFCM) (Pal
et al., 2005) and Generalized IFS (GIFS) (Liu, 2010), we propose an integrated approach to produce clusters of higher quality The approach has been validated empirically
Trang 33 The proposed algorithm
3.1 Intuitionistic fuzzy sets
Fuzzy sets have been introduced byZadeh (1965) and since
then this concept has been applied to various algebraic structures
Basically, a fuzzy set is defined as follows
Definition 1 A Fuzzy Set (FS)l(Zadeh, 1965) in a non-empty set X
is a function
x #lðxÞ
wherel(x) is the membership degree of each element x 2 X A fuzzy
set can be alternately defined as
Along with the development of practical problems, this notion is too
cramped An obvious evidence for this consideration can be
ex-tracted from the electoral problem (Atanassov, 2003) Suppose that
we have a list of all countries (n) with elective governments
Nor-mally, if we assume p is the number of countries whose electors
vote for the corresponding government then the number of
coun-tries whose electors do not vote is n p However, it is possible that
there exist some votes given to parties or persons outside the
gov-ernment Indeed, this situation cannot be modelled by traditional
fuzzy sets
Motivated by the concept of ‘‘intuitionism’’ of L Brouwer when
he invited mathematicians to remove Aristoteles’ law of excluded
middle,Atanassov (1986)presented the idea of intuitionistic fuzzy
sets That is to say, if we have a proposition A, we can state that A is
true, that A is false, or that we do not know whether A is true or
false This idea is expressed by the definition below
Definition 2 An Intuitionistic Fuzzy Set (IFS) (Atanassov, 1986) in a
non-empty set X is
~
A ¼ fhx;lA ~ðxÞ;c~ AðxÞijx 2 Xg; ð7Þ
wherelA ~ðxÞ is the membership degree of each element x 2 X and
cA ~ðxÞ is the non-membership degree,
l~ AðxÞ;cA ~ðxÞ 2 ½0; 1; 8x 2 X; ð8Þ
0 6 l~ AðxÞ þc~ AðxÞ 6 1; 8x 2 X: ð9Þ
The intuitionistic fuzzy index of an element showing the
non-deter-minacy is denoted as
PA ~ðxÞ ¼ 1 lA ~ðxÞ þcA ~ðxÞ; 8x 2 X: ð10Þ
WhenPA ~ðxÞ ¼ 0, the IFS set becomes FS set of Zadeh
Since this notation was introduced, several applications of the
IFS set have been presented For instance, Sen and Saha (1986)
defined the concept of aC-semigroup and established a relation
be-tween regular C-semigroup and C-group Kim and Jun (2001)
investigated some properties of the intuitionistic fuzzy sets in a
semigroup.Uckun, OZTURK, and Jun (2007)introduced the notion
of an intuitionistic fuzzyC-ideal of aC-semigroup and investigated
some properties connected with intuitionistic fuzzyC-ideals in a
C-semigroup Similarly,Gau and Buehrer (1993)introduced Vague
Sets In this literature, the authors noted that the drawback of using
the single membership value in fuzzy set theory is that the evidence
for x 2 X and the evidence against x 2 X are in fact mixed together
Therefore, instead of using point-based membership in FS,
inter-val-based membership is used (Lu & Ng, 2005)
Definition 3 A Vague Set (VS) (Gau & Buehrer, 1993) in a universe
of discourse X is
V ¼Xn i¼1
½aVðxiÞ; 1 bVðxiÞ=xi; xi2 X; ð11Þ
0 6 aVðxiÞ 6 1 bVðxiÞ 6 1; 1 6 i 6 n; ð12Þ
whereaV(xi) is a lower bound on the grade of membership of xi
derived from the evidence for xi, and bV(xi) is a lower bound on the grade of membership of the negation of xiderived from the evi-dence against xi The grade of membership of xiis bounded to a sub-interval [aV(xi), 1 bV(xi)] [0, 1] In case of aV(xi) = bV(xi), VS set becomes FS set of Zadeh However, VS is shown to be equivalent
to IFS with the intuitionistic fuzzy index 1 aV(xi) bV(xi) However, sometimes we do not know both the membership de-gree and the non-membership dede-gree of an element but we know its the value range instead To cope with this issue,Atanassov and Gargov (1989)extended IFS and introduced the concept of Inter-val-Valued Intuitionistic Fuzzy Set (IVIFS)
Definition 4 An Interval-Valued Intuitionistic Fuzzy Set (IVIFS) (Atanassov & Gargov, 1989) in a non-empty set X is
~
AIV¼ fhx;l~ A IVðxÞ;cA ~ IVðxÞijx 2 Xg; ð13Þ
where
lA ~ IVðxÞ ¼ lL
~
AIVðxÞ;lU
~
AIVðxÞ
lL
~
A IVðxÞ ¼ InflA ~IVðxÞ; lU
~
A IVðxÞ ¼ Supl~ AIVðxÞ; 8x 2 X; ð15Þ
cA ~ IVðxÞ ¼ cL
~
A IVðxÞ;cU
~
A IVðxÞ
cL
~
A IVðxÞ ¼ InfcA ~ IVðxÞ; cU
~
A IVðxÞ ¼ Supc~ A IVðxÞ; 8x 2 X: ð17Þ
A constraint is set up for(13)
0 6 lA ~IVðxÞ þcA ~IVðxÞ 6 1: ð18Þ
The intuitionistic fuzzy index is
P~ A IVðxÞ ¼lPLA ~IVðxÞ;PUA ~IVðxÞm
PL~ AIVðxÞ ¼ 1 lU
~
AIVðxÞ cU
~
AIVðxÞ; PUA ~IVðxÞ
¼ 1 lL
~
A IVðxÞ cL
~
Note that, when l~ A IVðxÞ ¼lL
~
A IVðxÞ ¼lU
~
A IVðxÞ and c~ A IVðxÞ ¼cL
~
A IVðxÞ ¼
cU
~
A IVðxÞ, the IVIFS set is reduced to IFS
In 2002,Mondal and Samanta (2002)presented a generalized version of IFS where the condition (9) is extended by a fuzzy operator
Definition 5 A Mondal and Samanta’s Generalized Intuitionistic Fuzzy Set (MS-GIFS) (2002) in a non-empty set X is
~
AMS¼ fhx;l~ AMSðxÞ;c~ AMSðxÞijx 2 Xg: ð21Þ
The meanings of variables are similar to those in Def 2 Neverthe-less, a little change is occurred in the condition(9)as shown below
0 6 lA ~ðxÞ ^c~ AðxÞ 6 0:5; 8x 2 X: ð22Þ
Recently,Liu (2010)has shown an extension of Mondal and
Saman-ta model by atSaman-taching an extensional index so called L-value This index was incorporated to the condition(22)above
Definition 6 A Liu’s Generalized Intuitionistic Fuzzy Set (L-GIFS) (Liu, 2010) in a non-empty set X is
~
AL¼ fhx;l~ A LðxÞ;c~ A LðxÞijx 2 Xg: ð23Þ
Trang 4The condition(22)is replaced by
0 6 lA ~ðxÞ þc~ AðxÞ 6 1 þ L; 8x 2 X: ð24Þ
When L = 0, the Liu fuzzy sets return to IFS
Liu has also proved that for any universe of discourse X
FSðXÞ IFSðXÞ MS GIFSðXÞ L GIFSðXÞ; ð25Þ
L GIFSðXÞ å MS GIFSðXÞ å IFSðXÞ å FSðXÞ: ð26Þ
Notice that the L-value used in(25) and (26)is one Besides, from
the Def 3, we also confirm that
Some algebraic operations, fuzzy relations and fuzzy topology for
these sets were discussed intensively in equivalent literatures
As we have mentioned in the previous section, fuzzy clustering
is the method of choice to tackle GDA problems However, the
usual Fuzzy C-Means (FCM) and its variants FGWC rely on the basic
principles about fuzzy sets ofZadeh (1965) This kind of fuzzy sets
has been shown to contain some limitations for current practical
problem Indeed, to fulfil the objective of this paper, we need to
de-velop a clustering algorithm in some extended fuzzy sets of Zadeh
Thus, in some next parts, we will pay much attention to the
Intui-tionistic Fuzzy Set (IFS) and its latest version L-GIFS As such, the
proposed clustering algorithm will be firstly constructed in IFS
and then expanded in L-GIFS
Example 1 By Matlab simulation, we illustrate these fuzzy sets
fromFigs 1–4
3.2 Status quo of clustering algorithms for IFS
Clustering algorithms for IFS sets have been being studied
intensively in recent times Providing a useful way to describe
fuz-zy sets, these kinds of algorithm often enhance the understanding
of internal structure of data
Pelekis, Iakovidis, Kotsifakos, and Kopanakis (2008)argued that
the distances determining the membership of a feature vector to a
cluster are also subject to uncertainty Current fuzzy clustering
ap-proaches do not utilize any information about uncertainty at the
constitutional feature level The most typical one, FCM algorithm
tries to partition the dataset by just looking at the feature vectors,
and as such it ignores the fact that these vectors may be
accompa-nied by qualitative information which may be given per feature
For example, following the idea of intuitionistic fuzzy set theory,
a data point Xkis not just a p-dimensional vector (Xk1, Xk2, , Xkp)
of quantitative information, but instead it is a p-dimensional vector
of triplets [(Xk1,lk1,ck1), (Xk2,lk2,ck2), , (Xkp,lkp,ckp)] where for each Xki there exists qualitative information which is provided via the intuitionistic membershiplkiand non-membershipckiof the current data point to the feature i
In this situation, the traditional distance function in FCM algo-rithm is not suitable because it operates only on the feature vectors and not on the qualitative information which may be given per fea-ture For this reason,Pelekis et al (2008)presented a modified ver-sion of FCM algorithm for IFS sets with a new intuitionistic fuzzy set distance metric,
DðA; BÞ ¼ ðdðlA;lBÞ þ dðcA;cBÞÞ=2; ð28Þ
where
dðA0;B0
Þ ¼
PN i¼1 minðA 0 ðx i Þ;B 0 ðx i ÞÞ
PN i¼1
maxðA 0 ðxiÞ;B 0 ðxiÞÞ
A0[ B0–/
1 A0[ B0¼ /
8
>
Zhang, Xu, and Chen (2007)defined the concept of intuitionistic
fuz-zy similarity matrix (IFSM), and constructed an intuitionistic fuzfuz-zy equivalence matrix (IFEM) Then they transformed the IFSM into the IFEM and clustered the dataset based on the k-cutting matrix
of the interval-valued matrix However, the whole process is performed on the IVIFS sets (Def 4) that are converted from the ori-ginal IFS sets Indeed, it requires a lot of computations and loses too
Fig 2 The IFS sets.
Trang 5much information in the process of calculating the intuitionistic
fuz-zy similarity degree (IFSD)
To overcome this limitation,Xu, Chen, and Wu (2008)proposed
an algorithm that uses the derived association coefficients (AC) of
IFS to construct an association matrix (AM), and utilizes the
transi-tivity principle of the equivalent matrix (Zhang, 1983) to transform
it into an equivalent association matrix (EAM) Then, the EAM will be
used to construct the k-cutting matrix Based on this matrix,
clas-sification will be performed
CðA;BÞ ¼
PN
i¼1wi½lAðxiÞ:lBðxiÞ þcAðxiÞ:cBðxiÞ þPAðxiÞ:PBðxiÞ
max PN
i¼1wihl2ðxiÞ þc2ðxiÞ þP2ðxiÞi
;PN i¼1wihl2ðxiÞ þc2ðxiÞ þP2ðxiÞi
ð30Þ
The formula to calculate AC Eq.(30)is somehow similar to the
sim-ilarity degree of (Pelekis et al., 2008)
In 2009,Xu (2009)also introduced an intuitionistic fuzzy
hierar-chical algorithm for clustering IFS, which is based on the traditional
hierarchical clustering procedure and the intuitionistic fuzzy
aggregation operator
However, these are some limitations in the series works of Xu
et al Firstly, the number of clusters is not set up beforehand In case
we wish to partition the dataset into a pre-defined number of
clus-ters, it is impossible for these algorithms Secondly, we cannot
determine the suitable value of parameter k Depend on a specific
value of this parameter, the number of clusters will be established
Thirdly, these algorithms cannot provide the information about
membership degrees of the objects to each cluster
Solving these obstacles,Xu and Wu (2010)developed another
intuitionistic fuzzy C-means algorithm to cluster IFS, which is
based on the well-known fuzzy C-means clustering method and the basic distance measures between IFS (Szmidt & Kacprzyk, 2000; Xu, 2007)
d1ðA;BÞ ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1
2
XN i¼1
wi½ðlAðxiÞ lBðxiÞÞ2þ ðcAðxiÞ cBðxiÞÞ2þ ðPAðxiÞ PBðxiÞÞ2
v u
:
ð31Þ
The differences between two versions of modified FCM algorithm for IFS in (Pelekis et al., 2008; Xu & Wu, 2010) are not only the dis-tance metric but also the way to define clusters’ centers In (Pelekis
et al., 2008), the membership (non-membership) degree of a center
is calculated by average of the ones of its data points specified through the membership matrix While in (Xu & Wu, 2010), it is cal-culated through a combination of membership values and data points; thus giving better representation of clusters’ centers In gen-eral, Xu and Wu‘s algorithm performs more stably and effectively than any other clustering methods on IFS sets
3.3 A new fuzzy clustering method for intuitionistic fuzzy sets Throughout the previous section, a natural question can be aris-en: ‘‘Is theXu and Wu method (2010)enough for the GDA prob-lem?’’ In a certain extent, it is acceptable However, due to some limitations below, we cannot totally rely on this method and have
to think about a new one
The Xu and Wu method uses FCM as a skeleton to deploy the main algorithm As pointed out byPal et al (2005), FCM is very sensitive to outliers Consider the number of clusters is two If
an outlier is equidistant from two prototypes, then its member-ship in each cluster will be the same regardless of the absolute value of the distance of this point from the two centers as well
as from the other points in the data In such cases, it should be different membership values following by the distance criteria,
The membership values in both methods (Pelekis et al., 2008;
Xu & Wu, 2010) are still crisp value In the other word, it is con-sidered the traditional fuzzy set (FS) of Zadeh In some specific context, we should extend this set into intuitionistic one,
Most of the geo-demographic data treat their features equally
In the other word, the membership values of an element to its features are often one Indeed, we should reduce the pattern and center sets to the FS sets instead of IFS ones
From those observations, we should investigate a new fuzzy clustering algorithm for IFS.Pal et al (2005)presented an effective fuzzy clustering method so called Possibilistic Fuzzy C-Means (PFCM) PFCM was proven to solve the noise sensitivity defect of FCM, overcome the coincident clusters problem of Possibilistic
Fig 3 The MS-GIFS sets.
Fig 4 The L-GIFS sets.
Trang 6C-Means (PCM) and eliminate the row sum constraints of Fuzzy
Possibilistic C-Means (FPCM) Therefore, it is suitable for the
pro-posed algorithm
Now, we state the objective function of the considered problem
(Intuitionistic Possibilistic Fuzzy C-Means – IPFCM)
Jm; g ;sðU; T; H; V; XÞ ¼XN
k¼1
XC j¼1
a1umþ a2tgkjþ a3hskj
kXk Vjk2
þXC j¼1
cj
XN k¼1 ð1 tkjÞg! min : ð32Þ
In(32), we have to partition the pattern set X into C clusters Each
pattern has a membership value U, a hesitation level H and a
typi-cality value T The constraints for this problem are
XC
j¼1
XC
j¼1
hkj¼ 1
ai>0; i ¼ 1; 3:
cj; j ¼ 1; C are constants
ukj;tkj;hkj2 ½0; 1:
Some special cases can be seen as follows,
þ a3¼ 0 : IPFCM ! PFCM
þ a1¼ 0 ^ a3¼ 0 : IPFCM ! PCM
þ a3¼ 0 ^ a2¼ 0 ^cj¼ 0; 8j ¼ 1;C : IPFCM ! FCM
þ a3¼ 0 ^ a01¼ a2¼ 1 ^cj¼ 0; 8j ¼ 1;C : IPFCM ! FPCM:
ð36Þ
Theorem 1 The optimal solutions of systems(32)–(35)are
ukj¼ 1
PC
i¼1
kX k V j k
kXkV i k
m1
; k ¼ 1; N; j ¼ 1; C; ð37Þ
hkj¼ 1
PC
i¼1
kXkV j k
kX k V i k
s 1
; k ¼ 1; N; j ¼ 1; C; ð38Þ
tkj¼ 1
1 þ a2 kX k V j k2
c j
g 1
; k ¼ 1; N; j ¼ 1; C; ð39Þ
Vj¼
PN
k¼1a1umþ a2tgkjþ a3hskj
Xk
PN
k¼1a1umþ a2tgkjþ a3hskj ; j ¼ 1; C: ð40Þ
Proof Fix T, H, V for the kth column Ukof U, we get the reduced
problem,
Jm;g;sðUkÞ ¼XC
j¼1
a1umþ a2tgkjþ a3hskj
kXk Vjk2! min : ð41Þ
Using Lagrange Multiplier for(41)we obtain
LðUk;kkÞ ¼XC
j¼1
a1um
kjþ a2tgkjþ a3hskj
kXk Vjk2 kk
XC j¼1
ukj 1
!
; ð42Þ
@LðUk;kkÞ
@ukj
¼ a1 m um1
@LðUk;kkÞ
@ukj
a1 m kXk Vjk2
!1 m1
; j ¼ 1;C;k ¼ 1;N: ð44Þ SincePC
j¼1ukj¼ 1, we easily have
XC j¼1
kk
a1 m kXk Vjk2
!1 m1
() kk¼ a1 m XC
j¼1
From(44) and (46), we get
ukj¼ 1
PC i¼1
kX k V j k
kXkV i k
m1
; k ¼ 1; N; j ¼ 1; C: ð47Þ
Similarly, fix U, T, V for the kth column Hkof H and use Lagrange multiplier, we obtain the solution
hkj¼ 1
PC i¼1
kXkVjk
kX k V i k
s 1
; k ¼ 1; N; j ¼ 1; C: ð48Þ
Fix U, H, V for the typicality value tkj, we obtain the reduced problem,
Jkj m; g ; sðtkjÞ ¼ a1um
kjþ a2tgkjþ a3hskj
kXk Vjk2þcjð1 tkjÞg! min: ð49Þ
Using gradient method for(49), the solution is found as
@Jkjm;g;sðtkjÞ
@tkj ¼ a2g tgkj1kXk Vjk2cjg ð1 tkjÞg1; ð50Þ
@Jkjm;g;sðtkjÞ
@tkj ¼ 0 () a2g tg1
kj kXk Vjk2¼cjg ð1 tkjÞg1;ð51Þ
()a2kXk Vjk
2
cj ¼
1
tkj
1
g 1
() 1 þ a2kXk Vjk
2
cj
!1
g 1
¼1
() tkj¼ 1
1 þ a2 kX k V j k 2
c j
g 1
; k ¼ 1; N; j ¼ 1;C: ð54Þ
Fix U, H, T, we calculate the gradient of Jm,g,s(V) with respect to each Vj,
Jm;g;sðVÞ ¼XN
k¼1
XC j¼1
a1um
kjþ a2tgkjþ a3hskj
kXk Vjk2! min; ð55Þ
@Jm;g;sðVÞ
@Vj
¼XN k¼1
a1um
kjþ a2tgkjþ a3hskj
ð2Xkþ 2VjÞ; ð56Þ
@Jm;g;s
@Vj
¼ 0 () Vj¼
PN k¼1 a1um
kjþ a2tgkjþ a3hskj
Xk
PN k¼1 a1um
kjþ a2tgkjþ a3hskj
Similar to (Pal et al., 2005), in this paper, we try to investigate some interesting properties of the solutions above
Property 1 Since the exponential factor belongs to the interval [0, 1],
we get
Trang 7m!1þfukjg ¼ 1
lim
m!1þ
PC i¼1
kX k V j k
kXkV i k
m1
lim
Property 2 The series expansion at m = 1 is
1 þ2 logðaÞ
2ðlog2ðaÞ þ logðaÞÞ
4log 3 ðaÞ
3 þ 4log2ðaÞ þ 2 logðaÞ
þ
2log4ðaÞ
3 þ 4log3ðaÞ þ 6log2ðaÞ þ 2logðaÞ
m4 þ
4log 5 ðaÞ
15 þ8log34ðaÞþ 8log3ðaÞ þ 8log2ðaÞ þ 2logðaÞ
m
6!
;ð60Þ
where
a ¼XC
i¼1
kXk Vjk
Thus,
lim
Obviously, when the parameter m is getting larger, the value of ukj
tends to converge to one
Property 3 Similarly, some limits of the hesitation level are
lim
s!1þfhkjg ¼ 1:
lim
s!1fhkjg ¼ 0
lim
s!1fhkjg ¼ 1
ð63Þ
Property 4 Limits of the typicality value are
lim
1 þ lim
g !1þ
a 2 kX k V j k 2
c j
g 1
¼
1 if kXk Vjk2<cj
a 2
1 if kXk Vjk2¼cj
a2
0 otherwise
8
>
lim
g !1ftkjg ¼
0 if kXk Vjk2<cj
a 2
1 if kXk Vjk2¼cj
a 2
1 otherwise
8
>
Property 5
lim
g !1ftkjg ¼1
Property 6 if a2= 0 then tkj¼ 18k ¼ 1; N; j ¼ 1; C
This property shows the fact that the typicality values are
independent of the distances between Xkand Vjwhen the parameter
a2= 0 In this case, the objective function approximates to the one of
FCM algorithm with the supplement of the hesitation level
Property 7 The double limit of the sequence {uj} is
lim
m!1
g !1
s!1
fvjg ¼ lim
ðm; g ;sÞ!ð1;1;1Þ
PN k¼1 Xk
1þ
PN l¼1;l–k a1u m
lj þa2tgljþa3hslj
a1u m þa2tgkjþa3hskj
ffi
PN
k¼1 Xk
1þðN1Þ lim
ðm; g ; s Þ!ð1;1;1Þ
a1u m
lj þa2tgljþa3hslj a1umþa2tgkjþa3hskj
¼
PN
k¼1 X k
¼ V l 2 ½1; N
When the parameters are quite large, all the clusters’ centers tend to move to the central point of the dataset
Property 8 If a2=cj= 1; j ¼ 1; C;g= 2 then
ZN 2
Z C 2
tkjdkdj ¼ ðN C þ 1Þ logðN C þ 1Þ þ ðN 1Þ logðN 1Þ
ðC 3Þip ðC 3Þ logðC 3Þ: ð68Þ
If a2=cj= 1; j ¼ 1; C;g= 0 then
ZN 2
Z C 2
tkjdkdj ¼ ðN 2ÞðC 2Þ þ ðN C þ 1Þ logðN C þ 1Þ
ðN 1Þ logðN 1Þ þ ðC 3Þip
From this property, we see that there is a gap of the quantity (N 2)(C 2) between the areas of typicality values in two steps
g= 0 andg= 2
Property 9
lim m!1þ
s!1þ
ukj
hkj
¼ lim m!1þ
s!1þ
XC i¼1
kXk Vjk
kXk Vjk
! 2ðm s Þ ðm1Þð s 1Þ
where ~1 is complex infinity
lim m!1þ
s!1
ukj
hkj
lim m!1
s!1
ukj
hkj
¼ XC i¼1
kXk Vjk
kXk Vjk
!2
s 1
; XC i¼1
kXk Vjk
kXk Vjk
!2 m1
8
<
:
9
=
;: ð72Þ
From these formulas, we may see that ukjand hkjhave similar roles
Theorem 2 The relation between two parameters m andsto assure the constraint(33)is the IPFCM condition below
Proof Since the constraint(33), we get
1
PC i¼1
kX k V j k
kX k V j k
m1
PC i¼1
kX k V j k
kX k V j k
s 1
From the Cauchy inequality, we obtain
1
PC i¼1
kX k V j k
kXkVjk
m1
PC i¼1
kX k V j k
kXkVjk
s 1
PC i¼1
kX k V j k
kX k V j k
mþ s 2
ðm1Þð s 1Þ
; ð75Þ
) m þs 2
where
A ¼ XC i¼1
kXk Vjk
kXk Vjk
!
Since A < 2
) m þs 2
ðm 1Þðs 1Þ P logA2 > 1; ð78Þ
The condition(73)allows us to choose the correct value of m andsfor the IPFCM algorithm In the proof above, the strict upper bound for A is one However, due to these results below,
Trang 8lim
We, therefore, should choose the loose upper bound for this
param-eter In this case, we set it to two
3.4 Fuzzy clustering method for Liu’s generalized intuitionistic fuzzy
sets
In this section, we extend the IPFCM for the L-GIFS sets Notice
that the only difference between the L-GIFS and IFS sets is the
replacement of constraint(9) in IFS by constraint(24) in L-GIFS
with the support of an extensional index In the other word, the
constraint(24)is a generalization of(9)because when this index
is set to zero, the L-GIFS returns to IFS Therefore, the IPFCM
algo-rithm itself remains unchanged However, for the adaptation of
constraint(24), we now state another condition of parameters
Theorem 3 The relation between two parameters m andsto assure
the constraint(24)in the IPFCM algorithm for the L-GIFS sets is
m þs 2
ðm 1Þðs 1Þ P log2
2
Proof Similar to the above proof and following by the Cauchy
con-dition, we obtain the inequality below,
2
PC
i¼1
kXkVjk
kX k V j k
mþ s 2
ðm1Þð s 1Þ
By marking A as in(77), we get the following facts,
m þs 2
ðm 1Þðs 1Þ P logA
2
1 þ L P log2
2
Especially when L = 1,
Because
2 s P 3 2s
The values of parameters m andsin(85)are definitely stricter than
the ones in(73) Indeed, we can use these constraints to find the
suitable values of these parameters for the IPFCM algorithm h
3.5 The intuitionistic possibilistic fuzzy geographically weighted
clustering algorithm
For all previous parts of this section, we have presented the
fuz-zy clustering algorithm for IFS and L-GIFS sets (IPFCM) Now, we
present the main algorithm of this paper for the GDA problem This
algorithm is named Intuitionistic Possibilistic Fuzzy Geographically
Weighted Clustering (IPFGWC) It is an extension of IPFCM for the
considered problem We have a relationship here,
The algorithm is stated below
Step 1: Set the number of patterns N, the number of clusters C, the
threshold e> 0 and other parameters such as
m;g;s>1; ai>0; i ¼ 1; 3,cj;j ¼ 1; C satisfying the IPFCM
condition for the L-GIFS sets(82)and geographic
parame-tersa, b, a, b,
Step 2: Initialize the centers of clusters Vj, j ¼ 1; C at t = 0,
Step 3: Use (37)–(39) to calculate the membership values, the hesitation level and the typicality values, respectively The distance used in these formulas is Euclidean Squared Distance Metric,
Step 4: Perform geographic modifications through equations(1), (2) and (4) Notice that the population of an area is decided through the current membership values,
Step 5: Use(40)to calculate the centers of clusters at t + 1, Step 6: If the error kV(t+1) V(t)k 6e then stop the algorithm Otherwise, assign V(t)= V(t+1)and returns to Step 3 The outputted results of this algorithm are the clusters’ centers, final membership values and hesitation levels
4 Results and discussions 4.1 Experiment Environment
In this part, we describe the experimental tool, the experimen-tal datasets and the cluster validity measurement
Experimental tools: We have implemented the proposed algo-rithms (IPFGWC) in addition to the above three algoalgo-rithms: FCM (Bezdek & Ehrlich, 1984), FGWC (Mason & Jacobson,
2007) and CFGWC (Son et al., 2011) in C programming language and executed them on a PC Intel (R) Core (TM)2 Duo CPU T6400
@ 2.00 GHz (2 CPUs), 2048 MB RAM and the operating system is Windows 7 Professional 32-bit
Some parameters of these algorithms are set up as below
The proposed algorithm: Threshold e= 103, parameters
m = 3, s= 2,g= 2, a1¼ a2¼ a3¼ 1;cj¼ 1 j ¼ 1; C and geo-graphic parameters a = b = 1,a= 0.7, b = 0.3,
FCM, FGWC, CFGWC: These algorithms have the similar val-ues of geographic parameters, m and threshold with the algorithm above
Experimental datasets: We use a real dataset of socio-economic demographic variables from United Nation Organization – UNO (UNSD Statistical Databases, 2011) These data have been collected from national statistical authorities since 1948 through a set of questionnaires dispatched annually by the Uni-ted Nations Statistics Division to over 230 national statistical offices (Fig 5) They contain statistics on population size and composition, births, deaths, marriage and divorce on an annual basis, economic activity, educational attainment, household characteristics, housing, ethnicity and language, etc
Cluster Validity Measurement: We use the validity function of fuzzy clustering for spatial data namely IFV (Chunchun, Lingkui,
& Wenzhong, 2008) This index was shown to be robust and sta-ble when clustering spatial data The definition of this index is characterized below,
IFV ¼1 C
XC j¼1
1 N
XN k¼1
u2
kj log2C 1
N
XN k¼1 log2ukj
8
<
:
9
=
;
SDmax
rD : ð88Þ
The maximal distance between centers is
SDmax¼ max
The even deviation between each object and the cluster center is
rD¼1 C
XC j¼1
1 N
XN k¼1
kXk Vjk2
!
When IFV ? max, the value of IFV is said to yield the most optimal
of the dataset
Trang 94.2 IFV comparison
Firstly, we calculate the values of IFV index following by m and C
The results are shown inTable 1
From these results, we can recognize that the IFV values of the
proposed algorithm (IPFGWC) are larger than the ones of CFGWC,
FGWC and FCM algorithms The descending order of these
algo-rithms following by IFV index is IPFGWC, CFGWC, FGWC and
FCM It is obvious that the IPFGWC algorithm obtains the best
qual-ity of clustering among all other ones Moreover, this test also
reconfirms the results shown in the literature (Son et al., 2011)
where the authors compared three algorithms FCM, FGWC and
CFGWC The FCM has the smallest values of IFV index due to the
lack of geographic effects in the algorithm itself Overcoming this
deficiency, FGWC obtains the higher values of IFV than FCM does
However, by supporting a context term, the algorithm CFGWC
really outperforms than FGWC and, of course, FCM Finally, the
IFV values of IPFGWC are shown to be larger than the ones of
CFGWC (Figs 6 and 7) Thus, IPFGWC is considered as the most
effective algorithm in terms of quality of clustering among all
When the parameter m = 1.5, the optimal number of clusters of
IPFGWC, CFGWC, FGWC and FCM algorithm is 3, 7, 4, 6,
respec-tively These numbers in case of m = 2.5 are 3, 2, 4 and 7 Generally,
the number of clusters produced by IPFGWC is the smallest among
all results Indeed, the IPFGWC method can generate results, which are near to optimal ones more than other methods do
In IPFGWC algorithm, the average number of IFV per the num-ber of clusters C is 2452.1 and 2391.8, respectively when m = 1.5 and m = 2.5 Thus, another conclusion can be drawn from this test: When the parameter m increases, the IFV value of IPFGWC tends to decrease
Secondly, we measure the values of IFV index following byaand
C The results inTable 2clearly show that the IFV value of IPFGWC
is still the largest among all Furthermore, IPFGWC’s result is much better than CFGWC’s and FGWC’s The IFV value of IPFGWC is 1.5 times larger than CFGWC’s and 37.1 times larger than FGWC’s in average Especially when C = 2, these numbers are 33 and 1079, respectively Obviously, the supplements of typicality values and hesitation level on the objective function have made a great improvement in the IFV values of the IPFGWC algorithm
Whena= 0.4 (Fig 8), the optimal number of clusters of three algorithms is 3, 7 and 5, respectively These numbers in case of
a= 0.6 (Fig 9) anda= 0.8 (Fig 10) are 3, 2, 6 and 3, 2, 2, respec-tively The results of IPFGWC seem to be stable through different values ofa On the contrary, the results of CFGWC and FGWC tend
to be small when the number ofaincreases In case ofa< 0.5, the number of clusters of the IPFGWC algorithm is said to be the most optimal result among all
Fig 5 Populations of 233 countries in 2001.
Table 1
IFV values by m and C.
Trang 10Similar to the previous test, when the parameteraincreases,
the IFV value of IPFGWC tends to decrease
The final conclusion of this part is: the quality of clustering of the
IPFGWC algorithm really outperforms the ones of CFGWC, FGWC
and FCM
4.3 The running time comparison
In this part, we measure the speed of these algorithms following
by C (Table 3) anda(Table 4)
Obviously, the running time of the IPFGWC method is slower
than other ones’.Table 3shows that the running time of IPFGWC
is 4.33 times larger than the one of CFGWC in average These
numbers in case of FGWC and FCM are 4.05 and 3.12, respectively
Because of some modifications in the objective function, IPFGWC
has to undertake some extra calculations Thus, they cause the slow running time of IPFGWC as shown in this table.Table 3also reconfirms the experimental results in the literature (Son et al.,
2011) where the descending order of running time is FCM, FGWC and CFGWC
Another finding extracted fromTable 3is the average increment
of these algorithms per the number clusters C When a cluster is added, the running time of IPFGWC is increased by 0.132 s These incremental numbers in CFGWC, FGWC and FCM are 0.03, 0.032 and 0.043, respectively This finding helps us to estimate the run-ning time of an algorithm when more clusters are provided
InTable 4, the running time of IPFFGWC is 3.42 and 3.06 times larger than CFGWC and FGWC, respectively These numbers are smaller than the ones inTable 3 Indeed, the number of cluster C contributes to the increment of running times of these algorithms more than the parameteradoes
Similar toTable 3, the average increment of IPFGWC, CFGWC and FGWC per a= 0.1 is 0.22, 0.06 and 0.08, respectively These incremental numbers are double the ones in Table 3 Thus, the increment of the parameteraoften leads to higher values of the average increment of these algorithms than the increment of the number of cluster C does
Fig 11shows the number of iteration steps of four algorithms in equivalent to the results inTable 3
The final conclusion of this part is: the running time of the IPFGWC algorithm is slower than the ones of CFGWC, FGWC and FCM due to extra calculations However, it is not too slow and is acceptable
4.4 The changes of objective function of IPFGWC Finally, we investigate the changes of objective function’s values of the proposed method following by its parameters (Table 5)
Fig 6 IFV values when m = 1.5.
Fig 7 IFV values when m = 2.5.
Table 2
IFV values byaand C.
Fig 8 IFV values whena= 0.4.