A novel kernel fuzzy clustering algorithm for Geo-DemographicIn this paper, we present a novel fuzzy clustering algorithm named as Kernel FuzzyGeographically Clustering KFGC that utilize
Trang 1A novel kernel fuzzy clustering algorithm for Geo-Demographic
In this paper, we present a novel fuzzy clustering algorithm named as Kernel FuzzyGeographically Clustering (KFGC) that utilizes both the kernel similarity function and the
Some supported properties and theorems of KFGC are also examined in the paper.Specifically, the differences between solutions of KFGC and those of MIPFGWC and of somevariants of KFGC are theoretically validated Lastly, experimental analysis is performed tocompare the performance of KFGC with those of the relevant algorithms in terms ofclustering quality
Ó 2015 Elsevier Inc All rights reserved
1 Introduction
Prior to the definition of Geo-Demographic Analysis (GDA) problem, let us consider an example to demonstrate the roles
of GDA to practical applications
results are expressed in a map showing various groups determined by intervals of cases such as [0, 47] and [48, 233] Fromthis fact, decision makers could observe the most dangerous places and issue appropriate medical measures to prevent suchthe situation in the future The distribution can be expressed by linguistic labels such as ‘‘High cases of viral hemorrhagicfever’’ and ‘‘Low cases of viral hemorrhagic fever’’ to eliminate the limitations of boundary points in the intervals Suchhot-spot analysis in this example is a kind of Geo-Demographic Analysis regarding the classification of a geographical areaaccording to a given subject, e.g viral hemorrhagic fever cases The classification could be done for spatial data both in point
http://dx.doi.org/10.1016/j.ins.2015.04.050
⇑ Tel.: +84 904171284; fax: +84 0438623938.
E-mail addresses: sonlh@vnu.edu.vn , chinhson2002@gmail.com
Information Sciences
j o u r n a l h o m e p a g e : w w w e l s e v i e r c o m / l o c a t e / i n s
Trang 2and region standards of Geographical Information Systems (GIS), and a number of points/regions that both share common
Intuitively, GDA can be regarded as the spatial clustering in GIS
Definition 1 Given a geo-demographic dataset X consisting of N data points where each data point is equivalent to a t/region of spatial data in GIS This data point is characterized by many geo-demographic attributes where each one could beconsidered as a subject for clustering The objective of GDA is to classify X into C clusters so that,
where ukjis the membership value of data point Xkðk ¼ 1; NÞ to cluster jth ðj ¼ 1; CÞ Vjis the center of cluster jth ðj ¼ 1; CÞ wj
is the weight of cluster jth ðj ¼ 1; CÞ showing the influence of spatial relationships in a map It is often calculated through a
or Spatial Interaction – Modification Model (SIM2)[26]
GDA is widely used in the public and private sectors for planning and provision of products and services In GDA,geo-demographic attributes are used to characterize essential information of population at a certain geographical areaand a specific point of time Some common geo-demographics can be named but a few such as gender, age, ethnicity, knowl-edge of languages, disabilities, mobility, home ownership, and employment status One of the most useful functions of GDA
is the capability to visualize the geo-demographic trends by locations and time stamps such as the study of the average age
the results of GDA are depicted on a map that demonstrates the distribution of several distinct groups Various distributionmaps can be combined or overlapped into a single one so that users could observe the tendency of a certain group over timefor the analyses of geo-demographic trends Both distributions and trends of values within a geo-demographic variable are ofinterest in GDA By providing essential information about geo-demographic distributions and trends, GDA assists effectivelyfor many decision-making processes involving the provision and distribution of products and services in society, the deter-mination of common population’s characteristics and the study of population variation This clearly demonstrates the impor-tant role of GDA to practical applications nowadays
Lists of abbreviation
K-Means a hard partition based clustering algorithm
IPFGWC Intuitionistic Possibilistic Fuzzy Geographically Weighted Clustering
MIPFGWC Modified Intuitionistic Possibilistic Fuzzy Geographically Weighted Clustering
1
Trang 3In GDA, improving the clustering quality especially in terms of theoretical results over the relevant exiting methods is the
geo-demographic data is essential and significant to the expression and reasoning of events and phenomena concerningthe characteristics of population and to the better support of decision-making Some of the first methods –
neu-ral networks to determine the underlying demographic and socio-economic phenomena However, their disadvantages arethe requisition of large memory space and computational complexity For these reasons, researchers tended to use clustering
clusters represented in the forms of hierarchical trees and isolated groups Nonetheless, using hard clustering for GDA oftenleads to the issues of ecological fallacy, which can be shortly understood that statistics accurately describing group charac-teristics do not necessarily apply to individuals within that group Thus, fuzzy clustering algorithms like Fuzzy C-Means(FCM) and its variants were considered as the appropriate methods to determine the distribution of a demographic feature
Fig 1 The distribution of viral hemorrhagic fever cases in Vietnam in 2011.
Trang 4fuzzy geographically clustering algorithm has better clustering quality than other relevant algorithms such as NE[6], FGWC
state-of-the-art algorithm for the GDA problem
mem-bership matrix However, there exists some problems in this process that may decrease the clustering quality Let us make adeeper analysis about this consideration
Euclidean similarity measure in clustering algorithms often has high error rate and sensitivity to noises and outlierssince this measure is not suitable for ordinal data, where preferences are listed according to rank instead of according
to actual real values Furthermore, the Euclidean measure cannot determine the correlation between user profiles that
mea-sure results in the contributions to the clustering results among all features being the same, which lead to a strongdependence of clustering on the sample space For those reasons, a better similarity measure should be used instead
of Euclidean function to obtain high clustering quality
membership matrix, the hesitation level and the typicality by solving the optimization problem The membership
not updated by any geographical model so that the new centers, which are calculated from those values and theupdated membership matrix, are not ‘‘geographically aware’’ Thus, the final clustering results could be far fromthe accurate ones by this situation
We clearly recognize that these two drawbacks prevent MIPFGWC from achieving high quality of clustering so that theyneed enhancing and improving thoroughly Our motivation in this article is to investigate a new method that can handlethose limitations, more specifically,
(a) For the first problem, we consider the kernel functions, which can be used in many applications as they provide a ple bridge from linearity to non-linearity for algorithms that can be expressed in terms of dot products This type ofsimilarity measures makes very good sense as a measure of difference between samples in the context of certain data,e.g geo-demographic datasets
giving more closely related to spatial relationships since all elements such as the cluster memberships, the hesitationlevel, the typicality values and the centers are ‘‘geographically aware’’
Our contributions in this paper are:
(a) A novel fuzzy clustering algorithm named as Kernel Fuzzy Geographically Clustering (KFGC) that utilizes both the kernel
(b) Some supported properties and theorems of KFGC are also examined in the paper Specifically, the differences betweenthe solutions of KFGC and those of MIPFGWC and of some variants of KFGC are theoretically validated
(c) Experimental analysis is performed to compare the performance of KFGC with those of the relevant algorithms interms of clustering quality
These findings both ameliorate the quality of results for the GDA problem and enrich the knowledge of developing
findings are significant to both theory and practical sides
Before we close this section and move to the detailed model and solutions, let us raise an important question: ‘‘How can
we apply the Kernel functions to the clustering algorithm in order to handle the problem of similarity measures?’’ To answer
appro-priate algorithm and kernel function for our considered problem Through this survey, it is clear that the kernel function isoften applied directly to the objective function with the most frequent used function being the Gaussian kernel-induced dis-
func-tion and solufunc-tions, some supported properties and theorems, and details of the new algorithms – KFGC Specifically, in
to handle two limitations that exist in MIPFGWC By using the Lagrangian method, the optimal solutions including the new
examine some interesting properties and theorems of solutions such as,
Trang 5Some characteristics of KFGC’s solutions, e.g the limit of membership values when fuzzifier is large;
The estimation of the difference between solutions of KFGC and those of MIPFGWC and of the variants of KFGC
directions
2 Kernel Fuzzy Geographically Clustering
2.1 The proposed model and solutions
Supposing there is a geo-demographic dataset X consisting of N data points Let us divide the dataset into C groups isfying the objective function(3)
m
þ a2 t0 kj
g
þ a3 h0 kj
cj
XN k¼1
Trang 6ai>0; i ¼ 1; 3; ð17Þ
function of MIPFGWC This handles the first limitation of MIPFGWC as shown above
In Eq.(3), u0
kj;t0
kj;h0kjare used instead of the original membership values, typicality values and hesitation level in MIPFGWC.This tackles the second problem of MIPFGWC where the typicality values and the hesitation level are not updated by anygeographical model so that the next center is not correctly calculated According to this amendment, the membership val-
function is kept intact for the update of those values
Inspired by the spatial bias correction of the group of kernel-based fuzzy clustering, especially the work of Yang and Tsai
i¼1wij
is equivalent torjin the work of Yang & Tsai Nonetheless, since there has
constraints, the proposed model is getting closely related to spatial relationships
PC
j¼1 1
PC i¼1wij
Pj1 i¼1wji au ukiþ buuk i1 ð Þþ þ bi1
1 K X k;Vj
!1 m10
@
1A;
Pj1 i¼1wji ah hkiþ bhhk i1ð Þþ þ bi1h hk1
ahþ bh
Pj1 i¼1wji c h
Ahbi1h þ þ b þ 1 1
PC j¼11 K X k;Vj
@
1A;
a2 atþ bt
Pj1 i¼1wji ct
j ¼ 1;C:
ð22Þ
Trang 72.2 Some supported properties and theorems
auþ bu
Pj1 i¼1wji cu
XC i¼1
wij
!
buPj1 i¼1wji au ukiþ buuk i1ð Þþ þ bi1u uk1
auþ bu
Pj1 i¼1wji c u
A ubi1u þ þ b þ 1 XC
j¼1
1C
XC i¼1
The results in(24) and (25)differ those in[26]a quantity of constants This means that applying the SIM2model into the objective
Property 2 Similarly, some limits of the hesitation level are
XC i¼1
wij
!
bhXj1 i¼1
Pj1 i¼1wjiat tkiþ bttk i1ð Þþ þ bi1t tk1
atþct
A tbtPj1 i¼1wjibi1t þ þ b þ 1
Pj1 i¼1wjiat tkiþ bttk i1 ð Þþ þ bi1t tk1
atþct
A tbt
Pj1 i¼1wjibi1t þ þ b þ 1
Property 4 if a2¼ 0 then tkj¼ 1;8k ¼ 1; N; j ¼ 1; C
1 þ
PN l¼1;l–k a1 u 0 lj
m
þa 2 t 0 lj
g
þa 3 h 0 lj
ð Þs
a1 u 0 kj
m
þa 2 t 0 kj
ca3h 0 kj
ð Þs
ffi
PN k¼1Xk
1 þ N 1ð Þlimðm; g ; s Þ!ð1;1;1Þ
a1 u 0 lj
m
þa 2 t 0 lj
g
þa 3 h 0 lj
ð Þs
a1 u 0 kj
m
þa 2 t 0 kj
g
þa 3 h 0 kj
ð Þs
0
@
1A
¼
PN k¼1Xk
N
When the parameters are quite large, all the clusters’ centers tend to move to the central point of the dataset
Trang 8Property 6 The limits of the ratio between ukjand hkj are
1PC i¼1 wij
bu
Pj1 i¼1wji au ukiþ buuk i1ð Þþ þ bi1u uk1
Auðbi1u þþbþ1Þ
1
PC j¼1
1PC i¼1 w ij
Ahðbi1h þþbþ1Þ
! : ð35Þ
2.2.2 The difference of solutions between algorithms
Theorem 2 Supposing that similar inputs and initialization are given to both KFGC and MIPFGWC The upper bounds of thedifference of solutions of algorithms are:
0
@
1A
0
@
1A;ð36Þ
C
j¼1
bhX j1 i¼1
Ah bi1h þ þ b þ 1
0
@
1 A
0
@
1 A
0
@
1 A; ð37Þ
clus-tering spatial data The definition of IFV is stated below
IFV ¼ 1=Cð ÞXC
j¼1
1=N
ð ÞXN k¼1
Trang 9IFVKMIPFGWC IFVMIPFGWC¼1
C
XC j¼1
1N
XN k¼1
uKMIPFGWC kj
log2C 1N
XN k¼1
log2uKMIPFGWC kj
XC j¼1
1N
XN k¼1
uMIPFGWC kj
log2C 1N
XN k¼1
log2uMIPFGWC kj
uKMIPFGWC kj
log2C 1N
XN k¼1
log2uKMIPFGWC kj
XN k¼1
uMIPFGWC kj
log2C 1N
XN k¼1
log2uMIPFGWC kj
0
@
1A
Based upon the results inTheorem 2, we can recognize that IFVKMIPFGWC
IFVMIPFGWCP0 In the other words, the clustering quality
of KFGC is generally better than that of MIPFGWC
Secondly, the effective of KFGC with and without using Gaussian kernel function is verified
PC
j¼1 1
PC i¼1wij
Pj1 i¼1wji au ukiþ buuk i1 ð Þþ þ bi1
Xk Vj
!1 m10
B
1CA;
Pj1 i¼1wji ah hkiþ bhhk i1ð Þþ þ bi1h hk1
ahþ bh
Pj1 i¼1wji c h
B
1C
; j ¼ 1;C:
ð47Þ
Theorem 4 The upper bounds of the difference of solutions of KFGC with (a.k.a K1) and without (a.k.a K2) using Gaussian kernelfunction are:
Trang 10Xj1 i¼1
A ubi1u þ þ b þ 1
0
@
1A;ð48Þ
Xj1 i¼1
a2 atþ bt
Pj1 i¼1wji ct
ukj¼PC 1
j¼1 1
PC i¼1wij
PC j¼11 K X k;Vj
1 K Xk;Vj
!1 m1
hkj¼PC 1
j¼1 1
PC i¼1wij
PC j¼11 K X k;Vj
model into the objective function are:
buPj1 i¼1wji au ukiþ buuk i1ð Þþ þ bi1u uk1
auþ bu
Pj1 i¼1wji c u
bhPj1 i¼1wji ah hkiþ bhhk i1 ð Þþ þ bi1h hk1
Trang 11TðK1Þ TðK3Þ
6 N C X
C j¼1
solutions of the system
bu
Pj1 i¼1wji au ukiþ buuk i1 ð Þþ þ bi1
1 K Xk;Vj
!1 m10
@
1A;
ahþ bh
Pj1 i¼1wji ch
Ahbi1h þ þ b þ 1 1
PC j¼1 1 K Xk;Vj
@
1A;
; j ¼ 1;C:
ð63Þ
Theorem 8 The upper bounds of the difference of solutions of KFGC with (a.k.a K1) and without (a.k.a K4) using the standardized
1 bu
Pj1 i¼1wjiau ukiþ buuk i1ð Þþ þ bi1u uk1
cg11
j þ auþcu
A ubuPj1 i¼1wjibi1u þ þ b þ 1
0B
@
1C
C j¼1
1 bh
Pj1 i¼1wjiah hkiþ bhhk i1 ð Þþ þ bi1h hk1
0B
@
1C
Trang 12TðK1Þ TðK4Þ
1PC i¼1wij
C j¼1
0B
@
1C
In this section, the KFGC algorithm is presented in details
Kernel Fuzzy Geographically Clustering
and other parameters:au;bu;cu;at;bt;ct;ah;bh;ch;m;g; s>1; ai>0; ði ¼ 1; 3Þ;cj>0
In this part, the experimental environments are described such as,
C programming language and executed them on a PC Intel(R) Core(TM)2 Duo CPU T6570 @ 2.10 GHz (2 CPUs), 2048 MBRAM, and the operating system is Windows 7 Professional 32-bit The experimental results are taken as the average val-ues after 10 runs
Experimental dataset:
deaths, marriage and divorce on an annual basis, economic activity, educational attainment, household characteristics,
characteristics of KFGC, respectively
capabilities of KFGC to produce results that are more closely related to spatial relationships
– All attributes/variables of these datasets are used concurrently for the best evaluation of the algorithms