1. Trang chủ
  2. » Thể loại khác

DSpace at VNU: A novel kernel fuzzy clustering algorithm for Geo-Demographic Analysis

22 158 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 22
Dung lượng 3,31 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

A novel kernel fuzzy clustering algorithm for Geo-DemographicIn this paper, we present a novel fuzzy clustering algorithm named as Kernel FuzzyGeographically Clustering KFGC that utilize

Trang 1

A novel kernel fuzzy clustering algorithm for Geo-Demographic

In this paper, we present a novel fuzzy clustering algorithm named as Kernel FuzzyGeographically Clustering (KFGC) that utilizes both the kernel similarity function and the

Some supported properties and theorems of KFGC are also examined in the paper.Specifically, the differences between solutions of KFGC and those of MIPFGWC and of somevariants of KFGC are theoretically validated Lastly, experimental analysis is performed tocompare the performance of KFGC with those of the relevant algorithms in terms ofclustering quality

Ó 2015 Elsevier Inc All rights reserved

1 Introduction

Prior to the definition of Geo-Demographic Analysis (GDA) problem, let us consider an example to demonstrate the roles

of GDA to practical applications

results are expressed in a map showing various groups determined by intervals of cases such as [0, 47] and [48, 233] Fromthis fact, decision makers could observe the most dangerous places and issue appropriate medical measures to prevent suchthe situation in the future The distribution can be expressed by linguistic labels such as ‘‘High cases of viral hemorrhagicfever’’ and ‘‘Low cases of viral hemorrhagic fever’’ to eliminate the limitations of boundary points in the intervals Suchhot-spot analysis in this example is a kind of Geo-Demographic Analysis regarding the classification of a geographical areaaccording to a given subject, e.g viral hemorrhagic fever cases The classification could be done for spatial data both in point

http://dx.doi.org/10.1016/j.ins.2015.04.050

⇑ Tel.: +84 904171284; fax: +84 0438623938.

E-mail addresses: sonlh@vnu.edu.vn , chinhson2002@gmail.com

Information Sciences

j o u r n a l h o m e p a g e : w w w e l s e v i e r c o m / l o c a t e / i n s

Trang 2

and region standards of Geographical Information Systems (GIS), and a number of points/regions that both share common

Intuitively, GDA can be regarded as the spatial clustering in GIS

Definition 1 Given a geo-demographic dataset X consisting of N data points where each data point is equivalent to a t/region of spatial data in GIS This data point is characterized by many geo-demographic attributes where each one could beconsidered as a subject for clustering The objective of GDA is to classify X into C clusters so that,

where ukjis the membership value of data point Xkðk ¼ 1; NÞ to cluster jth ðj ¼ 1; CÞ Vjis the center of cluster jth ðj ¼ 1; CÞ wj

is the weight of cluster jth ðj ¼ 1; CÞ showing the influence of spatial relationships in a map It is often calculated through a

or Spatial Interaction – Modification Model (SIM2)[26]

GDA is widely used in the public and private sectors for planning and provision of products and services In GDA,geo-demographic attributes are used to characterize essential information of population at a certain geographical areaand a specific point of time Some common geo-demographics can be named but a few such as gender, age, ethnicity, knowl-edge of languages, disabilities, mobility, home ownership, and employment status One of the most useful functions of GDA

is the capability to visualize the geo-demographic trends by locations and time stamps such as the study of the average age

the results of GDA are depicted on a map that demonstrates the distribution of several distinct groups Various distributionmaps can be combined or overlapped into a single one so that users could observe the tendency of a certain group over timefor the analyses of geo-demographic trends Both distributions and trends of values within a geo-demographic variable are ofinterest in GDA By providing essential information about geo-demographic distributions and trends, GDA assists effectivelyfor many decision-making processes involving the provision and distribution of products and services in society, the deter-mination of common population’s characteristics and the study of population variation This clearly demonstrates the impor-tant role of GDA to practical applications nowadays

Lists of abbreviation

K-Means a hard partition based clustering algorithm

IPFGWC Intuitionistic Possibilistic Fuzzy Geographically Weighted Clustering

MIPFGWC Modified Intuitionistic Possibilistic Fuzzy Geographically Weighted Clustering

1

Trang 3

In GDA, improving the clustering quality especially in terms of theoretical results over the relevant exiting methods is the

geo-demographic data is essential and significant to the expression and reasoning of events and phenomena concerningthe characteristics of population and to the better support of decision-making Some of the first methods –

neu-ral networks to determine the underlying demographic and socio-economic phenomena However, their disadvantages arethe requisition of large memory space and computational complexity For these reasons, researchers tended to use clustering

clusters represented in the forms of hierarchical trees and isolated groups Nonetheless, using hard clustering for GDA oftenleads to the issues of ecological fallacy, which can be shortly understood that statistics accurately describing group charac-teristics do not necessarily apply to individuals within that group Thus, fuzzy clustering algorithms like Fuzzy C-Means(FCM) and its variants were considered as the appropriate methods to determine the distribution of a demographic feature

Fig 1 The distribution of viral hemorrhagic fever cases in Vietnam in 2011.

Trang 4

fuzzy geographically clustering algorithm has better clustering quality than other relevant algorithms such as NE[6], FGWC

state-of-the-art algorithm for the GDA problem

mem-bership matrix However, there exists some problems in this process that may decrease the clustering quality Let us make adeeper analysis about this consideration

Euclidean similarity measure in clustering algorithms often has high error rate and sensitivity to noises and outlierssince this measure is not suitable for ordinal data, where preferences are listed according to rank instead of according

to actual real values Furthermore, the Euclidean measure cannot determine the correlation between user profiles that

mea-sure results in the contributions to the clustering results among all features being the same, which lead to a strongdependence of clustering on the sample space For those reasons, a better similarity measure should be used instead

of Euclidean function to obtain high clustering quality

membership matrix, the hesitation level and the typicality by solving the optimization problem The membership

not updated by any geographical model so that the new centers, which are calculated from those values and theupdated membership matrix, are not ‘‘geographically aware’’ Thus, the final clustering results could be far fromthe accurate ones by this situation

We clearly recognize that these two drawbacks prevent MIPFGWC from achieving high quality of clustering so that theyneed enhancing and improving thoroughly Our motivation in this article is to investigate a new method that can handlethose limitations, more specifically,

(a) For the first problem, we consider the kernel functions, which can be used in many applications as they provide a ple bridge from linearity to non-linearity for algorithms that can be expressed in terms of dot products This type ofsimilarity measures makes very good sense as a measure of difference between samples in the context of certain data,e.g geo-demographic datasets

giving more closely related to spatial relationships since all elements such as the cluster memberships, the hesitationlevel, the typicality values and the centers are ‘‘geographically aware’’

Our contributions in this paper are:

(a) A novel fuzzy clustering algorithm named as Kernel Fuzzy Geographically Clustering (KFGC) that utilizes both the kernel

(b) Some supported properties and theorems of KFGC are also examined in the paper Specifically, the differences betweenthe solutions of KFGC and those of MIPFGWC and of some variants of KFGC are theoretically validated

(c) Experimental analysis is performed to compare the performance of KFGC with those of the relevant algorithms interms of clustering quality

These findings both ameliorate the quality of results for the GDA problem and enrich the knowledge of developing

findings are significant to both theory and practical sides

Before we close this section and move to the detailed model and solutions, let us raise an important question: ‘‘How can

we apply the Kernel functions to the clustering algorithm in order to handle the problem of similarity measures?’’ To answer

appro-priate algorithm and kernel function for our considered problem Through this survey, it is clear that the kernel function isoften applied directly to the objective function with the most frequent used function being the Gaussian kernel-induced dis-

func-tion and solufunc-tions, some supported properties and theorems, and details of the new algorithms – KFGC Specifically, in

to handle two limitations that exist in MIPFGWC By using the Lagrangian method, the optimal solutions including the new

examine some interesting properties and theorems of solutions such as,

Trang 5

 Some characteristics of KFGC’s solutions, e.g the limit of membership values when fuzzifier is large;

 The estimation of the difference between solutions of KFGC and those of MIPFGWC and of the variants of KFGC

directions

2 Kernel Fuzzy Geographically Clustering

2.1 The proposed model and solutions

Supposing there is a geo-demographic dataset X consisting of N data points Let us divide the dataset into C groups isfying the objective function(3)

 m

þ a2 t0 kj

 g

þ a3 h0 kj

cj

XN k¼1

Trang 6

ai>0; i ¼ 1; 3; ð17Þ

function of MIPFGWC This handles the first limitation of MIPFGWC as shown above

 In Eq.(3), u0

kj;t0

kj;h0kjare used instead of the original membership values, typicality values and hesitation level in MIPFGWC.This tackles the second problem of MIPFGWC where the typicality values and the hesitation level are not updated by anygeographical model so that the next center is not correctly calculated According to this amendment, the membership val-

function is kept intact for the update of those values

 Inspired by the spatial bias correction of the group of kernel-based fuzzy clustering, especially the work of Yang and Tsai

i¼1wij

is equivalent torjin the work of Yang & Tsai Nonetheless, since there has

constraints, the proposed model is getting closely related to spatial relationships

PC

j¼1 1

PC i¼1wij

Pj1 i¼1wji au ukiþ buuk i1 ð Þþ    þ bi1

1  K X k;Vj

!1 m10

@

1A;

Pj1 i¼1wji ah hkiþ bhhk i1ð Þþ    þ bi1h hk1

ahþ bh

Pj1 i¼1wji c h

Ahbi1h þ    þ b þ 1 1 

PC j¼11  K X k;Vj

@

1A;

 a2 atþ bt

Pj1 i¼1wji ct

j ¼ 1;C:

ð22Þ

Trang 7

2.2 Some supported properties and theorems

auþ bu

Pj1 i¼1wji cu

XC i¼1

wij

!



buPj1 i¼1wji au ukiþ buuk i1ð Þþ    þ bi1u uk1

auþ bu

Pj1 i¼1wji c u

A ubi1u þ    þ b þ 1  XC

j¼1

1C

XC i¼1

The results in(24) and (25)differ those in[26]a quantity of constants This means that applying the SIM2model into the objective

Property 2 Similarly, some limits of the hesitation level are

XC i¼1

wij

!

 bhXj1 i¼1

Pj1 i¼1wjiat tkiþ bttk i1ð Þþ    þ bi1t tk1

atþct

A tbtPj1 i¼1wjibi1t þ    þ b þ 1

Pj1 i¼1wjiat tkiþ bttk i1 ð Þþ    þ bi1t tk1

atþct

A tbt

Pj1 i¼1wjibi1t þ    þ b þ 1

Property 4 if a2¼ 0 then tkj¼ 1;8k ¼ 1; N; j ¼ 1; C

1 þ

PN l¼1;l–k a1 u 0 lj

 m

þa 2 t 0 lj

 g

þa 3 h 0 lj

ð Þs

a1 u 0 kj

 m

þa 2 t 0 kj

 

ca3h 0 kj

ð Þs

PN k¼1Xk

1 þ N  1ð Þlimðm; g ; s Þ!ð1;1;1Þ

a1 u 0 lj

 m

þa 2 t 0 lj

 g

þa 3 h 0 lj

ð Þs

a1 u 0 kj

 m

þa 2 t 0 kj

 g

þa 3 h 0 kj

ð Þs

0

@

1A

¼

PN k¼1Xk

N

When the parameters are quite large, all the clusters’ centers tend to move to the central point of the dataset

Trang 8

Property 6 The limits of the ratio between ukjand hkj are

1PC i¼1 wij

   bu

Pj1 i¼1wji au ukiþ buuk i1ð Þþ    þ bi1u uk1

Auðbi1u þþbþ1Þ

1

PC j¼1

1PC i¼1 w ij

Ahðbi1h þþbþ1Þ

! : ð35Þ

2.2.2 The difference of solutions between algorithms

Theorem 2 Supposing that similar inputs and initialization are given to both KFGC and MIPFGWC The upper bounds of thedifference of solutions of algorithms are:

0

@

1A

0

@

1A;ð36Þ

C

j¼1

bhX j1 i¼1

Ah bi1h þ  þ b þ 1 

0

@

1 A

0

@

1 A

0

@

1 A; ð37Þ

clus-tering spatial data The definition of IFV is stated below

IFV ¼ 1=Cð ÞXC

j¼1

1=N

ð ÞXN k¼1

Trang 9

IFVKMIPFGWC IFVMIPFGWC¼1

C

XC j¼1

1N

XN k¼1

uKMIPFGWC kj

log2C 1N

XN k¼1

log2uKMIPFGWC kj

XC j¼1

1N

XN k¼1

uMIPFGWC kj

log2C 1N

XN k¼1

log2uMIPFGWC kj

uKMIPFGWC kj

log2C 1N

XN k¼1

log2uKMIPFGWC kj

XN k¼1

uMIPFGWC kj

log2C 1N

XN k¼1

log2uMIPFGWC kj

0

@

1A

Based upon the results inTheorem 2, we can recognize that IFVKMIPFGWC

 IFVMIPFGWCP0 In the other words, the clustering quality

of KFGC is generally better than that of MIPFGWC

Secondly, the effective of KFGC with and without using Gaussian kernel function is verified

PC

j¼1 1

PC i¼1wij

Pj1 i¼1wji au ukiþ buuk i1 ð Þþ    þ bi1

Xk Vj

!1 m10

B

1CA;

Pj1 i¼1wji ah hkiþ bhhk i1ð Þþ    þ bi1h hk1

ahþ bh

Pj1 i¼1wji c h

B

1C

; j ¼ 1;C:

ð47Þ

Theorem 4 The upper bounds of the difference of solutions of KFGC with (a.k.a K1) and without (a.k.a K2) using Gaussian kernelfunction are:

Trang 10

Xj1 i¼1

A ubi1u þ    þ b þ 1

0

@

1A;ð48Þ

Xj1 i¼1

 a2 atþ bt

Pj1 i¼1wji ct

ukj¼PC 1

j¼1 1

PC i¼1wij

PC j¼11  K X k;Vj

1  K Xk;Vj

!1 m1

hkj¼PC 1

j¼1 1

PC i¼1wij

PC j¼11  K X k;Vj

model into the objective function are:

buPj1 i¼1wji au ukiþ buuk i1ð Þþ    þ bi1u uk1

auþ bu

Pj1 i¼1wji c u

bhPj1 i¼1wji ah hkiþ bhhk i1 ð Þþ    þ bi1h hk1

Trang 11

TðK1Þ TðK3Þ



  6 N  C X

C j¼1

solutions of the system

bu

Pj1 i¼1wji au ukiþ buuk i1 ð Þþ    þ bi1

1  K Xk;Vj

!1 m10

@

1A;

ahþ bh

Pj1 i¼1wji ch

Ahbi1h þ    þ b þ 1 1 

PC j¼1 1  K Xk;Vj

@

1A;

; j ¼ 1;C:

ð63Þ

Theorem 8 The upper bounds of the difference of solutions of KFGC with (a.k.a K1) and without (a.k.a K4) using the standardized

1  bu

Pj1 i¼1wjiau ukiþ buuk i1ð Þþ    þ bi1u uk1

cg11

j þ auþcu

A ubuPj1 i¼1wjibi1u þ    þ b þ 1

0B

@

1C

C j¼1

1  bh

Pj1 i¼1wjiah hkiþ bhhk i1 ð Þþ    þ bi1h hk1

0B

@

1C

Trang 12

TðK1Þ TðK4Þ



1PC i¼1wij

C j¼1

0B

@

1C

In this section, the KFGC algorithm is presented in details

Kernel Fuzzy Geographically Clustering

and other parameters:au;bu;cu;at;bt;ct;ah;bh;ch;m;g; s>1; ai>0; ði ¼ 1; 3Þ;cj>0

In this part, the experimental environments are described such as,

C programming language and executed them on a PC Intel(R) Core(TM)2 Duo CPU T6570 @ 2.10 GHz (2 CPUs), 2048 MBRAM, and the operating system is Windows 7 Professional 32-bit The experimental results are taken as the average val-ues after 10 runs

 Experimental dataset:

deaths, marriage and divorce on an annual basis, economic activity, educational attainment, household characteristics,

characteristics of KFGC, respectively

capabilities of KFGC to produce results that are more closely related to spatial relationships

– All attributes/variables of these datasets are used concurrently for the best evaluation of the algorithms

Ngày đăng: 16/12/2017, 04:39

TỪ KHÓA LIÊN QUAN