1. Trang chủ
  2. » Thể loại khác

DSpace at VNU: Spatial interaction – modification model and applications to geo-demographic analysis

19 138 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 19
Dung lượng 2,33 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

DSpace at VNU: Spatial interaction – modification model and applications to geo-demographic analysis tài liệu, giáo án,...

Trang 1

Spatial interaction – modification model and applications to

geo-demographic analysis

Le Hoang Sona,⇑, Bui Cong Cuongb, Hoang Viet Longc

a

VNU University of Science, Vietnam National University, Viet Nam

b

Institute of Mathematics, Vietnam Academy of Science and Technology, Viet Nam

c

Faculty of Basic Sciences, University of Transport and Communications, Viet Nam

Article history:

Received 7 January 2013

Received in revised form 7 May 2013

Accepted 9 May 2013

Available online 23 May 2013

Keywords:

Fuzzy clustering

Geo-demographic analysis

Geographic effects

Spatial interaction modification model

Data mining

a b s t r a c t

In this paper, we introduce a novel model so-called Spatial Interaction – Modification Model (SIM2), serv-ing for the classification of spatially-referenced demographic data It is integrated with the main part of the best fuzzy clustering algorithm for geo-demographic analysis problem – IPFGWC to form the new method named as MIPFGWC Theoretical and experimental analyses show that MIPFGWC achieves better clustering quality than IPFGWC and other available algorithms

 2013 Elsevier B.V All rights reserved

1 Introduction

Geo-Demographic Analysis (GDA) has been being widely used in

various applications such as the planning and distribution of

prod-ucts and services, the determination of common population’s

char-acteristics and the study of population variation in terms of gender,

ages, sex, ethnicity, etc Results of this kind of analysis are

visual-ized on a map as several distinct groups that represent for different

levels of a population’s characteristic, e.g ‘‘High density of

chain-smokers’’ and ‘‘Low density of chain-chain-smokers’’ This knowledge

as-sists policy makers in two phases: (i) Understanding the causes for

such a distribution of the population’s characteristic and (ii) Giving

effective policies to adjust the distribution in order to achieve a

certain goal, e.g the limitation of smoking Thus, GDA is of great

interest to engineers and managers alike

In GDA, the fuzzy clustering methods are often used to generate

a specific distribution of population’s characteristics Improving

the quality of GDA or the quality of fuzzy clustering used for

GDA is considered as an important, necessary objective for the

accurate description of the distribution In what follows, we herein

shortly summarize some relevant works for that goal

Some authors such as Bezdek and Ehrlich[1], Ji et al.[5], Khashei

et al.[6], Son et al.[8,9], Yin et al.[11]and Zadegan et al.[12]introduced

Fuzzy C-Means (FCM) and its variants to determine the distribution of a demographic feature on a map Feng and Flowerdew[4]presented an improvement of FCM so-called Neighbourhood Effects (NE) to remedy the limitation of missing geographic factors in that algorithm Based upon the principle of Spatial Interaction Model (SIM)[2], NE modifies the cluster memberships by geographic parameters so that final results are spatially referenced Mason and Jacobson[7]improved NE through two remarks: (i) SIM model was extended to use the population factor instead of the length of the common boundary The new model was named Spatial Interaction Model with Population Factor (SIM-PF) and (ii) the modification of cluster memberships was executed in each iter-ation instead of at the end of the algorithm Their algorithm was named FGWC Our previous work[8]improved FGWC by integrating

it with some results of Intuitionistic Fuzzy Sets (IFS) to tackle the problems of sensitivity to outliers and crisp memberships that were remained in FGWC SIM-PF model was kept unchanged in the new algorithm named as IPFGWC Some remarks found from those arti-cles are shown below

(a) Experimental and theoretical analyses in the article [8]

showed that the clustering quality of IPFGWC is better than those of FGWC, NE and FCM

(b) SIM-PF model used in IPFGWC contains some limitations that may reduce the quality of the algorithm

(c) Theoretical analyses of the impact of SIM-PF (SIM) model to cluster memberships as well as the characteristics of the model were not found in existing articles

0950-7051/$ - see front matter  2013 Elsevier B.V All rights reserved.

⇑Corresponding author Address: 334 Nguyen Trai, Thanh Xuan, Hanoi, Viet Nam.

Tel.: +84 904171284; fax: +84 0438623938.

E-mail address: sonlh@vnu.edu.vn (L.H Son).

Contents lists available atSciVerse ScienceDirect

Knowledge-Based Systems

j o u r n a l h o m e p a g e : w w w e l s e v i e r c o m / l o c a t e / k n o s y s

Trang 2

Let us make a deeper analysis for the second consideration The

modification of the cluster membership based on SIM-PF is

de-scribed as follow

u0

k¼a ukþ b 1

A

XC j¼1

wkj¼ ðpopk popjÞb=dakj: ð3Þ

In Eqs.(1)–(3), u0

kðukÞ is the new (old) cluster membership of area k

Two parametersaand b are scaling variables A is a factor to scale

the ‘‘sum’’ term to the range 0 to 1 wkjis the weight between two

areas k and j, showing the influence level of one area upon another

It is calculated through Eq.(3)where popk(popj) and dkjare the

pop-ulation of area k(j) and the distance between these areas,

respec-tively a and b are user-defined parameters

Now, we will give three examples to illustrate the weaknesses

of SIM-PF model

Example 1 (The limitation of the model) In Eq (1), the new

updated cluster membership cannot be used in the modification

process of the next ones For example, inFig 1, area ‘‘A’’ is affected

by the others following by SIM-PF model and moves to the new

location ‘‘A’’’.Fig 2shows the next modification process of SIM-PF

on area ‘‘B’’ However, instead of using ‘‘A’’’, this model still uses

‘‘A’’ for the update of area ‘‘B’’ This may lead to the inaccurate

modification that decreases the quality of the algorithm

Example 2 (The limitation of the weight – Common Boundary) The

weight in Eq.(3)does not count for areas having common

bound-aries Thus, neighboring areas having low population may receive a

smaller weight than that of distant areas having high population

InFig 3, we illustrate three areas ‘‘A’’, ‘‘B’’ and ‘‘C’’ Through the

comparison of the distances between three centers, we see that

‘‘A’’ is nearer to ‘‘C’’ than ‘‘B’’ The populations of {‘‘A’’, ‘‘C’’} are

lar-ger than those of {‘‘A’’, ‘‘B’’} According to Eq.(3), the weight wACis

definitely larger than wAB However, areas {‘‘A’’, ‘‘B’’} have the

com-mon boundary with the element ‘‘8’’ while {‘‘A’’, ‘‘C’’} do not

Nat-urally, it is assumable that {‘‘A’’, ‘‘B’’} is closely related to each other

more than {‘‘A’’, ‘‘C’’} Indeed, it should be wAB> wAC

Example 3 (The limitation of the weight – Immigration) The weight

in Eq.(3)cannot reflect the immigration behavior, which is the key demographic factor showing the strong connection of some groups having high densities of immigrating elements InFig 4, we illus-trate three areas in which elements ‘‘6’’ and ‘‘8’’ moved from area

‘‘B’’ to ‘‘A’’, and element ‘‘4’’ went from ‘‘A’’ to ‘‘B’’ The formula in

Eq.(3)shows that ‘‘A’’ is related to ‘‘C’’ more than ‘‘B’’ However,

it should be wAB> wACsince the historic immigration pointed out the interaction between ‘‘A’’ and ‘‘B’’

Those examples show the need of a new model which can ame-liorate the limitations of SIM-PF Additionally, theoretical analyses

of the new model such as the measurement of the influence of two areas, the difference of cluster memberships between SIM-PF and the new model, and the suitable selection of parameters for the best quality of the algorithm should be considered The relevant articles lacked those systematic analyses as shown in the third consideration above Last but not least, the improvement of IPFGWC including the new model is constructed, and the clustering quality of the new algorithm will be better than that of IPFGWC since the limitations of SIM-PF model are remedied Those consid-erations are all our motivation and objectives in this article The rest of the paper and our contribution are described as fol-lows Section2presents our main contribution including the novel model named as Spatial Interaction – Modification Model (SIM2) and some theoretical analyses of it Specifically, in Section 2.1, we examine a new weighting function that handles the limitations

of the weight stated inExamples 2 and 3 Moreover, some interest-ing properties and theorems of that function, which are useful to determine the values of some parameters, are investigated such as,

 The upper bound of the average influence of two ubiquitous areas which can help us determine the maximal value of any weight

 The relation between some parameters of the weighting func-tion that ensures the same influence of two ubiquitous pairs

of areas both in the perfect and imperfect assumptions

 The sum of weights when the number of areas is very large

In Section2.2, a new model that handles the limitation stated in

Example 1is considered We also investigate some properties and theorems such as,

 The difference of cluster memberships of SIM-PF and SIM2 model

Trang 3

 Conditions of parameters for u0

k

 

to be a completely monotone increasing sequence

 The relation between a monotone increasing sequence and a

completely one

 The suitable selection of parameters for the best quality of the

algorithm in Section3

The results of Sections2.1 and 2.2are called SIM2model In

Sec-tion3, we integrate SIM2with the main part of IPFGWC to form the

new algorithm so-called Modified IPFGWC (MIPFGWC) Section4

validates the proposed approach through a set of experiments

involving real-world data Finally, Section5draws the conclusions

and delineates the future research directions

2 Spatial interaction modification model 2.1 A new weighting function

Definition 1 A Weighting Function (WF) w is a function

ðk; jÞ # wkj¼

ðpopkpopjÞ b p c

kj IM d kj

d a

kj k – j

8

<

where popk(popj) is the population of area k(j); dkjis the distance be-tween those areas; p is the maximal distance between elements in

Fig 2 The second modification step of SIM-PF model.

Fig 3 The common boundary.

Trang 4

the common boundary of two areas In case that these areas do not

have the common boundary or there is an element in the boundary

only, pkjis set to one IMkjis the total number of elements

immigrat-ing from area k to j and vice versa It measures the historic

immigra-tion in the previous calculaimmigra-tion step If no immigraimmigra-tion between two

areas is found, IMkj= 1 Finally, four numbers a, b, c, d are the

param-eters specified by some theorems in this sub-section Some

con-straints are attached to Eq.(4)as follows

XC

k¼1

popk¼ N0

pkj6dkj

IMkj6popkþ popj

8

>

>

>

>

where C(N0) is the total number of areas (population and elements

in common boundaries) When pkj= 0, N0is equal to the total

num-ber of population (N)

Example 4 In Fig 4, assume that a = b = c = d = 1 and

dAB= 10; dAC= 8, we re-calculate the weights of all areas following

by Eq.(4)

wAB¼ ½ð3  2Þ  1  3=10 ¼ 1:8; wAC¼ ½ð3  4Þ  1  1=8

¼ 1:5:

Some special cases are shown below

ðc ¼ d ¼ 0Þ _ ðpkj¼ 1 ^ IMkj¼ 1Þ : w ! wSIMPF;

Now, we investigate some interesting properties and theorems

of the weighting function

Property 1 It is easy to check that WF satisfies:

ðaÞ Commutative : wk;j¼ wj;kð8j; k ^ j – kÞ;

ðbÞ 8j : w/;j¼ 0 where / is an area having no population;

ðcÞ Cardinality : jwj ¼ C2:

ð7Þ

Theorem 1 The upper bound of the average influence of area k on another one is

wAVG

k 6ð1=CÞ  ðpopk N0Þb ðpopkþ N0Þd: ð8Þ

Proof From Eqs.(4) and (5), we have

XC j¼1

wkj¼ popbXC

j¼1

popb

j  IMdkjp

c kj

dakj

!

6popbXC

j¼1

popb

j  IMdkj

Using Holder inequality, we obtain

XC j¼1

wkj6popb XC

j¼1

popj

!b

 XC j¼1

IMkj

!d

6ðpopk N0Þb XC

j¼1

IMkj

!d

From constraint(5)and Minkowski inequality, we get

XC j¼1

wkj6ðpopk N0Þb ðpopkþ N0Þd: ð13Þ

Thus, we receive the result as in Eq.(8) h

Consequence 1 By the similar proof, we obtain the upper bound of the average influence of a ubiquitous area on another one as follow

wAVG

¼ ð1=CÞ2XC

k¼1

XC j¼1

wkj62N2bþd

This result gives us an estimation of the upper bound of the influence wkj, "k – j This means that wherever two areas are on the map, the maximal impact between them is shown in Eq

(14) In what follows, we will consider another theorem of the weighting function

Fig 4 Immigration interaction.

Trang 5

Theorem 2 Given the following assumptions,

ðaÞ All areas are not empty and do not intersect the others; ð15Þ

ðcÞ dkj¼ dkiþewith k – i – j andeis an infinitesimal; ð17Þ

ðdÞ IMkj¼popkþ popj

The condition to ensure that the influences from area k to the

oth-ers are equal is:

ð1 þeÞa> 2

ðN0 C þ 1ÞðN0 C þ 2Þ

2

Proof Consider a lemma as follows: for any a, b, x, y satisfying

Then, the following inequality holds

xa

Indeed, Eq (21) can be obtained through some transformations

below

() a  ln x þ b  ln y P a  ln y þ b  ln x; ð23Þ

() 2  ða  ln x þ b  ln yÞ P ða þ bÞðln x þ ln yÞ; ð24Þ

Because the influences from area k to the others are equal, we have

()ðpopk popjÞ

b

 pc

kj IMdkj

dakj

¼ðpopk popiÞ

b

 pc

ki IMdki

daki

; ð27Þ

() popj

popi

 b

 IMkj

IMki

 d

¼ dkj

dki

 a

 pki

pkj

!c

From the constraints(15), (17), (18), Eq.(28)is re-written as

popj

popi

 b

 popkþ popj

popkþ popi

¼ 1 þ e

dki

 ð1 þeÞa: ð29Þ

Without generality, we can assume that popj6popi, "i – j – k

Apply lemma(21)for the left side of Eq.(29), we have

popj

popi

 b

 popkþ popj

popkþ popi

P popj

popi

popkþ popj popkþ popi

2

Because of constraint(15), the right side of Eq.(30)is minimal if

its numerator is smallest, and its denominator is largest Thus,

popk= popj= 1 and popi= N0 C + 1

popj

popi

popkþ popj

popkþ popi

2

ðN0 C þ 1ÞðN0 C þ 2Þ

2 : ð31Þ

From Eqs.(29)–(31), we obtain the result in Eq.(19) h

Theorem 2 shows us the relation between some

parameters to ensure the same impacts from area k to the others

If a > 0, Eq.(19) always holds Otherwise, it helps us to choose

the suitable values of a, b, d when given the number of areas and

population

Consequence 2 By the similar proof, we obtain the condition for

the same influence of two ubiquitous pairs of areas,

ð1 þeÞa> 2

N  C þ 2

2

Theorem 3 If we replace a constraint inTheorem 2with that below then the new condition to ensure that the influences from area k to the others are equal is:

(a) IMkj= min{popk, popj}:

The condition is:

ð1 þeÞa> 1

N0 C þ 1

(b) Assume that fdkjg; j ¼ 1; C; j – k is a Padovan sequence:

dk1¼ dk2¼ dk3¼ D0

dk;j¼ dk;j2þ dk;j3



j ¼ 4; C; j – k; D0: constant;

The condition is:

PðCÞa> 2

ðN0 C þ 1ÞðN0 C þ 2Þ

2

where

PðCÞ ¼ 1 þ r1

rCþ2

1 ð2 þ 3r1Þþ

1 þ r2

rCþ2

2 ð2 þ 3r2Þþ

1 þ r3

rCþ2

3 ð2 þ 3r3Þ; ð35Þ

riði ¼ 1; 3Þ is the ith root of equation x3+ x2

 1 = 0

(c) pkj= dkj/d with d being a constant:

The condition is:

ð1 þeÞac> 2

ðN0 C þ 1ÞðN0 C þ 2Þ

2

In Theorem 2, the assumptions are quite ideal We will cer-tainly not see any part of the real world containing areas that

do not overlap the others or are equidistant from a considered one The meaning of this theorem is to examine the relation be-tween some parameters of the weighting function to ensure the same influence from a given area to the others in the perfect condition However, if we do not have one of these conditions then how will the new relation become?Theorem 3helps us an-swer it by replacing an assumption inTheorem 2with the new one For example, if we use the minimum operator instead of the average one, the new relation is set up as in Eq.(33), which is looser than Eq.(19) Similarly, when the distances from a given area to the others form a Padovan sequence, the new relation described in equation(34)is also looser than that inTheorem 2

If all areas intersect the others then we get the result in Eq.(36), which is stricter than Eq.(19) Thus, those cases in Theorem 3

give us a reference of relations between the parameters in differ-ent environmdiffer-ents

Now, we investigate the sum of weights when the number of areas is very large Assume that we have the following constraints:

ðaÞ Assumptions ð15Þ; ð17Þ and ð18Þof Theorem2; ð37Þ

ðcÞ fpopkg; k ¼ 1; C is a Triangular sequence: ð39Þ

Property 2

ðaÞ X1 k¼1

X1 j¼1

wkj¼1

D0

lim

C!1

1 3ðC þ 1Þ24ðP

2C2 9C2 6C2

wð1ÞðC þ 1Þ þ 2P2C

12C  12Cwð1ÞðC þ 1Þ  6wð1ÞðC þ 1Þ þ P2Þ

;ð40Þ

¼ 4 3D0

where w(n)

(x) is the nth derivative of the digamma function, and 2b + d = 2

Trang 6

D0is the distance from a ubiquitous area to area k.

ðbÞ X1

k¼1

X1

j¼1

wkj¼80  8P

2

D0

; where 2b þ d ¼ 3: ð42Þ

From this property, we recognize that there is a difference with

value (92  28P2/3)/D0between two consecutive series

Property 3

ðaÞ X 1

k¼1

X 1

i¼1

X 1

j¼1

ðw ki  w kj Þ ¼1

D 2 lim C!1 1 45ðC þ 1Þ 4 16ðP4

C4þ 150P2

C4 1575C4 ð43Þ

 900C 4

w ð1Þ ðC þ 1Þ  15C 4

w ð3Þ ðC þ 1Þ þ 4P4

C 3

þ 600P2

C 3

 5400C 3

 3600C 3

wð1ÞðC þ 1Þ  60C 3

wð3ÞðC þ 1Þ þ 6P4

C 2

þ 900P2

C 2

 6300C 2

 5400C 2 w ð1Þ

ðC þ 1Þ  90C 2 w ð3Þ

ðC þ 1Þ þ 4P4

C þ 600P2

C  2520C

 3600Cw ð1Þ

ðC þ 1Þ  60Cw ð3Þ

ðC þ 1Þ  900w ð1Þ

ðC þ 1Þ  15w ð3Þ

ðC þ 1Þ

þP4

þ 150P2

Þ

;

¼ 16

45D 2 ð1575 þ 150P2

þP4 Þ; where 2b þ d ¼ 2: ð44Þ

ðbÞ X1

k¼1

X1

i¼1

X1

j¼1

ðwki wkjÞ ¼P

2

þP4

107D20 ;where 2b þ d ¼ 3: ð45Þ

Similarly, the difference between two consecutive series in

Property 3is ð2696400  256755P2 1667P4Þ=D2

2.2 A new model

Definition 2 The Spatial Interaction – Modification Model (SIM2) is

defined as,

u0

k¼a ukþ b Xk1

j¼1

wkj u0þc1

A

XC j¼k

wkj uj; ð46Þ

In this model, the new updated areas are appeared within Eq.(46)

They contribute to the shift of the next areas to the goal state (a, b,

c) are the user-defined parameters satisfying constraint(47) In Eq

(46), the weighting function wkjis defined in Eq.(4) The last

param-eter – A is a scaling variable that forces the sum of cluster

member-ships to one It is often calculated through the maximum operator A

special case of SIM2model is:

Similar to the previous section, we examine some important

theorems fromDefinition 2

Theorem 4 The maximal difference of cluster memberships using

SIM-PF and SIM2model is:

MD ¼ abc

A

 

½N0ðN0 C þ 1Þbð2N0 C þ 1Þdþ2ðC  1ÞCbc

A N

2bþd

Proof Fix index k, we have

u0

k

SIM2

 u0

k

SIMPF

k1 j¼1

wkj b  u0c

A uj

For the simplicity, we assume that u 0 SIM2

¼ u 0 SIMPF

;8j < k Eq

(50)is now re-written as,

u0

k

SIM2

 u0

k

SIMPF

k1

j¼1

wkj abc

A

 ujþbc

A 

XC i¼1

ðwji uiÞ

; ð51Þ

u0

k

SIM2

 u0

k

SIMPF

C

j¼1

uj abc

A

 wkjþbc

A 

Xk1 i¼1

ðwki wijÞ

: ð52Þ

Apply Holder inequality for Eq.(52), we get

u0 k SIM2

 u0 k SIMPF

6X

C

j¼1

uj

XC j¼1

abc

A

 wkjþbc

A 

Xk1 i¼1

ðwki wijÞ

: ð53Þ

Because of the property of the cluster membershipsPC

j¼1uj¼ 1 and apply Minkowski inequality for the right side of Eq.(53), we obtain

u0 k SIM2

 ðu0

kÞSIMPF

6X

C

j¼1

abc

A

 

 wkjþbc

A

XC j¼1

Xk1 i¼1

ðwki wijÞ: ð54Þ

Apply the result in Eq.(13)ofTheorem 1, we get

XC j¼1

abc

A

 wkj6 abc

A

 ðpopk N0Þb ðpopkþ N0Þd: ð55Þ

Following by Consequence 1,

XC j¼1

Xk1 i¼1

ðwki wijÞ 6 2N2bþd0 ðk  1ÞC: ð56Þ

From Eqs.(54)–(56), we receive the maximal difference of cluster membership kth of two models SIM2and SIM-PF,

u0 k SIM2

 u0 k SIMPF

A

 ðpopk N0Þb ðpopkþ N0Þd

þ2bcN2bþd0 ðk  1ÞC

Thus, the maximal difference of cluster memberships using SIM-PF and SIM2model is

u0

ð ÞSIM2 ðu0ÞSIMPF

k¼1;C

u0 k SIM2

 u0 k SIMPF

6MD; ð58Þ

where MD is stated in Eq.(49) h

Theorem 4gives us an estimation of the difference of cluster memberships of two models Now, let us see some other defini-tions below

Definition 3 Vector a = (a1, a2, , an) is said to bes– larger than

b = (b1, b2, , bn) if

ða > bÞs() #fai>big ¼s 8i ¼ 1; n; ð59Þ

wheresis an integer, belonged to [0, n] Ifs= n then a is completely larger than b, and this relationship is denoted as a  b

Definition 4 (Ordered relationship of vectors) Vector a = (a1, a2, - - -., an) is said to be larger than vector b = (b1, b2, , bn), denoted

by a > b if there exists two maximal values s1, s2 that satisfy

ða > bÞs1 and ðb > aÞs

2ands1 >s2 Property 4

ðaÞ ða > bÞ0() 9i 6 n : ðb > aÞi; ðbÞ a ¼ b () ða > bÞ0^ ðb > aÞ0; ðcÞ 0 6s1þs26n;

ðdÞ a  b ¼ ða > bÞn:

ð60Þ

Example 5 Given a = (5, 8, 3, 6, 9) and b = (5, 8, 1, 7, 2) We easily recognize that (a > b)2and (b > a)1 Moreover, a > b

Definition 5 A sequence of vectors fXk;k ¼ 1; Mg is said to be monotone increasing if Xk+1> Xk, 8k ¼ 1; M  1 If Xk+1 Xk then

fX;k ¼ 1; Mg is called completely monotone increasing

Trang 7

Theorem 5 The conditions of parameters to ensure that sequence

u0

k

 

of SIM2model is completely monotone increasing are:

b Pmax

k¼2;C

max

i¼kþ1;C

wk1;i wk;i

Pk1 j¼1ðwk;j wk1;jÞwj;i

c

AaPmaxk¼2;C i¼1;k2max

wk1;i wk;i

Pk1 j¼1ðwk;j wk1;jÞwj;i

bc

AaPmaxk¼2;C

1  b  wk;k1

Pk1

j¼1ðwk;j wk1;jÞwj;k1

aPc

A maxk¼2;C

wk1;k b Xk1

j¼1

ðwk;j wk1;jÞwj;k

Proof FromDefinition 2, we have:

u0

k u0

k1¼a ðuk uk1Þ þ b Xk2

j¼1

ðwk;j wk1;jÞu0þ b  wk;k1

 u0

k1þc

A

XC j¼k

ðwk;j wk1;jÞuj 8k ¼ 2; C; ð65Þ

) u0

k u0

k1¼a ½uk ð1  b  wk;k1Þuk1 þbc

A

Xk1

j¼1

XC i¼1

ðwk;j wk1;jÞ  wj;i uiþ ½ab

Xk2

j¼1

ðwk;j wk1;jÞujþc

A

XC j¼k

ðwk;j wk1;jÞuj

From Eq.(66), we receive a linear combination in Eqs.(67) and (68),

u0

k u0

k1¼ h1u1þ h2u2þ    þ hk2uk2þ hk1uk1þ hkuk

þ hkþ1ukþ1þ    þ hCuC

Inequality(67)holds when hiP0;8i ¼ 1; C Using Eq.(68), we

re-ceive the conditions of parameters for u0

k u0 k1P0 where k is fixed Repeat the similar process for other k ¼ 2; C, we get the

re-sults in Eqs.(61)–(64)

hi¼

b c

AXk1

j¼1

ðwk;j wk1;jÞwj;iþab ðwk;i wk1;iÞ i ¼ 1; k  2

b c

AXk1

j¼1

ðwk;j wk1;jÞwj;k1a ð1  b  wk;k1Þ i ¼ k  1

b c

AXk1

j¼1

ðwk;j wk1;jÞwj;kþaþAc ðwk;k wk1;kÞ i ¼ k

b c

AXk1

j¼1

ðwk;j wk1;jÞwj;iþcA ðwk;i wk1;iÞ i ¼ k þ 1; C

8

>

>

>

>

>

>

>

>

>

>

>

>

:

ð68Þ

h

Theorem 6

(a) If u0

k

 

is completely monotone increasing then it will be

mono-tone increasing

(b) Conversely, if u0

k

 

is monotone increasing, and all parameters

of SIM2model satisfy the conditions(61)–(64)then it will be

completely monotone increasing

Proof

(a) Following by Definition 5, u0

k;k ¼ 1; C

is completely monotone increasing Thus, u0

kþ1 u0

k;8k ¼ 1; C  1 From

Definition 4, $s1= N ^s2= 0: u0

kþ1>u0 k

N and u0

k>u0 kþ1

0 ApplyProperty 4, we get

u0 kþ1 u0

k¼ u0 kþ1>u0 k

N^ u0

k>u0 kþ1

0! u0 kþ1>u0

Perform the similar proof for other k ¼ 1; C  1, we receive the conclusion

(b) If u0 k

 

is monotone increasing then $s1>s2: u0

kþ1>u0 k

s 1

and u0

k>u0 kþ1

s 2 In order for u0

k

 

to be completely mono-tone increasing, the following conditions should hold:

# u0 kþ1>u0 k

¼ N

# u0

k>u0 kþ1

¼ 0

(

The condition(71)is obtained if the parameters of SIM2model sat-isfy constraints(61)–(64) Therefore, we get the conclusion h

Theorem 6 provides us the relation between a monotone increasing sequence and a completely one While the first result

in 6a is quite nice, the second one in 6b is not so good as the pre-vious result In fact, if the parameters satisfy conditions(61)–(64)

then u0 k

 

is completely monotone increasing following by Theo-rem 5 Thus, we do not need condition ‘‘ u0

k

 

is monotone increas-ing’’ anymore This remark tells us the truth that it is impossible to transform from a monotone increasing sequence to a completely one without the conditions of the parameters in SIM2model

Property 5 Assume that u0

k

 

is a completely monotone increasing sequence Then, limits of some special means are shown below

(a) Modified Harmonic Mean:

lim

k!1

C

1 þPC k¼110 k

0

@

1

(b) Root Mean Square:

lim

k!1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

PC k¼1 u0 k 2

C

s 0

@

1

A ¼ 1;

(c) Geometric Mean:

lim

k!1

YC k¼1

u0 k

!

¼ 1:

In what follows, we will look for the conditions of parameters to ensure ukPu0

k for given area k This is quite important because they can show us the trend of all areas in the map For example,

if ukPu0

kfor most k ¼ 1; C, all elements may concentrate on some last areas rather than the first ones Indeed, outliers can be hap-pened and affect the final results Thus, a suitable selection of parameters is required to avoid such cases and guarantees the best quality of the algorithm

Theorem 7 For given area k, the condition of parameters to ensure

ukPu0

kis:

maxfab;c=Ag

1 a  maxi¼1;Cfwkig þ bc

Að1 aÞ maxj¼1;k1 maxi¼1;Cfwkjwjig

Trang 8

ukPu0

() ukð1 aÞ PabXk1

j¼1

wkjujþc

A

XC j¼k

wkjujþbc

A

Xk1

j¼1

XC

i¼1

In order to obtain inequality(75), the following condition should

hold

ukð1 aÞ P max ab;c

A

XC i¼1

wkiuiþbc

A

Xk1 j¼1

XC i¼1

wkjwjiui; ð76Þ

() 1 Pmaxab;Ac

1 a 

XC i¼1

wki

ui

uk

 

þ bc

Að1 aÞ

Xk1

j¼1

XC

i¼1

wkjwji

ui

uk

 

Because all components in the sums are positive, each one should

satisfy the condition below to obtain inequality(77)

ðk  1ÞCP

maxab;cA

1 a  maxi¼1;C

wki

ui

uk

 



þ bc

Að1 aÞ

 max

j¼1;k1

max

i¼1;C

wkjwji

ui

uk

 



Due to the following fact,

lim

k;i!1

ui

We can reduce the cluster membership in inequality(78)and

get the result in(73) h

The findings in Section2help us understand the principal

char-acteristics of SIM2model and are the basis to deploy the main

algo-rithm which will be presented in the next section

3 The modified IPFGWC algorithm

In this section, we integrate SIM2model with the main part of

IPFGWC to form the new algorithm named as Modified IPFGWC

(MIPFGWC) Details of this algorithm are shown below

Input: Geo-demographic data X The number of elements

(clus-ters) – N(C) The dimension of dataset r Thresholdeand other

parameters m,g,s, ai(i ¼ 1; 3Þ,cjðj ¼ 1; CÞ Geographic

parame-tersa, b,c, a, b, c, d

Output: Final membership values u0

kand centers V(t+1) MIPFGWC Algorithm:

Step 1 : Set the number of clusters C, threshold e> 0 and other

parameters such as m;g;s>1; ai>0ði ¼ 1; 3Þ;cjðj ¼ 1; CÞ

as in IPFGWC algorithm[8]

Step 2 : Initialize centers of clusters Vj, j ¼ 1; C at t = 0

Step 3 : Set geographic parametersa, b,c, a, b, c, d satisfying

con-dition(47)

Step 4 : Use the formulas(80)–(82)to calculate the membership

val-ues, the hesitation level and the typicality valval-ues, respectively

Step 5 : Perform geographic modifications through Eqs.(4) and

(46)

Step 6 : If u0

k

 

is a completely monotone increasing sequence (Theorem 5) or ukPu0

kfor most k ¼ 1; C (Theorem 7) then conclude that there is no suitable solution for given

geo-graphic parameters Otherwise, go to Step 7

Step 7 : Calculate the centers of clusters at t + 1 by Eq.(83)

ukj¼ 1

PC i¼1

kX k V j k

kX k V i k

m1

; k ¼ 1; N; j ¼ 1; C; ð80Þ

hkj¼ 1

PC i¼1

kX k V j k

kX k V i k

s 1

; k ¼ 1; N; j ¼ 1; C; ð81Þ

tkj¼ 1

1 þ a2 kX k V j k2

c j

g 1

; k ¼ 1; N; j ¼ 1; C: ð82Þ

Vj¼

PN k¼1a1umþ a2tgkjþ a3hskj

 Xk

PN k¼1a1umþ a2tgkjþ a3hskj ; j ¼ 1; C: ð83Þ

Step 8 : If the difference kV(t+1)

 V(t)k 6e then stop the algo-rithm Otherwise, assign V(t)= V(t+1)and return to Step 4

4 Results

4.1 Experimental environment

In this part, we describe the experimental environments such as,

 Experimental tools: We have implemented the proposed algo-rithm (MIPFGWC) in addition to IPFGWC[8]and FGWC[7]in C programming language and executed them on a PC Intel(R) Cor-e(TM)2 Duo CPU T6570 @ 2.10 GHz (2 CPUs), 2048 MB RAM, and the operating system is Windows 7 Professional 32-bit

 Experimental datasets: We use the data as follow: a small uni-versity dataset (Table 1) and a real one of socio-economic demographic variables from United Nation Organization – UNO[10] The second dataset was used for experiments in the article[8]

 Cluster validity measurement: We use the validity function of fuzzy clustering for spatial data namely IFV [3] This index was shown to be robust and stable when clustering spatial data Besides, it was also used to evaluate the clustering quality in the article[8]

IFV ¼1 C

XC j¼1

1 N

XN k¼1

u2

kj log2C 1

N

XN k¼1

log2ukj

" #2

8

<

:

9

=

;

SDmax

rD

; ð84Þ

SDmax¼ max

rD¼1 C

XC j¼1

1 N

XN k¼1

kXk Vjk2

!

When IFV ? max, the value of IFV is said to yield the most optimal

of the dataset

 Objective: We evaluate the clustering qualities of those algo-rithms on two different datasets through IFV index Some com-parisons of computational times will also be considered

4.2 Evaluations on the university dataset 4.2.1 Case 1

In this case, we divide these students into three groups: ‘‘High’’,

‘‘Medium’’ and ‘‘Low’’ following by all attributes Using two clus-tering algorithms MIPFGWC and IPFGWC with the parameters being set up below,

e= 103, parameters m ¼ 3;s¼ 2;g¼ 2; a1¼ a2¼ a3¼ 1;

c ¼ 1j ¼ 1; C,

Trang 9

 IPFGWC: geographic parameters a = b = 1,a= 0.5, b = 0.5,

 MIPFGWC: geographic parameters a = b = c = d = 1, a = 0.5,

b= 0.3,c= 0.2

These parameters were used in the article[8]to evaluate the

clustering quality of algorithms The initial centers of clusters V(0)

used for MIPFGWC and IPFGWC is,

Vð0Þ¼

17 12 8:6 0:3

24 26 7:2 0:5

36 47 6:0 0:8

0

B

1 C

The results of MIPFGWC algorithm with those configurations

are shown in Eqs.(88) and (89) The computational time and the

number of iteration steps of MIPFGWC are 0.021 s and 15,

respec-tively Similarly, the results of IPFGWC algorithm are given in Eqs

(90) and (91) The computational time and the number of iteration

steps of IPFGWC are 0.019 s and 20, respectively Through Eqs

(84)–(86), we compare IFV values of these algorithms and get the

result in Eq.(92)

VðMIPFGWCÞ

¼

19:306553 10:105724 7:374804 0:999004

17:479478 28:197003 7:735475 0:256288

28:539287 33:201849 7:386632 0:638537

0

B

1 C A;

ð88Þ

UðMIPFGWCÞ¼

0:023007 0:927891 0:049102

0:091148 0:384451 0:524402

0:001554 0:994834 0:003612

0:111198 0:261117 0:627685

0:379026 0:251436 0:369538

0:050533 0:886195 0:063272

0:999511 0:000320 0:000169

0:012554 0:041774 0:945672

0

B

B

B

B

B

B

B

1 C C C C C C C

VðIPFGWCÞ¼

19:345955 10:096919 7:370893 0:999568

17:364905 27:968125 7:694930 0:263094

28:414465 34:814743 7:481142 0:557099

0

B

1 C A;

ð90Þ

UðIPFGWCÞ

¼

0:044201 0:469462 0:486337

0:062008 0:228704 0:709288

0:034854 0:497449 0:467697

0:068561 0:184172 0:747267

0:213137 0:178216 0:608646

0:054508 0:458405 0:487087

0:499723 0:034214 0:466063

0:031181 0:132587 0:836232

0

B

B

B

B

B

B

B

1 C C C C C C C

Thus, the clustering quality of MIPFGWC is better than that of IPFGWC From Eqs.(89) and (91), we get another result

Eq.(93)states that the maximal difference of cluster memberships using SIM-PF and SIM2model calculated by experiments is 1.89 and

is smaller than the theoretical value MD stated inTheorem 4 Now, we verify the value of weighting function of MIPFGWC algorithm shown in Eqs.(94)–(97)

w ¼

0 0:018119 0:009682 0:018119 0 0:060905 0:009682 0:060905 0

0 B

1 C

wAVG

1 ¼ 0:027801 < ðpop1 N0Þb ðpop1þ N0Þd¼ 160; ð95Þ

wAVG

2 ¼ 0:079024 < ðpop2 N0Þb ðpop2þ N0Þd¼ 384; ð96Þ

wAVG

3 ¼ 0:070587 < ðpop3 N0Þb ðpop3þ N0Þd¼ 160: ð97Þ

These results affirm that the average influence of an area on another one does not exceed the upper bound stated inTheorem 1

InTable 2, we calculate the sum of weights both byProperties 2 and 3and by experiments Results show that when the number of areas is small, the sum of weights is greater than that when the number of areas is large The experimental results in this case were performed with C = 3 and are higher than those ofProperties 2 and

3 Additionally, the difference between results of 2b + d = 2 and 2b + d = 3 is small

In what follow, we study the distribution of pattern sets and centers before and after running two algorithms Results are de-picted fromFigs 5–8 FromFig 5, we recognize that there is an outlier in the original pattern set MIPFGWC algorithm can han-dle outlier(s) better than IPFGWC.Fig 7shows that that outlier

is disappeared in MIPFGWC whilst it still remains in IPFGWC (Fig 6)

In IPFGWC, the patterns tend to move toward cluster ‘‘C’’ while those of MIPFGWC are evenly distributed in the data space The lacks of immigration effects, common boundaries between regions and updated cluster memberships in IPFGWC result in the abnormity of the distribution of patterns shown in Fig 6 MIPFGWC remedies this limitation through the use of SIM2 model and gives the distribution of patterns in

Fig 7

InFig 8, the changes of centers after running two algorithms are described Intuitively, there is not much difference between the locations of these centers This means that SIM2model makes

a great impact on the distribution of patterns more than the cen-ters All results above shows that MIPFGWC obtains better cluster-ing quality than IPFGWC

4.2.2 Case 2

In order to verify the consistency and robustness of the pro-posed algorithm, we make some changes about the parameters

of algorithms below

 IPFGWC: geographic parametersa= 0.7, b = 0.3,

 MIPFGWC: geographic parametersa= 0.7, b = 0.2,c= 0.1 Other parameters are kept intact as in Case 1 Results of MIPFGWC with the initial centers of clusters V(0)in Eq.(87)are:

VðMIPFGWCÞ

¼

20:986869 10:595425 7:231345 0:999561 17:637678 27:865067 7:649198 0:282628 31:018697 46:161775 8:406729 0:095751

0 B

1 C A; ð98Þ

Table 1

The university dataset.

Trang 10

¼

0:028586 0:948662 0:022752

0:068931 0:273096 0:657973

0:001370 0:997771 0:000859

0:000859 0:036672 0:946739

0:516925 0:296017 0:187058

0:045229 0:938912 0:015859

0:983921 0:013168 0:002911

0:161933 0:493241 0:344827

0

B

B

B

B

B

B

B

1 C C C C C C C

The computational time and the number of iteration steps of

MIPFGWC are 0.008 s and 9, respectively Similarly, the results of

IPFGWC algorithm are given in Eqs.(100) and (101) The

computa-tional time and the number of iteration steps of IPFGWC are 0.005s

and 9, respectively IFV values of these algorithms are calculated in

Eq.(102)

VðIPFGWCÞ

¼

20:846367 10:502604 7:243042 0:999342

17:660246 27:829394 7:643995 0:282196

30:688095 46:148977 8:380672 0:094085

0

B

1 C A;

ð100Þ

UðIPFGWCÞ

¼

0:069170 0:665294 0:265535 0:063570 0:207881 0:728549 0:052247 0:698536 0:249217 0:020084 0:058535 0:921382 0:375461 0:241883 0:382656 0:079164 0:660991 0:259845 0:690987 0:058643 0:250370 0:139263 0:364975 0:495762

0 B B B B B B B

1 C C C C C C C

These results show that the clustering quality of MIPFGWC is better than that of IPFGWC Additionally, IFV values of both algo-rithms are larger than those of previous case given in Eq (92) The computational times of these algorithms are also smaller than those of previous case This means that the large value ofa param-eter in this case in comparison with the previous one will enhance the clustering quality and reduce the computational time of MIPFGWC

Table 2 The sum of weights in Case 1.

Sum of weights Results of Properties 2 and 3 Experimental results

2b + d = 2 2b + d = 3

P C k¼1

P C

P C k¼1

P C i¼1

P C

Ngày đăng: 16/12/2017, 03:20