DSpace at VNU: Spatial interaction – modification model and applications to geo-demographic analysis tài liệu, giáo án,...
Trang 1Spatial interaction – modification model and applications to
geo-demographic analysis
Le Hoang Sona,⇑, Bui Cong Cuongb, Hoang Viet Longc
a
VNU University of Science, Vietnam National University, Viet Nam
b
Institute of Mathematics, Vietnam Academy of Science and Technology, Viet Nam
c
Faculty of Basic Sciences, University of Transport and Communications, Viet Nam
Article history:
Received 7 January 2013
Received in revised form 7 May 2013
Accepted 9 May 2013
Available online 23 May 2013
Keywords:
Fuzzy clustering
Geo-demographic analysis
Geographic effects
Spatial interaction modification model
Data mining
a b s t r a c t
In this paper, we introduce a novel model so-called Spatial Interaction – Modification Model (SIM2), serv-ing for the classification of spatially-referenced demographic data It is integrated with the main part of the best fuzzy clustering algorithm for geo-demographic analysis problem – IPFGWC to form the new method named as MIPFGWC Theoretical and experimental analyses show that MIPFGWC achieves better clustering quality than IPFGWC and other available algorithms
2013 Elsevier B.V All rights reserved
1 Introduction
Geo-Demographic Analysis (GDA) has been being widely used in
various applications such as the planning and distribution of
prod-ucts and services, the determination of common population’s
char-acteristics and the study of population variation in terms of gender,
ages, sex, ethnicity, etc Results of this kind of analysis are
visual-ized on a map as several distinct groups that represent for different
levels of a population’s characteristic, e.g ‘‘High density of
chain-smokers’’ and ‘‘Low density of chain-chain-smokers’’ This knowledge
as-sists policy makers in two phases: (i) Understanding the causes for
such a distribution of the population’s characteristic and (ii) Giving
effective policies to adjust the distribution in order to achieve a
certain goal, e.g the limitation of smoking Thus, GDA is of great
interest to engineers and managers alike
In GDA, the fuzzy clustering methods are often used to generate
a specific distribution of population’s characteristics Improving
the quality of GDA or the quality of fuzzy clustering used for
GDA is considered as an important, necessary objective for the
accurate description of the distribution In what follows, we herein
shortly summarize some relevant works for that goal
Some authors such as Bezdek and Ehrlich[1], Ji et al.[5], Khashei
et al.[6], Son et al.[8,9], Yin et al.[11]and Zadegan et al.[12]introduced
Fuzzy C-Means (FCM) and its variants to determine the distribution of a demographic feature on a map Feng and Flowerdew[4]presented an improvement of FCM so-called Neighbourhood Effects (NE) to remedy the limitation of missing geographic factors in that algorithm Based upon the principle of Spatial Interaction Model (SIM)[2], NE modifies the cluster memberships by geographic parameters so that final results are spatially referenced Mason and Jacobson[7]improved NE through two remarks: (i) SIM model was extended to use the population factor instead of the length of the common boundary The new model was named Spatial Interaction Model with Population Factor (SIM-PF) and (ii) the modification of cluster memberships was executed in each iter-ation instead of at the end of the algorithm Their algorithm was named FGWC Our previous work[8]improved FGWC by integrating
it with some results of Intuitionistic Fuzzy Sets (IFS) to tackle the problems of sensitivity to outliers and crisp memberships that were remained in FGWC SIM-PF model was kept unchanged in the new algorithm named as IPFGWC Some remarks found from those arti-cles are shown below
(a) Experimental and theoretical analyses in the article [8]
showed that the clustering quality of IPFGWC is better than those of FGWC, NE and FCM
(b) SIM-PF model used in IPFGWC contains some limitations that may reduce the quality of the algorithm
(c) Theoretical analyses of the impact of SIM-PF (SIM) model to cluster memberships as well as the characteristics of the model were not found in existing articles
0950-7051/$ - see front matter 2013 Elsevier B.V All rights reserved.
⇑Corresponding author Address: 334 Nguyen Trai, Thanh Xuan, Hanoi, Viet Nam.
Tel.: +84 904171284; fax: +84 0438623938.
E-mail address: sonlh@vnu.edu.vn (L.H Son).
Contents lists available atSciVerse ScienceDirect
Knowledge-Based Systems
j o u r n a l h o m e p a g e : w w w e l s e v i e r c o m / l o c a t e / k n o s y s
Trang 2Let us make a deeper analysis for the second consideration The
modification of the cluster membership based on SIM-PF is
de-scribed as follow
u0
k¼a ukþ b 1
A
XC j¼1
wkj¼ ðpopk popjÞb=dakj: ð3Þ
In Eqs.(1)–(3), u0
kðukÞ is the new (old) cluster membership of area k
Two parametersaand b are scaling variables A is a factor to scale
the ‘‘sum’’ term to the range 0 to 1 wkjis the weight between two
areas k and j, showing the influence level of one area upon another
It is calculated through Eq.(3)where popk(popj) and dkjare the
pop-ulation of area k(j) and the distance between these areas,
respec-tively a and b are user-defined parameters
Now, we will give three examples to illustrate the weaknesses
of SIM-PF model
Example 1 (The limitation of the model) In Eq (1), the new
updated cluster membership cannot be used in the modification
process of the next ones For example, inFig 1, area ‘‘A’’ is affected
by the others following by SIM-PF model and moves to the new
location ‘‘A’’’.Fig 2shows the next modification process of SIM-PF
on area ‘‘B’’ However, instead of using ‘‘A’’’, this model still uses
‘‘A’’ for the update of area ‘‘B’’ This may lead to the inaccurate
modification that decreases the quality of the algorithm
Example 2 (The limitation of the weight – Common Boundary) The
weight in Eq.(3)does not count for areas having common
bound-aries Thus, neighboring areas having low population may receive a
smaller weight than that of distant areas having high population
InFig 3, we illustrate three areas ‘‘A’’, ‘‘B’’ and ‘‘C’’ Through the
comparison of the distances between three centers, we see that
‘‘A’’ is nearer to ‘‘C’’ than ‘‘B’’ The populations of {‘‘A’’, ‘‘C’’} are
lar-ger than those of {‘‘A’’, ‘‘B’’} According to Eq.(3), the weight wACis
definitely larger than wAB However, areas {‘‘A’’, ‘‘B’’} have the
com-mon boundary with the element ‘‘8’’ while {‘‘A’’, ‘‘C’’} do not
Nat-urally, it is assumable that {‘‘A’’, ‘‘B’’} is closely related to each other
more than {‘‘A’’, ‘‘C’’} Indeed, it should be wAB> wAC
Example 3 (The limitation of the weight – Immigration) The weight
in Eq.(3)cannot reflect the immigration behavior, which is the key demographic factor showing the strong connection of some groups having high densities of immigrating elements InFig 4, we illus-trate three areas in which elements ‘‘6’’ and ‘‘8’’ moved from area
‘‘B’’ to ‘‘A’’, and element ‘‘4’’ went from ‘‘A’’ to ‘‘B’’ The formula in
Eq.(3)shows that ‘‘A’’ is related to ‘‘C’’ more than ‘‘B’’ However,
it should be wAB> wACsince the historic immigration pointed out the interaction between ‘‘A’’ and ‘‘B’’
Those examples show the need of a new model which can ame-liorate the limitations of SIM-PF Additionally, theoretical analyses
of the new model such as the measurement of the influence of two areas, the difference of cluster memberships between SIM-PF and the new model, and the suitable selection of parameters for the best quality of the algorithm should be considered The relevant articles lacked those systematic analyses as shown in the third consideration above Last but not least, the improvement of IPFGWC including the new model is constructed, and the clustering quality of the new algorithm will be better than that of IPFGWC since the limitations of SIM-PF model are remedied Those consid-erations are all our motivation and objectives in this article The rest of the paper and our contribution are described as fol-lows Section2presents our main contribution including the novel model named as Spatial Interaction – Modification Model (SIM2) and some theoretical analyses of it Specifically, in Section 2.1, we examine a new weighting function that handles the limitations
of the weight stated inExamples 2 and 3 Moreover, some interest-ing properties and theorems of that function, which are useful to determine the values of some parameters, are investigated such as,
The upper bound of the average influence of two ubiquitous areas which can help us determine the maximal value of any weight
The relation between some parameters of the weighting func-tion that ensures the same influence of two ubiquitous pairs
of areas both in the perfect and imperfect assumptions
The sum of weights when the number of areas is very large
In Section2.2, a new model that handles the limitation stated in
Example 1is considered We also investigate some properties and theorems such as,
The difference of cluster memberships of SIM-PF and SIM2 model
Trang 3Conditions of parameters for u0
k
to be a completely monotone increasing sequence
The relation between a monotone increasing sequence and a
completely one
The suitable selection of parameters for the best quality of the
algorithm in Section3
The results of Sections2.1 and 2.2are called SIM2model In
Sec-tion3, we integrate SIM2with the main part of IPFGWC to form the
new algorithm so-called Modified IPFGWC (MIPFGWC) Section4
validates the proposed approach through a set of experiments
involving real-world data Finally, Section5draws the conclusions
and delineates the future research directions
2 Spatial interaction modification model 2.1 A new weighting function
Definition 1 A Weighting Function (WF) w is a function
ðk; jÞ # wkj¼
ðpopkpopjÞ b p c
kj IM d kj
d a
kj k – j
8
<
where popk(popj) is the population of area k(j); dkjis the distance be-tween those areas; p is the maximal distance between elements in
Fig 2 The second modification step of SIM-PF model.
Fig 3 The common boundary.
Trang 4the common boundary of two areas In case that these areas do not
have the common boundary or there is an element in the boundary
only, pkjis set to one IMkjis the total number of elements
immigrat-ing from area k to j and vice versa It measures the historic
immigra-tion in the previous calculaimmigra-tion step If no immigraimmigra-tion between two
areas is found, IMkj= 1 Finally, four numbers a, b, c, d are the
param-eters specified by some theorems in this sub-section Some
con-straints are attached to Eq.(4)as follows
XC
k¼1
popk¼ N0
pkj6dkj
IMkj6popkþ popj
8
>
>
>
>
where C(N0) is the total number of areas (population and elements
in common boundaries) When pkj= 0, N0is equal to the total
num-ber of population (N)
Example 4 In Fig 4, assume that a = b = c = d = 1 and
dAB= 10; dAC= 8, we re-calculate the weights of all areas following
by Eq.(4)
wAB¼ ½ð3 2Þ 1 3=10 ¼ 1:8; wAC¼ ½ð3 4Þ 1 1=8
¼ 1:5:
Some special cases are shown below
ðc ¼ d ¼ 0Þ _ ðpkj¼ 1 ^ IMkj¼ 1Þ : w ! wSIMPF;
Now, we investigate some interesting properties and theorems
of the weighting function
Property 1 It is easy to check that WF satisfies:
ðaÞ Commutative : wk;j¼ wj;kð8j; k ^ j – kÞ;
ðbÞ 8j : w/;j¼ 0 where / is an area having no population;
ðcÞ Cardinality : jwj ¼ C2:
ð7Þ
Theorem 1 The upper bound of the average influence of area k on another one is
wAVG
k 6ð1=CÞ ðpopk N0Þb ðpopkþ N0Þd: ð8Þ
Proof From Eqs.(4) and (5), we have
XC j¼1
wkj¼ popbXC
j¼1
popb
j IMdkjp
c kj
dakj
!
6popbXC
j¼1
popb
j IMdkj
Using Holder inequality, we obtain
XC j¼1
wkj6popb XC
j¼1
popj
!b
XC j¼1
IMkj
!d
6ðpopk N0Þb XC
j¼1
IMkj
!d
From constraint(5)and Minkowski inequality, we get
XC j¼1
wkj6ðpopk N0Þb ðpopkþ N0Þd: ð13Þ
Thus, we receive the result as in Eq.(8) h
Consequence 1 By the similar proof, we obtain the upper bound of the average influence of a ubiquitous area on another one as follow
wAVG
¼ ð1=CÞ2XC
k¼1
XC j¼1
wkj62N2bþd
This result gives us an estimation of the upper bound of the influence wkj, "k – j This means that wherever two areas are on the map, the maximal impact between them is shown in Eq
(14) In what follows, we will consider another theorem of the weighting function
Fig 4 Immigration interaction.
Trang 5Theorem 2 Given the following assumptions,
ðaÞ All areas are not empty and do not intersect the others; ð15Þ
ðcÞ dkj¼ dkiþewith k – i – j andeis an infinitesimal; ð17Þ
ðdÞ IMkj¼popkþ popj
The condition to ensure that the influences from area k to the
oth-ers are equal is:
ð1 þeÞa> 2
ðN0 C þ 1ÞðN0 C þ 2Þ
2
Proof Consider a lemma as follows: for any a, b, x, y satisfying
Then, the following inequality holds
xa
Indeed, Eq (21) can be obtained through some transformations
below
() a ln x þ b ln y P a ln y þ b ln x; ð23Þ
() 2 ða ln x þ b ln yÞ P ða þ bÞðln x þ ln yÞ; ð24Þ
Because the influences from area k to the others are equal, we have
()ðpopk popjÞ
b
pc
kj IMdkj
dakj
¼ðpopk popiÞ
b
pc
ki IMdki
daki
; ð27Þ
() popj
popi
b
IMkj
IMki
d
¼ dkj
dki
a
pki
pkj
!c
From the constraints(15), (17), (18), Eq.(28)is re-written as
popj
popi
b
popkþ popj
popkþ popi
¼ 1 þ e
dki
ð1 þeÞa: ð29Þ
Without generality, we can assume that popj6popi, "i – j – k
Apply lemma(21)for the left side of Eq.(29), we have
popj
popi
b
popkþ popj
popkþ popi
P popj
popi
popkþ popj popkþ popi
2
Because of constraint(15), the right side of Eq.(30)is minimal if
its numerator is smallest, and its denominator is largest Thus,
popk= popj= 1 and popi= N0 C + 1
popj
popi
popkþ popj
popkþ popi
2
ðN0 C þ 1ÞðN0 C þ 2Þ
2 : ð31Þ
From Eqs.(29)–(31), we obtain the result in Eq.(19) h
Theorem 2 shows us the relation between some
parameters to ensure the same impacts from area k to the others
If a > 0, Eq.(19) always holds Otherwise, it helps us to choose
the suitable values of a, b, d when given the number of areas and
population
Consequence 2 By the similar proof, we obtain the condition for
the same influence of two ubiquitous pairs of areas,
ð1 þeÞa> 2
N C þ 2
2
Theorem 3 If we replace a constraint inTheorem 2with that below then the new condition to ensure that the influences from area k to the others are equal is:
(a) IMkj= min{popk, popj}:
The condition is:
ð1 þeÞa> 1
N0 C þ 1
(b) Assume that fdkjg; j ¼ 1; C; j – k is a Padovan sequence:
dk1¼ dk2¼ dk3¼ D0
dk;j¼ dk;j2þ dk;j3
j ¼ 4; C; j – k; D0: constant;
The condition is:
PðCÞa> 2
ðN0 C þ 1ÞðN0 C þ 2Þ
2
where
PðCÞ ¼ 1 þ r1
rCþ2
1 ð2 þ 3r1Þþ
1 þ r2
rCþ2
2 ð2 þ 3r2Þþ
1 þ r3
rCþ2
3 ð2 þ 3r3Þ; ð35Þ
riði ¼ 1; 3Þ is the ith root of equation x3+ x2
1 = 0
(c) pkj= dkj/d with d being a constant:
The condition is:
ð1 þeÞac> 2
ðN0 C þ 1ÞðN0 C þ 2Þ
2
In Theorem 2, the assumptions are quite ideal We will cer-tainly not see any part of the real world containing areas that
do not overlap the others or are equidistant from a considered one The meaning of this theorem is to examine the relation be-tween some parameters of the weighting function to ensure the same influence from a given area to the others in the perfect condition However, if we do not have one of these conditions then how will the new relation become?Theorem 3helps us an-swer it by replacing an assumption inTheorem 2with the new one For example, if we use the minimum operator instead of the average one, the new relation is set up as in Eq.(33), which is looser than Eq.(19) Similarly, when the distances from a given area to the others form a Padovan sequence, the new relation described in equation(34)is also looser than that inTheorem 2
If all areas intersect the others then we get the result in Eq.(36), which is stricter than Eq.(19) Thus, those cases in Theorem 3
give us a reference of relations between the parameters in differ-ent environmdiffer-ents
Now, we investigate the sum of weights when the number of areas is very large Assume that we have the following constraints:
ðaÞ Assumptions ð15Þ; ð17Þ and ð18Þof Theorem2; ð37Þ
ðcÞ fpopkg; k ¼ 1; C is a Triangular sequence: ð39Þ
Property 2
ðaÞ X1 k¼1
X1 j¼1
wkj¼1
D0
lim
C!1
1 3ðC þ 1Þ24ðP
2C2 9C2 6C2
wð1ÞðC þ 1Þ þ 2P2C
12C 12Cwð1ÞðC þ 1Þ 6wð1ÞðC þ 1Þ þ P2Þ
;ð40Þ
¼ 4 3D0
where w(n)
(x) is the nth derivative of the digamma function, and 2b + d = 2
Trang 6D0is the distance from a ubiquitous area to area k.
ðbÞ X1
k¼1
X1
j¼1
wkj¼80 8P
2
D0
; where 2b þ d ¼ 3: ð42Þ
From this property, we recognize that there is a difference with
value (92 28P2/3)/D0between two consecutive series
Property 3
ðaÞ X 1
k¼1
X 1
i¼1
X 1
j¼1
ðw ki w kj Þ ¼1
D 2 lim C!1 1 45ðC þ 1Þ 4 16ðP4
C4þ 150P2
C4 1575C4 ð43Þ
900C 4
w ð1Þ ðC þ 1Þ 15C 4
w ð3Þ ðC þ 1Þ þ 4P4
C 3
þ 600P2
C 3
5400C 3
3600C 3
wð1ÞðC þ 1Þ 60C 3
wð3ÞðC þ 1Þ þ 6P4
C 2
þ 900P2
C 2
6300C 2
5400C 2 w ð1Þ
ðC þ 1Þ 90C 2 w ð3Þ
ðC þ 1Þ þ 4P4
C þ 600P2
C 2520C
3600Cw ð1Þ
ðC þ 1Þ 60Cw ð3Þ
ðC þ 1Þ 900w ð1Þ
ðC þ 1Þ 15w ð3Þ
ðC þ 1Þ
þP4
þ 150P2
Þ
;
¼ 16
45D 2 ð1575 þ 150P2
þP4 Þ; where 2b þ d ¼ 2: ð44Þ
ðbÞ X1
k¼1
X1
i¼1
X1
j¼1
ðwki wkjÞ ¼P
2
þP4
107D20 ;where 2b þ d ¼ 3: ð45Þ
Similarly, the difference between two consecutive series in
Property 3is ð2696400 256755P2 1667P4Þ=D2
2.2 A new model
Definition 2 The Spatial Interaction – Modification Model (SIM2) is
defined as,
u0
k¼a ukþ b Xk1
j¼1
wkj u0þc1
A
XC j¼k
wkj uj; ð46Þ
In this model, the new updated areas are appeared within Eq.(46)
They contribute to the shift of the next areas to the goal state (a, b,
c) are the user-defined parameters satisfying constraint(47) In Eq
(46), the weighting function wkjis defined in Eq.(4) The last
param-eter – A is a scaling variable that forces the sum of cluster
member-ships to one It is often calculated through the maximum operator A
special case of SIM2model is:
Similar to the previous section, we examine some important
theorems fromDefinition 2
Theorem 4 The maximal difference of cluster memberships using
SIM-PF and SIM2model is:
MD ¼ abc
A
½N0ðN0 C þ 1Þbð2N0 C þ 1Þdþ2ðC 1ÞCbc
A N
2bþd
Proof Fix index k, we have
u0
k
SIM2
u0
k
SIMPF
k1 j¼1
wkj b u0c
A uj
For the simplicity, we assume that u 0 SIM2
¼ u 0 SIMPF
;8j < k Eq
(50)is now re-written as,
u0
k
SIM2
u0
k
SIMPF
k1
j¼1
wkj abc
A
ujþbc
A
XC i¼1
ðwji uiÞ
; ð51Þ
u0
k
SIM2
u0
k
SIMPF
C
j¼1
uj abc
A
wkjþbc
A
Xk1 i¼1
ðwki wijÞ
: ð52Þ
Apply Holder inequality for Eq.(52), we get
u0 k SIM2
u0 k SIMPF
6X
C
j¼1
uj
XC j¼1
abc
A
wkjþbc
A
Xk1 i¼1
ðwki wijÞ
: ð53Þ
Because of the property of the cluster membershipsPC
j¼1uj¼ 1 and apply Minkowski inequality for the right side of Eq.(53), we obtain
u0 k SIM2
ðu0
kÞSIMPF
6X
C
j¼1
abc
A
wkjþbc
A
XC j¼1
Xk1 i¼1
ðwki wijÞ: ð54Þ
Apply the result in Eq.(13)ofTheorem 1, we get
XC j¼1
abc
A
wkj6 abc
A
ðpopk N0Þb ðpopkþ N0Þd: ð55Þ
Following by Consequence 1,
XC j¼1
Xk1 i¼1
ðwki wijÞ 6 2N2bþd0 ðk 1ÞC: ð56Þ
From Eqs.(54)–(56), we receive the maximal difference of cluster membership kth of two models SIM2and SIM-PF,
u0 k SIM2
u0 k SIMPF
A
ðpopk N0Þb ðpopkþ N0Þd
þ2bcN2bþd0 ðk 1ÞC
Thus, the maximal difference of cluster memberships using SIM-PF and SIM2model is
u0
ð ÞSIM2 ðu0ÞSIMPF
k¼1;C
u0 k SIM2
u0 k SIMPF
6MD; ð58Þ
where MD is stated in Eq.(49) h
Theorem 4gives us an estimation of the difference of cluster memberships of two models Now, let us see some other defini-tions below
Definition 3 Vector a = (a1, a2, , an) is said to bes– larger than
b = (b1, b2, , bn) if
ða > bÞs() #fai>big ¼s 8i ¼ 1; n; ð59Þ
wheresis an integer, belonged to [0, n] Ifs= n then a is completely larger than b, and this relationship is denoted as a b
Definition 4 (Ordered relationship of vectors) Vector a = (a1, a2, - - -., an) is said to be larger than vector b = (b1, b2, , bn), denoted
by a > b if there exists two maximal values s1, s2 that satisfy
ða > bÞs1 and ðb > aÞs
2ands1 >s2 Property 4
ðaÞ ða > bÞ0() 9i 6 n : ðb > aÞi; ðbÞ a ¼ b () ða > bÞ0^ ðb > aÞ0; ðcÞ 0 6s1þs26n;
ðdÞ a b ¼ ða > bÞn:
ð60Þ
Example 5 Given a = (5, 8, 3, 6, 9) and b = (5, 8, 1, 7, 2) We easily recognize that (a > b)2and (b > a)1 Moreover, a > b
Definition 5 A sequence of vectors fXk;k ¼ 1; Mg is said to be monotone increasing if Xk+1> Xk, 8k ¼ 1; M 1 If Xk+1 Xk then
fX;k ¼ 1; Mg is called completely monotone increasing
Trang 7Theorem 5 The conditions of parameters to ensure that sequence
u0
k
of SIM2model is completely monotone increasing are:
b Pmax
k¼2;C
max
i¼kþ1;C
wk1;i wk;i
Pk1 j¼1ðwk;j wk1;jÞwj;i
c
AaPmaxk¼2;C i¼1;k2max
wk1;i wk;i
Pk1 j¼1ðwk;j wk1;jÞwj;i
bc
AaPmaxk¼2;C
1 b wk;k1
Pk1
j¼1ðwk;j wk1;jÞwj;k1
aPc
A maxk¼2;C
wk1;k b Xk1
j¼1
ðwk;j wk1;jÞwj;k
Proof FromDefinition 2, we have:
u0
k u0
k1¼a ðuk uk1Þ þ b Xk2
j¼1
ðwk;j wk1;jÞu0þ b wk;k1
u0
k1þc
A
XC j¼k
ðwk;j wk1;jÞuj 8k ¼ 2; C; ð65Þ
) u0
k u0
k1¼a ½uk ð1 b wk;k1Þuk1 þbc
A
Xk1
j¼1
XC i¼1
ðwk;j wk1;jÞ wj;i uiþ ½ab
Xk2
j¼1
ðwk;j wk1;jÞujþc
A
XC j¼k
ðwk;j wk1;jÞuj
From Eq.(66), we receive a linear combination in Eqs.(67) and (68),
u0
k u0
k1¼ h1u1þ h2u2þ þ hk2uk2þ hk1uk1þ hkuk
þ hkþ1ukþ1þ þ hCuC
Inequality(67)holds when hiP0;8i ¼ 1; C Using Eq.(68), we
re-ceive the conditions of parameters for u0
k u0 k1P0 where k is fixed Repeat the similar process for other k ¼ 2; C, we get the
re-sults in Eqs.(61)–(64)
hi¼
b c
AXk1
j¼1
ðwk;j wk1;jÞwj;iþab ðwk;i wk1;iÞ i ¼ 1; k 2
b c
AXk1
j¼1
ðwk;j wk1;jÞwj;k1a ð1 b wk;k1Þ i ¼ k 1
b c
AXk1
j¼1
ðwk;j wk1;jÞwj;kþaþAc ðwk;k wk1;kÞ i ¼ k
b c
AXk1
j¼1
ðwk;j wk1;jÞwj;iþcA ðwk;i wk1;iÞ i ¼ k þ 1; C
8
>
>
>
>
>
>
>
>
>
>
>
>
:
ð68Þ
h
Theorem 6
(a) If u0
k
is completely monotone increasing then it will be
mono-tone increasing
(b) Conversely, if u0
k
is monotone increasing, and all parameters
of SIM2model satisfy the conditions(61)–(64)then it will be
completely monotone increasing
Proof
(a) Following by Definition 5, u0
k;k ¼ 1; C
is completely monotone increasing Thus, u0
kþ1 u0
k;8k ¼ 1; C 1 From
Definition 4, $s1= N ^s2= 0: u0
kþ1>u0 k
N and u0
k>u0 kþ1
0 ApplyProperty 4, we get
u0 kþ1 u0
k¼ u0 kþ1>u0 k
N^ u0
k>u0 kþ1
0! u0 kþ1>u0
Perform the similar proof for other k ¼ 1; C 1, we receive the conclusion
(b) If u0 k
is monotone increasing then $s1>s2: u0
kþ1>u0 k
s 1
and u0
k>u0 kþ1
s 2 In order for u0
k
to be completely mono-tone increasing, the following conditions should hold:
# u0 kþ1>u0 k
¼ N
# u0
k>u0 kþ1
¼ 0
(
The condition(71)is obtained if the parameters of SIM2model sat-isfy constraints(61)–(64) Therefore, we get the conclusion h
Theorem 6 provides us the relation between a monotone increasing sequence and a completely one While the first result
in 6a is quite nice, the second one in 6b is not so good as the pre-vious result In fact, if the parameters satisfy conditions(61)–(64)
then u0 k
is completely monotone increasing following by Theo-rem 5 Thus, we do not need condition ‘‘ u0
k
is monotone increas-ing’’ anymore This remark tells us the truth that it is impossible to transform from a monotone increasing sequence to a completely one without the conditions of the parameters in SIM2model
Property 5 Assume that u0
k
is a completely monotone increasing sequence Then, limits of some special means are shown below
(a) Modified Harmonic Mean:
lim
k!1
C
1 þPC k¼110 k
0
@
1
(b) Root Mean Square:
lim
k!1
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PC k¼1 u0 k 2
C
s 0
@
1
A ¼ 1;
(c) Geometric Mean:
lim
k!1
YC k¼1
u0 k
!
¼ 1:
In what follows, we will look for the conditions of parameters to ensure ukPu0
k for given area k This is quite important because they can show us the trend of all areas in the map For example,
if ukPu0
kfor most k ¼ 1; C, all elements may concentrate on some last areas rather than the first ones Indeed, outliers can be hap-pened and affect the final results Thus, a suitable selection of parameters is required to avoid such cases and guarantees the best quality of the algorithm
Theorem 7 For given area k, the condition of parameters to ensure
ukPu0
kis:
maxfab;c=Ag
1 a maxi¼1;Cfwkig þ bc
Að1 aÞ maxj¼1;k1 maxi¼1;Cfwkjwjig
Trang 8ukPu0
() ukð1 aÞ PabXk1
j¼1
wkjujþc
A
XC j¼k
wkjujþbc
A
Xk1
j¼1
XC
i¼1
In order to obtain inequality(75), the following condition should
hold
ukð1 aÞ P max ab;c
A
XC i¼1
wkiuiþbc
A
Xk1 j¼1
XC i¼1
wkjwjiui; ð76Þ
() 1 Pmaxab;Ac
1 a
XC i¼1
wki
ui
uk
þ bc
Að1 aÞ
Xk1
j¼1
XC
i¼1
wkjwji
ui
uk
Because all components in the sums are positive, each one should
satisfy the condition below to obtain inequality(77)
ðk 1ÞCP
maxab;cA
1 a maxi¼1;C
wki
ui
uk
þ bc
Að1 aÞ
max
j¼1;k1
max
i¼1;C
wkjwji
ui
uk
Due to the following fact,
lim
k;i!1
ui
We can reduce the cluster membership in inequality(78)and
get the result in(73) h
The findings in Section2help us understand the principal
char-acteristics of SIM2model and are the basis to deploy the main
algo-rithm which will be presented in the next section
3 The modified IPFGWC algorithm
In this section, we integrate SIM2model with the main part of
IPFGWC to form the new algorithm named as Modified IPFGWC
(MIPFGWC) Details of this algorithm are shown below
Input: Geo-demographic data X The number of elements
(clus-ters) – N(C) The dimension of dataset r Thresholdeand other
parameters m,g,s, ai(i ¼ 1; 3Þ,cjðj ¼ 1; CÞ Geographic
parame-tersa, b,c, a, b, c, d
Output: Final membership values u0
kand centers V(t+1) MIPFGWC Algorithm:
Step 1 : Set the number of clusters C, threshold e> 0 and other
parameters such as m;g;s>1; ai>0ði ¼ 1; 3Þ;cjðj ¼ 1; CÞ
as in IPFGWC algorithm[8]
Step 2 : Initialize centers of clusters Vj, j ¼ 1; C at t = 0
Step 3 : Set geographic parametersa, b,c, a, b, c, d satisfying
con-dition(47)
Step 4 : Use the formulas(80)–(82)to calculate the membership
val-ues, the hesitation level and the typicality valval-ues, respectively
Step 5 : Perform geographic modifications through Eqs.(4) and
(46)
Step 6 : If u0
k
is a completely monotone increasing sequence (Theorem 5) or ukPu0
kfor most k ¼ 1; C (Theorem 7) then conclude that there is no suitable solution for given
geo-graphic parameters Otherwise, go to Step 7
Step 7 : Calculate the centers of clusters at t + 1 by Eq.(83)
ukj¼ 1
PC i¼1
kX k V j k
kX k V i k
m1
; k ¼ 1; N; j ¼ 1; C; ð80Þ
hkj¼ 1
PC i¼1
kX k V j k
kX k V i k
s 1
; k ¼ 1; N; j ¼ 1; C; ð81Þ
tkj¼ 1
1 þ a2 kX k V j k2
c j
g 1
; k ¼ 1; N; j ¼ 1; C: ð82Þ
Vj¼
PN k¼1a1umþ a2tgkjþ a3hskj
Xk
PN k¼1a1umþ a2tgkjþ a3hskj ; j ¼ 1; C: ð83Þ
Step 8 : If the difference kV(t+1)
V(t)k 6e then stop the algo-rithm Otherwise, assign V(t)= V(t+1)and return to Step 4
4 Results
4.1 Experimental environment
In this part, we describe the experimental environments such as,
Experimental tools: We have implemented the proposed algo-rithm (MIPFGWC) in addition to IPFGWC[8]and FGWC[7]in C programming language and executed them on a PC Intel(R) Cor-e(TM)2 Duo CPU T6570 @ 2.10 GHz (2 CPUs), 2048 MB RAM, and the operating system is Windows 7 Professional 32-bit
Experimental datasets: We use the data as follow: a small uni-versity dataset (Table 1) and a real one of socio-economic demographic variables from United Nation Organization – UNO[10] The second dataset was used for experiments in the article[8]
Cluster validity measurement: We use the validity function of fuzzy clustering for spatial data namely IFV [3] This index was shown to be robust and stable when clustering spatial data Besides, it was also used to evaluate the clustering quality in the article[8]
IFV ¼1 C
XC j¼1
1 N
XN k¼1
u2
kj log2C 1
N
XN k¼1
log2ukj
" #2
8
<
:
9
=
;
SDmax
rD
; ð84Þ
SDmax¼ max
rD¼1 C
XC j¼1
1 N
XN k¼1
kXk Vjk2
!
When IFV ? max, the value of IFV is said to yield the most optimal
of the dataset
Objective: We evaluate the clustering qualities of those algo-rithms on two different datasets through IFV index Some com-parisons of computational times will also be considered
4.2 Evaluations on the university dataset 4.2.1 Case 1
In this case, we divide these students into three groups: ‘‘High’’,
‘‘Medium’’ and ‘‘Low’’ following by all attributes Using two clus-tering algorithms MIPFGWC and IPFGWC with the parameters being set up below,
e= 103, parameters m ¼ 3;s¼ 2;g¼ 2; a1¼ a2¼ a3¼ 1;
c ¼ 1j ¼ 1; C,
Trang 9IPFGWC: geographic parameters a = b = 1,a= 0.5, b = 0.5,
MIPFGWC: geographic parameters a = b = c = d = 1, a = 0.5,
b= 0.3,c= 0.2
These parameters were used in the article[8]to evaluate the
clustering quality of algorithms The initial centers of clusters V(0)
used for MIPFGWC and IPFGWC is,
Vð0Þ¼
17 12 8:6 0:3
24 26 7:2 0:5
36 47 6:0 0:8
0
B
1 C
The results of MIPFGWC algorithm with those configurations
are shown in Eqs.(88) and (89) The computational time and the
number of iteration steps of MIPFGWC are 0.021 s and 15,
respec-tively Similarly, the results of IPFGWC algorithm are given in Eqs
(90) and (91) The computational time and the number of iteration
steps of IPFGWC are 0.019 s and 20, respectively Through Eqs
(84)–(86), we compare IFV values of these algorithms and get the
result in Eq.(92)
VðMIPFGWCÞ
¼
19:306553 10:105724 7:374804 0:999004
17:479478 28:197003 7:735475 0:256288
28:539287 33:201849 7:386632 0:638537
0
B
1 C A;
ð88Þ
UðMIPFGWCÞ¼
0:023007 0:927891 0:049102
0:091148 0:384451 0:524402
0:001554 0:994834 0:003612
0:111198 0:261117 0:627685
0:379026 0:251436 0:369538
0:050533 0:886195 0:063272
0:999511 0:000320 0:000169
0:012554 0:041774 0:945672
0
B
B
B
B
B
B
B
1 C C C C C C C
VðIPFGWCÞ¼
19:345955 10:096919 7:370893 0:999568
17:364905 27:968125 7:694930 0:263094
28:414465 34:814743 7:481142 0:557099
0
B
1 C A;
ð90Þ
UðIPFGWCÞ
¼
0:044201 0:469462 0:486337
0:062008 0:228704 0:709288
0:034854 0:497449 0:467697
0:068561 0:184172 0:747267
0:213137 0:178216 0:608646
0:054508 0:458405 0:487087
0:499723 0:034214 0:466063
0:031181 0:132587 0:836232
0
B
B
B
B
B
B
B
1 C C C C C C C
Thus, the clustering quality of MIPFGWC is better than that of IPFGWC From Eqs.(89) and (91), we get another result
Eq.(93)states that the maximal difference of cluster memberships using SIM-PF and SIM2model calculated by experiments is 1.89 and
is smaller than the theoretical value MD stated inTheorem 4 Now, we verify the value of weighting function of MIPFGWC algorithm shown in Eqs.(94)–(97)
w ¼
0 0:018119 0:009682 0:018119 0 0:060905 0:009682 0:060905 0
0 B
1 C
wAVG
1 ¼ 0:027801 < ðpop1 N0Þb ðpop1þ N0Þd¼ 160; ð95Þ
wAVG
2 ¼ 0:079024 < ðpop2 N0Þb ðpop2þ N0Þd¼ 384; ð96Þ
wAVG
3 ¼ 0:070587 < ðpop3 N0Þb ðpop3þ N0Þd¼ 160: ð97Þ
These results affirm that the average influence of an area on another one does not exceed the upper bound stated inTheorem 1
InTable 2, we calculate the sum of weights both byProperties 2 and 3and by experiments Results show that when the number of areas is small, the sum of weights is greater than that when the number of areas is large The experimental results in this case were performed with C = 3 and are higher than those ofProperties 2 and
3 Additionally, the difference between results of 2b + d = 2 and 2b + d = 3 is small
In what follow, we study the distribution of pattern sets and centers before and after running two algorithms Results are de-picted fromFigs 5–8 FromFig 5, we recognize that there is an outlier in the original pattern set MIPFGWC algorithm can han-dle outlier(s) better than IPFGWC.Fig 7shows that that outlier
is disappeared in MIPFGWC whilst it still remains in IPFGWC (Fig 6)
In IPFGWC, the patterns tend to move toward cluster ‘‘C’’ while those of MIPFGWC are evenly distributed in the data space The lacks of immigration effects, common boundaries between regions and updated cluster memberships in IPFGWC result in the abnormity of the distribution of patterns shown in Fig 6 MIPFGWC remedies this limitation through the use of SIM2 model and gives the distribution of patterns in
Fig 7
InFig 8, the changes of centers after running two algorithms are described Intuitively, there is not much difference between the locations of these centers This means that SIM2model makes
a great impact on the distribution of patterns more than the cen-ters All results above shows that MIPFGWC obtains better cluster-ing quality than IPFGWC
4.2.2 Case 2
In order to verify the consistency and robustness of the pro-posed algorithm, we make some changes about the parameters
of algorithms below
IPFGWC: geographic parametersa= 0.7, b = 0.3,
MIPFGWC: geographic parametersa= 0.7, b = 0.2,c= 0.1 Other parameters are kept intact as in Case 1 Results of MIPFGWC with the initial centers of clusters V(0)in Eq.(87)are:
VðMIPFGWCÞ
¼
20:986869 10:595425 7:231345 0:999561 17:637678 27:865067 7:649198 0:282628 31:018697 46:161775 8:406729 0:095751
0 B
1 C A; ð98Þ
Table 1
The university dataset.
Trang 10¼
0:028586 0:948662 0:022752
0:068931 0:273096 0:657973
0:001370 0:997771 0:000859
0:000859 0:036672 0:946739
0:516925 0:296017 0:187058
0:045229 0:938912 0:015859
0:983921 0:013168 0:002911
0:161933 0:493241 0:344827
0
B
B
B
B
B
B
B
1 C C C C C C C
The computational time and the number of iteration steps of
MIPFGWC are 0.008 s and 9, respectively Similarly, the results of
IPFGWC algorithm are given in Eqs.(100) and (101) The
computa-tional time and the number of iteration steps of IPFGWC are 0.005s
and 9, respectively IFV values of these algorithms are calculated in
Eq.(102)
VðIPFGWCÞ
¼
20:846367 10:502604 7:243042 0:999342
17:660246 27:829394 7:643995 0:282196
30:688095 46:148977 8:380672 0:094085
0
B
1 C A;
ð100Þ
UðIPFGWCÞ
¼
0:069170 0:665294 0:265535 0:063570 0:207881 0:728549 0:052247 0:698536 0:249217 0:020084 0:058535 0:921382 0:375461 0:241883 0:382656 0:079164 0:660991 0:259845 0:690987 0:058643 0:250370 0:139263 0:364975 0:495762
0 B B B B B B B
1 C C C C C C C
These results show that the clustering quality of MIPFGWC is better than that of IPFGWC Additionally, IFV values of both algo-rithms are larger than those of previous case given in Eq (92) The computational times of these algorithms are also smaller than those of previous case This means that the large value ofa param-eter in this case in comparison with the previous one will enhance the clustering quality and reduce the computational time of MIPFGWC
Table 2 The sum of weights in Case 1.
Sum of weights Results of Properties 2 and 3 Experimental results
2b + d = 2 2b + d = 3
P C k¼1
P C
P C k¼1
P C i¼1
P C