The two-dimensional distribution of Colon Cancer dataset... Table 3PCAES values of all algorithms in Case 1 on UNO dataset.. 4.Average PCAES of algorithms on UNO dataset by fuzzifiers.. E
Trang 1Applied Soft Computing
jo u r n al hom e p a g e :w w w e l s e v i e r c o m / l o c a t e / a s o c
VNU University of Science, Vietnam National University, Viet Nam
a r t i c l e i n f o
Article history:
Received in revised form 14 February 2014
Available online xxx
Keywords:
Context clustering
Fuzzy clustering type-2
Geo-demographic analysis
Heuristic algorithms
Particle swarm optimization
a b s t r a c t
Geo-DemographicAnalysis, whichis one ofthemost interestinginter-disciplinaryresearchtopics betweenGeographicInformationSystemsandDataMining,playsaveryimportantroleinpolicies deci-sion,populationmigrationandservicesdistribution.Amongsomesoftcomputingmethodsusedforthis problem,clusteringisthemostpopularonebecauseithasmanyadvantagesincomparisonwiththe restssuchasthefastprocessingtime,thequalityofresultsandtheusedmemoryspace.Nonetheless,the state-of-the-artclusteringalgorithmnamelyFGWChaslowclusteringqualitysinceitwasconstructedon thebasisoftraditionalfuzzysets.Inthispaper,wewillpresentanovelintervaltype-2fuzzyclustering algorithmdeployedinanextensionofthetraditionalfuzzysetsnamelyIntervalType-2FuzzySetsto enhancetheclusteringqualityofFGWC.Someadditionaltechniquessuchastheintervalcontext vari-able,ParticleSwarmOptimizationandtheparallelcomputingareattachedtospeedupthealgorithm.The experimentalevaluationthroughvariouscasestudiesshowsthattheproposedmethodobtainsbetter clusteringqualitythansomebest-knownones
©2014ElsevierB.V.Allrightsreserved
Introduction
Geo-Demographic Analysis(GDA), which wasdefined as“the
analysis of spatially referenced geo-demographic and lifestyle
data”[33],isoneofthemostinterestinginter-disciplinaryresearch
topicsbetweenGeographicInformationSystemsandDataMining,
andiswidelyusedinthepublicandprivatesectorsfortheplanning
andprovisionofproductsandservices.Therearevariousexamples
showingtheneedsofGDAinpracticalapplications.Sheltonetal
[34]performedageo-demographicclassificationformortality
pat-ternsinBritainandfoundthemaincausesofdeathsinEngland
andWalesfrom1981to2000associatedwithgeographical
loca-tionsinamapsothattheycouldassistdecisionmakersinbetter
understandingthedistributionofmajorcauses.Michael[23]
con-ductedaGDAanalysistogathercommunityattitudesonthefuture
growthofWerriBeachandGerringong,NSW(Nelson),Australia
focusingprimarilyonwhatactionsCouncilshouldtaketomanage
populationgrowthwithinexistingneighborhoods.Páezetal.[29]
presentedageo-demographicframeworkusingdatafrom
Mon-treal,Canadatoidentifypotentialcommercialpartnershipsthat
couldexploitthecharacteristicsofsmartcards.Campbelletal.[8]
∗ Correspondence to: 334 Nguyen Trai, Thanh Xuan, Hanoi 010000, Viet Nam.
Tel.: +84 904171284; fax: +84 0438623938.
E-mail addresses: sonlh@vnu.edu.vn , chinhson2002@gmail.com
providedadetailedGDAofover37,000giftedandtalentedstudents admittedtotheNationalAcademyforGiftedandTalentedYouth
inEnglandin2003/2005andshowedthatNationalAcademyhad nonethelessreachedsignificantnumbersofstudentsinthepoorest areas,somethingover3000students,and8%ofstudentsidentified
asgiftedandtalentedatthisstage.Dayetal.[11]tookasurvey thatdeterminedclustersofnationsgroupedbyhealthoutcomesby comparinglifeexpectancyandarangeofhealthsystemindicators withinandbetweeneachclusterinordertoprovidesensible group-ingsforinternationalcomparisons.Someothertypicalapplications
ofGDAsuchasthespatialandsocio-economicdeterminantsof tuberculosis,urbangreenspaceaccessibilityfor differentethnic andreligiousgroups,childrendisordersinvestigation,etc.couldbe referencedinthearticles[1,6,9,32,36,37]
In order to perform GDA, some soft computing methods are oftenusedsuchas PrincipalComponent Analysis(PCA), Self-OrganizingMaps(SOM)andclustering.Walford[41]describeda methodusingPCAtostudythespatialdistributionofthe1991 cen-susdatascores.However,resultsofPCAdependonthescalingof thevariables,anditsapplicabilityislimited bycertain assump-tionsmadeinthederivation.Loureiroetal.[21]introducedthe useofSOMasanadequatetoolforGDA.Basedonthevariationsin edgelengthinapathbetweentwounitsontheSOM,theauthors presentedanewwayofcalculatingfuzzymembershipsoffuzzy clusteringmethod.However,itrequiresalotofmemoryspacesto storeallneuronsandweights;whatismorethespeedoftraining
http://dx.doi.org/10.1016/j.asoc.2014.04.025
1568-4946/© 2014 Elsevier B.V All rights reserved.
Trang 2clusteringisoftenusedinsteadbecauseithasmanyadvantages
incomparisonwiththerestssuchasthefastprocessingtime,the
qualityofresultsandtheusedmemoryspace.Ourpreviouswork
in[36]madeanoverviewaboutsomeclusteringmethodsforGDA
suchasFuzzyC-Mean(FCM)[3],theagglomerativehierarchical
clustering[11],NeighborhoodEffects(NE)[13],K-Meansclustering
[20]andFuzzyGeographicallyWeightedClustering(FGWC)[24]
Amongthem,FGWCwasconsideredthemostfavoritealgorithm
andwasusedinmostofresearcharticlesaboutGDAapplications
uk=˛×uk+ˇ×1A×
c
j=1
wkj= (popk×popj)
b
da
kj
(3)
FGWCcalculatestheinfluenceofoneareauponanotherbyEqs
(1)–(3)whereuk(uk)isthenew(old)clustermembershipofthe
areak Twoparameters ˛and ˇarethescalingvariables.popk,
popjarethepopulationsofareaskandj,respectively.The
num-berdkjisthedistancebetweenkandj.Twonumbersaandbare
userdefinableparameters.Ais afactortoscalethe“sum”term
andiscalculatedacrossallclusters,ensuringthatthesumofthe
membershipsforagivenareaforallclustersisequaltoone
AlthoughFGWCisthemostpopularclusteringalgorithm for
GDA,itstillcontainssomelimitationssuchasthespeedof
com-putingandtheclusteringquality.Oneofourpreviousworksin[35]
presentedamethodso-calledCFGWCtoacceleratethespeedof
computingofFGWCbyattachingthecontextvariableterms.Other
worksin[36,37]haveshowedsomepreliminaryresultsin
improv-ingtheclusteringqualityofFGWCthroughintuitionisticfuzzysets
andgeographicalspatialeffects.Thus,ourfocusinthisworkisto
continuewiththeclusteringqualityproblemofFGWC.Basedupon
theobservationthatFGWCwasconstructedonthebasisofthe
tra-ditionalfuzzysets,whichcontainsomelimitationsinmembership
degreesaspointedoutbyMendel[25],thisfostersustoimprove
FGWCinanextensionofthetraditionalfuzzysetstoenhancethe
clusteringqualityofthealgorithm.Now,letusexplainwhy
clus-teringalgorithmsonthetraditionalfuzzysetshavelowclustering
quality
AccordingtoMendel[25],thetraditionalfuzzysetscannot
pro-cesssomeexceptionalcaseswherethemembershipdegreesare
notthecrispvaluesbutthefuzzyonesinstead.Forexample,the
possibilitytogettuberculosisdiseaseofapatientconcludedbya
doctorisfrom60to80percentsafterexaminingallsymptoms.Even
ifsomemodernmedicalmachinesareprovided,thedoctorcannot
giveanexactnumberofthatpossibility.Thisshowsthefactthat
crispmembershipvaluescannotmodelsomesituationsinthereal
worldandshouldbereplacedwiththefuzzyones.Rhee[30]stated
thatusingthetraditionalfuzzysetsoftenresultsinbadclustering
qualitybecausetheiruncertaintiessuchasdistancemeasure,
fuzzi-fier,centers,prototypeandinitializationofprototypeparameters
cancreateimperfectrepresentationsofthepatternsets.For
exam-ple,incaseofpatternsetsthatcontainclustersofdifferentvolume
ordensity,itispossiblethatpatternsstayingontheleftsideof
aclustermaycontributemorefortheotherratherthanthis
clus-tersothatchoosingsuitablevalueforthefuzzifierisdifficult.Bad
selectioncanyieldundesirableclusteringresultsforpatternsets
thatincludenoise.Becauseofthoselimitations,somepreliminary
resultsofdeployingfuzzyclusteringmethodsinanextensionof
thetraditionalfuzzysetsso-calledIntervalType-2FuzzySets(IT2FS)
havebeenintroduced.Mendel[25]describedthedefinitionofIT2FS
asfollows
˜
A= (x,u,A˜(x,u)=1)|∀x∈A,∀u∈JX⊆ [0,1]
FromEq.(4),werecognizethatIT2FSisageneralizationofthe traditionalfuzzysetssinceIT2FSwillreturntothetraditionalfuzzy setswhenthereisnouncertaintyinthethird dimension.Based uponthisdefinition,someauthorsintroducedseveralinterval
type-2fuzzyclusteringalgorithmssuchasintheworksofHwangand Rhee[15]andRhee[30].Specifically,HwangandRhee[15] pre-sentedatype-2fuzzyclusteringalgorithmtosolvetheproblemof choosingdistancemeasuresinFCMalgorithm,takingthedifference
ofeachtype-2membershipfunctionareawiththecorresponding type-1membershipvalue.Rhee[30]presentedanimprovement
ofthisalgorithmusingtwodifferentvaluesoffuzzifierstosolve theuncertainty of fuzzifier inFCM Some othervariants ofthe intervaltype-2fuzzyclusteringalgorithmscouldbereferencedin [2,10,12,14,17,19,22,26,27,31,42]
Motivatedbythoseresults,inthisarticle,wewillpresentanovel intervaltype-2fuzzyclusteringalgorithmso-calledContextFuzzy GeographicallyWeightedClusteringonIT2FSorinshortCFGWC2to enhancetheclusteringqualityofFGWC.ThedifferenceofCFGWC2 withthoseintervaltype-2fuzzyclusteringalgorithmsaboveistwo fold:Firstly,CFGWC2isspeciallydesignedfortheGDAproblem that requiresthemodification ofgeographical spatial effectsto thealgorithmitself;secondly,itisequippedwithsomeadditional techniquestospeedupthewholealgorithm,namely:
• Anintervalcontextvariable,whichisanextensionofthesingle contextvariableofPedrycz[28],isproposedandusedtoclarify theclusteringresultsandacceleratethecomputingspeed
• Inordertoavoidbad initialization,which mayoccurinother interval type-2 fuzzy clustering algorithms, and to converge quicklytothe(sub-)optimasolutions,ameta-heuristic optimiza-tionmethodnamelyParticleSwarmOptimization–PSO[18]is usedtodeterminegoodinitialcentersforCFGWC2
• Sincecontextvaluesintheintervalcontextvariablecanbe simul-taneouslyprocessedinCFGWC2,parallelcomputingtechniqueis adaptedtoCFGWC2toreducethecomputationalcosts
Whathavebeenlistedinthosebulletsareourcontributionsin thispaper.Theproposedalgorithmwillbeimplementedand com-paredwithsomerelevantmethodsintermofclusteringqualityto verifyitsefficiency
The rests of this paper are organized as follows Section
“Theproposedmethodology”elaboratestheproposedmethodin detailsincluding thoseadditional techniquesone-after-another The numerical experiments through various case studies and discussionsaregiveninSection“Results”.Finally,Section “Con-clusions”givestheconclusionsandoutlinesfutureworksofthis article
The proposed methodology
In theprevioussection,we have knownthat CFGWC2is an interval type-2fuzzy clustering algorithm equippedwith some additionaltechniquessuchastheinterval contextvariable,PSO andtheparallelcomputingfortheGDAproblem.Sincethose tech-niquesarenecessaryforthedescriptionofCFGWC2,theyarefirstly presentedinSections“UsingPSOforthedeterminationofinitial centers”and“Theintervalcontext”.TheCFGWC2algorithm accom-paniedwiththeparallelcomputingmechanismwillbedescribed
inSection“Evaluationbyvariouscasestudies”
Trang 3Thissectionmentionsthetechniquethatfindsgoodinitial
cen-tersforclusteringalgorithmsbyPSO.Theideaofthistechniqueis
togiveapreliminaryclassificationoftheoriginalpatternsetsothat
“temporal”clusterresultscanbeusedtoorienttheclassificationin
themainalgorithm.TheobjectivefunctionisshowninEq.(5),and
itsconstrainsaregiveninEqs.(6)–(7):
J=
N
k =1
C
j =1
Xk−Vj2
min
j=1,C
j /=i
Vi−Vj> max
s =1,POP(i)
Xs−Vi
Xs∈Cluster(i)
i=1,C
(6)
Cluster(i)≤ε1 where POP(i)=1 and i=1,C (7)
Constrain(6)requiresthatallclustersareseparatedfromthe
others.Alternatively,theminimaldistancefromacluster’scenter
totheothersisnotshorterthanthemaximalonefromthiscenterto
alldatapointsinthecluster.POP(i)isthepopulationornumberof
patternsintheclusterCluster(i).Constrain(7)minimizesthe
num-berofoutliersintheresult.Accordingly,thenumberofoutliersis
notgreaterthanapre-definedthresholdε1
Fortheproblem(5)–(7),weusePSO[18]todeterminethe
(sub-)optimasolutionswiththebeginningpopulationbeinginitiated
withPparticles.Eachparticleisavectorz= (z1,z2, ,zC) where
zi(i=1,C)isapatternrandomlychosenfromtheoriginalpattern
set.Thevelocitiesofziaresettozeros.Detailsofthealgorithmare
describedbythepseudo-codeinTable1
Notice that Eq (9) is used solely for the first iteration of
MaxStepPSO.Inthenextiterations,thecentersarecalculatedfrom
thepreviousone.Additionally,thevalueofMDiinEq.(10)issetto
zeroincasethatthisclusterhasnotgotanyelement.Thefitness
valueofaparticleiscalculatedbyEq.(13)where(1,2)arethe
ratioconstants.Eqs.(14)–(16)areusedtoupdatethevelocitiesand
positionsofallparticles.Inthoseequations,c1istheratiotokeep
thevelocityintact,c2istheratiotochangethevelocityfollowing
bypBestandc3showstheinfluencelevelofgBesttothevelocity
Sincetheroleofzi(i=1,C)fromtheseconditerationafterwards
isreplacedwithcenterVi,thedomainofrandomnumberinEq
(14)issetto(−1,1)inordertoensurethevaluesofthecenters
areboundedwithinthedomainofthepatternset.Afteranumber
ofiterationstepsdefinedbyMaxStepPSO,thesolutionisgetting
betterbecauseoftheameliorationprocessaftereach“flyingstep”
basedonthefitnessfunction.TheoutputtedresultV(0)=(V1,V2, ,
VC)canbefoundfromtheparticleholdingcurrentgBestandisused
astheinitialcenterforCFGWC2
Theintervalcontext
Inordertoclarifytheclusteringresultsandacceleratethe
com-putingspeed of theclustering algorithms, thecontext variable
couldbeused.AccordingtoPedrycz[28],a(single)contextvariable
inY⊂Xisdefinedthroughthemapbelow
A:Y→ [0,1]
wherefkcanbeunderstoodastherepresentationforthelevelof
relationofthekthpointtothesupposedcontextfk.Therearesome
waystodefinetherelationbetweenfkandthemembershipofkth
pointtotheithcluster,forinstance,usingthesumoperator(18)or maximumoperator(19)
c
i =1
c
max
i=1uki=fk,k=1,N (19)
Inourpreviousworkin[35],wedefinedacontextvariableto narrowtheoriginalgeographicaldatasetundersomeconditionsof certaindimensions.Thereasontousethetermofcontextforthe clusteringalgorithmistwofold.Firstly,acontextvariableisuseful
toclarifytheresultsfollowingbyusers’purposes.Becauseonlya subsetoftheoriginaldatasetwhichhasconsiderablemeaningto thecontextisinvoked,theresultfocusesontheareathatreally hasmanyrelevantpoints.Secondly,ithelpsimprovingthespeedof computing.Inthetraditionalclusteringmethod,itnotonlytakes longtimetoprocessthewholedata,butalsomakestheresultsless meaningtotheconsideredcontext.Onthecontrary,the context-basedclusteringmethodsbothacceleratethespeedandimprove thesemantic.Nevertheless,therearesomelimitationsin defini-tion(17).Firstly,theimportanceofthekthpointtothesupposed contextisdecidedbyavaluefk.Infact,itisnotenoughtoreflect
avarietyofdifferentevaluationsofmanypeopletothisrelation
Intheotherwords,onecanassumethattheimportanceisonly0.3 whileotheraffirmsthatitshouldbe0.6.Duetothisfact,theuseofa valuefkisnotenough.Secondly,theoldapproachexcludestheroles
ofotherdatapointstothecontext.Itisamisleadingassumption sinceallcharacteristicsalwayshaverelationshipseitherdirectly
orindirectlywiththeothers.Fromtheselimitations,weextend theuseofcontextbyintroducinganewterm:“theintervalcontext variable”.Anintervalcontextisdefinedasf=[f1,f2]whereeachfi (i=1,2)isstatedthroughthemapinEq.(17).Forthemost impor-tantpoints,thevalueoffishigh,e.g.[0.6,0.8].Similarly,thevalue
offincaseoflessimportantpointsislow,e.g.[0,0.15].Thisinterval reflectsthe“fuzziness”ofthecontext.Intheotherwords,wehave justperformeda“fuzzy”stepfortheconsideredcontext.Ithelps
usovercometheshortcomingsofthesinglecontextvariableand
issuitableforCFGWC2,whichworksonIT2FS.Detailsofapplying theintervalcontextvariableforCFGWC2willbepresentedinthe Section“TheCFGWC2algorithm”
TheCFGWC2algorithm
Wehavehadageneralbackgroundofchoosinginitialcenters
byPSOinSection“UsingPSOforthedeterminationofinitial cen-ters”andthebasicdefinitionoftheintervalcontextinSection“The intervalcontext”.Now,weusebothofthemaccompaniedwiththe parallelcomputingmechanisminthemainactivityoftheCFGWC2 algorithm.LetusseethemechanismofCFGWC2illustratedbyFig.1 below
According to Fig 1, the parallel computing mechanism of CFGWC2 requires three machineswhose first one (Machine 1)
is responsible for generating initial centers for the remaining machines Nevertheless, the centers values of Machine 2 and Machine3aredifferentsincethestoppingconditionsofPSOarenot identical.After(MaxStepPSO/2)iterationsteps,thefirstcenterV(0)
isoutputtedandtransferredtoMachine2,andthesecondcenter
issenttoMachine3after(MaxStepPSO)iterations.This guaran-teesdifferentresultsinMachine2andMachine3,andissuitable forthedeterminationoftheupperandlowercentersand mem-bershipdegreesoftheclusteringalgorithmsonIT2FS,i.e.U(1),V(1)
(Machine2)andU(2),V(2)(Machine3)inFig.1
InMachine2andMachine3,wesendtheinitialcentersV(0)to
atype-2fuzzyclusteringprocedureaccompaniedwiththeinterval
Trang 4Table 1
The pseudo-code of PSO procedure.
Input - The pattern set X whose dimension is r
- The number of elements (clusters) – N(C)
- The number of particles in the beginning population – P
- Maximal number of iteration steps in PSO – MaxStep PSO Output - Final center V (0)
Particle Swarm Optimization (PSO)
X j ∈ Cluster(i) ⇔z i − X j= minz k − X j|k = 1, C
(8)
6: Calculate center V i and the maximal distance from V i to cluster’s elements:
V (l)
i =
Xs∈Cluster(i)
X (l)
s
MD i = max
s=1,POP(i)
X
s − V i= max
s=1,POP(i)
⎧
⎨
⎩
l=1
(X s(l)− V i(l))2
⎫
⎬
⎭,
X s ∈ Cluster(i),
(10)
SEP(z) =Cluster(i) where
min
j = 1, C
j/=i
V i − V j
MD i
OUT (z) =Cluster(i) where POP(i) ≤ 1; i = 1, C (12)
( 1 /1 + SEP(z)) + ( 2 /1 + OUT (z)) (13)
velocityij = c 1 ∗velocityij + c 2 ∗ rand(−1, 1) ∗ (z pBest,j − z ij ) + c 3 ∗ rand(−1, 1) ∗ (z gBest,j − z ij ), (14)
contextvariableso-calledContext-FGWC2togetthecrispcenter
V(1) (Machine2)andV(2) (Machine3).Ifthedifferencebetween
theinitialandcrispcentersissmallerthanathreshold(Eps)orthe
maximalnumberofiterations(MaxStep)isreachedthenwestop
theContext-FGWC2procedureandtakethecrispcenterand
mem-bershipdegree,i.e.U(1),V(1)(Machine2)andU(2),V(2)(Machine3)
asthefinalresults.Otherwise,weassignV(0)=V(1)inMachine2and
V(0)=V(2)inMachine3andstartanewiterationinContext-FGWC2
untilthestoppingconditionshold
Oncetheupperandlowercentersandmembershipdegreesare
calculated,weuseadefuzzificationmethodso-calledthePartition
CoefficientandExponentialSeparation(PCAES)[40]validityindexto
obtainthefinalcenterandmembershipdegreeasbelow
V(∗)=
V(1) if PCAES(V(1))≥PCAES(V(2))
V(2) otherwise
(20)
Thisindexmeasuresthepotential,whethertheidentified
clus-terhasanabilitytobeagoodclusterornot.Itwascomparedwith
otherindexessuchasPartitionEntropy(PE),PartitionCoefficient
(PC),FuzzyHypervolume(FHV),Xie&Beni,Pal&Bezdek,
Modifica-tionPC(MPC),Zahidetal.,andshowedtheimpressiveresults,even
inanoisyenvironment.ThedefinitionofPCAESisgivenbelow
PCAES(C)=
C
j=1
where
PCAES[j]=
N
k =1
ukj2
uM −exp
⎛
⎜−mini/=j{Vj−Vi2
}
ˇT
⎞
uM= min
1 ≤i≤C
N
k =1
u2 ki
(23)
ˇT=
C l=1Vl−V2
V=(V1,V2, ,.Vr)whereVi(i=1,r)iscalculatedas,
Vi=
C
l =1Vli
PCAES[j]isusedtomeasurethecompactnessandseparationfor clusterj(j=1,C).TheyaresummeduptocalculatePCAES(C)∈(−C, C).ThelargePCAES(C)valuemeansthateachoftheseCclusters
iscompactandseparatedfromotherclusters.Itisacriterionto choosethesuitableclustering’soutput.Dependingonwhichcenter
isopted,therelatedmembershipdegreeisusedasfinal member-shipU(*)
Now,wedescribetheContext-FGWC2procedure Remember-inginSection“Theintervalcontext”thatanintervalcontextwas definedasf=[f1,f2]sothatwecouldapplyfi(i=1,2)ineachmachine
Trang 5Table 2
The pseudo-code of Context-FGWC2 procedure.
Input - Initial center V (0) , the pattern set X, an interval fuzzifier [m 1 ,m 2 ],
- The number of elements (clusters) – N(C), the dimension of dataset r,
- Geographic parameters ˛, ˇ, a and b, precision ε, MaxStep iteration.
Output - Final center V (3)
Context-FGWC2
U(x), U(x)
7: Sort X following by lin ascending order
8: Find index k 0 satisfying (30) Otherwise, k 0 ← N − 1
9: Calculate U (1)(l) , V (1) by (31)–(32)
11: For s = l + 1, r: U kj(1)(s)← U kj (j = 1, C, k = 1, N)
18: Repeat from Step 5 to 17 to calculate V L , U (2)
19: Perform Type-Reduction by (36)
20: Determine the population of each cluster by (37)
21: Update U (C) (x) by geo-characteristics in (2), (3) and (38)–(40)
22: Perform Type-Reduction and compute center V (2) by (41) and (42) to get U GT (x)
24: Repeat from Step 6 to 18 to calculate V R , V L from V (B) and U GT (x)
25: Perform defuzzification to calculate V (3) by (43)
26: UntilV(3) − V (0)≤ ε or MaxStep is reached
Specifically,f1(f2)wasusedintheContext-FGWC2procedure
ofMachine2(3).Becauseofusingdifferentcontextvaluesand
ini-tialcentersinthosemachines,theupperandlowercentersand
membershipdegreestotallyreflectthebasicprincipleofIT2FS.The
basicideaoftheContext-FGWC2procedureinMachine2isusingan
intervalofprimarymembershipconsistingofthelowerandupper
onescalculatedfromtheinitialcenterandupdatingtheinterval
bygeo-characteristicsand contextvaluef1.Thepseudo-codeof
Context-FGWC2isshowninTable2
In Step 4 of the Context-FGWC2, the intervals of primary
membershipconsistingoftheupperandlowermembershipsare
calculatedbyEqs.(26)–(29).Noticethatin(26)–(27),thesumof
membershipdegreesinallclustersisequaltof1k wheref1k isa
contextvalueofthekthpointinthepatternset.Analogously,the
valuesoftheupperandlowermembershipsaredependedbythis
contextvalueasshownin(28)–(29)
U(x)=
⎧
⎨
⎩Ukj∈(0,1)|k=1,N;j=1,C;
C
j =1
Ukj=f1k
⎫
⎬
U(x)=
⎧
⎨
⎩Ukj∈(0,1)|k=1,N;j=1,C;
C
j=1
Ukj=f1k
⎫
⎬
Ukj=
⎧
⎪
⎪
⎪
⎪
⎪
⎪
f 1k C
i=1
X
k − V j(0)
Xk − V i(0)
2/m1−1 , if f1k
C
i=1
X
k − V j(0)
Xk − V i(0)
≥ 1/C
f 1k C
i=1
X
k − V j(0)
Xk − V i(0)
2/m 2 −1 , otherwise
(28)
U kj =
⎧
⎪
⎪
⎪
⎪
⎪
⎪
f 1k C
i=1
X
k − V j(0)
Xk − V i(0)
2/m 1 −1 , if f1k
C
i=1
X
k − V j(0)
Xk − V i(0)
< 1/C
f 1k C
i=1
X
k − V j(0)
Xk − V i(0)
2/m 2 −1 , otherwise
(29)
Afterwehavetheintervalofprimarymembership,the maxi-mum(minimum)centerVR(VL)andtherelatedmembershipmatrix
U(1)(U(2))arecalculatedbythesamestepsfromStep6to17 Specif-ically,inStep8indexk0intherange[1,N−1]satisfyingEq.(30)will
beselectedasapivottocalculateU(1)(l)inEq.(31)
Xk0l≤C
j =1vjl(A)
Ukj(1)(l)=
Ukj ifk≤k0
Ukj otherwise
, (j=1,C, k=1,N) (31)
Usingtheaverageoperatoroffuzzifier,centerV(1)iscalculated below
Vji(1)=
N k=1(Ukj(1)(l))[m1+m2/2]Xki
N
k =1(Ukj(1)(l))[m1+m2/2]
, (j=1,C, i=1,r) (32)
Next,inStep10wecheckwhetherV(1)=V(A)ornot.Ifthis con-ditionholds,weconcludethatthemaximumcenterVR=V(1)and therelatedmembershipmatrixU(1)isfoundinEq.(33)
U(1)=
r l=1U(1)(l)
Otherwise,wemakeanotherloopwiththenextfeaturelinthe patternset.Bythesimilarprocess,inStep18wecancomputethe
Trang 6Fig 1.The mechanism of CFGWC2.
minimumcenterVLandtherelatedmembershipmatrixU(2)where
Eqs.(31)and(33)arereplacedwith(34)and(35),respectively
Ukj(2)(l)=
Ukj ifk≤k0
Ukj otherwise , (j=1,C, k=1,N) (34)
U(2)=
r
l =1U(2)(l)
Fromtheserelatedmembershipmatrices,Step19obtainsthe
membershipdegreeoftraditionalfuzzysets(a.k.a.type-1)through
Eq.(36).Thisprocessiscalledthetype-reductionandusedto
calcu-latethepopulationofeachcluster.Step20calculatesthepopulation
ofeachclusterbythisrule:
If Ukj(C)>Uki(C) and i/=j then Xkisassignedtocluster j, (37)
(k=1,N;i=1,C)
Basedonthepopulation,Step21determinesthegeographical
weightsofallareasbyEq.(3),andthemodificationofmembership
degreefollowingbygeo-characteristicsisperformedthroughEqs
(2),(3)and(38)–(40)
UG(x)=G(U(C)(x))=
UkjG,UkjG
, (j=1,C, k=1,N) (38)
UkjG=˛×Ukj(2)+ˇ×A1×
C
i=1
UkjG=˛×U(1)kj +ˇ×1
A×
C
i =1
wji×Uki(1), (i,j=1,C,i/=j,k=1,N)
(40)
NoticethatparameterAinEqs.(39)and(40)isafactortoscale the“sum”termandiscalculatedacrossallclusters,ensuringthat thesumofthemembershipsforagivenareakforallclustersis equaltothecontextvaluef1k(k=1,N).Step22performsthe type-reductionforthemodifiedmembershipdegreeandcalculatesnew centerV(2)byEqs.(41)and(42),respectively
UkjGT=Ukj
G
+UkjG
Vji(2)=
N k=1(UkjGT)[m1+m2/2]Xki
N k=1(UkjGT)[m1+m2/2]
, (j=1,C, i=1,r) (42)
Now,wehavemodifiedmembershipdegreeUGandcrispcenter
V(2).SinceweworkonIT2FS,V(2) shouldbeaninterval contain-ingtheminimumandmaximumcentersVL,VR.Thisworkisdone throughStep23and24.Inordertoverifywhethertheoutputted centersisthesolutionornot,Step25performsthedefuzzification fortheinterval centerasin Eq.(43)andgetcrisponeV(3).This centerisusedtocheckthestoppingconditiondescribedinStep26
V(3)=
VL ifVL−V(0)≤VR−V(0)
(43)
Inordertoavoidunstoppableiteration,welimitthemaximal numberofiterationstepstoMaxStep.Ifthenumberofiteration stepsexceedsthisthreshold,theContext-FGWC2procedurewill stopimmediately.Oncethestoppingconditionholds,wereceive thetype-2membershipdegreeUGandtheintervalcenter[VL,VR] ThecrispcenterV(3)andthedistributionofpatternsetafter clus-tering can be extracted fromthem (UG,V(3)) are theoutput of Context-FGWC2,andthecrispcenterV(3) isdenotedinFig.1as
V(1)(Machine2)andV(2)(Machine3)
TheworksofContext-FGWC2inMachine3isanalogoustothose
in Machine 2except themaximal number of iteration stepsin Machine3isequaltohalfofthatinMachine2(∼MaxStep/2).The reasonforthisalterationliesinthesynchronizationprocess Specif-ically,theresultsinMachine2and3aretransferredtoMachine1 aftercompletionsothatifamachinetakestoomuchtimeto gen-eratetheoutputs,itwillcauselargedelayedtimeoftheoverall system.BecausetheinitialcenterofMachine3issomehowbetter thanthatofMachine2,theconvergencemaybefasterandisnot affectedbythenumberofiterationsteps.Inpractical,thenumber
ofmachinescanbereduced,forinstancetheworksoftheMachine
1canbeassignedtooneoftwoleftmachines.Becauseittakes muchtimetotransferdatabetweenmachines,itisbetterifwecan decreasethewaitingtime.Ifso,thenumberoftransferredsteps betweenmachinesisreducedbyhalfandtheoverallprocessing timeisreducedremarkably
TheadvantagesofCFGWC2arefourth-fold:Firstly,itis capa-bletohandlethebadinitializationandimmatureconvergenceby thePSOprocedure;secondly,theclusteringresultsfocusonthe users’ purposes by theinterval context;thirdly, thecomputing speedofCFGWC2isamelioratedthroughtheintervalcontextand theparallelcomputingmechanism;fourthly,themostimportant advantageofCFGWC2isthehighclusteringqualityincomparison withsomerelevantmethodssincethisalgorithmwasdeployedon
Trang 7Fig 2.The two-dimensional distribution of UNO dataset.
IT2FS,whichismoregeneralandabletohandletheexisting
lim-itationsofthetraditionalfuzzysets.ThedisadvantageofCFGWC2
couldbethecomputationalcostsanditscomplexactivities
Never-theless,byemployingsomeadditionaltechniqueswehopethatthe
disadvantagescouldbeameliorated,andCFGWC2achievesgood
clusteringresults
Results
Experimentalenvironment
Thissectiondescribestheexperimentalenvironmentusedin
nextones
• Experimental tools: We haveimplemented theproposed
algo-rithm(CFGWC2)inadditiontothesealgorithms:NE[13],FGWC
[24]andCFGWC[35]inMPI/Cprogramminglanguageand
exe-cutedthemonaLinuxCluster1350witheightcomputingnodes
of 51.2GFlops Eachnode contains two Intel Xeon dual core 3.2GHz, 2GB Ram.Theexperimentalresultsare takenas the averagevaluesafter10runs
• Clustervalidity:WeusePCAESvalidityfunctiondescribedinEqs (21)–(25)
• Dataset:Weusetwokindsofdatasetsbelow
-Arealdatasetofsocio-economicdemographicvariablesfrom UnitedNationOrganization(UNO)[39]containingthestatistic aboutpopulationof230countriesovertenyears(2001–2010) MissingdatawereprocessedbyBinningmethod[16].The two-dimensionaldistributionisillustratedinFig.2
-AbenchmarkdemographicdatasetfromTheUniversityof Edin-burgh, Scotland (Fig 3)including expressionlevels of 2880 genestakenin 11differentareas [7].Thisdatasetwasused
inmanydifferentresearchpapersongeneexpressionby geo-graphicalfactorssuchasin[4,5]
• Objective:WecomparetheclusteringqualityofCFGWC2with thoseofotheralgorithmsthroughPCAESindex.Additionally,the
Fig 3. The two-dimensional distribution of Colon Cancer dataset.
Trang 8Table 3
PCAES values of all algorithms in Case 1 on UNO dataset.
2 1091.30832 11.49441 106.87815 106.87815 730.86493 15.80779 107.95304 107.95304
3 3508.71041 14.20249 102.97090 103.08807 1764.55205 15.48401 104.51216 104.62430
4 1026.1004 9.66077 101.00239 101.05883 1882.45315 9.60082 102.01264 102.07279
5 851.56196 13.83029 98.86012 98.89076 828.00298 20.09243 98.70007 98.73446
6 734.85210 23.45840 105.61367 105.11415 713.06259 13.36007 106.82538 95.32594
2 435.14908 15.35085 110.80574 110.80576 222.59648 14.84918 111.54395 111.54397
3 699.52639 17.05059 112.36477 112.46454 448.65676 18.15664 121.39454 121.45259
4 758.04253 12.13725 111.70188 111.77472 530.12028 15.16747 123.22859 123.30832
5 729.73602 13.80425 109.59175 109.64291 544.21607 17.33470 122.96865 123.03807
6 660.41492 21.53153 107.14039 107.19830 534.99351 18.78905 122.06920 123.31178
Fig 4.Average PCAES of algorithms on UNO dataset by fuzzifiers.
evaluationaboutthecomputationaltimesofthesealgorithmsis
alsomentioned
Evaluationbyvariouscasestudies
Inthis section,we evaluatetheproposedalgorithm in
com-parison with the relevant methods by various case studies
about the parameters of algorithms Main findings are found
below
Case 1. Inthiscase,someparametersofthesealgorithmsareset
upasbelow
-Thedefaultgeo-characteristicsare:a=b=1,˛=0.7,ˇ=0.3.These
values determine thegeo-modification process stated in Eqs
(1)–(3).Ourpreviouswork[35]suggestedusingvalue˛≥0.6in
ordertoincreasetheclusteringquality
-Weusethedefaultcontextvaluesin[35]forCFGWCalgorithm
below
f=(f1,f2, ,fN), where fi=
⎧
⎪
⎪
0 ifk=0 rand(0,1)
2k otherwise
, k=imod4, i=1,N
(44)
-InCFGWC2,m2=2×m1=2×mwheremisthefuzzifierofNE, FGWCandCFGWC.Theintervalcontextf=
f1,f2 wheref1=f andf2=1.Abroadintervaloffuzzifiersandcontextswillcreate moredistinctresultsthananarrowone
-In PSO, MaxStep PSO=100 and populationsize is 500 Other parametersare(c1,c2,c3)=(0.2,0.3,0.5)and(1,2)=(1,1).As suggestedbyThienetal.[38],thesevalueswillmakethe conver-gencetotheoptimumfaster
-Threshold ε and MaxStep of allalgorithms are 10−3 and 500, respectively
Table3describesthePCAESvaluesofallalgorithmsonUNO dataset.Theexperimentsareperformedfollowingbydifferent val-uesof thenumber ofclustersand fuzzifiers.Results showthat PCAESvalues ofCFGWC2arethelargestamongall.Thismeans thattheclusteringqualityofCFGWC2isbetterthanthoseofother algorithms.Inordertocomprehendtheexperimentalresults,we illustratethePCAESvaluesofallalgorithmsthroughvariouscases
offuzzifiersinFig.4.Fromthis figure,werecognizethatPCAES valuesofCFGWC2arelargerthanthoseofotheralgorithms Forexample, PCAESofCFGWC2 inFig.4is13 timesgreater thanthatofFGWCwhenm=1.5.ThesenumbersincasesofNEand CFGWCare14and99times,respectively.Similarly,whenm=3.0, PCAESofCFGWC2isstilllargerthanthoseofotheralgorithms,i.e 3.79(FGWC),3.78(NE)and27times(CFGWC).Theseevidences confirmthattheclusteringqualityofCFGWC2isthebestamong
Trang 9Table 4
The computational time of all algorithms in Case 1 on UNO dataset (s).
CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE
2 7.68 0.04 0.04 0.03 10.165 0.04 0.04 0.04
3 14.55 0.03 0.09 0.11 14.31 0.04 0.10 0.13
4 12.94 0.07 0.08 0.12 12.86 0.08 0.11 0.14
5 11.14 0.07 0.16 0.12 17.49 0.07 0.17 0.14
6 20.94 0.07 0.24 0.19 24.56 0.11 0.30 0.22
CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE
2 5.23 0.03 0.04 0.03 10.06 0.04 0.04 0.04
3 14.98 0.04 0.08 0.15 15.40 0.06 0.09 0.12
4 15.96 0.09 0.17 0.21 18.06 0.11 0.19 0.17
5 17.57 0.11 0.19 0.19 22.02 0.27 0.23 0.18
6 24.82 0.17 0.31 0.36 24.87 0.23 0.36 0.30
all.Nonetheless,PCAESvaluesofCFGWC2tendtodecreasewhen
thefuzzifierincreases.Forinstance,PCAESvaluesofCFGWC2from
m=1.5tom=3.0are1442,1183,656and456,respectively.The
averagereducingratioperhalfofafuzzifieris31%.Thismeansthat
eachtimethevalueoffuzzifierisincreasedby0.5,PCAESvalueof
CFGWC2isreducedby31percentsonaverage.Ontheotherhands,
theaveragePCAESvaluesofotheralgorithmsseemtobestable
throughdifferentvaluesoffuzzifier,i.e.109(FGWC),108(NE)and
15(CFGWC).Byroughcalculation,wecaneasyfindthevalueof
fuzzifierthatmakesPCAESvalueofCFGWC2issmallerthanother
algorithms,i.e.m≥5.0.ThisfacttellsusthetruththatCFGWC2
shouldbeusedwhenthefuzzifierissmall.AsmentionedbyBezdek
etal.[3]whendesigningFCMalgorithm,theauthorsstatedthat
thefuzzifiershouldbefrom1.5to2.5,ideallym=2.0,forthesake
ofoptimalcentersfoundbythealgorithm.Thus,wemayseethat
somecasessuchasm≥5.0willneverhappeninpractical
appli-cations.However,thisfindingmaybeusefulforustochoosethe
appropriatevalueofparameters.Isthereanychangeoftheorder
ofalgorithmsintermsofPCAESvaluesbydifferentvaluesof
num-berofclusters?FollowingbyTable3,theanswerisabsolutelyno
Foragivennumberofclusters,PCAESvalueofCFGWC2isalways
largerthanthoseofalgorithms.Indeed,thisshowsthestabilityof
theproposedalgorithm
The computational time of all algorithms for exporting the
resultsinTable3isdescribedinTable4.Clearly,thecomputational
timeofCFGWC2islongerthanthoseofotheralgorithms
When m=3.0, the average computational time of CFGWC2,
FGWC,NEandCFGWCare18.1,0.182,0.162and0.142s,
respec-tively.Similarresultsareobtainedinm=2.0andm=2.5.Aswe
mayseeinthepseudo-codeofContext-FGWC2,itrequireshuge
computationtoprocesstheintervalmembershipmatrix.Byusing
someadditionaltechniquestospeedupthisalgorithm,the
com-putationaltimeofCFGWC2isreducedremarkably.Themaximal
(minimal)computationaltimeofCFGWC2inTable4is24.87(5.23)
s.Withtheincreasingofcomputingpowersnowadays,the
com-putationalcostinthiscaseisacceptable.Table4alsogivesusthe
averageincrementlevelsofthecomputationaltimeofalgorithms
perfuzzifier.Eachtimethefuzzifierisincreasedbyoneunit,the
computationaltimeofCFGWC2isincreasedby16.8percents.The
percentvaluesofFGWC,CFGWCandNEare29.5%,57%and64.9%,
respectively.Whenthefuzzifierislargeenough,thesetimescould
beapproximatetotheothers
Now,weevaluatetheproposedalgorithmonalargerdataset
thanUNO.InFig.5,wemeasuretheaveragePCAESvaluesofall
algo-rithmsonColonCancerdatasetfollowingbyfuzzifiers.Theresults
showthatPCAESvaluesofCFGWC2arelargerthanthoseofother
algorithms.Forexample,whenm=1.5,theaveragePCAESvalueof
CFGWC2is1.13timeslargerthanthatofCFGWC.Thesenumbers
incasesofFGWCandNEare2.2and2.19times,respectively Sim-ilarly,whenm=3.0,theaveragePCAESofCFGWC2is1.32times, 1.15timesand1.16timeslargerthanthoseofCFGWC,FGWCand
NE,respectively.Theseevidencesconfirmthattheclustering qual-ityofCFGWC2isthebestamongallevenonalargedatasetsuch
asColonCancer.Nonetheless,PCAESvaluesofCFGWC2andother algorithmstendtodecreasewhenthefuzzifierincreases.The val-uesofCFGWC2fromm=1.5tom=3.0are48.77,34.18,26.95and 22.94,respectively.ThisresultissimilartothatontheUNOdataset andshowsthatweshouldchoosethesmallvalueoffuzzifierinthis caseinordertoobtaingoodclusteringqualityofCFGWC2.Even whenPCAESvaluesofCFGWC2reduce,theyarestillbetterthan thoseof otheralgorithms.TheaveragePCAESvalueofCFGWC2
isapproximately1.4timeslargerthanthoseofotheralgorithms throughvariouscasesoffuzzifiers.Thismeansthatwhenthe fuzzi-fierincreases,PCAESvaluesofbothCFGWC2andotheralgorithms reduce,butthevaluesofCFGWC2arestilllargerthanthoseofother algorithms.However,smallPCAESvaluesofCFGWC2incasesof largefuzzifierarenotagoodchoiceforus,andweshouldkeepthe fuzzifierisassmallaspossible
InFig.6,weverifywhetherornotPCAESvaluesofCFGWC2are largerthanthoseofotheralgorithmsbythenumberofclusters.This figureclearlypointsoutthatthelineofPCAESvaluesofCFGWC2is higherthanthoseofotheralgorithms.Thestartedpointofalllines (C=2)showsthatPCAESvaluesofalgorithmsareapproximateto theothers,i.e.7.87(CFGWC2),8.67(CFGWC),7.182(FGWC)and 7.184(NE).However,thedifferencesbetweenthoselinesare get-tingobviouswhenthenumberofclustersincreases.Forexample, whenC=3,PCAESvaluesofCFGWC2,CFGWC,FGWCandNEare 23.4, 19.3,16.67and16.62,respectively.WhenC=6, the differ-encebetweenCFGWC2andotheralgorithmsismaximalsincethe amplitudesofthoselinesexpand.PCAESvaluesofthosealgorithms
inthiscaseofclustersare56.2,47.5,33.8and33.2,respectively Thus,threeremarksareextractedfromthisfigure:(i)theclustering qualityofCFGWC2isthebestevenwhenallalgorithmsaretested followingbythenumberofclusters;(ii)Thehigherthenumberof clustersis,thelargerPCAESvalueofCFGWC2is;(iii)Thevalueof fuzzifiershouldbeinverselyproportionaltothatofthenumberof clustersforthesakeofhighPCAESvaluesofCFGWC2asshownin Figs.5and6
In Fig.7,weverify thechangesof PCAESvalues ofCFGWC2
byfuzzifiersonvariousdatasets.Clearly,PCAESvaluesonalarge dataset (Colon Cancer) are much smaller than those on small dataset(UNO).Forexample,theaveragePCAESvaluesofCFGWC2
onUNOandColonCancerare1442and48.77,respectivelywhen
m=1.5.Similarresultscanbeseenwhenm=3.0withPCAESvalues
onUNOandColonCancerbeing456and22.94,respectively.Thus, tworemarksarefoundfromthistest:Firstly,thesizesofinputted datasetsshouldbesmallormediumforthehighPCAESvaluesof CFGWC2;secondly,thechangesofPCAESvaluesthroughvarious fuzzifiersonalargedatasetaresmallerthanthoseonasmallone RunningonalargedatasetsuchasColonCancerresultsinhigh computationaltime ofCFGWC2 as shown in Fig 8.This figure comparesthe averagecomputationaltime of CFGWC2 onUNO and ColonCancer datasetsbyfuzzifiers.Theaverageprocessing timeofCFGWC2perfuzzifieronColonCanceris418swhilstthat processingtimeonUNOis15.7s.Fromthisresult,weshould con-siderthefirstremarkaboutsmallor mediuminputteddatasets whenrunningCFGWC2algorithm
Themajorremarkinthiscaseistheconfirmationofthebest clusteringqualityofCFGWC2amongall
Case 2. InCase2,wemakesomechangesoftheparametersofall algorithms.Specifically,geo-characteristicsare˛=0.4andˇ=0.6 OtherparametersarekeptintactasinCase1.Theaimistoverify
Trang 10Fig 5.Average PCAES of algorithms on Colon Cancer dataset by fuzzifiers.
Fig 6.Average PCAES of algorithms on Colon Cancer dataset by number of clusters.
Fig 7. Changes of PCAES values of CFGWC2 by fuzzifiers on various datasets.
...membershipdegreestotallyreflectthebasicprincipleofIT2FS.The
basicideaoftheContext-FGWC2procedureinMachine2isusingan
intervalofprimarymembershipconsistingofthelowerandupper
onescalculatedfromtheinitialcenterandupdatingtheinterval... (21)–(25)
• Dataset:Weusetwokindsofdatasetsbelow
-Arealdatasetofsocio-economicdemographicvariablesfrom UnitedNationOrganization(UNO)[39]containingthestatistic aboutpopulationof230countriesovertenyears(2001–2010)...
bygeo-characteristicsand contextvaluef1.Thepseudo-codeof
Context- FGWC2isshowninTable2
In Step of the Context- FGWC2, the intervals of primary
membershipconsistingoftheupperandlowermembershipsare