1. Trang chủ
  2. » Thể loại khác

DSpace at VNU: Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization

19 120 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 19
Dung lượng 2,86 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

The two-dimensional distribution of Colon Cancer dataset... Table 3PCAES values of all algorithms in Case 1 on UNO dataset.. 4.Average PCAES of algorithms on UNO dataset by fuzzifiers.. E

Trang 1

Applied Soft Computing

jo u r n al hom e p a g e :w w w e l s e v i e r c o m / l o c a t e / a s o c

VNU University of Science, Vietnam National University, Viet Nam

a r t i c l e i n f o

Article history:

Received in revised form 14 February 2014

Available online xxx

Keywords:

Context clustering

Fuzzy clustering type-2

Geo-demographic analysis

Heuristic algorithms

Particle swarm optimization

a b s t r a c t

Geo-DemographicAnalysis, whichis one ofthemost interestinginter-disciplinaryresearchtopics betweenGeographicInformationSystemsandDataMining,playsaveryimportantroleinpolicies deci-sion,populationmigrationandservicesdistribution.Amongsomesoftcomputingmethodsusedforthis problem,clusteringisthemostpopularonebecauseithasmanyadvantagesincomparisonwiththe restssuchasthefastprocessingtime,thequalityofresultsandtheusedmemoryspace.Nonetheless,the state-of-the-artclusteringalgorithmnamelyFGWChaslowclusteringqualitysinceitwasconstructedon thebasisoftraditionalfuzzysets.Inthispaper,wewillpresentanovelintervaltype-2fuzzyclustering algorithmdeployedinanextensionofthetraditionalfuzzysetsnamelyIntervalType-2FuzzySetsto enhancetheclusteringqualityofFGWC.Someadditionaltechniquessuchastheintervalcontext vari-able,ParticleSwarmOptimizationandtheparallelcomputingareattachedtospeedupthealgorithm.The experimentalevaluationthroughvariouscasestudiesshowsthattheproposedmethodobtainsbetter clusteringqualitythansomebest-knownones

©2014ElsevierB.V.Allrightsreserved

Introduction

Geo-Demographic Analysis(GDA), which wasdefined as“the

analysis of spatially referenced geo-demographic and lifestyle

data”[33],isoneofthemostinterestinginter-disciplinaryresearch

topicsbetweenGeographicInformationSystemsandDataMining,

andiswidelyusedinthepublicandprivatesectorsfortheplanning

andprovisionofproductsandservices.Therearevariousexamples

showingtheneedsofGDAinpracticalapplications.Sheltonetal

[34]performedageo-demographicclassificationformortality

pat-ternsinBritainandfoundthemaincausesofdeathsinEngland

andWalesfrom1981to2000associatedwithgeographical

loca-tionsinamapsothattheycouldassistdecisionmakersinbetter

understandingthedistributionofmajorcauses.Michael[23]

con-ductedaGDAanalysistogathercommunityattitudesonthefuture

growthofWerriBeachandGerringong,NSW(Nelson),Australia

focusingprimarilyonwhatactionsCouncilshouldtaketomanage

populationgrowthwithinexistingneighborhoods.Páezetal.[29]

presentedageo-demographicframeworkusingdatafrom

Mon-treal,Canadatoidentifypotentialcommercialpartnershipsthat

couldexploitthecharacteristicsofsmartcards.Campbelletal.[8]

∗ Correspondence to: 334 Nguyen Trai, Thanh Xuan, Hanoi 010000, Viet Nam.

Tel.: +84 904171284; fax: +84 0438623938.

E-mail addresses: sonlh@vnu.edu.vn , chinhson2002@gmail.com

providedadetailedGDAofover37,000giftedandtalentedstudents admittedtotheNationalAcademyforGiftedandTalentedYouth

inEnglandin2003/2005andshowedthatNationalAcademyhad nonethelessreachedsignificantnumbersofstudentsinthepoorest areas,somethingover3000students,and8%ofstudentsidentified

asgiftedandtalentedatthisstage.Dayetal.[11]tookasurvey thatdeterminedclustersofnationsgroupedbyhealthoutcomesby comparinglifeexpectancyandarangeofhealthsystemindicators withinandbetweeneachclusterinordertoprovidesensible group-ingsforinternationalcomparisons.Someothertypicalapplications

ofGDAsuchasthespatialandsocio-economicdeterminantsof tuberculosis,urbangreenspaceaccessibilityfor differentethnic andreligiousgroups,childrendisordersinvestigation,etc.couldbe referencedinthearticles[1,6,9,32,36,37]

In order to perform GDA, some soft computing methods are oftenusedsuchas PrincipalComponent Analysis(PCA), Self-OrganizingMaps(SOM)andclustering.Walford[41]describeda methodusingPCAtostudythespatialdistributionofthe1991 cen-susdatascores.However,resultsofPCAdependonthescalingof thevariables,anditsapplicabilityislimited bycertain assump-tionsmadeinthederivation.Loureiroetal.[21]introducedthe useofSOMasanadequatetoolforGDA.Basedonthevariationsin edgelengthinapathbetweentwounitsontheSOM,theauthors presentedanewwayofcalculatingfuzzymembershipsoffuzzy clusteringmethod.However,itrequiresalotofmemoryspacesto storeallneuronsandweights;whatismorethespeedoftraining

http://dx.doi.org/10.1016/j.asoc.2014.04.025

1568-4946/© 2014 Elsevier B.V All rights reserved.

Trang 2

clusteringisoftenusedinsteadbecauseithasmanyadvantages

incomparisonwiththerestssuchasthefastprocessingtime,the

qualityofresultsandtheusedmemoryspace.Ourpreviouswork

in[36]madeanoverviewaboutsomeclusteringmethodsforGDA

suchasFuzzyC-Mean(FCM)[3],theagglomerativehierarchical

clustering[11],NeighborhoodEffects(NE)[13],K-Meansclustering

[20]andFuzzyGeographicallyWeightedClustering(FGWC)[24]

Amongthem,FGWCwasconsideredthemostfavoritealgorithm

andwasusedinmostofresearcharticlesaboutGDAapplications

uk=˛×uk+ˇ×1A×

c



j=1

wkj= (popk×popj)

b

da

kj

(3)

FGWCcalculatestheinfluenceofoneareauponanotherbyEqs

(1)–(3)whereuk(uk)isthenew(old)clustermembershipofthe

areak Twoparameters ˛and ˇarethescalingvariables.popk,

popjarethepopulationsofareaskandj,respectively.The

num-berdkjisthedistancebetweenkandj.Twonumbersaandbare

userdefinableparameters.Ais afactortoscalethe“sum”term

andiscalculatedacrossallclusters,ensuringthatthesumofthe

membershipsforagivenareaforallclustersisequaltoone

AlthoughFGWCisthemostpopularclusteringalgorithm for

GDA,itstillcontainssomelimitationssuchasthespeedof

com-putingandtheclusteringquality.Oneofourpreviousworksin[35]

presentedamethodso-calledCFGWCtoacceleratethespeedof

computingofFGWCbyattachingthecontextvariableterms.Other

worksin[36,37]haveshowedsomepreliminaryresultsin

improv-ingtheclusteringqualityofFGWCthroughintuitionisticfuzzysets

andgeographicalspatialeffects.Thus,ourfocusinthisworkisto

continuewiththeclusteringqualityproblemofFGWC.Basedupon

theobservationthatFGWCwasconstructedonthebasisofthe

tra-ditionalfuzzysets,whichcontainsomelimitationsinmembership

degreesaspointedoutbyMendel[25],thisfostersustoimprove

FGWCinanextensionofthetraditionalfuzzysetstoenhancethe

clusteringqualityofthealgorithm.Now,letusexplainwhy

clus-teringalgorithmsonthetraditionalfuzzysetshavelowclustering

quality

AccordingtoMendel[25],thetraditionalfuzzysetscannot

pro-cesssomeexceptionalcaseswherethemembershipdegreesare

notthecrispvaluesbutthefuzzyonesinstead.Forexample,the

possibilitytogettuberculosisdiseaseofapatientconcludedbya

doctorisfrom60to80percentsafterexaminingallsymptoms.Even

ifsomemodernmedicalmachinesareprovided,thedoctorcannot

giveanexactnumberofthatpossibility.Thisshowsthefactthat

crispmembershipvaluescannotmodelsomesituationsinthereal

worldandshouldbereplacedwiththefuzzyones.Rhee[30]stated

thatusingthetraditionalfuzzysetsoftenresultsinbadclustering

qualitybecausetheiruncertaintiessuchasdistancemeasure,

fuzzi-fier,centers,prototypeandinitializationofprototypeparameters

cancreateimperfectrepresentationsofthepatternsets.For

exam-ple,incaseofpatternsetsthatcontainclustersofdifferentvolume

ordensity,itispossiblethatpatternsstayingontheleftsideof

aclustermaycontributemorefortheotherratherthanthis

clus-tersothatchoosingsuitablevalueforthefuzzifierisdifficult.Bad

selectioncanyieldundesirableclusteringresultsforpatternsets

thatincludenoise.Becauseofthoselimitations,somepreliminary

resultsofdeployingfuzzyclusteringmethodsinanextensionof

thetraditionalfuzzysetsso-calledIntervalType-2FuzzySets(IT2FS)

havebeenintroduced.Mendel[25]describedthedefinitionofIT2FS

asfollows

˜

A= (x,u,A˜(x,u)=1)|∀x∈A,∀u∈JX⊆ [0,1]

FromEq.(4),werecognizethatIT2FSisageneralizationofthe traditionalfuzzysetssinceIT2FSwillreturntothetraditionalfuzzy setswhenthereisnouncertaintyinthethird dimension.Based uponthisdefinition,someauthorsintroducedseveralinterval

type-2fuzzyclusteringalgorithmssuchasintheworksofHwangand Rhee[15]andRhee[30].Specifically,HwangandRhee[15] pre-sentedatype-2fuzzyclusteringalgorithmtosolvetheproblemof choosingdistancemeasuresinFCMalgorithm,takingthedifference

ofeachtype-2membershipfunctionareawiththecorresponding type-1membershipvalue.Rhee[30]presentedanimprovement

ofthisalgorithmusingtwodifferentvaluesoffuzzifierstosolve theuncertainty of fuzzifier inFCM Some othervariants ofthe intervaltype-2fuzzyclusteringalgorithmscouldbereferencedin [2,10,12,14,17,19,22,26,27,31,42]

Motivatedbythoseresults,inthisarticle,wewillpresentanovel intervaltype-2fuzzyclusteringalgorithmso-calledContextFuzzy GeographicallyWeightedClusteringonIT2FSorinshortCFGWC2to enhancetheclusteringqualityofFGWC.ThedifferenceofCFGWC2 withthoseintervaltype-2fuzzyclusteringalgorithmsaboveistwo fold:Firstly,CFGWC2isspeciallydesignedfortheGDAproblem that requiresthemodification ofgeographical spatial effectsto thealgorithmitself;secondly,itisequippedwithsomeadditional techniquestospeedupthewholealgorithm,namely:

• Anintervalcontextvariable,whichisanextensionofthesingle contextvariableofPedrycz[28],isproposedandusedtoclarify theclusteringresultsandacceleratethecomputingspeed

• Inordertoavoidbad initialization,which mayoccurinother interval type-2 fuzzy clustering algorithms, and to converge quicklytothe(sub-)optimasolutions,ameta-heuristic optimiza-tionmethodnamelyParticleSwarmOptimization–PSO[18]is usedtodeterminegoodinitialcentersforCFGWC2

• Sincecontextvaluesintheintervalcontextvariablecanbe simul-taneouslyprocessedinCFGWC2,parallelcomputingtechniqueis adaptedtoCFGWC2toreducethecomputationalcosts

Whathavebeenlistedinthosebulletsareourcontributionsin thispaper.Theproposedalgorithmwillbeimplementedand com-paredwithsomerelevantmethodsintermofclusteringqualityto verifyitsefficiency

The rests of this paper are organized as follows Section

“Theproposedmethodology”elaboratestheproposedmethodin detailsincluding thoseadditional techniquesone-after-another The numerical experiments through various case studies and discussionsaregiveninSection“Results”.Finally,Section “Con-clusions”givestheconclusionsandoutlinesfutureworksofthis article

The proposed methodology

In theprevioussection,we have knownthat CFGWC2is an interval type-2fuzzy clustering algorithm equippedwith some additionaltechniquessuchastheinterval contextvariable,PSO andtheparallelcomputingfortheGDAproblem.Sincethose tech-niquesarenecessaryforthedescriptionofCFGWC2,theyarefirstly presentedinSections“UsingPSOforthedeterminationofinitial centers”and“Theintervalcontext”.TheCFGWC2algorithm accom-paniedwiththeparallelcomputingmechanismwillbedescribed

inSection“Evaluationbyvariouscasestudies”

Trang 3

Thissectionmentionsthetechniquethatfindsgoodinitial

cen-tersforclusteringalgorithmsbyPSO.Theideaofthistechniqueis

togiveapreliminaryclassificationoftheoriginalpatternsetsothat

“temporal”clusterresultscanbeusedtoorienttheclassificationin

themainalgorithm.TheobjectivefunctionisshowninEq.(5),and

itsconstrainsaregiveninEqs.(6)–(7):

J=

N



k =1

C



j =1

Xk−Vj2

min

j=1,C

j /=i

Vi−Vj> max

s =1,POP(i)

Xs−Vi

Xs∈Cluster(i)

i=1,C

(6)

Cluster(i)≤ε1 where POP(i)=1 and i=1,C (7)

Constrain(6)requiresthatallclustersareseparatedfromthe

others.Alternatively,theminimaldistancefromacluster’scenter

totheothersisnotshorterthanthemaximalonefromthiscenterto

alldatapointsinthecluster.POP(i)isthepopulationornumberof

patternsintheclusterCluster(i).Constrain(7)minimizesthe

num-berofoutliersintheresult.Accordingly,thenumberofoutliersis

notgreaterthanapre-definedthresholdε1

Fortheproblem(5)–(7),weusePSO[18]todeterminethe

(sub-)optimasolutionswiththebeginningpopulationbeinginitiated

withPparticles.Eachparticleisavectorz= (z1,z2, ,zC) where

zi(i=1,C)isapatternrandomlychosenfromtheoriginalpattern

set.Thevelocitiesofziaresettozeros.Detailsofthealgorithmare

describedbythepseudo-codeinTable1

Notice that Eq (9) is used solely for the first iteration of

MaxStepPSO.Inthenextiterations,thecentersarecalculatedfrom

thepreviousone.Additionally,thevalueofMDiinEq.(10)issetto

zeroincasethatthisclusterhasnotgotanyelement.Thefitness

valueofaparticleiscalculatedbyEq.(13)where(1,2)arethe

ratioconstants.Eqs.(14)–(16)areusedtoupdatethevelocitiesand

positionsofallparticles.Inthoseequations,c1istheratiotokeep

thevelocityintact,c2istheratiotochangethevelocityfollowing

bypBestandc3showstheinfluencelevelofgBesttothevelocity

Sincetheroleofzi(i=1,C)fromtheseconditerationafterwards

isreplacedwithcenterVi,thedomainofrandomnumberinEq

(14)issetto(−1,1)inordertoensurethevaluesofthecenters

areboundedwithinthedomainofthepatternset.Afteranumber

ofiterationstepsdefinedbyMaxStepPSO,thesolutionisgetting

betterbecauseoftheameliorationprocessaftereach“flyingstep”

basedonthefitnessfunction.TheoutputtedresultV(0)=(V1,V2, ,

VC)canbefoundfromtheparticleholdingcurrentgBestandisused

astheinitialcenterforCFGWC2

Theintervalcontext

Inordertoclarifytheclusteringresultsandacceleratethe

com-putingspeed of theclustering algorithms, thecontext variable

couldbeused.AccordingtoPedrycz[28],a(single)contextvariable

inY⊂Xisdefinedthroughthemapbelow

A:Y→ [0,1]

wherefkcanbeunderstoodastherepresentationforthelevelof

relationofthekthpointtothesupposedcontextfk.Therearesome

waystodefinetherelationbetweenfkandthemembershipofkth

pointtotheithcluster,forinstance,usingthesumoperator(18)or maximumoperator(19)

c



i =1

c

max

i=1uki=fk,k=1,N (19)

Inourpreviousworkin[35],wedefinedacontextvariableto narrowtheoriginalgeographicaldatasetundersomeconditionsof certaindimensions.Thereasontousethetermofcontextforthe clusteringalgorithmistwofold.Firstly,acontextvariableisuseful

toclarifytheresultsfollowingbyusers’purposes.Becauseonlya subsetoftheoriginaldatasetwhichhasconsiderablemeaningto thecontextisinvoked,theresultfocusesontheareathatreally hasmanyrelevantpoints.Secondly,ithelpsimprovingthespeedof computing.Inthetraditionalclusteringmethod,itnotonlytakes longtimetoprocessthewholedata,butalsomakestheresultsless meaningtotheconsideredcontext.Onthecontrary,the context-basedclusteringmethodsbothacceleratethespeedandimprove thesemantic.Nevertheless,therearesomelimitationsin defini-tion(17).Firstly,theimportanceofthekthpointtothesupposed contextisdecidedbyavaluefk.Infact,itisnotenoughtoreflect

avarietyofdifferentevaluationsofmanypeopletothisrelation

Intheotherwords,onecanassumethattheimportanceisonly0.3 whileotheraffirmsthatitshouldbe0.6.Duetothisfact,theuseofa valuefkisnotenough.Secondly,theoldapproachexcludestheroles

ofotherdatapointstothecontext.Itisamisleadingassumption sinceallcharacteristicsalwayshaverelationshipseitherdirectly

orindirectlywiththeothers.Fromtheselimitations,weextend theuseofcontextbyintroducinganewterm:“theintervalcontext variable”.Anintervalcontextisdefinedasf=[f1,f2]whereeachfi (i=1,2)isstatedthroughthemapinEq.(17).Forthemost impor-tantpoints,thevalueoffishigh,e.g.[0.6,0.8].Similarly,thevalue

offincaseoflessimportantpointsislow,e.g.[0,0.15].Thisinterval reflectsthe“fuzziness”ofthecontext.Intheotherwords,wehave justperformeda“fuzzy”stepfortheconsideredcontext.Ithelps

usovercometheshortcomingsofthesinglecontextvariableand

issuitableforCFGWC2,whichworksonIT2FS.Detailsofapplying theintervalcontextvariableforCFGWC2willbepresentedinthe Section“TheCFGWC2algorithm”

TheCFGWC2algorithm

Wehavehadageneralbackgroundofchoosinginitialcenters

byPSOinSection“UsingPSOforthedeterminationofinitial cen-ters”andthebasicdefinitionoftheintervalcontextinSection“The intervalcontext”.Now,weusebothofthemaccompaniedwiththe parallelcomputingmechanisminthemainactivityoftheCFGWC2 algorithm.LetusseethemechanismofCFGWC2illustratedbyFig.1 below

According to Fig 1, the parallel computing mechanism of CFGWC2 requires three machineswhose first one (Machine 1)

is responsible for generating initial centers for the remaining machines Nevertheless, the centers values of Machine 2 and Machine3aredifferentsincethestoppingconditionsofPSOarenot identical.After(MaxStepPSO/2)iterationsteps,thefirstcenterV(0)

isoutputtedandtransferredtoMachine2,andthesecondcenter

issenttoMachine3after(MaxStepPSO)iterations.This guaran-teesdifferentresultsinMachine2andMachine3,andissuitable forthedeterminationoftheupperandlowercentersand mem-bershipdegreesoftheclusteringalgorithmsonIT2FS,i.e.U(1),V(1)

(Machine2)andU(2),V(2)(Machine3)inFig.1

InMachine2andMachine3,wesendtheinitialcentersV(0)to

atype-2fuzzyclusteringprocedureaccompaniedwiththeinterval

Trang 4

Table 1

The pseudo-code of PSO procedure.

Input - The pattern set X whose dimension is r

- The number of elements (clusters) – N(C)

- The number of particles in the beginning population – P

- Maximal number of iteration steps in PSO – MaxStep PSO Output - Final center V (0)

Particle Swarm Optimization (PSO)

X j ∈ Cluster(i) ⇔z i − X j= minz k − X j|k = 1, C

(8)

6: Calculate center V i and the maximal distance from V i to cluster’s elements:

V (l)

i =





Xs∈Cluster(i)

X (l)

s



MD i = max

s=1,POP(i)

X

s − V i= max

s=1,POP(i)

l=1

(X s(l)− V i(l))2

⎭,

X s ∈ Cluster(i),

(10)

SEP(z) =Cluster(i) where

min

j = 1, C

j/=i

V i − V j

MD i

OUT (z) =Cluster(i) where POP(i) ≤ 1; i = 1, C (12)

( 1 /1 + SEP(z)) + ( 2 /1 + OUT (z)) (13)

velocityij = c 1 ∗velocityij + c 2 ∗ rand(−1, 1) ∗ (z pBest,j − z ij ) + c 3 ∗ rand(−1, 1) ∗ (z gBest,j − z ij ), (14)

contextvariableso-calledContext-FGWC2togetthecrispcenter

V(1) (Machine2)andV(2) (Machine3).Ifthedifferencebetween

theinitialandcrispcentersissmallerthanathreshold(Eps)orthe

maximalnumberofiterations(MaxStep)isreachedthenwestop

theContext-FGWC2procedureandtakethecrispcenterand

mem-bershipdegree,i.e.U(1),V(1)(Machine2)andU(2),V(2)(Machine3)

asthefinalresults.Otherwise,weassignV(0)=V(1)inMachine2and

V(0)=V(2)inMachine3andstartanewiterationinContext-FGWC2

untilthestoppingconditionshold

Oncetheupperandlowercentersandmembershipdegreesare

calculated,weuseadefuzzificationmethodso-calledthePartition

CoefficientandExponentialSeparation(PCAES)[40]validityindexto

obtainthefinalcenterandmembershipdegreeasbelow

V(∗)=



V(1) if PCAES(V(1))≥PCAES(V(2))

V(2) otherwise

(20)

Thisindexmeasuresthepotential,whethertheidentified

clus-terhasanabilitytobeagoodclusterornot.Itwascomparedwith

otherindexessuchasPartitionEntropy(PE),PartitionCoefficient

(PC),FuzzyHypervolume(FHV),Xie&Beni,Pal&Bezdek,

Modifica-tionPC(MPC),Zahidetal.,andshowedtheimpressiveresults,even

inanoisyenvironment.ThedefinitionofPCAESisgivenbelow

PCAES(C)=

C



j=1

where

PCAES[j]=

N



k =1

ukj2

uM −exp

⎜−mini/=j{Vj−Vi2

}

ˇT

uM= min

1 ≤i≤C

 N



k =1

u2 ki



(23)

ˇT=

C l=1Vl−V2

V=(V1,V2, ,.Vr)whereVi(i=1,r)iscalculatedas,

Vi=

C

l =1Vli

PCAES[j]isusedtomeasurethecompactnessandseparationfor clusterj(j=1,C).TheyaresummeduptocalculatePCAES(C)∈(−C, C).ThelargePCAES(C)valuemeansthateachoftheseCclusters

iscompactandseparatedfromotherclusters.Itisacriterionto choosethesuitableclustering’soutput.Dependingonwhichcenter

isopted,therelatedmembershipdegreeisusedasfinal member-shipU(*)

Now,wedescribetheContext-FGWC2procedure Remember-inginSection“Theintervalcontext”thatanintervalcontextwas definedasf=[f1,f2]sothatwecouldapplyfi(i=1,2)ineachmachine

Trang 5

Table 2

The pseudo-code of Context-FGWC2 procedure.

Input - Initial center V (0) , the pattern set X, an interval fuzzifier [m 1 ,m 2 ],

- The number of elements (clusters) – N(C), the dimension of dataset r,

- Geographic parameters ˛, ˇ, a and b, precision ε, MaxStep iteration.

Output - Final center V (3)

Context-FGWC2

U(x), U(x)

7: Sort X following by lin ascending order

8: Find index k 0 satisfying (30) Otherwise, k 0 ← N − 1

9: Calculate U (1)(l) , V (1) by (31)–(32)

11: For s = l + 1, r: U kj(1)(s)← U kj (j = 1, C, k = 1, N)

18: Repeat from Step 5 to 17 to calculate V L , U (2)

19: Perform Type-Reduction by (36)

20: Determine the population of each cluster by (37)

21: Update U (C) (x) by geo-characteristics in (2), (3) and (38)–(40)

22: Perform Type-Reduction and compute center V (2) by (41) and (42) to get U GT (x)

24: Repeat from Step 6 to 18 to calculate V R , V L from V (B) and U GT (x)

25: Perform defuzzification to calculate V (3) by (43)

26: UntilV(3) − V (0)≤ ε or MaxStep is reached

Specifically,f1(f2)wasusedintheContext-FGWC2procedure

ofMachine2(3).Becauseofusingdifferentcontextvaluesand

ini-tialcentersinthosemachines,theupperandlowercentersand

membershipdegreestotallyreflectthebasicprincipleofIT2FS.The

basicideaoftheContext-FGWC2procedureinMachine2isusingan

intervalofprimarymembershipconsistingofthelowerandupper

onescalculatedfromtheinitialcenterandupdatingtheinterval

bygeo-characteristicsand contextvaluef1.Thepseudo-codeof

Context-FGWC2isshowninTable2

In Step 4 of the Context-FGWC2, the intervals of primary

membershipconsistingoftheupperandlowermembershipsare

calculatedbyEqs.(26)–(29).Noticethatin(26)–(27),thesumof

membershipdegreesinallclustersisequaltof1k wheref1k isa

contextvalueofthekthpointinthepatternset.Analogously,the

valuesoftheupperandlowermembershipsaredependedbythis

contextvalueasshownin(28)–(29)

U(x)=

⎩Ukj∈(0,1)|k=1,N;j=1,C;

C



j =1

Ukj=f1k

U(x)=

⎩Ukj∈(0,1)|k=1,N;j=1,C;

C



j=1

Ukj=f1k

Ukj=

f 1k C



i=1

 X

k − V j(0)

Xk − V i(0)

2/m1−1 , if f1k

C

 i=1

 X

k − V j(0)

Xk − V i(0)

≥ 1/C

f 1k C



i=1

 X

k − V j(0)

Xk − V i(0)

2/m 2 −1 , otherwise

(28)

U kj =

f 1k C

 i=1

 X

k − V j(0)

Xk − V i(0)

2/m 1 −1 , if f1k

C

 i=1

 X

k − V j(0)

Xk − V i(0)

< 1/C

f 1k C

 i=1

 X

k − V j(0)

Xk − V i(0)

2/m 2 −1 , otherwise

(29)

Afterwehavetheintervalofprimarymembership,the maxi-mum(minimum)centerVR(VL)andtherelatedmembershipmatrix

U(1)(U(2))arecalculatedbythesamestepsfromStep6to17 Specif-ically,inStep8indexk0intherange[1,N−1]satisfyingEq.(30)will

beselectedasapivottocalculateU(1)(l)inEq.(31)

Xk0l≤C

j =1vjl(A)

Ukj(1)(l)=



Ukj ifk≤k0

Ukj otherwise

, (j=1,C, k=1,N) (31)

Usingtheaverageoperatoroffuzzifier,centerV(1)iscalculated below

Vji(1)=

N k=1(Ukj(1)(l))[m1+m2/2]Xki

N

k =1(Ukj(1)(l))[m1+m2/2]

, (j=1,C, i=1,r) (32)

Next,inStep10wecheckwhetherV(1)=V(A)ornot.Ifthis con-ditionholds,weconcludethatthemaximumcenterVR=V(1)and therelatedmembershipmatrixU(1)isfoundinEq.(33)

U(1)=

r l=1U(1)(l)

Otherwise,wemakeanotherloopwiththenextfeaturelinthe patternset.Bythesimilarprocess,inStep18wecancomputethe

Trang 6

Fig 1.The mechanism of CFGWC2.

minimumcenterVLandtherelatedmembershipmatrixU(2)where

Eqs.(31)and(33)arereplacedwith(34)and(35),respectively

Ukj(2)(l)=



Ukj ifk≤k0

Ukj otherwise , (j=1,C, k=1,N) (34)

U(2)=

r

l =1U(2)(l)

Fromtheserelatedmembershipmatrices,Step19obtainsthe

membershipdegreeoftraditionalfuzzysets(a.k.a.type-1)through

Eq.(36).Thisprocessiscalledthetype-reductionandusedto

calcu-latethepopulationofeachcluster.Step20calculatesthepopulation

ofeachclusterbythisrule:

If Ukj(C)>Uki(C) and i/=j then Xkisassignedtocluster j, (37)

(k=1,N;i=1,C)

Basedonthepopulation,Step21determinesthegeographical

weightsofallareasbyEq.(3),andthemodificationofmembership

degreefollowingbygeo-characteristicsisperformedthroughEqs

(2),(3)and(38)–(40)

UG(x)=G(U(C)(x))=

UkjG,UkjG

 , (j=1,C, k=1,N) (38)

UkjG=˛×Ukj(2)+ˇ×A1×

C



i=1

UkjG=˛×U(1)kj +ˇ×1

C



i =1

wji×Uki(1), (i,j=1,C,i/=j,k=1,N)

(40)

NoticethatparameterAinEqs.(39)and(40)isafactortoscale the“sum”termandiscalculatedacrossallclusters,ensuringthat thesumofthemembershipsforagivenareakforallclustersis equaltothecontextvaluef1k(k=1,N).Step22performsthe type-reductionforthemodifiedmembershipdegreeandcalculatesnew centerV(2)byEqs.(41)and(42),respectively

UkjGT=Ukj

G

+UkjG

Vji(2)=

N k=1(UkjGT)[m1+m2/2]Xki

N k=1(UkjGT)[m1+m2/2]

, (j=1,C, i=1,r) (42)

Now,wehavemodifiedmembershipdegreeUGandcrispcenter

V(2).SinceweworkonIT2FS,V(2) shouldbeaninterval contain-ingtheminimumandmaximumcentersVL,VR.Thisworkisdone throughStep23and24.Inordertoverifywhethertheoutputted centersisthesolutionornot,Step25performsthedefuzzification fortheinterval centerasin Eq.(43)andgetcrisponeV(3).This centerisusedtocheckthestoppingconditiondescribedinStep26

V(3)=



VL ifVL−V(0)≤VR−V(0)

(43)

Inordertoavoidunstoppableiteration,welimitthemaximal numberofiterationstepstoMaxStep.Ifthenumberofiteration stepsexceedsthisthreshold,theContext-FGWC2procedurewill stopimmediately.Oncethestoppingconditionholds,wereceive thetype-2membershipdegreeUGandtheintervalcenter[VL,VR] ThecrispcenterV(3)andthedistributionofpatternsetafter clus-tering can be extracted fromthem (UG,V(3)) are theoutput of Context-FGWC2,andthecrispcenterV(3) isdenotedinFig.1as

V(1)(Machine2)andV(2)(Machine3)

TheworksofContext-FGWC2inMachine3isanalogoustothose

in Machine 2except themaximal number of iteration stepsin Machine3isequaltohalfofthatinMachine2(∼MaxStep/2).The reasonforthisalterationliesinthesynchronizationprocess Specif-ically,theresultsinMachine2and3aretransferredtoMachine1 aftercompletionsothatifamachinetakestoomuchtimeto gen-eratetheoutputs,itwillcauselargedelayedtimeoftheoverall system.BecausetheinitialcenterofMachine3issomehowbetter thanthatofMachine2,theconvergencemaybefasterandisnot affectedbythenumberofiterationsteps.Inpractical,thenumber

ofmachinescanbereduced,forinstancetheworksoftheMachine

1canbeassignedtooneoftwoleftmachines.Becauseittakes muchtimetotransferdatabetweenmachines,itisbetterifwecan decreasethewaitingtime.Ifso,thenumberoftransferredsteps betweenmachinesisreducedbyhalfandtheoverallprocessing timeisreducedremarkably

TheadvantagesofCFGWC2arefourth-fold:Firstly,itis capa-bletohandlethebadinitializationandimmatureconvergenceby thePSOprocedure;secondly,theclusteringresultsfocusonthe users’ purposes by theinterval context;thirdly, thecomputing speedofCFGWC2isamelioratedthroughtheintervalcontextand theparallelcomputingmechanism;fourthly,themostimportant advantageofCFGWC2isthehighclusteringqualityincomparison withsomerelevantmethodssincethisalgorithmwasdeployedon

Trang 7

Fig 2.The two-dimensional distribution of UNO dataset.

IT2FS,whichismoregeneralandabletohandletheexisting

lim-itationsofthetraditionalfuzzysets.ThedisadvantageofCFGWC2

couldbethecomputationalcostsanditscomplexactivities

Never-theless,byemployingsomeadditionaltechniqueswehopethatthe

disadvantagescouldbeameliorated,andCFGWC2achievesgood

clusteringresults

Results

Experimentalenvironment

Thissectiondescribestheexperimentalenvironmentusedin

nextones

• Experimental tools: We haveimplemented theproposed

algo-rithm(CFGWC2)inadditiontothesealgorithms:NE[13],FGWC

[24]andCFGWC[35]inMPI/Cprogramminglanguageand

exe-cutedthemonaLinuxCluster1350witheightcomputingnodes

of 51.2GFlops Eachnode contains two Intel Xeon dual core 3.2GHz, 2GB Ram.Theexperimentalresultsare takenas the averagevaluesafter10runs

• Clustervalidity:WeusePCAESvalidityfunctiondescribedinEqs (21)–(25)

• Dataset:Weusetwokindsofdatasetsbelow

-Arealdatasetofsocio-economicdemographicvariablesfrom UnitedNationOrganization(UNO)[39]containingthestatistic aboutpopulationof230countriesovertenyears(2001–2010) MissingdatawereprocessedbyBinningmethod[16].The two-dimensionaldistributionisillustratedinFig.2

-AbenchmarkdemographicdatasetfromTheUniversityof Edin-burgh, Scotland (Fig 3)including expressionlevels of 2880 genestakenin 11differentareas [7].Thisdatasetwasused

inmanydifferentresearchpapersongeneexpressionby geo-graphicalfactorssuchasin[4,5]

• Objective:WecomparetheclusteringqualityofCFGWC2with thoseofotheralgorithmsthroughPCAESindex.Additionally,the

Fig 3. The two-dimensional distribution of Colon Cancer dataset.

Trang 8

Table 3

PCAES values of all algorithms in Case 1 on UNO dataset.

2 1091.30832 11.49441 106.87815 106.87815 730.86493 15.80779 107.95304 107.95304

3 3508.71041 14.20249 102.97090 103.08807 1764.55205 15.48401 104.51216 104.62430

4 1026.1004 9.66077 101.00239 101.05883 1882.45315 9.60082 102.01264 102.07279

5 851.56196 13.83029 98.86012 98.89076 828.00298 20.09243 98.70007 98.73446

6 734.85210 23.45840 105.61367 105.11415 713.06259 13.36007 106.82538 95.32594

2 435.14908 15.35085 110.80574 110.80576 222.59648 14.84918 111.54395 111.54397

3 699.52639 17.05059 112.36477 112.46454 448.65676 18.15664 121.39454 121.45259

4 758.04253 12.13725 111.70188 111.77472 530.12028 15.16747 123.22859 123.30832

5 729.73602 13.80425 109.59175 109.64291 544.21607 17.33470 122.96865 123.03807

6 660.41492 21.53153 107.14039 107.19830 534.99351 18.78905 122.06920 123.31178

Fig 4.Average PCAES of algorithms on UNO dataset by fuzzifiers.

evaluationaboutthecomputationaltimesofthesealgorithmsis

alsomentioned

Evaluationbyvariouscasestudies

Inthis section,we evaluatetheproposedalgorithm in

com-parison with the relevant methods by various case studies

about the parameters of algorithms Main findings are found

below

Case 1. Inthiscase,someparametersofthesealgorithmsareset

upasbelow

-Thedefaultgeo-characteristicsare:a=b=1,˛=0.7,ˇ=0.3.These

values determine thegeo-modification process stated in Eqs

(1)–(3).Ourpreviouswork[35]suggestedusingvalue˛≥0.6in

ordertoincreasetheclusteringquality

-Weusethedefaultcontextvaluesin[35]forCFGWCalgorithm

below

f=(f1,f2, ,fN), where fi=

0 ifk=0 rand(0,1)

2k otherwise

, k=imod4, i=1,N

(44)

-InCFGWC2,m2=2×m1=2×mwheremisthefuzzifierofNE, FGWCandCFGWC.Theintervalcontextf=

f1,f2 wheref1=f andf2=1.Abroadintervaloffuzzifiersandcontextswillcreate moredistinctresultsthananarrowone

-In PSO, MaxStep PSO=100 and populationsize is 500 Other parametersare(c1,c2,c3)=(0.2,0.3,0.5)and(1,2)=(1,1).As suggestedbyThienetal.[38],thesevalueswillmakethe conver-gencetotheoptimumfaster

-Threshold ε and MaxStep of allalgorithms are 10−3 and 500, respectively

Table3describesthePCAESvaluesofallalgorithmsonUNO dataset.Theexperimentsareperformedfollowingbydifferent val-uesof thenumber ofclustersand fuzzifiers.Results showthat PCAESvalues ofCFGWC2arethelargestamongall.Thismeans thattheclusteringqualityofCFGWC2isbetterthanthoseofother algorithms.Inordertocomprehendtheexperimentalresults,we illustratethePCAESvaluesofallalgorithmsthroughvariouscases

offuzzifiersinFig.4.Fromthis figure,werecognizethatPCAES valuesofCFGWC2arelargerthanthoseofotheralgorithms Forexample, PCAESofCFGWC2 inFig.4is13 timesgreater thanthatofFGWCwhenm=1.5.ThesenumbersincasesofNEand CFGWCare14and99times,respectively.Similarly,whenm=3.0, PCAESofCFGWC2isstilllargerthanthoseofotheralgorithms,i.e 3.79(FGWC),3.78(NE)and27times(CFGWC).Theseevidences confirmthattheclusteringqualityofCFGWC2isthebestamong

Trang 9

Table 4

The computational time of all algorithms in Case 1 on UNO dataset (s).

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 7.68 0.04 0.04 0.03 10.165 0.04 0.04 0.04

3 14.55 0.03 0.09 0.11 14.31 0.04 0.10 0.13

4 12.94 0.07 0.08 0.12 12.86 0.08 0.11 0.14

5 11.14 0.07 0.16 0.12 17.49 0.07 0.17 0.14

6 20.94 0.07 0.24 0.19 24.56 0.11 0.30 0.22

CFGWC2 CFGWC FGWC NE CFGWC2 CFGWC FGWC NE

2 5.23 0.03 0.04 0.03 10.06 0.04 0.04 0.04

3 14.98 0.04 0.08 0.15 15.40 0.06 0.09 0.12

4 15.96 0.09 0.17 0.21 18.06 0.11 0.19 0.17

5 17.57 0.11 0.19 0.19 22.02 0.27 0.23 0.18

6 24.82 0.17 0.31 0.36 24.87 0.23 0.36 0.30

all.Nonetheless,PCAESvaluesofCFGWC2tendtodecreasewhen

thefuzzifierincreases.Forinstance,PCAESvaluesofCFGWC2from

m=1.5tom=3.0are1442,1183,656and456,respectively.The

averagereducingratioperhalfofafuzzifieris31%.Thismeansthat

eachtimethevalueoffuzzifierisincreasedby0.5,PCAESvalueof

CFGWC2isreducedby31percentsonaverage.Ontheotherhands,

theaveragePCAESvaluesofotheralgorithmsseemtobestable

throughdifferentvaluesoffuzzifier,i.e.109(FGWC),108(NE)and

15(CFGWC).Byroughcalculation,wecaneasyfindthevalueof

fuzzifierthatmakesPCAESvalueofCFGWC2issmallerthanother

algorithms,i.e.m≥5.0.ThisfacttellsusthetruththatCFGWC2

shouldbeusedwhenthefuzzifierissmall.AsmentionedbyBezdek

etal.[3]whendesigningFCMalgorithm,theauthorsstatedthat

thefuzzifiershouldbefrom1.5to2.5,ideallym=2.0,forthesake

ofoptimalcentersfoundbythealgorithm.Thus,wemayseethat

somecasessuchasm≥5.0willneverhappeninpractical

appli-cations.However,thisfindingmaybeusefulforustochoosethe

appropriatevalueofparameters.Isthereanychangeoftheorder

ofalgorithmsintermsofPCAESvaluesbydifferentvaluesof

num-berofclusters?FollowingbyTable3,theanswerisabsolutelyno

Foragivennumberofclusters,PCAESvalueofCFGWC2isalways

largerthanthoseofalgorithms.Indeed,thisshowsthestabilityof

theproposedalgorithm

The computational time of all algorithms for exporting the

resultsinTable3isdescribedinTable4.Clearly,thecomputational

timeofCFGWC2islongerthanthoseofotheralgorithms

When m=3.0, the average computational time of CFGWC2,

FGWC,NEandCFGWCare18.1,0.182,0.162and0.142s,

respec-tively.Similarresultsareobtainedinm=2.0andm=2.5.Aswe

mayseeinthepseudo-codeofContext-FGWC2,itrequireshuge

computationtoprocesstheintervalmembershipmatrix.Byusing

someadditionaltechniquestospeedupthisalgorithm,the

com-putationaltimeofCFGWC2isreducedremarkably.Themaximal

(minimal)computationaltimeofCFGWC2inTable4is24.87(5.23)

s.Withtheincreasingofcomputingpowersnowadays,the

com-putationalcostinthiscaseisacceptable.Table4alsogivesusthe

averageincrementlevelsofthecomputationaltimeofalgorithms

perfuzzifier.Eachtimethefuzzifierisincreasedbyoneunit,the

computationaltimeofCFGWC2isincreasedby16.8percents.The

percentvaluesofFGWC,CFGWCandNEare29.5%,57%and64.9%,

respectively.Whenthefuzzifierislargeenough,thesetimescould

beapproximatetotheothers

Now,weevaluatetheproposedalgorithmonalargerdataset

thanUNO.InFig.5,wemeasuretheaveragePCAESvaluesofall

algo-rithmsonColonCancerdatasetfollowingbyfuzzifiers.Theresults

showthatPCAESvaluesofCFGWC2arelargerthanthoseofother

algorithms.Forexample,whenm=1.5,theaveragePCAESvalueof

CFGWC2is1.13timeslargerthanthatofCFGWC.Thesenumbers

incasesofFGWCandNEare2.2and2.19times,respectively Sim-ilarly,whenm=3.0,theaveragePCAESofCFGWC2is1.32times, 1.15timesand1.16timeslargerthanthoseofCFGWC,FGWCand

NE,respectively.Theseevidencesconfirmthattheclustering qual-ityofCFGWC2isthebestamongallevenonalargedatasetsuch

asColonCancer.Nonetheless,PCAESvaluesofCFGWC2andother algorithmstendtodecreasewhenthefuzzifierincreases.The val-uesofCFGWC2fromm=1.5tom=3.0are48.77,34.18,26.95and 22.94,respectively.ThisresultissimilartothatontheUNOdataset andshowsthatweshouldchoosethesmallvalueoffuzzifierinthis caseinordertoobtaingoodclusteringqualityofCFGWC2.Even whenPCAESvaluesofCFGWC2reduce,theyarestillbetterthan thoseof otheralgorithms.TheaveragePCAESvalueofCFGWC2

isapproximately1.4timeslargerthanthoseofotheralgorithms throughvariouscasesoffuzzifiers.Thismeansthatwhenthe fuzzi-fierincreases,PCAESvaluesofbothCFGWC2andotheralgorithms reduce,butthevaluesofCFGWC2arestilllargerthanthoseofother algorithms.However,smallPCAESvaluesofCFGWC2incasesof largefuzzifierarenotagoodchoiceforus,andweshouldkeepthe fuzzifierisassmallaspossible

InFig.6,weverifywhetherornotPCAESvaluesofCFGWC2are largerthanthoseofotheralgorithmsbythenumberofclusters.This figureclearlypointsoutthatthelineofPCAESvaluesofCFGWC2is higherthanthoseofotheralgorithms.Thestartedpointofalllines (C=2)showsthatPCAESvaluesofalgorithmsareapproximateto theothers,i.e.7.87(CFGWC2),8.67(CFGWC),7.182(FGWC)and 7.184(NE).However,thedifferencesbetweenthoselinesare get-tingobviouswhenthenumberofclustersincreases.Forexample, whenC=3,PCAESvaluesofCFGWC2,CFGWC,FGWCandNEare 23.4, 19.3,16.67and16.62,respectively.WhenC=6, the differ-encebetweenCFGWC2andotheralgorithmsismaximalsincethe amplitudesofthoselinesexpand.PCAESvaluesofthosealgorithms

inthiscaseofclustersare56.2,47.5,33.8and33.2,respectively Thus,threeremarksareextractedfromthisfigure:(i)theclustering qualityofCFGWC2isthebestevenwhenallalgorithmsaretested followingbythenumberofclusters;(ii)Thehigherthenumberof clustersis,thelargerPCAESvalueofCFGWC2is;(iii)Thevalueof fuzzifiershouldbeinverselyproportionaltothatofthenumberof clustersforthesakeofhighPCAESvaluesofCFGWC2asshownin Figs.5and6

In Fig.7,weverify thechangesof PCAESvalues ofCFGWC2

byfuzzifiersonvariousdatasets.Clearly,PCAESvaluesonalarge dataset (Colon Cancer) are much smaller than those on small dataset(UNO).Forexample,theaveragePCAESvaluesofCFGWC2

onUNOandColonCancerare1442and48.77,respectivelywhen

m=1.5.Similarresultscanbeseenwhenm=3.0withPCAESvalues

onUNOandColonCancerbeing456and22.94,respectively.Thus, tworemarksarefoundfromthistest:Firstly,thesizesofinputted datasetsshouldbesmallormediumforthehighPCAESvaluesof CFGWC2;secondly,thechangesofPCAESvaluesthroughvarious fuzzifiersonalargedatasetaresmallerthanthoseonasmallone RunningonalargedatasetsuchasColonCancerresultsinhigh computationaltime ofCFGWC2 as shown in Fig 8.This figure comparesthe averagecomputationaltime of CFGWC2 onUNO and ColonCancer datasetsbyfuzzifiers.Theaverageprocessing timeofCFGWC2perfuzzifieronColonCanceris418swhilstthat processingtimeonUNOis15.7s.Fromthisresult,weshould con-siderthefirstremarkaboutsmallor mediuminputteddatasets whenrunningCFGWC2algorithm

Themajorremarkinthiscaseistheconfirmationofthebest clusteringqualityofCFGWC2amongall

Case 2. InCase2,wemakesomechangesoftheparametersofall algorithms.Specifically,geo-characteristicsare˛=0.4andˇ=0.6 OtherparametersarekeptintactasinCase1.Theaimistoverify

Trang 10

Fig 5.Average PCAES of algorithms on Colon Cancer dataset by fuzzifiers.

Fig 6.Average PCAES of algorithms on Colon Cancer dataset by number of clusters.

Fig 7. Changes of PCAES values of CFGWC2 by fuzzifiers on various datasets.

...

membershipdegreestotallyreflectthebasicprincipleofIT2FS.The

basicideaoftheContext-FGWC2procedureinMachine2isusingan

intervalofprimarymembershipconsistingofthelowerandupper

onescalculatedfromtheinitialcenterandupdatingtheinterval... (21)–(25)

• Dataset:Weusetwokindsofdatasetsbelow

-Arealdatasetofsocio-economicdemographicvariablesfrom UnitedNationOrganization(UNO)[39]containingthestatistic aboutpopulationof230countriesovertenyears(2001–2010)...

bygeo-characteristicsand contextvaluef1.Thepseudo-codeof

Context- FGWC2isshowninTable2

In Step of the Context- FGWC2, the intervals of primary

membershipconsistingoftheupperandlowermembershipsare

Ngày đăng: 16/12/2017, 14:50

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm