An efficient method for mining frequent

An ef ﬁcient method for mining frequent itemsets withdouble constraints Hai Duonga, Tin Truonga, Bay Vob,n a Department of Mathematics and Computer Science, University of Dalat, Dalat, V

Trang 1

and education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of the article (e.g in Word or Tex form) to their personal website or institutional repository Authors requiring further information regarding Elsevier’s archiving and manuscript policies are

encouraged to visit:

http://www.elsevier.com/authorsrights

Trang 2

An ef ﬁcient method for mining frequent itemsets with

double constraints

Hai Duonga, Tin Truonga, Bay Vob,n

a Department of Mathematics and Computer Science, University of Dalat, Dalat, Vietnam

b Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh, Vietnam

a r t i c l e i n f o

Article history:

Received 23 March 2013

Received in revised form

11 July 2013

Accepted 5 September 2013

Available online 18 October 2013

Keywords:

Frequent itemsets

Closed frequent itemsets

Closed itemset lattice

Generators

Constraint mining

a b s t r a c t

Constraint-based frequent itemset mining is necessary when the needs and interests of users are the top priority In this task, two opposite types of constraint are studied, namely anti-monotone and monotone constraints Previous approaches have mainly mined frequent itemsets that satisfy one of these two types of constraint Mining frequent itemsets that satisfy both types is of interest The present study considers the problem of mining frequent itemsets with the following two conditions: they include a set

C0(monotone) and contain no items of set C01(anti-monotone), where the intersection of C0and C01is empty and they are changed regularly A unique representation of frequent itemsets restricted on C0and

C01using closed itemsets and their generators is proposed Then, an algorithm called MFS_DoubleCons is developed to quickly and distinctly generate all frequent itemsets that satisfy the constraints from the lattice of closed itemsets and generators instead of mining them directly from the database The theoretical results are proven to be reliable Extensive experiments on a broad range of synthetic and real databases that compare MFS_DoubleCons to dEclat-DC (a modiﬁed version of dEclat utilized to mine frequent itemsets with constraints) show the effectiveness of our approach

1 Introduction

Frequent itemsets play an important role in many data mining

tasks such as the mining of association rules and classiﬁcation

Therefore, a lot of frequent itemset mining algorithms, such as

Apriori (Agrawal and Srikant, 1994), Eclat (Zaki et al., 1997),

FP-Growth (Han et al., 2000), FP-Growthn(Grahne and Zhu, 2005),

BitTable-FI (Dong and Han, 2007) and Index-BitTableFI (Song et al.,

2008), have been proposed Some condensed representations of

frequent itemsets, such as frequent closed itemsets and maximal

frequent itemsets, have been proposed (Burdick et al., 2001;Lin

and Kedem, 2002;Pasquier et al., 1999a,1999b,2005;Pei et al.,

2000;Vo et al., 2013;Vo et al., 2012;Wang et al., 2003;Zaki and

Hsiao, 2005)

Although the set of all frequent itemsets is quite large, users

only care about a small number that satisfy some given

con-straints A model of constraint-based mining has thus been

developed (Bayardo et al., 2000; Nguyen et al., 1998; Srikant

et al., 1997) Constraints help to focus on interesting knowledge

and to reduce the number of patterns extracted to those of

potential interest In addition, they are used for decreasing the

search space and enhancing the mining efficiency Two important types of constraint have been studied, namely anti-monotone constraints (Nguyen et al., 1998), denoted asCam, and monotone constraints (Pei et al., 2001), denoted asCm An itemset satisfies a constraint Cam (or Cm) if its arbitrary subset (or superset) also satisfies the constraint Camis simple and suitable with Apriori-like algorithms, so it is often integrated into them to prune candidates

Cmis more complicated to exploit and less effective for pruning the search space

Most previous approaches mine frequent itemsets with either

Cam orCm Mining frequent itemsets that satisfy bothCam andCm

is of interest This can be accomplished byfirst mining frequent itemsets that satisfy Cam using an algorithm such as Apriori (Agrawal and Srikant, 1994; Mannila and Toivonen, 1997), Eclat (Zaki and Hsiao, 2005), and FP-growth (Pei and Han, 2002), and thenfiltering the ones matching Cmin a post-processing step This approach is inefficient because it often has to test a large number

of itemsets A more complicated solution is to integrate bothCam

andCminto the algorithm toﬁnd all frequent itemsets that satisfy them However, the authors inJeudy and Boulicaut (2002)showed that the integration ofCmcan lead to a reduction in the pruning of

Cam Therefore, there is a tradeoff betweenCamandCmpruning The present study considers problems that include many Cm

andCamconstraints Such a problem is stated as follows LetA be

a set of attributes or items andT be a database of transactions,

Contents lists available atScienceDirect

journal homepage:www.elsevier.com/locate/engappai

n Corresponding author Tel.: þ84 083974186.

E-mail addresses: haidv@dlu.edu.vn (H Duong) , tintc@dlu.edu.vn (T Truong) ,

vdbay@it.tdt.edu.vn (B Vo)

Trang 3

where each transaction contains a set of items A set X DA is

called an itemset The support of an itemset X, denoted as supp(X),

is the ratio of the number of transactions containing X and N, the

number of transactions inT Given two threshold values, s0and s1,

and two subsets, C0and C1, such that 0os0os1r1, C0D C1DA,

the task is to ﬁnd the class FSC0D C 1ðs0; s1Þ of all the frequent

itemsets that (1) include a subset C0and (2) contain no items of

a subset C0

1DA Criterion (2) implies that the itemsets are

contained in C1¼A\C′1(the complement of C′1fromA) Formally,

the goal is to discover all elements L′ of FSC 0 D C 1ðs0; s1Þ that match

the following criteria, called double-constraint

(1) the support of L′ is in [s0, s1]: s0rsupp (L′) (Cam) and supp (L′)

rs1(Cm);

(2) L′ includes C0(Cm) and L′ is contained in C1(Cam): C0DL′DC1

1.1 Contributions

The contributions of this paper are as follows First, the mining

of frequent itemsets whose supports are not too high based on the

s1value is proposed Such itemsets can be valuable For instance,

they can help to discover association rules which are a little

frequent but unique, or to ﬁnd abnormal rules with high

con-ﬁdence values from low frequent itemsets Second, an efﬁcient

method for discovering frequent itemsets that satisfy bothCam and

Cmconstraints is proposed Third, a structure and unique

representa-tion of frequent itemsets with double-constraint are proposed Based

on these, the MFS_DoubleCons algorithm is developed to quickly and

distinctly mine all constraint-based frequent itemsets Moreover,

the algorithm accesses the database only once, even if theCam and

Cm constraints are changed regularly This considerably enhances

mining performance

The rest of this paper is organized as follows Some studies

related to constraint-based frequent itemset and association rule

mining are reviewed inSection 2.Section 3presents some basic

concepts in frequent itemset mining and notations InSection 4, a

unique representation of frequent itemsets with double-constraint

and a procedure for quickly determining all closed frequent

item-sets and their generators are described An efﬁcient algorithm for

ﬁnding all frequent itemsets with double-constraint is also

pro-posed Experimental results are discussed in Section 5 Finally,

conclusion and future work are presented inSection 6

2 Related work

Constraint mining wasﬁrst deﬁned by Nguyen et al in 1998

(Nguyen et al., 1998) They proposed an Apriori-like algorithm

called CAP to reduce the frequent itemsets computation In Pei

and Han (2002), Pei et al proposed the concept of convertible

constraints and integrated them into the mining process of the

FP-growth algorithm.Srikant et al (1997)considered the problem of

integrating constraints that are Boolean expressions in the

pre-sence or abpre-sence of items in the association rules.Bayardo et al

(2000) restricted the problem of mining association rules in two

constraints that are the consequent and the minimum

improve-ment InBucila et al (2003), Bucila et al integrated bothCamand

Cminto the algorithm DualMiner Unfortunately, it has to scan the

database many times and perform a large number of useless tests

on long itemsets, especially when the minimum support is low An

Apriori-like algorithm called ExAMiner (Bonchi et al., 2003) uses

both of these constraints to reduce the input data and the search

space A shortcoming of these algorithms is that they need to be

rerun whenever the constraints are changed This makes a

reduc-tion on mining speed of users A solureduc-tion is to mine and save only

once all frequent itemsets with small values of minimum support Then, when the constraints are changed, the system extracts frequent itemsets that satisfy them The computational and memory costs for this second step are very high because the number of frequent itemsets generated from the ﬁrst step is usually enormous

Recently, many authors have focused on combining the advan-tages of constrained mining and a condensed representation of frequent itemsets Instead of mining all frequent itemsets, only the condensed ones are extracted Using condensed frequent itemsets has three primary advantages First, it is easier to store because the number of them is much smaller than the size of the class of all frequent itemsets, especially for dense databases Second, they are mined only once from database even when the constraints are changed Third, they can be used to generate all frequent itemsets without access to the database There are two types of conden-sed representation The ﬁrst type is maximal frequent itemsets (Burdick et al., 2001;Lin and Kedem, 2002) Since their cardinality

is very small, they can be discovered quickly All frequent itemsets can be generated from the maximal frequent itemsets However, the generation often produces duplications In addition, the generated frequent itemsets can lose information about their supports Therefore, the supports need to be recomputed when mining association rules The second type is closed frequent itemsets, called maximal ones, and their generators, called mini-mal ones (Bonchi and Lucchese, 2004; Boulicaut and Bykowski,

2000; Boulicaut et al., 2003; Pasquier et al., 2005) Each closed frequent itemset represents a class of all frequent itemsets that have the same closure Thus, it (with its generators) can be used to uniquely determine all frequent itemsets in the class without losing their supports

3 Some concepts and notations

For a database T , let O be a non-empty set containing transactions,A be a set of items appearing in those transactions, and ℛ be a binary relation on O A A set of items is called an itemset Consider two operators:λ: 2O-2Aandρ: 2A-2Odeﬁned

as follows (λ(∅):¼A andρ(∅):¼O): 8ODO, 8ADA,λ(O)¼{aAA| (o, a) Aℛ, 8oAO}, ρ(A)¼{oAO|(o, a) Aℛ, 8aAA} A closed operator h in 2A(Birkhoff, 1948;Wille, 1992) is deﬁned as h¼λo

ρ Denote h(A) as the closure of subset A DA A is called a closed itemset iff h(A)¼A Let CS be the class of all closed itemsets The minimum and maximum support thresholds are denoted as s0and

s1, respectively, where 1/|O|rs0rs1r1 Only consider non-trivial items in subsetAF

:¼{aAA: supp({a})Zs0} ofA are considered

A non-empty itemset A (subset of AF) is called frequent iff

s0rsupp(A)rs1 It is noted that if s1 is equal to 1, then the traditional frequent itemset concept is obtained Brieﬂy, ℱS:¼{L′ DA: s0rsupp(L′)rs1} denotes the class of all frequent itemsets andℱCS:¼ℱS \CS denotes the class of all closed frequent item-sets For any two sets G, A:∅aGDADA, G is called a generator (Mannila and Toivonen, 1997) of A if h(G)¼h(A) and (8∅aG′

CG ) h(G′)Ch(G)) Let G(A) be the class of all generators of A and

ℒGAbe the lattice of all closed itemsets and their generators Given C0 and C1 as two constraint subsets such that

∅aC0DC1DA, FSC 0 D C 1ðs0; s1Þ:¼{L′DℱS: C0DL′DC1} denotes the class of all frequent itemsets with double-constraint, FCS+ C0ðs0; s1Þ:¼{LAℱCS: L+C0} denotes the class of all closed frequent itemsets which include C0, and FCSC 0 D C 1ðs0; s1Þ:¼ {LC1 L\C1: LA FCS+ C0ðs0; s1Þand ((Li A G(L): LiD C1)} denotes the class of all closed frequent itemsets with double-constraint

H Duong et al / Engineering Applications of Artiﬁcial Intelligence 27 (2014) 148–154 149

Trang 4

Remark 1 8L′AℱS, if C0DL′DC1, then supp (C0)Zsupp (L′)Zs0

and supp (C1)rsupp (L′)rs1 Thus, only C0and C1 where supp

(C0)Zs0and supp (C1)rs1are considered

4 Mining frequent itemsets with double-constraint

Deﬁnition 1 (Tin and Anh, 2010) (Equivalence relationhoverℱS)

8A, BAℱS:

AhB3hðAÞ ¼ hðBÞ

For each A AℱS, [A]:¼{BAℱS: h(B)¼h(A)} denotes the

equivalence class of all frequent itemsets with the same closure

If L AℱCS, then [L]:¼{L′DL: h(L′)¼L} Using this relation, ℱS is

partitioned into disjoint equivalence classes Each class contains

frequent itemsets with the same closure L AℱCS and the same

support The following theorem is derived:

Theorem 1 (Tin and Anh, 2010) (A partition ofℱS)

FS¼ ∑

L A FCS½L

This partition allows us to independently mine frequent

item-sets with double-constraint in each equivalence class

Example 1 The rest of this paper considers databaseT shown

in Fig 1 (a) For s0¼1/4 and s1¼3/4, Charm-L (Zaki and Hsiao,

2005) and MinimalGenerators (Zaki, 2004) are used to mine a

lattice of all closed frequent itemsets and their generators The

results are shown inFig 1(b) Then,ℱS ¼[a]þ[c]þ[ceg]þ[ac]þ

[afh]þ[aceg]þ[bceg]þ[acfh]þ[adfh].1With this disjoint partition,

the duplication in the mining process of frequent itemsets in

different classes is greatly reduced However,Theorem 2inAnh

et al (2011)showed that frequent itemsets generated in each class

can be still duplicated

4.1 Partition of the class of all frequent itemsets with

double-constraint

Let FSC 0 D L C1:¼ {L′DLC1: C0DL′, h(L′)¼h(LC1)} be the class of all

frequent itemsets in [L] which include C0and are contained in LC1,

where LC1AFCSC 0 D C 1ðs0; s1Þ Based on the idea of the above

partition and Proposition 1 in Hai et al (2013), the following

proposition is derived

Proposition 1 (The disjoint partition of the class of all frequent itemsets with double-constraint).8C0DC1DA:

FSC 0 D C 1ðs0; s1Þ ¼ ∑

L C1 A FCSC0 D C1ðs 0 ; s 1 ÞFSC 0 D L C1

Proof

The sets on the right-hand side are disjoint In fact, 8L′i

AFSC 0 D L C1; i, where LC1, i:¼Li\C1AFCSC 0 D C 1ðs0; s1Þ, Li AFCS+ C0

ðs0; s1Þ, i¼1, 2 and L1aL2 Then, h(L′1)¼h(LC1, 1)¼L1aL2¼ h(LC1, 2)¼h(L′2) Thus, L′1 and L′2 are in different equivalence classes [L1] and [L2] Hence, L′1aL′2

“D”: 8L′AFSC 0 D C 1ðs0; s1Þ, assign that L¼h(L′) and LC1¼L\C1, then C0DL′Dh(L′)¼L and s0rsupp(L′)¼supp(L)rs1 Let LiA G(L′) (because G(L′)a∅ (Anh et al., 2012), then LiDL′DC1,

C0DLC1 and h(Li)¼h(L′)¼L and LiAG(L′)DG(L) Thus, LA FCS+ C0ðs0; s1Þ and LC1AFCSC 0 D C 1ðs0; s1Þ Moreover, it is known that C0DL′DLC1DL and L¼h(L′)¼h(Li)Dh(LC1)DL Therefore, h (LC1)¼h(L′) It is thus concluded that L′AFSC 0 D L C1

“+”: 8LC1AFCSC 0 D C 1ðs0; s1Þ, L′AFSC 0 D L C1, then h(L′)¼h(LC1),

C0DL′DLC1DC1 and there exists LiAG(L): LiDC1 Then,

LiDLC1DL and L¼h(Li)Dh(LC1)Dh(L)¼L Thus, h(L′)¼h(LC1)¼L and s0rsupp(L′)¼supp(L)rs1 Hence, L′AFSC 0 D C 1ðs0; s1Þ □

Remark 2 The disjoint partition of the class of all frequent itemsets allows parallel algorithms to be designed to indepen-dently exploit each class, signiﬁcantly reducing mining time Example 2 Set s0¼1/4, s1¼3/4, C0¼a, and C1¼adefg Then, consider L¼aceg, G(L)¼{L1¼ae, L2¼ag} From Theorem 3 inAnh

et al (2011), XU¼aeg, XU,1¼g, XU,2¼e, X_¼c Thus, [aceg]¼{ae, aeg, aegc, aec, ag, agc} Similarly, [bceg]¼{bc, be, bg, bce, bcg, beg, bceg}, [acfh]¼{cf, acf, cfh, acfh, ch, ach}, [adfh]¼{d, ad, df, dh, adf, adh, dfh, adfh}, [ceg]¼{e, ce, eg, ceg, g, cg}, [ac]¼{ac}, [afh]¼{f, af,

fh, afh, h, ah}, [c]¼{c}, and [a]¼{a} Frequent itemsets L′ of all classes are tested by the condition C0D L′DC1, yielding

FSC 0 D C 1ðs0; s1Þ¼ {ae, aeg, ag, ad, adf, af, a}

InExample 2, testing the condition C0DL′DC1is very expen-sive In the next section, a method for efﬁciently mining the frequent itemsets with double-constraint in each equivalence class

is proposed

4.2 Extracting closed frequent itemsets and generators with double-constraint

The elements of FCSC 0 D C 1ðs0; s1Þcan be determined as follows From the lattice ℒGA, the class of all closed frequent itemsets including C isﬁrst determined Second, for each element L of them,

Fig 1 (a) Example database and (b) corresponding lattice of closed itemsets The frequent itemsets are underlined and their generators are in italics Supports are shown on the right of the lattice.

1

The symbol þdenotes the union of two disjoint sets.

Trang 5

if there exists a generator of L contained in C1, then the closed

frequent itemset that satisﬁes the double-constraint, called LC1, is the

intersection of L and C1, i.e LC1¼L\C1AFCSC0D C 1ðs0; s1Þ Otherwise,

L is discarded In this way, the generators of LC1are determined by

ﬁnding all generators (of L) contained in C1 Similar toProposition 2in

Anh et al (2011), the following proposition is derived:

Proposition 2 Generate closed frequent itemsets and their

genera-tors with double-constraint

a 8LC1¼L\C1A FCSC0D C 1ðs0; s1Þ: G(LC1)¼{LiA G(L): LiDC1}

b The elements of FCSC 0 D C 1ðs0; s1Þ are generated distinctly

From this proposition, the procedure for generating all closed

frequent itemsets with double-constraint and generators using

ℒGAis shown inFig 2

Example 3 With the values of C0, C1, s0, and s1 as those in

Example 2, closed frequent itemsets with double-constraint and

their generators are mined using MFCS_DoubleCons.Table 1shows

the process

Remark 3 At line 3 of the MFCS_DoubleCons procedure, if L

includes C0, then all its supersets also include C0(the property of

constraint Cm) A similar property holds for the test supp(L)rs1

Thus, if a candidate passes the tests, its supersets do not need to be

tested, signiﬁcantly reducing the execution time of the algorithm

A similar remark can be made for constraint s0(constraintCam)

4.3 Mining all frequent itemsets with double-constraint

This section describes the unique representation and the

structure of frequent itemsets with double-constraint For each

LC1AFCSC0D C 1ðs0; s1Þ, letKmin ; C 0 ;C 1:¼Minimal{Ki¼Li\C0|LiAG(LC1)}

be the class of all the minimal itemsets of {Li\C0|LiAG(LC1)} in

terms of the inclusion order “D” on sets Assign KU; C 0 ;C 1:¼

UK i A K min ; C0; C1Ki, KU ; C 0 ;C 1 ; i:¼KU ; C 0 ;C 1\Ki, K_ ; C 0 ;C 1:¼LC1\(KU ; C 0 ;C 1þC0), and deﬁne:

FSnC

0 D L C1:¼ {L′¼C0þKiþK′iþK″|KiA Kmin; C0;C1, K′iDKU; C0;C1; i,

K″DK_ ; C 0 ;C 1and (Kjg KiþK′i, 8KjA Kmin ; C 0 ;C 1: 1rjoi)} (n) Remark 4 If there exists LiA G(LC1) such that Ki¼Li\C0¼∅, then

FSn

C 0 D L C1¼{L′¼C0þK″, K″DLC1\C0} In this way, the frequent itemsets with double-constraint are generated more quickly Indeed, if there exists LiA G(LC1) such that Ki¼Li\C0¼∅, which implies that LiDC0, then Kmin ; C 0 ;C 1¼∅, KU ; C 0 ;C 1¼ ∅, KU ; C 0 ;C 1 ; i¼∅ This implies that K_; C0;C1¼LC1\C0 Thus, L′¼C0þK″AFSn

C 0 D L C1, where K″DLC1\C0

Theorem 2 Distinctly generate the elements of FSn

C 0 D L C1 For each LC1A FCSC 0 D C 1ðs0; s1Þ:

a) FSC 0 D L C1¼FSn

C 0 D L C1

b) The elements of FSn

C0D L C1are generated distinctly

Proof: (a)

“D”: if L′AFSC0D L C1, according to Theorem 3 inAnh et al (2011),

L′¼LiþL′iþL, where LiAG(LC1), LU¼ [LiA GðL C1 ÞLi, L′iDLU,i¼LU\Li,

LDL_¼LC1\LU Thus, L′¼C0\(LiþL′iþL)þ(Li\C0)þ(L′i\C0)þ (L\C0)¼C0þKiþ(L′i\C0)þ(L\C0)¼C0þKiþK′iþK″ where Ki¼Li\

C0AKmin ; C 0 ;C 1, K′i¼(L′i\C0)\KU ; C 0 ;C 1DKU ; C 0 ;C 1\(LU\Li)DKU ; C 0 ;C 1

Table 1

Mined closed frequent itemsets and their generators with double constraint.

L G(L) L +C 0 L i A G(L) and C 1 +L i L C1 ¼ L\C 1 G(L C1 )

acfh cf, ch acfh

ceg e, g

MFCS_DoubleCons

:

if and then //

(L C1 ) C 1

return

Fig 2 MFCS_DoubleCons procedure.

MFS_DoubleCons

return

MFCS_DoubleCons

MFS_DoubleCons_OneClass

return

Fig 3 MFS_DoubleCons algorithm.

MFS_DoubleCons_OneClass

break

else

test the condition (*)

break

return

Trang 6

\LiDKU ; C 0 ;C 1\Ki¼KU ; C 0 ;C 1 ; iand K″¼[(L′i\C0)\KU ; C 0 ;C 1þ(L\C0)]

D(LC1\(C0þKU; C 0 ;C 1))þ(LC1\KU; C0;C1)\C0DLC1\(C0þKU; C 0 ;C 1)¼

K_ ; C 0 ;C 1 The lowest index i is always selected such that

Ki¼Li\C0AKmin ; C 0 ;C 1is the minimal set and L′ has the

representa-tion in the form: L′¼C0þKiþK′iþK″

Moreover, if there exists index koi such that KkAKmin; C0;C1

and KkCKiþK′i, then L′ also has a different representation: L′¼

C0þKkþK′kþK″, where K′k¼(KiþK′i)\KkDKU ; C 0 ;C 1 ; k This

contra-dicts the selection of index i Thus, KkgKiþK′i, 8KkAKmin; C0;C1:

1rkoi Hence, L′AFSn

C 0 D L C1

“+”: if L′AFSn

C 0 D L C1, there exists Ki¼Li\C0A Kmin ; C 0 ;C 1, LiA G (LC1), K′iDKU; C0;C1; i, K″ DK_ ; C 0 ;C 1: C0D L′¼C0þKiþK′iþK″

DLC1 Thus, h(L′)Dh(LC1) On the other hand, since

Li¼Kiþ(Li\C0) DKiþC0DL′, so h(LC1)¼h(Li)D h(L′) It is

inferred that h(L′)¼h(LC1) Hence, L′AFSC 0 D L C1

(b) Assume that there exist i and k, where i4kZ1, such that

L′k L′i, where L′k:¼ C0þKkþK′kþK″k, L′i:¼C0þKiþK′iþK″i, Kk,

Ki A Kmin ; C 0 ;C 1, KkaKi, K″k, K″i DK_ ; C 0 ;C 1, K′k DKU ; C 0 ;C 1 ; k, K′i

DKU; C0;C1; i Since Kk\C0¼∅ and Kk\K″i¼∅, so KkCKiþK′i(the

equality does not hold because Kiand Kkare two different minimal

sets) It is opposite to the way that index i is selected Therefore, all

elements of FSnC0D L C1 are generated distinctly □

Example 4 For s0¼1/4, s1¼3/4, C0¼a, and C1¼adefg, consider

L¼aceg, G(L)¼{L1¼ae, L2¼ag} The following results are obtained:

C0DL, LC1¼L\C1¼aeg, G(LC1)¼{L1¼ae, L2¼ag}, K1¼L1\C0¼e,

K2¼L2\C0¼g, Kmin ; C 0 ;C 1¼ Minimal{K1, K2}¼{K1, K2}¼{e, g},

KU ; C 0 ;C 1¼eg, KU ; C 0 ;C 1 ; 1¼g, KU ; C 0 ;C 1 ; 2¼ e, K_ ;C 0 ;C 1¼ ∅ With K1¼e, then aþe, aþeþg AFSC 0 D aceg \ C 1ðs0; s1Þ, and with K2¼g, then aþg AFSC 0 D aceg \ C 1ðs0; s1Þ Thus, FSC 0 D aceg \ C 1ðs0; s1Þ¼ {ae, aeg, ag} Similarly, FSC0D adf h \ C1ðs0; s1Þ¼ {ad, adf}, FSC 0 D af h \ C 1ðs0; s1Þ¼ {af} and FSC0D a \ C 1ðs0; s1Þ¼ {a} Hence, FSC0D C 1ðs0; s1Þ¼ {ae, aeg,

ag, ad, adf, af, a}

According toTheorem 2, the procedure MFS_DoubleCons_One-Class (pseudo code shown inFig 4) is used for mining frequent itemsets with double-constraint in a class UsingProposition 1and this procedure, the algorithm MFS_DoubleCons is proposed, shown

inFig 3, for mining all frequent itemsets with double-constraint

5 Experimental results

Experiments were performed on a PC with an i5-2400 CPU, 3.10 GHz@ 3.09 GHz PC and 3.16 GB of memory, running Windows

XP The algorithms were coded in C# To compare the perfor-mance, the source code for Charm-L, MinimalGenerators and dEclat (Anon, 2010) was converted to C# Charm-L and MinimalGenerators were used to mine the lattice of the closed itemsets and their generators dEclat was used to exploit all frequent itemsets Here,

it was modiﬁed for mining frequent itemsets with double-constraint In its 1st step, only frequent itemsets contained in C1

are taken Then, the output is filtered to determine frequent itemsets that satisfy s1 and C0 This modified version is called dEclat-DC For the post-processing approach, Gen_Itemsets_DC is a modification of Gen_Itemsets (Anh et al., 2011)

For the performance test, the following benchmark databases

in FIMDR (2009) were chosen: Pumsb, Connect, Mushroom, T10I4D100K, and T40I10D100K Pumsb, Connect, and Mushroom are real and dense, i.e., they produce many long frequent itemsets

Table 2

Database characteristics.

Table 3

Time reductions of MFS_DoubleCons compared to Gen_Itemsets_DC and dEclat-DC.

DB, MS R_MG (%) R_ME (%) RR (%) DB, MS R_MG (%) R_ME (%) RR (%)

Fig 6 Performance results for T10I4D100K.

Trang 7

even for very high support values The others are synthetic and

sparse.Table 2shows their characteristics

The support threshold s1isﬁxed at 0.95 Assuming that the size

of C0is m, C1with a size of mþdn|AF|/100 (dA[1,100]) is chosen

With dense databases, for each pair of database (DB) and

mini-mum support (MS), m ranges from 4% to 14% of|AF

| (step 2%) and

d¼81 For sparse ones, m ranges from 0.2% to 2% of |AF| (step 0.2%)

and d¼93 For each pair of C0's size and C1's size, 10

constraints are selected for the dense ones and 6

double-constraints are selected for the sparse ones (each of them contains

items randomly selected fromAF

) Let T_MD, T_GID, and T_DED be the average execution times of MFS_DoubleCons, Gen_Itemsets_DC,

and dEclat-DC for 60 selected double-constraints

Table 3shows the experimental evaluation of MFS_DoubleCons

against Gen_Itemsets_DC and dEclat-DC, where column R_MG

shows the ratios of T_MD and T_GID, and column R_ME shows

the ratios of T_MD and T_DED NCIn is the number of frequent

itemsets contained in C1 and NNC is the number of frequent

itemsets contained in C1without including C0 Column RR is used

to indicate the average number of the ratios of NNC and NCIn Compared to Gen_Itemsets_DC, MFS_DoubleCons is faster for dense databases The time is reduced by 30.3–0.96% For sparse data-bases, the reduction is lower because the number of frequent itemsets is small and their size is small, leading to a low cost for testing the constraints Compared to dEclat-DC, MFS_DoubleCons is much faster for all databases The time is reduced by 40.6–2.01% The reason for the reduction is that there is a large number of candidates (RR ranges from 95.94% to 99.99%) which fail the last test of dEclat-DC, leading to lower performance

Figs 5–8show the comparisons of the average execution times for various support values The performance and scalability of MFS_DoubleCons are superior to those of Gen_Itemsets_DC and dEclat-DC

Figs 9–11show the performance results for various numbers

of double-constraints The performance gap between MFS_Dou-bleCons and dEclat-DC increases with the number of double-constraints The main reason is that, when the double-constraint changes, MFS_DoubleCons executes without creating the lattice of closed itemset and their generators again from the database MFS_DoubleCons outperforms both Gen_Itemsets_DC and dEclat-DC, especially when the minimum support is lower and the number of double-constraints is high

6 Conclusion and future work

This paper presented a unique representation and structure of frequent itemsets with double-constraint The correctness of the theoretical results was proven An efﬁcient algorithm was developed for mining all frequent itemsets with double-constraint Tests on the benchmark databases show the efﬁciency of the proposed approach Moreover, the tests show that the algorithm outperforms existing

Fig 9 Performance results for Mushroom for various numbers of double-constraints.

Fig 8 Performance results for T40I10D100K.

Trang 8

algorithms, especially when the minimum support values are very low

and the number of double-constraints is high

In the future, the mining of frequent itemsets with more

complicated types of constraint will be studied and association

rules based on these frequent itemsets will be derived

Acknowledgments

This work was funded by Dalat University and Vietnam’s

National Foundation for Science and Technology Development

(NAFOSTED) under Grant no 102.01-2012.17

References

Agrawal, R., Srikant, R., 1994 Fast algorithms for mining association rules In:

Proceedings of the 20th International Conference on Very Large Data Bases,

pp 478–499.

Anh, T., Hai, D., Tin, T., Bac, L., 2011 Efﬁcient algorithms for mining frequent

itemsets with constraint In: Proceedings of the Third International Conference

on Knowledge and Systems Engineering, pp 19–25.

Anh, T., Hai, D., Tin, T., Bac, L., 2012 Mining frequent itemsets with dualistic

constraints In Proceedings of the PRICAI 2012, LNAI, vol 7458, Springer,

pp 807–813.

Anh, T., Tin, T., Bac, L., Hai, D., 2012 Mining association rules restricted on

constraint In Proceedings of the IEEE-RIVF 2012, pp 51–56.

Anon〈http://www.cs.rpi.edu/zaki/wwwnew/pmwiki.php/Software/Software#pa

tutils〉 (accessed 2010).

Bayardo, R.J., Agrawal, R., Gunopulos, D., 2000 Constraint-based rule mining in

large, dense databases Data Mining and Knowledge Discovery 4 (2–3),

217–240

Birkhoff, G., 1948 Lattice Theory American Mathematical Society, New York

Bonchi, F., Lucchese, C., 2004 On closed constrained frequent pattern mining In:

Proceedings of the IEEE ICDM’04, pp 35–42.

Bonchi, F., Giannotti, F., Mazzanti, A., Pedreschi, D., 2003 Examiner: optimized

level-wise frequent pattern mining with monotone constraints In Proceedings

of the IEEE ICDM’03, pp 11–18.

Boulicaut, J.F., Bykowski, A., 2000 Frequent closures as a concise representation for

binary Data Mining In Proc PAKDD’00, vol 1805, pp 62–73, Springer.

Boulicaut, J.F., Bykowski, A., Rigotti, C., 2003 Free-sets: a condensed representation

of boolean data for the approximation of frequency queries Data Mining and

Knowledge Discovery 7 (1), 5–22

Bucila, C., Gehrke, J.E., Kifer, D., White, W., 2003 Dualminer: a dual-pruning

algorithm for itemsets with constraints Data Mining and Knowledge Discovery

7 (3), 241–272

Burdick, D, Calimlim, M, Gehrke, J., 2001 MAFIA: a maximal frequent

itemset algorithm for transactional databases In Proceedings of the IEEE

ICDE’01, pp 443–452

Dong, J., Han, M., 2007 BitTable-FI: an efﬁcient mining frequent itemsetsalgorithm.

Knowledge Based Systems 20 (4), 329–335

Frequent Itemset Mining Dataset Repository (FIMDR), 〈http://ﬁmi.cs.helsinki.ﬁ/

data/〉 (accessed 2009).

Grahne, G., Zhu, J., 2005 Fast algorithms for frequent itemset mining using fp-trees IEEE Transactions on Knowledge and Data Engineering 17 (10), 1347–1362

Hai, D., Tin, T., Bac, L., 2013 An efﬁcient algorithm for mining frequent itemsets with single constraint In: Proceedings of ICCSAMA 2013, Advanced Computa-tional Methods for Knowledge Engineering, Springer, pp 367–378.

Han, J., Pei, J., Yin, Y., 2000 Mining frequent patterns without candidate generation SIGMOD’00, 1–12.

Jeudy, B., Boulicaut, J.F., 2002 Optimization of association rule mining queries Intelligent Data Analysis 6 (4), 341–357

Lin, D.I., Kedem, Z.M., 2002 Pincer search: an efﬁcient algorithm for discovering the maximum frequent sets IEEE Transactions on Knowledge and Data Engineering

14 (3), 553–566

Mannila, H., Toivonen, H., 1997 Levelwise search and borders of theories in knowledge discovery Data Mining and Knowledge Discovery 1 (3), 241–258

Nguyen, R.T., Lakshmanan, V.S., Han, J., Pang, A., 1998 Exploratory mining and pruning optimizations of constrained association rules In Proceedings of the

1998 ACM-SIG-MOD International Conference on the Management of Data,

pp 13–24 Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L., 1999 Discovering frequent closed itemsets for association rules ICDT’12, 398–416

Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L., 1999 Efﬁcient mining of association rules using closed itemset lattices In: Information Systems 24 (1), 25–46

Pasquier, N., Taouil, R., Bastide, Y., Stumme, G., Lakhal, L., 2005 Generating a condensed representation for association rules Journal of Intelligent Informa-tion Systems 24 (1), 29–60

Pei, J., Han, J., 2002 Constrained frequent pattern mining: A Pattern-Growth view ACM SIGKDD Explorations 4 (1), 31–39

Pei, J., Han, J., & Mao, R., 2000 CLOSET: an efﬁcient algorithm for mining frequent closed itemsets SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, p 11–20.

Pei, J., Han, J., Lakshmanan, L.V.S., 2001 Mining frequent itemsets with convertible constraints In Proceedings of the IEEE ICDE’01, pp 433–442

Song, W., Yang, B., Xu, Z., 2008 Index-BitTableFI: an improved algorithm formining frequent itemsets Knowledge Based Systems 21 (6), 507–513

Srikant, R., Vu, Q., Agrawal, R., 1997 Mining association rules with item constraints.

In Proc KDD’97, pp 67–73.

Tin, T., Anh, T., 2010 Structure of set of association rules based on concept lattice Adv in Intelligent Infor and Database Systems, SCI, 283, Springer, p 217–227.

Vo, B., Hong, T.P., Le, B., 2012 DBV-Miner: a dynamic bit-vector approach for fast mining frequent closed itemsets Expert Systems with Applications 39 (8), 7196–7206

Vo, B., Hong, T.P, Le, B., 2013 A lattice-based approach for mining most general-ization association rules Knowledge-Based Systems 45, 20–30

Wang, J., Han, J., Pei, J., 2003 CLOSETþ: searching for the best strategies for mining frequent closed itemsets SIGKDD’03, 236–245

Wille, R., 1992 Concept lattices and conceptual knowledge systems Computers & Mathematics with Applications 23, 493–515

Zaki, M.J 2004 Mining Non-Redundant Association Rules Data Mining and Knowledge Discovery, pp 223–248

Zaki, M.J., Hsiao, C.J., 2005 Efﬁcient algorithms for mining closed itemsets and their lattice structure IEEE Transactions on Knowledge and Data Engineering 17 (4), 462–478

Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W., 1997 New algorithms for fast discovery of association rules In Proceedings of the 3rd International Con-ference on Knowledge Discovery and Data Mining (KDD’97), pp 283–296.

Fig 11 Performance results for synthetic and sparse databases for various numbers of double-constraints.

Định dạng
Số trang	8
Dung lượng	1,3 MB