An ef ficient method for mining frequent itemsets withdouble constraints Hai Duonga, Tin Truonga, Bay Vob,n a Department of Mathematics and Computer Science, University of Dalat, Dalat, V
Trang 1and education use, including for instruction at the authors institution
and sharing with colleagues.
Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party
websites are prohibited.
In most cases authors are permitted to post their version of the article (e.g in Word or Tex form) to their personal website or institutional repository Authors requiring further information regarding Elsevier’s archiving and manuscript policies are
encouraged to visit:
http://www.elsevier.com/authorsrights
Trang 2An ef ficient method for mining frequent itemsets with
double constraints
Hai Duonga, Tin Truonga, Bay Vob,n
a Department of Mathematics and Computer Science, University of Dalat, Dalat, Vietnam
b Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh, Vietnam
a r t i c l e i n f o
Article history:
Received 23 March 2013
Received in revised form
11 July 2013
Accepted 5 September 2013
Available online 18 October 2013
Keywords:
Frequent itemsets
Closed frequent itemsets
Closed itemset lattice
Generators
Constraint mining
a b s t r a c t
Constraint-based frequent itemset mining is necessary when the needs and interests of users are the top priority In this task, two opposite types of constraint are studied, namely anti-monotone and monotone constraints Previous approaches have mainly mined frequent itemsets that satisfy one of these two types of constraint Mining frequent itemsets that satisfy both types is of interest The present study considers the problem of mining frequent itemsets with the following two conditions: they include a set
C0(monotone) and contain no items of set C01(anti-monotone), where the intersection of C0and C01is empty and they are changed regularly A unique representation of frequent itemsets restricted on C0and
C01using closed itemsets and their generators is proposed Then, an algorithm called MFS_DoubleCons is developed to quickly and distinctly generate all frequent itemsets that satisfy the constraints from the lattice of closed itemsets and generators instead of mining them directly from the database The theoretical results are proven to be reliable Extensive experiments on a broad range of synthetic and real databases that compare MFS_DoubleCons to dEclat-DC (a modified version of dEclat utilized to mine frequent itemsets with constraints) show the effectiveness of our approach
& 2013 Elsevier Ltd All rights reserved
1 Introduction
Frequent itemsets play an important role in many data mining
tasks such as the mining of association rules and classification
Therefore, a lot of frequent itemset mining algorithms, such as
Apriori (Agrawal and Srikant, 1994), Eclat (Zaki et al., 1997),
FP-Growth (Han et al., 2000), FP-Growthn(Grahne and Zhu, 2005),
BitTable-FI (Dong and Han, 2007) and Index-BitTableFI (Song et al.,
2008), have been proposed Some condensed representations of
frequent itemsets, such as frequent closed itemsets and maximal
frequent itemsets, have been proposed (Burdick et al., 2001;Lin
and Kedem, 2002;Pasquier et al., 1999a,1999b,2005;Pei et al.,
2000;Vo et al., 2013;Vo et al., 2012;Wang et al., 2003;Zaki and
Hsiao, 2005)
Although the set of all frequent itemsets is quite large, users
only care about a small number that satisfy some given
con-straints A model of constraint-based mining has thus been
developed (Bayardo et al., 2000; Nguyen et al., 1998; Srikant
et al., 1997) Constraints help to focus on interesting knowledge
and to reduce the number of patterns extracted to those of
potential interest In addition, they are used for decreasing the
search space and enhancing the mining efficiency Two important types of constraint have been studied, namely anti-monotone constraints (Nguyen et al., 1998), denoted asCam, and monotone constraints (Pei et al., 2001), denoted asCm An itemset satisfies a constraint Cam (or Cm) if its arbitrary subset (or superset) also satisfies the constraint Camis simple and suitable with Apriori-like algorithms, so it is often integrated into them to prune candidates
Cmis more complicated to exploit and less effective for pruning the search space
Most previous approaches mine frequent itemsets with either
Cam orCm Mining frequent itemsets that satisfy bothCam andCm
is of interest This can be accomplished byfirst mining frequent itemsets that satisfy Cam using an algorithm such as Apriori (Agrawal and Srikant, 1994; Mannila and Toivonen, 1997), Eclat (Zaki and Hsiao, 2005), and FP-growth (Pei and Han, 2002), and thenfiltering the ones matching Cmin a post-processing step This approach is inefficient because it often has to test a large number
of itemsets A more complicated solution is to integrate bothCam
andCminto the algorithm tofind all frequent itemsets that satisfy them However, the authors inJeudy and Boulicaut (2002)showed that the integration ofCmcan lead to a reduction in the pruning of
Cam Therefore, there is a tradeoff betweenCamandCmpruning The present study considers problems that include many Cm
andCamconstraints Such a problem is stated as follows LetA be
a set of attributes or items andT be a database of transactions,
Contents lists available atScienceDirect
journal homepage:www.elsevier.com/locate/engappai
0952-1976/$ - see front matter & 2013 Elsevier Ltd All rights reserved.
n Corresponding author Tel.: þ84 083974186.
E-mail addresses: haidv@dlu.edu.vn (H Duong) , tintc@dlu.edu.vn (T Truong) ,
vdbay@it.tdt.edu.vn (B Vo)
Trang 3where each transaction contains a set of items A set X DA is
called an itemset The support of an itemset X, denoted as supp(X),
is the ratio of the number of transactions containing X and N, the
number of transactions inT Given two threshold values, s0and s1,
and two subsets, C0and C1, such that 0os0os1r1, C0D C1DA,
the task is to find the class FSC0D C 1ðs0; s1Þ of all the frequent
itemsets that (1) include a subset C0and (2) contain no items of
a subset C0
1DA Criterion (2) implies that the itemsets are
contained in C1¼A\C′1(the complement of C′1fromA) Formally,
the goal is to discover all elements L′ of FSC 0 D C 1ðs0; s1Þ that match
the following criteria, called double-constraint
(1) the support of L′ is in [s0, s1]: s0rsupp (L′) (Cam) and supp (L′)
rs1(Cm);
(2) L′ includes C0(Cm) and L′ is contained in C1(Cam): C0DL′DC1
1.1 Contributions
The contributions of this paper are as follows First, the mining
of frequent itemsets whose supports are not too high based on the
s1value is proposed Such itemsets can be valuable For instance,
they can help to discover association rules which are a little
frequent but unique, or to find abnormal rules with high
con-fidence values from low frequent itemsets Second, an efficient
method for discovering frequent itemsets that satisfy bothCam and
Cmconstraints is proposed Third, a structure and unique
representa-tion of frequent itemsets with double-constraint are proposed Based
on these, the MFS_DoubleCons algorithm is developed to quickly and
distinctly mine all constraint-based frequent itemsets Moreover,
the algorithm accesses the database only once, even if theCam and
Cm constraints are changed regularly This considerably enhances
mining performance
The rest of this paper is organized as follows Some studies
related to constraint-based frequent itemset and association rule
mining are reviewed inSection 2.Section 3presents some basic
concepts in frequent itemset mining and notations InSection 4, a
unique representation of frequent itemsets with double-constraint
and a procedure for quickly determining all closed frequent
item-sets and their generators are described An efficient algorithm for
finding all frequent itemsets with double-constraint is also
pro-posed Experimental results are discussed in Section 5 Finally,
conclusion and future work are presented inSection 6
2 Related work
Constraint mining wasfirst defined by Nguyen et al in 1998
(Nguyen et al., 1998) They proposed an Apriori-like algorithm
called CAP to reduce the frequent itemsets computation In Pei
and Han (2002), Pei et al proposed the concept of convertible
constraints and integrated them into the mining process of the
FP-growth algorithm.Srikant et al (1997)considered the problem of
integrating constraints that are Boolean expressions in the
pre-sence or abpre-sence of items in the association rules.Bayardo et al
(2000) restricted the problem of mining association rules in two
constraints that are the consequent and the minimum
improve-ment InBucila et al (2003), Bucila et al integrated bothCamand
Cminto the algorithm DualMiner Unfortunately, it has to scan the
database many times and perform a large number of useless tests
on long itemsets, especially when the minimum support is low An
Apriori-like algorithm called ExAMiner (Bonchi et al., 2003) uses
both of these constraints to reduce the input data and the search
space A shortcoming of these algorithms is that they need to be
rerun whenever the constraints are changed This makes a
reduc-tion on mining speed of users A solureduc-tion is to mine and save only
once all frequent itemsets with small values of minimum support Then, when the constraints are changed, the system extracts frequent itemsets that satisfy them The computational and memory costs for this second step are very high because the number of frequent itemsets generated from the first step is usually enormous
Recently, many authors have focused on combining the advan-tages of constrained mining and a condensed representation of frequent itemsets Instead of mining all frequent itemsets, only the condensed ones are extracted Using condensed frequent itemsets has three primary advantages First, it is easier to store because the number of them is much smaller than the size of the class of all frequent itemsets, especially for dense databases Second, they are mined only once from database even when the constraints are changed Third, they can be used to generate all frequent itemsets without access to the database There are two types of conden-sed representation The first type is maximal frequent itemsets (Burdick et al., 2001;Lin and Kedem, 2002) Since their cardinality
is very small, they can be discovered quickly All frequent itemsets can be generated from the maximal frequent itemsets However, the generation often produces duplications In addition, the generated frequent itemsets can lose information about their supports Therefore, the supports need to be recomputed when mining association rules The second type is closed frequent itemsets, called maximal ones, and their generators, called mini-mal ones (Bonchi and Lucchese, 2004; Boulicaut and Bykowski,
2000; Boulicaut et al., 2003; Pasquier et al., 2005) Each closed frequent itemset represents a class of all frequent itemsets that have the same closure Thus, it (with its generators) can be used to uniquely determine all frequent itemsets in the class without losing their supports
3 Some concepts and notations
For a database T , let O be a non-empty set containing transactions,A be a set of items appearing in those transactions, and ℛ be a binary relation on O A A set of items is called an itemset Consider two operators:λ: 2O-2Aandρ: 2A-2Odefined
as follows (λ(∅):¼A andρ(∅):¼O): 8ODO, 8ADA,λ(O)¼{aAA| (o, a) Aℛ, 8oAO}, ρ(A)¼{oAO|(o, a) Aℛ, 8aAA} A closed operator h in 2A(Birkhoff, 1948;Wille, 1992) is defined as h¼λo
ρ Denote h(A) as the closure of subset A DA A is called a closed itemset iff h(A)¼A Let CS be the class of all closed itemsets The minimum and maximum support thresholds are denoted as s0and
s1, respectively, where 1/|O|rs0rs1r1 Only consider non-trivial items in subsetAF
:¼{aAA: supp({a})Zs0} ofA are considered
A non-empty itemset A (subset of AF) is called frequent iff
s0rsupp(A)rs1 It is noted that if s1 is equal to 1, then the traditional frequent itemset concept is obtained Briefly, ℱS:¼{L′ DA: s0rsupp(L′)rs1} denotes the class of all frequent itemsets andℱCS:¼ℱS \CS denotes the class of all closed frequent item-sets For any two sets G, A:∅aGDADA, G is called a generator (Mannila and Toivonen, 1997) of A if h(G)¼h(A) and (8∅aG′
CG ) h(G′)Ch(G)) Let G(A) be the class of all generators of A and
ℒGAbe the lattice of all closed itemsets and their generators Given C0 and C1 as two constraint subsets such that
∅aC0DC1DA, FSC 0 D C 1ðs0; s1Þ:¼{L′DℱS: C0DL′DC1} denotes the class of all frequent itemsets with double-constraint, FCS+ C0ðs0; s1Þ:¼{LAℱCS: L+C0} denotes the class of all closed frequent itemsets which include C0, and FCSC 0 D C 1ðs0; s1Þ:¼ {LC1 L\C1: LA FCS+ C0ðs0; s1Þand ((Li A G(L): LiD C1)} denotes the class of all closed frequent itemsets with double-constraint
H Duong et al / Engineering Applications of Artificial Intelligence 27 (2014) 148–154 149
Trang 4Remark 1 8L′AℱS, if C0DL′DC1, then supp (C0)Zsupp (L′)Zs0
and supp (C1)rsupp (L′)rs1 Thus, only C0and C1 where supp
(C0)Zs0and supp (C1)rs1are considered
4 Mining frequent itemsets with double-constraint
Definition 1 (Tin and Anh, 2010) (Equivalence relationhoverℱS)
8A, BAℱS:
AhB3hðAÞ ¼ hðBÞ
For each A AℱS, [A]:¼{BAℱS: h(B)¼h(A)} denotes the
equivalence class of all frequent itemsets with the same closure
If L AℱCS, then [L]:¼{L′DL: h(L′)¼L} Using this relation, ℱS is
partitioned into disjoint equivalence classes Each class contains
frequent itemsets with the same closure L AℱCS and the same
support The following theorem is derived:
Theorem 1 (Tin and Anh, 2010) (A partition ofℱS)
FS¼ ∑
L A FCS½L
This partition allows us to independently mine frequent
item-sets with double-constraint in each equivalence class
Example 1 The rest of this paper considers databaseT shown
in Fig 1 (a) For s0¼1/4 and s1¼3/4, Charm-L (Zaki and Hsiao,
2005) and MinimalGenerators (Zaki, 2004) are used to mine a
lattice of all closed frequent itemsets and their generators The
results are shown inFig 1(b) Then,ℱS ¼[a]þ[c]þ[ceg]þ[ac]þ
[afh]þ[aceg]þ[bceg]þ[acfh]þ[adfh].1With this disjoint partition,
the duplication in the mining process of frequent itemsets in
different classes is greatly reduced However,Theorem 2inAnh
et al (2011)showed that frequent itemsets generated in each class
can be still duplicated
4.1 Partition of the class of all frequent itemsets with
double-constraint
Let FSC 0 D L C1:¼ {L′DLC1: C0DL′, h(L′)¼h(LC1)} be the class of all
frequent itemsets in [L] which include C0and are contained in LC1,
where LC1AFCSC 0 D C 1ðs0; s1Þ Based on the idea of the above
partition and Proposition 1 in Hai et al (2013), the following
proposition is derived
Proposition 1 (The disjoint partition of the class of all frequent itemsets with double-constraint).8C0DC1DA:
FSC 0 D C 1ðs0; s1Þ ¼ ∑
L C1 A FCSC0 D C1ðs 0 ; s 1 ÞFSC 0 D L C1
Proof
The sets on the right-hand side are disjoint In fact, 8L′i
AFSC 0 D L C1; i, where LC1, i:¼Li\C1AFCSC 0 D C 1ðs0; s1Þ, Li AFCS+ C0
ðs0; s1Þ, i¼1, 2 and L1aL2 Then, h(L′1)¼h(LC1, 1)¼L1aL2¼ h(LC1, 2)¼h(L′2) Thus, L′1 and L′2 are in different equivalence classes [L1] and [L2] Hence, L′1aL′2
“D”: 8L′AFSC 0 D C 1ðs0; s1Þ, assign that L¼h(L′) and LC1¼L\C1, then C0DL′Dh(L′)¼L and s0rsupp(L′)¼supp(L)rs1 Let LiA G(L′) (because G(L′)a∅ (Anh et al., 2012), then LiDL′DC1,
C0DLC1 and h(Li)¼h(L′)¼L and LiAG(L′)DG(L) Thus, LA FCS+ C0ðs0; s1Þ and LC1AFCSC 0 D C 1ðs0; s1Þ Moreover, it is known that C0DL′DLC1DL and L¼h(L′)¼h(Li)Dh(LC1)DL Therefore, h (LC1)¼h(L′) It is thus concluded that L′AFSC 0 D L C1
“+”: 8LC1AFCSC 0 D C 1ðs0; s1Þ, L′AFSC 0 D L C1, then h(L′)¼h(LC1),
C0DL′DLC1DC1 and there exists LiAG(L): LiDC1 Then,
LiDLC1DL and L¼h(Li)Dh(LC1)Dh(L)¼L Thus, h(L′)¼h(LC1)¼L and s0rsupp(L′)¼supp(L)rs1 Hence, L′AFSC 0 D C 1ðs0; s1Þ □
Remark 2 The disjoint partition of the class of all frequent itemsets allows parallel algorithms to be designed to indepen-dently exploit each class, significantly reducing mining time Example 2 Set s0¼1/4, s1¼3/4, C0¼a, and C1¼adefg Then, consider L¼aceg, G(L)¼{L1¼ae, L2¼ag} From Theorem 3 inAnh
et al (2011), XU¼aeg, XU,1¼g, XU,2¼e, X_¼c Thus, [aceg]¼{ae, aeg, aegc, aec, ag, agc} Similarly, [bceg]¼{bc, be, bg, bce, bcg, beg, bceg}, [acfh]¼{cf, acf, cfh, acfh, ch, ach}, [adfh]¼{d, ad, df, dh, adf, adh, dfh, adfh}, [ceg]¼{e, ce, eg, ceg, g, cg}, [ac]¼{ac}, [afh]¼{f, af,
fh, afh, h, ah}, [c]¼{c}, and [a]¼{a} Frequent itemsets L′ of all classes are tested by the condition C0D L′DC1, yielding
FSC 0 D C 1ðs0; s1Þ¼ {ae, aeg, ag, ad, adf, af, a}
InExample 2, testing the condition C0DL′DC1is very expen-sive In the next section, a method for efficiently mining the frequent itemsets with double-constraint in each equivalence class
is proposed
4.2 Extracting closed frequent itemsets and generators with double-constraint
The elements of FCSC 0 D C 1ðs0; s1Þcan be determined as follows From the lattice ℒGA, the class of all closed frequent itemsets including C isfirst determined Second, for each element L of them,
Fig 1 (a) Example database and (b) corresponding lattice of closed itemsets The frequent itemsets are underlined and their generators are in italics Supports are shown on the right of the lattice.
1
The symbol þdenotes the union of two disjoint sets.
Trang 5if there exists a generator of L contained in C1, then the closed
frequent itemset that satisfies the double-constraint, called LC1, is the
intersection of L and C1, i.e LC1¼L\C1AFCSC0D C 1ðs0; s1Þ Otherwise,
L is discarded In this way, the generators of LC1are determined by
finding all generators (of L) contained in C1 Similar toProposition 2in
Anh et al (2011), the following proposition is derived:
Proposition 2 Generate closed frequent itemsets and their
genera-tors with double-constraint
a 8LC1¼L\C1A FCSC0D C 1ðs0; s1Þ: G(LC1)¼{LiA G(L): LiDC1}
b The elements of FCSC 0 D C 1ðs0; s1Þ are generated distinctly
From this proposition, the procedure for generating all closed
frequent itemsets with double-constraint and generators using
ℒGAis shown inFig 2
Example 3 With the values of C0, C1, s0, and s1 as those in
Example 2, closed frequent itemsets with double-constraint and
their generators are mined using MFCS_DoubleCons.Table 1shows
the process
Remark 3 At line 3 of the MFCS_DoubleCons procedure, if L
includes C0, then all its supersets also include C0(the property of
constraint Cm) A similar property holds for the test supp(L)rs1
Thus, if a candidate passes the tests, its supersets do not need to be
tested, significantly reducing the execution time of the algorithm
A similar remark can be made for constraint s0(constraintCam)
4.3 Mining all frequent itemsets with double-constraint
This section describes the unique representation and the
structure of frequent itemsets with double-constraint For each
LC1AFCSC0D C 1ðs0; s1Þ, letKmin ; C 0 ;C 1:¼Minimal{Ki¼Li\C0|LiAG(LC1)}
be the class of all the minimal itemsets of {Li\C0|LiAG(LC1)} in
terms of the inclusion order “D” on sets Assign KU; C 0 ;C 1:¼
UK i A K min ; C0; C1Ki, KU ; C 0 ;C 1 ; i:¼KU ; C 0 ;C 1\Ki, K_ ; C 0 ;C 1:¼LC1\(KU ; C 0 ;C 1þC0), and define:
FSnC
0 D L C1:¼ {L′¼C0þKiþK′iþK″|KiA Kmin; C0;C1, K′iDKU; C0;C1; i,
K″DK_ ; C 0 ;C 1and (Kjg KiþK′i, 8KjA Kmin ; C 0 ;C 1: 1rjoi)} (n) Remark 4 If there exists LiA G(LC1) such that Ki¼Li\C0¼∅, then
FSn
C 0 D L C1¼{L′¼C0þK″, K″DLC1\C0} In this way, the frequent itemsets with double-constraint are generated more quickly Indeed, if there exists LiA G(LC1) such that Ki¼Li\C0¼∅, which implies that LiDC0, then Kmin ; C 0 ;C 1¼∅, KU ; C 0 ;C 1¼ ∅, KU ; C 0 ;C 1 ; i¼∅ This implies that K_; C0;C1¼LC1\C0 Thus, L′¼C0þK″AFSn
C 0 D L C1, where K″DLC1\C0
Theorem 2 Distinctly generate the elements of FSn
C 0 D L C1 For each LC1A FCSC 0 D C 1ðs0; s1Þ:
a) FSC 0 D L C1¼FSn
C 0 D L C1
b) The elements of FSn
C0D L C1are generated distinctly
Proof: (a)
“D”: if L′AFSC0D L C1, according to Theorem 3 inAnh et al (2011),
L′¼LiþL′iþL, where LiAG(LC1), LU¼ [LiA GðL C1 ÞLi, L′iDLU,i¼LU\Li,
LDL_¼LC1\LU Thus, L′¼C0\(LiþL′iþL)þ(Li\C0)þ(L′i\C0)þ (L\C0)¼C0þKiþ(L′i\C0)þ(L\C0)¼C0þKiþK′iþK″ where Ki¼Li\
C0AKmin ; C 0 ;C 1, K′i¼(L′i\C0)\KU ; C 0 ;C 1DKU ; C 0 ;C 1\(LU\Li)DKU ; C 0 ;C 1
Table 1
Mined closed frequent itemsets and their generators with double constraint.
L G(L) L +C 0 L i A G(L) and C 1 +L i L C1 ¼ L\C 1 G(L C1 )
acfh cf, ch acfh
ceg e, g
MFCS_DoubleCons
:
if and then //
(L C1 ) C 1
return
Fig 2 MFCS_DoubleCons procedure.
MFS_DoubleCons
return
MFCS_DoubleCons
MFS_DoubleCons_OneClass
return
Fig 3 MFS_DoubleCons algorithm.
MFS_DoubleCons_OneClass
break
else
test the condition (*)
break
return
H Duong et al / Engineering Applications of Artificial Intelligence 27 (2014) 148–154 151
Trang 6\LiDKU ; C 0 ;C 1\Ki¼KU ; C 0 ;C 1 ; iand K″¼[(L′i\C0)\KU ; C 0 ;C 1þ(L\C0)]
D(LC1\(C0þKU; C 0 ;C 1))þ(LC1\KU; C0;C1)\C0DLC1\(C0þKU; C 0 ;C 1)¼
K_ ; C 0 ;C 1 The lowest index i is always selected such that
Ki¼Li\C0AKmin ; C 0 ;C 1is the minimal set and L′ has the
representa-tion in the form: L′¼C0þKiþK′iþK″
Moreover, if there exists index koi such that KkAKmin; C0;C1
and KkCKiþK′i, then L′ also has a different representation: L′¼
C0þKkþK′kþK″, where K′k¼(KiþK′i)\KkDKU ; C 0 ;C 1 ; k This
contra-dicts the selection of index i Thus, KkgKiþK′i, 8KkAKmin; C0;C1:
1rkoi Hence, L′AFSn
C 0 D L C1
“+”: if L′AFSn
C 0 D L C1, there exists Ki¼Li\C0A Kmin ; C 0 ;C 1, LiA G (LC1), K′iDKU; C0;C1; i, K″ DK_ ; C 0 ;C 1: C0D L′¼C0þKiþK′iþK″
DLC1 Thus, h(L′)Dh(LC1) On the other hand, since
Li¼Kiþ(Li\C0) DKiþC0DL′, so h(LC1)¼h(Li)D h(L′) It is
inferred that h(L′)¼h(LC1) Hence, L′AFSC 0 D L C1
(b) Assume that there exist i and k, where i4kZ1, such that
L′k L′i, where L′k:¼ C0þKkþK′kþK″k, L′i:¼C0þKiþK′iþK″i, Kk,
Ki A Kmin ; C 0 ;C 1, KkaKi, K″k, K″i DK_ ; C 0 ;C 1, K′k DKU ; C 0 ;C 1 ; k, K′i
DKU; C0;C1; i Since Kk\C0¼∅ and Kk\K″i¼∅, so KkCKiþK′i(the
equality does not hold because Kiand Kkare two different minimal
sets) It is opposite to the way that index i is selected Therefore, all
elements of FSnC0D L C1 are generated distinctly □
Example 4 For s0¼1/4, s1¼3/4, C0¼a, and C1¼adefg, consider
L¼aceg, G(L)¼{L1¼ae, L2¼ag} The following results are obtained:
C0DL, LC1¼L\C1¼aeg, G(LC1)¼{L1¼ae, L2¼ag}, K1¼L1\C0¼e,
K2¼L2\C0¼g, Kmin ; C 0 ;C 1¼ Minimal{K1, K2}¼{K1, K2}¼{e, g},
KU ; C 0 ;C 1¼eg, KU ; C 0 ;C 1 ; 1¼g, KU ; C 0 ;C 1 ; 2¼ e, K_ ;C 0 ;C 1¼ ∅ With K1¼e, then aþe, aþeþg AFSC 0 D aceg \ C 1ðs0; s1Þ, and with K2¼g, then aþg AFSC 0 D aceg \ C 1ðs0; s1Þ Thus, FSC 0 D aceg \ C 1ðs0; s1Þ¼ {ae, aeg, ag} Similarly, FSC0D adf h \ C1ðs0; s1Þ¼ {ad, adf}, FSC 0 D af h \ C 1ðs0; s1Þ¼ {af} and FSC0D a \ C 1ðs0; s1Þ¼ {a} Hence, FSC0D C 1ðs0; s1Þ¼ {ae, aeg,
ag, ad, adf, af, a}
According toTheorem 2, the procedure MFS_DoubleCons_One-Class (pseudo code shown inFig 4) is used for mining frequent itemsets with double-constraint in a class UsingProposition 1and this procedure, the algorithm MFS_DoubleCons is proposed, shown
inFig 3, for mining all frequent itemsets with double-constraint
5 Experimental results
Experiments were performed on a PC with an i5-2400 CPU, 3.10 GHz@ 3.09 GHz PC and 3.16 GB of memory, running Windows
XP The algorithms were coded in C# To compare the perfor-mance, the source code for Charm-L, MinimalGenerators and dEclat (Anon, 2010) was converted to C# Charm-L and MinimalGenerators were used to mine the lattice of the closed itemsets and their generators dEclat was used to exploit all frequent itemsets Here,
it was modified for mining frequent itemsets with double-constraint In its 1st step, only frequent itemsets contained in C1
are taken Then, the output is filtered to determine frequent itemsets that satisfy s1 and C0 This modified version is called dEclat-DC For the post-processing approach, Gen_Itemsets_DC is a modification of Gen_Itemsets (Anh et al., 2011)
For the performance test, the following benchmark databases
in FIMDR (2009) were chosen: Pumsb, Connect, Mushroom, T10I4D100K, and T40I10D100K Pumsb, Connect, and Mushroom are real and dense, i.e., they produce many long frequent itemsets
Table 2
Database characteristics.
Table 3
Time reductions of MFS_DoubleCons compared to Gen_Itemsets_DC and dEclat-DC.
DB, MS R_MG (%) R_ME (%) RR (%) DB, MS R_MG (%) R_ME (%) RR (%)
Fig 6 Performance results for T10I4D100K.
Trang 7even for very high support values The others are synthetic and
sparse.Table 2shows their characteristics
The support threshold s1isfixed at 0.95 Assuming that the size
of C0is m, C1with a size of mþdn|AF|/100 (dA[1,100]) is chosen
With dense databases, for each pair of database (DB) and
mini-mum support (MS), m ranges from 4% to 14% of|AF
| (step 2%) and
d¼81 For sparse ones, m ranges from 0.2% to 2% of |AF| (step 0.2%)
and d¼93 For each pair of C0's size and C1's size, 10
constraints are selected for the dense ones and 6
double-constraints are selected for the sparse ones (each of them contains
items randomly selected fromAF
) Let T_MD, T_GID, and T_DED be the average execution times of MFS_DoubleCons, Gen_Itemsets_DC,
and dEclat-DC for 60 selected double-constraints
Table 3shows the experimental evaluation of MFS_DoubleCons
against Gen_Itemsets_DC and dEclat-DC, where column R_MG
shows the ratios of T_MD and T_GID, and column R_ME shows
the ratios of T_MD and T_DED NCIn is the number of frequent
itemsets contained in C1 and NNC is the number of frequent
itemsets contained in C1without including C0 Column RR is used
to indicate the average number of the ratios of NNC and NCIn Compared to Gen_Itemsets_DC, MFS_DoubleCons is faster for dense databases The time is reduced by 30.3–0.96% For sparse data-bases, the reduction is lower because the number of frequent itemsets is small and their size is small, leading to a low cost for testing the constraints Compared to dEclat-DC, MFS_DoubleCons is much faster for all databases The time is reduced by 40.6–2.01% The reason for the reduction is that there is a large number of candidates (RR ranges from 95.94% to 99.99%) which fail the last test of dEclat-DC, leading to lower performance
Figs 5–8show the comparisons of the average execution times for various support values The performance and scalability of MFS_DoubleCons are superior to those of Gen_Itemsets_DC and dEclat-DC
Figs 9–11show the performance results for various numbers
of double-constraints The performance gap between MFS_Dou-bleCons and dEclat-DC increases with the number of double-constraints The main reason is that, when the double-constraint changes, MFS_DoubleCons executes without creating the lattice of closed itemset and their generators again from the database MFS_DoubleCons outperforms both Gen_Itemsets_DC and dEclat-DC, especially when the minimum support is lower and the number of double-constraints is high
6 Conclusion and future work
This paper presented a unique representation and structure of frequent itemsets with double-constraint The correctness of the theoretical results was proven An efficient algorithm was developed for mining all frequent itemsets with double-constraint Tests on the benchmark databases show the efficiency of the proposed approach Moreover, the tests show that the algorithm outperforms existing
Fig 9 Performance results for Mushroom for various numbers of double-constraints.
Fig 8 Performance results for T40I10D100K.
H Duong et al / Engineering Applications of Artificial Intelligence 27 (2014) 148–154 153
Trang 8algorithms, especially when the minimum support values are very low
and the number of double-constraints is high
In the future, the mining of frequent itemsets with more
complicated types of constraint will be studied and association
rules based on these frequent itemsets will be derived
Acknowledgments
This work was funded by Dalat University and Vietnam’s
National Foundation for Science and Technology Development
(NAFOSTED) under Grant no 102.01-2012.17
References
Agrawal, R., Srikant, R., 1994 Fast algorithms for mining association rules In:
Proceedings of the 20th International Conference on Very Large Data Bases,
pp 478–499.
Anh, T., Hai, D., Tin, T., Bac, L., 2011 Efficient algorithms for mining frequent
itemsets with constraint In: Proceedings of the Third International Conference
on Knowledge and Systems Engineering, pp 19–25.
Anh, T., Hai, D., Tin, T., Bac, L., 2012 Mining frequent itemsets with dualistic
constraints In Proceedings of the PRICAI 2012, LNAI, vol 7458, Springer,
pp 807–813.
Anh, T., Tin, T., Bac, L., Hai, D., 2012 Mining association rules restricted on
constraint In Proceedings of the IEEE-RIVF 2012, pp 51–56.
Anon〈http://www.cs.rpi.edu/zaki/wwwnew/pmwiki.php/Software/Software#pa
tutils〉 (accessed 2010).
Bayardo, R.J., Agrawal, R., Gunopulos, D., 2000 Constraint-based rule mining in
large, dense databases Data Mining and Knowledge Discovery 4 (2–3),
217–240
Birkhoff, G., 1948 Lattice Theory American Mathematical Society, New York
Bonchi, F., Lucchese, C., 2004 On closed constrained frequent pattern mining In:
Proceedings of the IEEE ICDM’04, pp 35–42.
Bonchi, F., Giannotti, F., Mazzanti, A., Pedreschi, D., 2003 Examiner: optimized
level-wise frequent pattern mining with monotone constraints In Proceedings
of the IEEE ICDM’03, pp 11–18.
Boulicaut, J.F., Bykowski, A., 2000 Frequent closures as a concise representation for
binary Data Mining In Proc PAKDD’00, vol 1805, pp 62–73, Springer.
Boulicaut, J.F., Bykowski, A., Rigotti, C., 2003 Free-sets: a condensed representation
of boolean data for the approximation of frequency queries Data Mining and
Knowledge Discovery 7 (1), 5–22
Bucila, C., Gehrke, J.E., Kifer, D., White, W., 2003 Dualminer: a dual-pruning
algorithm for itemsets with constraints Data Mining and Knowledge Discovery
7 (3), 241–272
Burdick, D, Calimlim, M, Gehrke, J., 2001 MAFIA: a maximal frequent
itemset algorithm for transactional databases In Proceedings of the IEEE
ICDE’01, pp 443–452
Dong, J., Han, M., 2007 BitTable-FI: an efficient mining frequent itemsetsalgorithm.
Knowledge Based Systems 20 (4), 329–335
Frequent Itemset Mining Dataset Repository (FIMDR), 〈http://fimi.cs.helsinki.fi/
data/〉 (accessed 2009).
Grahne, G., Zhu, J., 2005 Fast algorithms for frequent itemset mining using fp-trees IEEE Transactions on Knowledge and Data Engineering 17 (10), 1347–1362
Hai, D., Tin, T., Bac, L., 2013 An efficient algorithm for mining frequent itemsets with single constraint In: Proceedings of ICCSAMA 2013, Advanced Computa-tional Methods for Knowledge Engineering, Springer, pp 367–378.
Han, J., Pei, J., Yin, Y., 2000 Mining frequent patterns without candidate generation SIGMOD’00, 1–12.
Jeudy, B., Boulicaut, J.F., 2002 Optimization of association rule mining queries Intelligent Data Analysis 6 (4), 341–357
Lin, D.I., Kedem, Z.M., 2002 Pincer search: an efficient algorithm for discovering the maximum frequent sets IEEE Transactions on Knowledge and Data Engineering
14 (3), 553–566
Mannila, H., Toivonen, H., 1997 Levelwise search and borders of theories in knowledge discovery Data Mining and Knowledge Discovery 1 (3), 241–258
Nguyen, R.T., Lakshmanan, V.S., Han, J., Pang, A., 1998 Exploratory mining and pruning optimizations of constrained association rules In Proceedings of the
1998 ACM-SIG-MOD International Conference on the Management of Data,
pp 13–24 Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L., 1999 Discovering frequent closed itemsets for association rules ICDT’12, 398–416
Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L., 1999 Efficient mining of association rules using closed itemset lattices In: Information Systems 24 (1), 25–46
Pasquier, N., Taouil, R., Bastide, Y., Stumme, G., Lakhal, L., 2005 Generating a condensed representation for association rules Journal of Intelligent Informa-tion Systems 24 (1), 29–60
Pei, J., Han, J., 2002 Constrained frequent pattern mining: A Pattern-Growth view ACM SIGKDD Explorations 4 (1), 31–39
Pei, J., Han, J., & Mao, R., 2000 CLOSET: an efficient algorithm for mining frequent closed itemsets SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, p 11–20.
Pei, J., Han, J., Lakshmanan, L.V.S., 2001 Mining frequent itemsets with convertible constraints In Proceedings of the IEEE ICDE’01, pp 433–442
Song, W., Yang, B., Xu, Z., 2008 Index-BitTableFI: an improved algorithm formining frequent itemsets Knowledge Based Systems 21 (6), 507–513
Srikant, R., Vu, Q., Agrawal, R., 1997 Mining association rules with item constraints.
In Proc KDD’97, pp 67–73.
Tin, T., Anh, T., 2010 Structure of set of association rules based on concept lattice Adv in Intelligent Infor and Database Systems, SCI, 283, Springer, p 217–227.
Vo, B., Hong, T.P., Le, B., 2012 DBV-Miner: a dynamic bit-vector approach for fast mining frequent closed itemsets Expert Systems with Applications 39 (8), 7196–7206
Vo, B., Hong, T.P, Le, B., 2013 A lattice-based approach for mining most general-ization association rules Knowledge-Based Systems 45, 20–30
Wang, J., Han, J., Pei, J., 2003 CLOSETþ: searching for the best strategies for mining frequent closed itemsets SIGKDD’03, 236–245
Wille, R., 1992 Concept lattices and conceptual knowledge systems Computers & Mathematics with Applications 23, 493–515
Zaki, M.J 2004 Mining Non-Redundant Association Rules Data Mining and Knowledge Discovery, pp 223–248
Zaki, M.J., Hsiao, C.J., 2005 Efficient algorithms for mining closed itemsets and their lattice structure IEEE Transactions on Knowledge and Data Engineering 17 (4), 462–478
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W., 1997 New algorithms for fast discovery of association rules In Proceedings of the 3rd International Con-ference on Knowledge Discovery and Data Mining (KDD’97), pp 283–296.
Fig 11 Performance results for synthetic and sparse databases for various numbers of double-constraints.