Combination of dynamic bit vectors and transaction information forMinh-Thai Trana, Bac Leb, Bay Voc,n a Faculty of Information Technology, Information Technology College, Ho Chi Minh Cit
Trang 1Combination of dynamic bit vectors and transaction information for
Minh-Thai Trana, Bac Leb, Bay Voc,n
a
Faculty of Information Technology, Information Technology College, Ho Chi Minh City, Vietnam
b Department of Computer Science, University of Science, VNU-Ho Chi Minh, Vietnam
c
Faculty of Information Technology, Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam
a r t i c l e i n f o
Article history:
Received 24 May 2014
Received in revised form
23 October 2014
Accepted 28 October 2014
Keywords:
Dynamic bit vector
Frequent closed sequence
CloFS-DBV
a b s t r a c t
Sequence mining algorithms attempt to mine all possible frequent sequences These algorithms produce redundant results, increasing the required storage space and runtime, especially for large sequence databases In recent years, many studies have proved that mining frequent closed sequences is more efficient than mining all frequent sequences The desired information can be fully extracted from frequent closed sequences Most algorithms for mining frequent closed sequences use a candidate maintenance-and-test paradigm The present paper proposes an algorithm called CloFS-DBV that uses dynamic bit vectors Various methods are employed to reduce memory usage and runtime Experimental results show that CloFS-DBV is more efficient than the BIDE and CloSpan algorithms in terms of execution time and memory usage
& 2014 Elsevier Ltd All rights reserved
1 Introduction
Sequential pattern mining is a fundamental problem in knowledge
discovery and data mining with broad applications, including those in
the analysis of customer purchase behavior, web access patterns,
sci-entific experiments, disease treatment, natural disaster prevention,
and protein formation Sequential pattern mining includes two main
stages: frequent pattern mining and rule mining Many studies have
modified the AprioriAll algorithm (Agrawal and Srikant, 1995) for
min-ing frequent sequential patterns Unlike the general minmin-ing of frequent
sequences, the mining of frequent closed sequences has not been
extensively studied Although some algorithms have been proposed,
such as CloSpan (Yan et al., 2003), CLOSETþ (Wang et al., 2003), and
BIDE (Wang et al., 2007), their performance is poor for large databases
BIDE detects frequent sequences, not closed ones, and prunes
candi-dates early, instead of using maintenance-and-test patterns
Recently, many authors have proposed techniques that present
data in a vertical format (Song et al., 2005), use projection databases
operation (Pei et al., 2001), use bit vector data structures (Song
et al., 2008), all of which have been shown to be effective However,
the storage space and execution time can be further reduced in the
mining process for large sequence databases
The present study proposes the CloFS-DBV algorithm, which
uses a vertical data format and data compression, and divides the
search space to reduce the required storage space and execution time for mining frequent closed sequences The rest of the paper is organized as follows Section 2 gives the problem definition
Section 3summarizes related work.Sections 4 and 5present the proposed algorithm and experimental results, respectively The conclusions and future work are given inSection 6
2 Problem definition Consider a sequence database with a set of distinct events
I ¼ fi1; i2; i3; ⋯; ing, where ij is an event (or an item), where
1rjrn A set of unordered events is called an itemset Each itemset
is put in brackets, for example ðABCÞ To simplify notation, for itemsets that contain only a single item, the brackets are omitted, for example
B A sequence S ¼ fe1; e2; e3; ⋯; emg is an ordered list of events, where
ejð1rjrmÞ is an itemset Suppose that ℓ is the number of events in
a sequence A sequence with lengthℓ is called an ℓsequence For example, ABðAEÞCB is a 6 sequence A sequence Sa¼ a1; a2; ⋯; amis contained in another sequence Sb ¼ b1; b2; ⋯; bnif there exist inte-gers 1ri1oi2o⋯oimrn such that ai¼ bi1; a2¼ bi2; ⋯; am¼ bim
If sequence Sais contained in sequence Sb, Sais called a subsequence
of Sb and Sb is called a supersequence of Sa, denoted as SaDSb A sequence database is denoted as D ¼ fs1; s2; s3; ⋯; sj j Dg, where jDj is the number of sequences in D and sið1rirjDjÞ is a transaction in the form ID; Sequence, where the attribute ID is used to describe the information of si corresponding to transaction information over time The absolute support (support) of a sequence Sain a sequence database D is calculated as the number of occurrences of Sain the
Contents lists available atScienceDirect
journal homepage:www.elsevier.com/locate/engappai
http://dx.doi.org/10.1016/j.engappai.2014.10.021
0952-1976/& 2014 Elsevier Ltd All rights reserved.
n Corresponding author.
E-mail addresses: minhthai@itc.edu.vn (M.-T Tran),
lhbac@fit.hcmus.edu.vn (B Le), bayvodinh@gmail.com (B Vo).
Trang 2transactions of D, denoted as supDðSaÞ The support of a sequence is
given in the notation sequence: support For example, a sequence
AB with support 3 is represented as AB: 3
Given a minimum support threshold minSup, a sequence Sais a
frequent sequence on D if supDðSaÞZminSup If sequence Sa is
frequent and there exists no proper supersequence Sbof Sawith
the same support, Sais called a frequent closed sequence, i.e., there
does not exist Sb such that SaDSb and supDð Þ ¼ supSa DðSbÞ The
problem of mining frequent closed sequences is tofind a complete
set of frequent closed sequences for an input sequence database D
and a given minimum support threshold minSup
Example 1 Consider the sequence database in Table 1 The
database hasfive unique items I ¼ A; B; C; D; Ef g and four
transac-tions, i.e., jDj ¼ 4 Assume that the minimum support threshold is
minSup ¼ 2 ð50%Þ If all frequent sequences of D are mined with
the given minSup, the following 32 sequences are obtained:
SFS¼{A : 4, AA : 4, AB:3, AC:4, (AC):2, AAB:2, AAC:2, A(AC):2,
ABA:3, ABB:3, ABC:3, A(BC):3, ACA:2, ACB:2, ABAB:2, AB(BC):2,
A(BC)A:2, A(BC)B:2, B:3, BA:3, BB:3, BC:3, (BC):3, BAB:2, B(BC):2,
(BC)A:2, (BC)B:2, C:4, CA:3, CB:2, CC:2, CAC:2} In contrast, mining
the frequent closed sequences yields SFCS¼{AA:4, AC:4, AAC:2, A
(AC):2, ABA:3, ABB:3, ABC:3, A(BC):3, ABAB:2, AB(BC):2, A(BC)A:2, A
(BC)B:2, CA:3, CAC:2}, which has only 14 sequences
Frequent closed sequences SFCS are thus more compact than
general frequent sequences SFS This is due to subsequence Sawith
the same support as that of supersequence Sbbeing absorbed by Sb
without affecting the mining results For example, sequence ðBCÞA:
2 is absorbed by sequence AðBCÞA: 2 because ðBCÞADAðBCÞA and
supDððBCÞAÞ ¼ supDðAðBCÞA Þ ¼ 2
Atfirst, the frequent sequences with length 1 are mined from a
sequence database After that, these frequent sequences will
com-bine (or extend) each other to form new candidates with length 2
This process is repeated until there are no new generated frequent
sequences In general, the sequences with length k are used to
generate sequences with length k þ1 Besides the generation of
candidates, the checking of frequent closed sequences is applied in
each process The following definitions are used in the process of
extending sequences and checking frequent closed sequences
Definition 1 (substring of a sequence) Let S be a sequence
subi;jðSÞ ðirjÞ is defined as a substring of length ðji þ1Þ from
position i to position j of S For example, sub1 ;3ðBABCÞ is BAB and
sub4 ;4ðBABCÞ is C
Definition 2 (extending a sequence from a 1-sequence) Letαandβ
be two frequent 1-sequences ftα:pαg and ftβ:pβg are the
transac-tions and positransac-tions of sequencesαandβ, respectively There are
two forms of sequence extension
Itemset extension: 〈ðαβÞ〉ftβ:pβg; if ðαoβÞ4ðtα¼ tβÞ4ðpα¼ pβÞ:
ð2:1Þ Sequence extension: 〈αβ〉ftβ:pβg; if ðtα¼ tβÞ4ðpαopβÞ ð2:2Þ
Definition 3 (extending a sequence from a k-sequence) Letαandβ
be two frequent k-sequences ðk41Þ, u ¼ sub;kð Þ, and v ¼ subα ;kβ
ftα:pαg and ftβ:pβg are the transactions and positions of sequencesα
andβ, respectively There are two forms of sequence extension Itemset extension:αþiβ¼ sub1;k 1ðαÞðuvÞftβ:pβg
if ðuovÞ4ðtα¼ tβÞ4ðpα¼ pβÞ4ðsub1 ;k 1ðαÞ ¼ sub1 ;k 1ðβÞÞ
ð3:1Þ Sequence extension:αþsβ¼αvftβ:pβg;
if ðtα¼ tβÞ4ðpαopβÞ4ðsub1 ;k 1ðαÞ ¼ sub1 ;k 1ðβÞÞ ð3:2Þ
Definition 4 Let S ¼ e1e2⋯en An item e' can be added to a pattern extension of S in one of three positions
S0¼ e1e2⋯ene04ðsupDðS0Þ ¼ supDðSÞÞ ð4:1Þ (i 1rionð Þsuch that S0¼ e1e2⋯eie0⋯en4ðsupDðS0Þ ¼ supDðSÞÞ
ð4:2Þ
S0¼ e0
e1e2⋯en4ðsupDðS0Þ ¼ supDðSÞÞ ð4:3Þ
In(4.1), item e0appears after en, so item e0is called a forward-extension and S0 is called a forward-extension sequence For example, sequence AC: 4 is a forward-extension of sequence A :
4 because sequence C is extended after sequence A and their support is 4 In(4.2) and (4.3), item e0appears before en, so item
e0is called a backward-extension and S0is called a backward-extension sequence
For example, sequence CAC: 2 is a backward-extension of sequence CC: 2 because sequence A is extended in the middle of sequence CC and their support is 2
Definition 5 Let S ¼ e1e2⋯en The starting position of sequence S
is the position of thefirst appearance of itemset e1 For example, in the sequence ABðABCÞCB, the starting position of sequence ðABCÞ is
3, and that of sequence ABB is 1
3 Related work
Mining frequent sequences was first proposed in 1995 by Agrawal and Srikant with their AprioriAll algorithm, which is based
on the Apriori property Agrawal and Srikant then expanded the mining problem in a general way with the GSP algorithm (Srikant and Agrawal, 1996) Since then, many frequent sequence mining algorithms have been proposed to improve mining efficiency The algorithms use various approaches for organizing data and storing mined information Typical algorithms include SPADE (Zaki, 2001), PrefixSpan (Pei et al., 2001), SPAM (Ayres et al., 2002), and LAPIN-SPAM (Yang and Kitsuregawa, 2005) The SPAM algorithm organizes data in a vertical bitmap format and uses a dictionary tree structure
to store mined information PrefixSpan uses database projection for sequence extension to reduce the search space, with the data presented horizontally The LAPIN-SPAM algorithm uses a list to store thefinal positions of items and a set of boundary positions of the prefix to reduce the scope of the search space
Various algorithms have been proposed for mining non-red-undant frequent sequences to reduce the required storage space and runtime for mining rules Frequent closed sequence mining and frequent closed itemset mining algorithms include A-CLOSE (Pasquier et al., 1999), CLOSET (Pei et al., 2000), CHARM (Zaki and Hsiao, 2002), and CLOSETþ (Wang et al., 2003) Most of these algorithms maintain mined frequent itemsets in order to test frequent closed sequences, which require a lot of memory CLOSETþ uses a two-level hash-index structure and a tree structure for storing the itemsets to reduce memory space and the time required for testing closed itemsets CloSpan (Yan et al., 2003) uses a maintain-and-test pattern method and combines a hash-index structure with a tree structure for storing sequences This algorithm prunes patterns
Table 1
Example sequence database D.
Trang 3using techniques such as Common Prefix and Backward Sub-Pattern
to reduce the search space The ClaSP (Gomariz et al., 2013) algorithm
uses a vertical database format strategy, as done by the SPADE
algorithm, and a heuristic to prune non-closed sequences, as done by
the CloSpan algorithm However, the algorithm maintains previous
candidates to test the closure of sequences and removes them later
The maintenance of candidates increases memory consumption, and
the number of test candidates increases with the number of
gen-erated frequent closed sequences
In order to overcome these problems, the BIDE algorithm (Wang
et al., 2007) does not keep track of historical frequent closed
sequences for checking the closure of new patterns Instead, it uses
bi-directional extension techniques to examine frequent closed
patterns as candidates before extending a sequence Moreover, the
algorithm uses a BackScan process to determine candidates that
cannot be extended to reduce mining time The algorithm uses
pseudo projection techniques to reduce database storage space and
is efficient for low support thresholds However, in the process of
mining, it has to project and scan databases many times for each
prefix, making it inefficient
4 Proposed algorithm
This section describes the proposed CloFS-DBV algorithm,
which uses a dynamic bit vector (DBV) structure combined with
location information in the structure of the transaction
CloFS-DBVPattern to mine frequent closed sequences
4.1 DBV data structure
Sequence mining algorithms based on a vertical data format
have proven to be more efficient than those based on a horizontal
data format Typical algorithms that use a vertical format include
SPADE (Zaki, 2001), DISC-all (Chiu et al., 2004), HVSM (Song et al.,
2005), and MSGPs (Pham et al., 2012) These algorithms scan the
database only once and calculate the support of the sequence
quickly However, the disadvantage is that they consume much
more memory to store additional information BitTableFI (Dong and
Han, 2007) and Index-BitTableFI (Song et al., 2008) have solved this
problem by compressing data by using a bit table (BitTable)
The main drawback of the bit vector structure is afixed size,
which depends on the number of transactions in a sequence
database ‘1’ indicates that the item appears in the transaction
and‘0’ indicates otherwise In practice, there are usually many ‘0’
bits in a bit vector, i.e., items in sequence database often random
appear in the sequence database In addition, during the extending
process of sequences (using bitwise AND) the ‘0’ bits will more
appear Thus increases the required memory and processing time
In order to overcome this problem, dynamic bit vector architecture
is used (Vo et al., 2012) Let A and B be two bit vectors p1and p2are
the probabilities of‘1’ bits in two bit vectors A and B, respectively
Assuming k is the probability of‘0’ bits after joining A and B to get
AB by the extending process of sequence Therefore, the probability
of‘1’ bits in the bit vector AB is minðp1; p2Þk, where minðp1; p2Þ is
the minimum value of p1and p2 Obviously, the probability of‘1’ in
AB will decrease in contrast the probability of‘0’ in that increase
Moreover, the gap between p1 and p2 will be larger quickly after
several sequence extensions
Suppose there are 16 transactions in a sequence database An
item i exists in transactions 7, 9, 10, 11, and 13 The bit vector for
the item i needs 16 bytes, as shown inTable 2 Thefirst non-zero byte appears at index 7 The DBV only stores the starting index and sequence of bytes starting from thefirst non-zero byte until the last non-zero byte, as shown inTable 3 Only 8 bytes are required
to store the information using the DBV structure
Each DBV consists of two parts: (1) Start bit: the position of the first appearance of ‘1’ and (2) Bit vector: sequence of bits starting from thefirst non-zero byte until the last non-zero byte The DBV structure is used to store transactions in a vertical format Sequence supports can easily be calculated by counting the number of‘1’ bits Example 2 Consider database D inTable 1 Sequence A exists in transactions 1, 2, 3, and 4, so the start bit is 1, and the bit vector is
1111 The bit vector has four‘1’ bits, so the support of sequence A is
4 Sequence B exists in transactions 2, 3, and 4, so the start bit is 2, and thus the bit vector is 111 The bit vector has three‘1’ bits, so the support of sequence B is 3 Table 4shows the conversion of database D inTable 1to DBV format
4.2 CloFS-DBVPattern data structure The CloFS-DBVPattern structure combines a DBV structure with
a representation of sequence information Each CloFS-DBVPattern consists of two parts: (1) Sequence: sequence information and (2) BlockInfo: a DBV and a list of positions appearing in the seq-uence of transactions List positions of each transaction are repre-sented in the form of startPos: flist positionsg, where startPos is the first appearance of the sequence in each transaction
Example 3 In database D (Table 1), sequence A exists in transac-tions 1, 2, 3, and 4 For thefirst transaction, sequence A appears at positions f2; 3; 4g The starting position is 2, and thus 2 : f2; 3; 4g is stored For the second transaction, sequence A appears at positions f1; 3g The starting position is 1, and thus 1 : f1; 3g is stored For the third transaction, sequence A appears at positions f1; 3g The starting position is 1, and thus 1: f1; 3g is stored Similarly, for the last tran-saction, sequence A appears at positions f1; 4g and the starting
Table 2
Example of 16-byte bit vector.
Table 3 Conversion of bit vector in Table 2 to DBV.
Table 4 Conversion of database D in Table 1 to DBV format.
Table 5 CloFS-DBVPattern for sequence A in Table 1
Trang 4position is 1, thus 1: f1; 4g is stored Table 5 presents the
CloFS-DBVPattern for sequence A inTable 1
The CloFS-DBV tree is used to store CloFS-DBVPattern The
CloFS-DBV tree is an extension of the prefix tree The prefix tree
can be constructed in the following way The root node of the tree
is at the top level and labeled NULL Recursively, each node X at
level k in the tree can be extended by adding one item to get a
child node X0 at level k þ 1 The children of node X are generated
and arranged in lexicographical order By using the prefix tree, the
generation of sequence rules becomes more efficient Typical
algorithms for building a prefix tree include CloGen (Pham et al.,
2013), IMSR_PreTree (Van et al., 2014), and MNSR_PreTree (Pham
et al., 2014) In the DBV tree, each node is a
CloFS-DBVPattern: a sequence, a DBV, and a list of positions of the
sequence in each transaction Each node in the tree is extended in
two forms: sequence extension and itemset extension.Fig 1shows
candidates for the database inTable 1obtained using the
CloFS-DBV algorithm
4.3 CloFS-DBV algorithm
Proposition 1 (checking sequence closure) If there exists a sequence
Sb that is a forward-extension or backward-extension of sequence Sa,
sequence Sa is not closed, and Sa can be safely absorbed by Sb
Considering the above example, suppose that Sa¼ CC : 2 and
Sb¼ CAC : 2 Then, CC : 2 will be absorbed by CAC because CC DCAC
and supDðCCÞ ¼ supDðCACÞ ¼ 2
Proposition 2 (pruning a prefix) Consider a prefix Sp¼ e1e2⋯en If
there exists an item e before the starting position of prefix Spin each of
the transactions containing Spin sequence database D, the extension
can be pruned by prefix Sp For example, consider the database D in
Table 1 There is no need to extend prefix B because there exists a
sequence A that occurs before B in each transaction that contains prefix
B If we extend prefix B, the results obtained will be absorbed due to the extension of prefix A already containing B and having the same support The CloFS-DBV algorithm consists of four main phases: (1) conver-sion of the sequence database to the CloFS-DBVPattern structure, (2) examination of the closure of frequent sequences, (3) early pruning of prefix sequences, and (4) extension of sequences Since CloFS-DBV uses the CloFS-DBVPattern structure, it can check the backward-extension and forward-extension quickly For each transac-tion, CloFS-DBV just considers the start position or the last position of the sequence Therefore, if the sequence has N transactions, the CloFS-DBV takes only N operations to check each candidate In contrast, BIDE algorithm that is more efficiently than CloSpan in almost all the cases (Wang et al., 2007) uses a local database to check backward-extension and uses a projected local database to check forward-extension, i.e., it has to scan each item on each transaction in this database Let k be the sequence length, and N be the number transaction of sequence Thus, BIDE requires k N operations to check each candidate
A
(1,1111) 1:{1,4}
1:{1,3}
1:{1,3}
2:{2,3,4}
B
(2,111) 2:{2,3}
2:{2,4}
2:{2,3,4}
C
(1,1111) 3:{3}
2:{2,5}
3:{3}
1:{1,4}
AA
(1,1111) 1:{4}
1:{3}
1:{3}
2:{3,4}
s
AB
(2,111) 1:{2,3}
1:{2,4}
1:{2,3,4}
AC
(1,1111) 1:{3}
1:{2,5}
1:{3}
2:{4}
(AC)
(1,11) 3:{3}
4:{4}
s s i
NULL
CA
(1,1101) 3:{4}
2:{3}
1:{2,3,4}
CB
(2,11) 2:{4}
3:{4}
CC
(1,101) 2:{5}
1:{4}
s s s
AAB
(2,11) 1:{4}
1:{4}
AAC
(1,101) 1:{5}
2:{4}
A(AC)
(1,11) 1:{3}
2:{4}
s s i
ABA
(2,111) 1:{4}
1:{3}
1:{3}
s
ABB
(2,111) 1:{3}
1:{4}
1:{3,4}
s
ABC
(2,111) 1:{3}
1:{5}
1:{3}
s
A(BC)
(2,111) 1:{3}
1:{2}
1:{3}
i
ACA
(3,11) 1:{4}
1:{3}
s
ACB
(2,11) 1:{4}
1:{4}
s
CAC
(1,101) 2:{5}
1:{4}
s
ABAB
(2,11) 1:{4}
1:{4}
s
AB(BC)
(2,101) 1:{4}
1:{4}
i
A(BC)A
(3,11) 1:{4}
1:{3}
A(BC)B
(2,11) 1:{4}
1:{4}
s s
BA
(2,111) 2:{3}
2:{3}
2:{4}
BB
(2,111) 2:{3,4}
2:{4}
2:{3}
BC
(2,111) 2:{3}
2:{5}
2:{3}
(BC)
(2,111) 3:{3}
2:{2}
3:{3}
s s s i
BAB
(2,11) 2:{4}
2:{4}
s
B(BC)
(2,101) 2:{3}
2:{3}
i
(BC)A
(3,11) 2:{3}
3:{4}
s
(BC)B
(2,11) 3:{4}
2:{4}
s
Fig 1 CloFS-DBV tree for database in Table 1 Shaded rectangles represent candidates that are not closed Unshaded rectangles represent frequent closed sequences Lines
Table 6 CloFS-DBV algorithm Method: CloFS-DBV (D, minSup) Input: A sequence database D and a support threshold minSup Output: A complete set of frequent closed sequences FCS
1 Let FCS:root ¼ NULL;
2 Let f cs1 ¼ fiCloFS DBV Pattern ið ÞjiAI U supðiÞZminSupg;
3 Sort (f cs1) increase order by item i;
4 Add f cs1 to child node of FCS:root;
5 For (each child node subNode in FCS:root) do
6 Call DBV-Pattern-Extension (subNode, minSup);
7 End For
Trang 5Table 6 shows the pseudo code of proposed CloFS-DBV
algo-rithm The algorithm first scans database D to find frequent
1-sequences and stores them in f cs1 as CloFS-DBVPattern (line 2)
Then, the items in f cs1 are sorted in ascending order (line 3) to
reduce the steps in the extension phase of the itemsets On line 6,
the algorithm performs the sequence extension according to the
child nodes of FCS:root
Table 7 shows DBV-Pattern-Extension algorithm called by the
CloFS-DBV algorithm The sequence extension in two forms: sequence
extension (line 5) and itemset extension (line 8) Before sequence extension, the algorithm tests and eliminates prefixes that cannot extend frequent closed sequences using Proposition 2(line 3) The process executes recursively (line 12) until no frequent closed sequences are generated Line 14 uses Proposition 1 to check the prefix Sp If Spis not a frequent closed sequence, it will be set to NULL Example 4 This example demonstrates sequence extension for the CloFS-DBV algorithm with sequence database D inTable 1and
Table 7
DBV-Pattern-Extension algorithm.
Method: DBV-Pattern-Extension (root, minSup)
Input: A root of prefix tree root and a minSup
Output: A set of frequent closed sequences root
1 Let list_node ¼child node of root;
2 For (each Sp in list_node) do
3 If (Sp is not pruned) then
4 For (each Sa in list_node) do
5 If (sup (Let Spa ¼Sequence-Extension (S p, Sa)) ZminSup) then
8 If (sup (Let Spa ¼Itemset-Extension (Sp, Sa)) ZminSup) then
12 Call DBV-Pattern-Extension (Sp, minSup)
14 If (Sp is not a frequent closed sequence) then
Table 8
Sequences A, B, and C in the sample database after conversion to CloFS-DBVPattern
Table 9
Example of (a) sequence extension and (b) itemset extension for prefix A.
Trang 6minSup ¼ 2 ð50%Þ After line 2 (Table 6) is executed, three frequent
1-sequences are stored, i.e., f cs1¼ fA : 4; B : 3; C : 4g (Table 8)
In this example, prefix A is not a closed sequence after the
backward-extension process and prefix B can be pruned after the
pruning prefix process The algorithm performs sequence extension
to create new frequent closed 2-sequences Starting with prefix A, the
extension proceeds with sequences A, B, and C in the forms of
sequence extension (Table 9a) and itemset extension (Table 9b)
Positions 3 and 4 of itemset ðABÞ are empty, so the bits corresponding
to those positions are set to ‘0’ and this itemset is removed (supDððABÞÞ ¼ 1ominSup) The process of expanding continues until
no candidate is generated The results obtained are shown inFig 1
5 Experiment results
Experiments were performed to evaluate the proposed algo-rithm All algorithms were implemented on a personal computer with an Intel Core Duo 2.0-GHz CPU and 4 GB of RAM running Windows 8.1 The BIDE and CloSpan algorithm, the currently well-known state of the art methods, were used for comparison The databases used for comparison were generated using the IBM synthetic data generator The definitions of parameters used to generate the databases are shown inTable 10
The comparisons of runtime and memory usage were per-formed on three databases: C6T5S4I4N1kD10k, T10I4D100k, and
Table 10
Definitions of parameters for standard databases from IBM.
Fig 2 Comparison of runtime for various minSup values for (a) C6T5S4I4N1kD10k, Fig 3 Comparison of memory usage for various minSup values for
Trang 7N10kD10k First, experiments were conducted to compare the
execution time of the three algorithms The results are shown in
Fig 2.Fig 2a shows the runtimes for minSup values of 6% to 10%
for the C6T5S4I4N1kD10k database, Fig 2b shows those for
minSup values of 3.5% to 5.5% for the T10I4D100k database, and
Fig 2c shows those for minSup values of 5–9% for the N10kD10k
database When decreasing the minSup, there are more obtained
frequent sequences The result is the number of checking frequent
closed sequences also increase So the execution time of
algo-rithms increases quickly
Fig 2shows the execution time of three algorithms increases
with decreasing minSup, CloFS-DBV being faster in all cases For
example,Fig 2a shows the mining time of CloFS-DBV, CloSpan,
and BIDE With minSup ¼ 6%, the mining time of BIDE is 7432 ms,
that of CloSpan is 7825 ms, and that of CloFS-DBV is 5040 ms
Almost the execution time of CloFS-DBV occurs in thefirst stage of
the mining process, i.e., CloFS-DBVfirst scans a sequence database
to construct a CloFS-DBVPattern structure for each item After that
stage, this algorithm takes very little time due to its operation
mostly works on bit manipulation
Next, experiments were conducted to compare the total
mem-ory usage (MBs) of the three algorithms.Fig 3shows the memory
usage of the three algorithms for various minSup values.Fig 3a–c
shows the memory usage for C6T5S4I4N1kD10k, T10I4D100k and
N10kD10k database respectively With decreasing minSup, the
number of generated candidates and required memory increases
for the three algorithms CloFS-DBV requires less storage space
than does BIDE or CloSpan due to its use of a compressed data
structure For example,Fig 3c shows the total memory usage of
CloFS-DBV, CloSpan, and BIDE for the N1kD10k database With
minSup ¼ 5%, the total memory usage of BIDE is 29.5 MBs, that of
CloSpan is 60.2 MBs, and that of CloFS-DBV is 7.37 MBs The total
memory usage of CloFS-DBV less than that of BIDE or CloSpan
because CloFS-DBV uses a DBV data structure and stores the
needed information in the mining process In the mining process,
CloFS-DBV neither uses a hash table nor uses database projection
as CloSpan or BIDE does Moreover, while extending sequences,
counting support of sequences, and other operations of CloFS-DBV
are mainly based on bit manipulation So that it consumes less
memory usage in the process
6 Conclusion and future work
This paper proposed the CloFS-DBV algorithm, which uses
DBVs and transaction information to mine frequent closed
sequences The CloFS-DBV algorithm is divided into two main
stages: (1) the original sequence database is transformed into a
vertical data format called DBVPattern, where each
CloFS-DBVPattern stores the position of frequent closed sequences which
appear in the database; (2) frequent closed sequences are
gener-ated and tested, and prefixes are pruned early The CloFS-DBV
algorithm scans the database only once and calculates the
sup-ports based on the DBV to generate new patterns Due to its use of
a compressed structure, the CloFS-DBV algorithm is more efficient
than the BIDE and CloSpan algorithm in terms of memory usage
and runtime
The CloFS-DBV algorithm has a few limitations that will be
addressed in the future Frequent closed inter-sequences will be
mined to reduce the number of redundant patterns Based on
mining frequent closed inter-sequences, the generation of rules
will be made more compact and efficient In addition, mining
maximal frequent sequences has been proposed in recent years
(Guan et al., 2005; García-Hernández et al., 2006; Lin et al., 2007;
Fournier-Viger et al., 2013) The DBV data structure will be applied
for the efficient mining of such sequences
Acknowledgment
This work was funded by the Vietnam National Foundation for Science and Technology Development (NAFOSTED) under Grant
no 102.05-2013.20
References
Agrawal, R., Srikant, R., 1995 Mining sequential patterns In: Proceedings of the IEEE International Conference on Data Engineering, pp 3–14.
Ayres, J., Gehrke, J., Yiu, T., Flannick, J., 2002 Sequential pattern mining using a bitmap representation In: Proceedings of the Eighth ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining Edmonton Alberta, Canada, pp 429–435.
Chiu, D.Y., Wu, Y.H., Chen, A.L.P., 2004 An efficient algorithm for mining frequent sequences by a new strategy without support counting In: Proceedings of the 20th International Conference on Data Engineering, pp 375–386.
Dong, J., Han, M., 2007 BitTableFI: an efficient mining frequent itemsets algorithm Knowl.-Based Syst 20 (4), 329–335
Fournier-Viger, P., Wu, C.W., Tseng, V.S., 2013 Mining maximal sequential patterns without candidate maintenance In: Advanced Data Mining and Applications, Lecture Notes in Computer Science, vol 8346, pp 169–180.
Gomariz, A., Campos, M., Marin, R., Goethals, B., 2013 ClaSP: an efficient algorithm for mining frequent closed sequences In: Advances in Knowledge Discovery and Data Mining, LNA, vol 7818, pp 50–61.
Guan, E.Z., Chang, X.Y., Wang, Z., Zhou, C.G., 2005 Mining maximal sequential patterns In: Proceedings of the Second International Conference on Neural Networks and Brain, pp 525–528.
García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., 2006 A new algorithm for fast discovery of maximal sequential patterns in a document collection In: Computer Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science, vol 3878, pp 514–523.
Lin, N.P., Hao, W.H., Chen, H.J., Chueh, H.E., Chang, C.I., 2007 Fast mining maximal sequential patterns In: Proceedings of the 7th International Conference on Simulation, Modeling and Optimization, September 15–17 Beijing, China,
pp 405–408.
Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L., 1999 Discovering frequent closed itemsets for association rules In: Proceedings of the International Conference
on Database Theory (ICDT’99), pp 398–416.
Pei, J., Han, J., Mao, R., 2000 CLOSET: an efficient algorithm for mining frequent closed itemsets In: Proceedings of the ACM SIGMOD Workshop Research Issues
in Data Mining and Knowledge Discovery (DMKD’00), pp 21–30.
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.-C., 2001 PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth In: Proceedings of the International Conference on Data Engineering,
pp 215–224.
Pham, T.T., Luo, J., Hong, T.P., Vo, B., 2012 MSGPs: a novel algorithm for mining sequential generator patterns In: Computational Collective Intelligence, Tech-nologies and Applications, Lecture Notes in Computer Science, vol 7654,
pp 393–401.
Pham, T.T., Luo, J., Vo, B., 2013 An effective algorithm for mining closed sequential patterns and their minimal generators based on prefix trees Int J Intell Inf Database Syst 7 (4), 324–339
Pham, T.T., Luo, J., Hong, T.P., Vo, B., 2014 An efficient method for mining non-redundant sequential rules using attributed prefix-trees Eng Appl.Artif Intell.
32, 88–99 Srikant, R., Agrawal, R.,1996 Mining sequential patterns: Generalizations and performance improvements In: Proceedings of the International Conference
on Extending Database Technology, pp 3–17.
Song, W., Yang, B., Xu, Z., 2008 Index-BitTableFI: an improved algorithm for mining frequent itemsets Knowl.-Based Syst 21 (6), 507–513
Song, S., Hu, H., Jin, S., 2005 HVSM: a new sequential pattern mining algorithm using bitmap representation In: Proceedings of the Advanced Data Mining and Applications, pp 455–463.
Vo, B., Hong, T.P., Le, B., 2012 DBV-Miner: a dynamic bit-vector approach for fast mining frequent itemsets Exp Syst Appl 39 (8), 7196–7206
Van, T.T., Vo, B., Le, B., 2014 IMSR_PreTree: an improved algorithm for mining sequential rules based on the prefix-tree Vietnam J Comput Sci 1 (2), 97–105 Wang, J., Han, J., Pei, J., 2003 CLOSETþ: searching for the best strategies for mining frequent closed itemsets In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’03), pp 236–245.
Wang, J., Han, J., Li, C., 2007 Frequent closed sequence mining without candidate maintenance IEEE Trans Knowl Data Eng 19 (8), 1042–1056
Yan, X., Han, J., Afshar, R., 2003 CloSpan: mining closed sequential patterns in large datasets In: Proceedings of the SIAM International Conference on Data Mining,
pp 166–177.
Yang, Z., Kitsuregawa, M., 2005 LAPIN–SPAM: an improved algorithm for mining sequential pattern In: Proceedings of the ICDE Workshops 2005, p 1222.
Zaki, M., 2001 SPADE: an efficient algorithm for mining frequent sequences Mach Learn 42 (1–2), 31–60
Zaki, M., Hsiao, C.,2002 CHARM: an efficient algorithm for closed itemset mining In: Proceedings of the SIAM International Conference on Data Mining (SDM’02),
pp 457–473.