DSpace at VNU: Combination of dynamic bit vectors and transaction information for mining frequent closed sequences efficiently

Combination of dynamic bit vectors and transaction information forMinh-Thai Trana, Bac Leb, Bay Voc,n a Faculty of Information Technology, Information Technology College, Ho Chi Minh Cit

Trang 1

Combination of dynamic bit vectors and transaction information for

Minh-Thai Trana, Bac Leb, Bay Voc,n

a

Faculty of Information Technology, Information Technology College, Ho Chi Minh City, Vietnam

b Department of Computer Science, University of Science, VNU-Ho Chi Minh, Vietnam

c

Faculty of Information Technology, Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam

a r t i c l e i n f o

Article history:

Received 24 May 2014

Received in revised form

23 October 2014

Accepted 28 October 2014

Keywords:

Dynamic bit vector

Frequent closed sequence

CloFS-DBV

a b s t r a c t

Sequence mining algorithms attempt to mine all possible frequent sequences These algorithms produce redundant results, increasing the required storage space and runtime, especially for large sequence databases In recent years, many studies have proved that mining frequent closed sequences is more efﬁcient than mining all frequent sequences The desired information can be fully extracted from frequent closed sequences Most algorithms for mining frequent closed sequences use a candidate maintenance-and-test paradigm The present paper proposes an algorithm called CloFS-DBV that uses dynamic bit vectors Various methods are employed to reduce memory usage and runtime Experimental results show that CloFS-DBV is more efﬁcient than the BIDE and CloSpan algorithms in terms of execution time and memory usage

1 Introduction

Sequential pattern mining is a fundamental problem in knowledge

discovery and data mining with broad applications, including those in

the analysis of customer purchase behavior, web access patterns,

sci-entiﬁc experiments, disease treatment, natural disaster prevention,

and protein formation Sequential pattern mining includes two main

stages: frequent pattern mining and rule mining Many studies have

modiﬁed the AprioriAll algorithm (Agrawal and Srikant, 1995) for

min-ing frequent sequential patterns Unlike the general minmin-ing of frequent

sequences, the mining of frequent closed sequences has not been

extensively studied Although some algorithms have been proposed,

such as CloSpan (Yan et al., 2003), CLOSETþ (Wang et al., 2003), and

BIDE (Wang et al., 2007), their performance is poor for large databases

BIDE detects frequent sequences, not closed ones, and prunes

candi-dates early, instead of using maintenance-and-test patterns

Recently, many authors have proposed techniques that present

data in a vertical format (Song et al., 2005), use projection databases

operation (Pei et al., 2001), use bit vector data structures (Song

et al., 2008), all of which have been shown to be effective However,

the storage space and execution time can be further reduced in the

mining process for large sequence databases

The present study proposes the CloFS-DBV algorithm, which

uses a vertical data format and data compression, and divides the

search space to reduce the required storage space and execution time for mining frequent closed sequences The rest of the paper is organized as follows Section 2 gives the problem deﬁnition

Section 3summarizes related work.Sections 4 and 5present the proposed algorithm and experimental results, respectively The conclusions and future work are given inSection 6

2 Problem deﬁnition Consider a sequence database with a set of distinct events

I ¼ fi1; i2; i3; ⋯; ing, where ij is an event (or an item), where

1rjrn A set of unordered events is called an itemset Each itemset

is put in brackets, for example ðABCÞ To simplify notation, for itemsets that contain only a single item, the brackets are omitted, for example

B A sequence S ¼ fe1; e2; e3; ⋯; emg is an ordered list of events, where

ejð1rjrmÞ is an itemset Suppose that ℓ is the number of events in

a sequence A sequence with lengthℓ is called an ℓsequence For example, ABðAEÞCB is a 6 sequence A sequence Sa¼ a1; a2; ⋯; amis contained in another sequence Sb ¼ b1; b2; ⋯; bnif there exist inte-gers 1ri1oi2o⋯oimrn such that ai¼ bi1; a2¼ bi2; ⋯; am¼ bim

If sequence Sais contained in sequence Sb, Sais called a subsequence

of Sb and Sb is called a supersequence of Sa, denoted as SaDSb A sequence database is denoted as D ¼ fs1; s2; s3; ⋯; sj j Dg, where jDj is the number of sequences in D and sið1rirjDjÞ is a transaction in the form ID; Sequence, where the attribute ID is used to describe the information of si corresponding to transaction information over time The absolute support (support) of a sequence Sain a sequence database D is calculated as the number of occurrences of Sain the

Contents lists available atScienceDirect

journal homepage:www.elsevier.com/locate/engappai

http://dx.doi.org/10.1016/j.engappai.2014.10.021

n Corresponding author.

E-mail addresses: minhthai@itc.edu.vn (M.-T Tran),

lhbac@ﬁt.hcmus.edu.vn (B Le), bayvodinh@gmail.com (B Vo).

Trang 2

transactions of D, denoted as supDðSaÞ The support of a sequence is

given in the notation sequence: support For example, a sequence

AB with support 3 is represented as AB: 3

Given a minimum support threshold minSup, a sequence Sais a

frequent sequence on D if supDðSaÞZminSup If sequence Sa is

frequent and there exists no proper supersequence Sbof Sawith

the same support, Sais called a frequent closed sequence, i.e., there

does not exist Sb such that SaDSb and supDð Þ ¼ supSa DðSbÞ The

problem of mining frequent closed sequences is toﬁnd a complete

set of frequent closed sequences for an input sequence database D

and a given minimum support threshold minSup

Example 1 Consider the sequence database in Table 1 The

database hasﬁve unique items I ¼ A; B; C; D; Ef g and four

transac-tions, i.e., jDj ¼ 4 Assume that the minimum support threshold is

minSup ¼ 2 ð50%Þ If all frequent sequences of D are mined with

the given minSup, the following 32 sequences are obtained:

SFS¼{A : 4, AA : 4, AB:3, AC:4, (AC):2, AAB:2, AAC:2, A(AC):2,

ABA:3, ABB:3, ABC:3, A(BC):3, ACA:2, ACB:2, ABAB:2, AB(BC):2,

A(BC)A:2, A(BC)B:2, B:3, BA:3, BB:3, BC:3, (BC):3, BAB:2, B(BC):2,

(BC)A:2, (BC)B:2, C:4, CA:3, CB:2, CC:2, CAC:2} In contrast, mining

the frequent closed sequences yields SFCS¼{AA:4, AC:4, AAC:2, A

(AC):2, ABA:3, ABB:3, ABC:3, A(BC):3, ABAB:2, AB(BC):2, A(BC)A:2, A

(BC)B:2, CA:3, CAC:2}, which has only 14 sequences

Frequent closed sequences SFCS are thus more compact than

general frequent sequences SFS This is due to subsequence Sawith

the same support as that of supersequence Sbbeing absorbed by Sb

without affecting the mining results For example, sequence ðBCÞA:

2 is absorbed by sequence AðBCÞA: 2 because ðBCÞADAðBCÞA and

supDððBCÞAÞ ¼ supDðAðBCÞA Þ ¼ 2

Atﬁrst, the frequent sequences with length 1 are mined from a

sequence database After that, these frequent sequences will

com-bine (or extend) each other to form new candidates with length 2

This process is repeated until there are no new generated frequent

sequences In general, the sequences with length k are used to

generate sequences with length k þ1 Besides the generation of

candidates, the checking of frequent closed sequences is applied in

each process The following deﬁnitions are used in the process of

extending sequences and checking frequent closed sequences

Deﬁnition 1 (substring of a sequence) Let S be a sequence

subi;jðSÞ ðirjÞ is deﬁned as a substring of length ðji þ1Þ from

position i to position j of S For example, sub1 ;3ðBABCÞ is BAB and

sub4 ;4ðBABCÞ is C

Deﬁnition 2 (extending a sequence from a 1-sequence) Letαandβ

be two frequent 1-sequences ftα:pαg and ftβ:pβg are the

transac-tions and positransac-tions of sequencesαandβ, respectively There are

two forms of sequence extension

Itemset extension: 〈ðαβÞ〉ftβ:pβg; if ðαoβÞ4ðtα¼ tβÞ4ðpα¼ pβÞ:

ð2:1Þ Sequence extension: 〈αβ〉ftβ:pβg; if ðtα¼ tβÞ4ðpαopβÞ ð2:2Þ

Deﬁnition 3 (extending a sequence from a k-sequence) Letαandβ

be two frequent k-sequences ðk41Þ, u ¼ sub;kð Þ, and v ¼ subα ;kβ

ftα:pαg and ftβ:pβg are the transactions and positions of sequencesα

andβ, respectively There are two forms of sequence extension Itemset extension:αþiβ¼ sub1;k 1ðαÞðuvÞftβ:pβg

if ðuovÞ4ðtα¼ tβÞ4ðpα¼ pβÞ4ðsub1 ;k 1ðαÞ ¼ sub1 ;k 1ðβÞÞ

ð3:1Þ Sequence extension:αþsβ¼αvftβ:pβg;

if ðtα¼ tβÞ4ðpαopβÞ4ðsub1 ;k 1ðαÞ ¼ sub1 ;k 1ðβÞÞ ð3:2Þ

Deﬁnition 4 Let S ¼ e1e2⋯en An item e' can be added to a pattern extension of S in one of three positions

S0¼ e1e2⋯ene04ðsupDðS0Þ ¼ supDðSÞÞ ð4:1Þ (i 1rionð Þsuch that S0¼ e1e2⋯eie0⋯en4ðsupDðS0Þ ¼ supDðSÞÞ

ð4:2Þ

S0¼ e0

e1e2⋯en4ðsupDðS0Þ ¼ supDðSÞÞ ð4:3Þ

In(4.1), item e0appears after en, so item e0is called a forward-extension and S0 is called a forward-extension sequence For example, sequence AC: 4 is a forward-extension of sequence A :

4 because sequence C is extended after sequence A and their support is 4 In(4.2) and (4.3), item e0appears before en, so item

e0is called a backward-extension and S0is called a backward-extension sequence

For example, sequence CAC: 2 is a backward-extension of sequence CC: 2 because sequence A is extended in the middle of sequence CC and their support is 2

Deﬁnition 5 Let S ¼ e1e2⋯en The starting position of sequence S

is the position of theﬁrst appearance of itemset e1 For example, in the sequence ABðABCÞCB, the starting position of sequence ðABCÞ is

3, and that of sequence ABB is 1

3 Related work

Mining frequent sequences was ﬁrst proposed in 1995 by Agrawal and Srikant with their AprioriAll algorithm, which is based

on the Apriori property Agrawal and Srikant then expanded the mining problem in a general way with the GSP algorithm (Srikant and Agrawal, 1996) Since then, many frequent sequence mining algorithms have been proposed to improve mining efﬁciency The algorithms use various approaches for organizing data and storing mined information Typical algorithms include SPADE (Zaki, 2001), PreﬁxSpan (Pei et al., 2001), SPAM (Ayres et al., 2002), and LAPIN-SPAM (Yang and Kitsuregawa, 2005) The SPAM algorithm organizes data in a vertical bitmap format and uses a dictionary tree structure

to store mined information PrefixSpan uses database projection for sequence extension to reduce the search space, with the data presented horizontally The LAPIN-SPAM algorithm uses a list to store thefinal positions of items and a set of boundary positions of the prefix to reduce the scope of the search space

Various algorithms have been proposed for mining non-red-undant frequent sequences to reduce the required storage space and runtime for mining rules Frequent closed sequence mining and frequent closed itemset mining algorithms include A-CLOSE (Pasquier et al., 1999), CLOSET (Pei et al., 2000), CHARM (Zaki and Hsiao, 2002), and CLOSETþ (Wang et al., 2003) Most of these algorithms maintain mined frequent itemsets in order to test frequent closed sequences, which require a lot of memory CLOSETþ uses a two-level hash-index structure and a tree structure for storing the itemsets to reduce memory space and the time required for testing closed itemsets CloSpan (Yan et al., 2003) uses a maintain-and-test pattern method and combines a hash-index structure with a tree structure for storing sequences This algorithm prunes patterns

Table 1

Example sequence database D.

Trang 3

using techniques such as Common Preﬁx and Backward Sub-Pattern

to reduce the search space The ClaSP (Gomariz et al., 2013) algorithm

uses a vertical database format strategy, as done by the SPADE

algorithm, and a heuristic to prune non-closed sequences, as done by

the CloSpan algorithm However, the algorithm maintains previous

candidates to test the closure of sequences and removes them later

The maintenance of candidates increases memory consumption, and

the number of test candidates increases with the number of

gen-erated frequent closed sequences

In order to overcome these problems, the BIDE algorithm (Wang

et al., 2007) does not keep track of historical frequent closed

sequences for checking the closure of new patterns Instead, it uses

bi-directional extension techniques to examine frequent closed

patterns as candidates before extending a sequence Moreover, the

algorithm uses a BackScan process to determine candidates that

cannot be extended to reduce mining time The algorithm uses

pseudo projection techniques to reduce database storage space and

is efﬁcient for low support thresholds However, in the process of

mining, it has to project and scan databases many times for each

preﬁx, making it inefﬁcient

4 Proposed algorithm

This section describes the proposed CloFS-DBV algorithm,

which uses a dynamic bit vector (DBV) structure combined with

location information in the structure of the transaction

CloFS-DBVPattern to mine frequent closed sequences

4.1 DBV data structure

Sequence mining algorithms based on a vertical data format

have proven to be more efﬁcient than those based on a horizontal

data format Typical algorithms that use a vertical format include

SPADE (Zaki, 2001), DISC-all (Chiu et al., 2004), HVSM (Song et al.,

2005), and MSGPs (Pham et al., 2012) These algorithms scan the

database only once and calculate the support of the sequence

quickly However, the disadvantage is that they consume much

more memory to store additional information BitTableFI (Dong and

Han, 2007) and Index-BitTableFI (Song et al., 2008) have solved this

problem by compressing data by using a bit table (BitTable)

The main drawback of the bit vector structure is aﬁxed size,

which depends on the number of transactions in a sequence

database ‘1’ indicates that the item appears in the transaction

and‘0’ indicates otherwise In practice, there are usually many ‘0’

bits in a bit vector, i.e., items in sequence database often random

appear in the sequence database In addition, during the extending

process of sequences (using bitwise AND) the ‘0’ bits will more

appear Thus increases the required memory and processing time

In order to overcome this problem, dynamic bit vector architecture

is used (Vo et al., 2012) Let A and B be two bit vectors p1and p2are

the probabilities of‘1’ bits in two bit vectors A and B, respectively

Assuming k is the probability of‘0’ bits after joining A and B to get

AB by the extending process of sequence Therefore, the probability

of‘1’ bits in the bit vector AB is minðp1; p2Þk, where minðp1; p2Þ is

the minimum value of p1and p2 Obviously, the probability of‘1’ in

AB will decrease in contrast the probability of‘0’ in that increase

Moreover, the gap between p1 and p2 will be larger quickly after

several sequence extensions

Suppose there are 16 transactions in a sequence database An

item i exists in transactions 7, 9, 10, 11, and 13 The bit vector for

the item i needs 16 bytes, as shown inTable 2 Theﬁrst non-zero byte appears at index 7 The DBV only stores the starting index and sequence of bytes starting from theﬁrst non-zero byte until the last non-zero byte, as shown inTable 3 Only 8 bytes are required

to store the information using the DBV structure

Each DBV consists of two parts: (1) Start bit: the position of the ﬁrst appearance of ‘1’ and (2) Bit vector: sequence of bits starting from theﬁrst non-zero byte until the last non-zero byte The DBV structure is used to store transactions in a vertical format Sequence supports can easily be calculated by counting the number of‘1’ bits Example 2 Consider database D inTable 1 Sequence A exists in transactions 1, 2, 3, and 4, so the start bit is 1, and the bit vector is

1111 The bit vector has four‘1’ bits, so the support of sequence A is

4 Sequence B exists in transactions 2, 3, and 4, so the start bit is 2, and thus the bit vector is 111 The bit vector has three‘1’ bits, so the support of sequence B is 3 Table 4shows the conversion of database D inTable 1to DBV format

4.2 CloFS-DBVPattern data structure The CloFS-DBVPattern structure combines a DBV structure with

a representation of sequence information Each CloFS-DBVPattern consists of two parts: (1) Sequence: sequence information and (2) BlockInfo: a DBV and a list of positions appearing in the seq-uence of transactions List positions of each transaction are repre-sented in the form of startPos: flist positionsg, where startPos is the ﬁrst appearance of the sequence in each transaction

Example 3 In database D (Table 1), sequence A exists in transac-tions 1, 2, 3, and 4 For theﬁrst transaction, sequence A appears at positions f2; 3; 4g The starting position is 2, and thus 2 : f2; 3; 4g is stored For the second transaction, sequence A appears at positions f1; 3g The starting position is 1, and thus 1 : f1; 3g is stored For the third transaction, sequence A appears at positions f1; 3g The starting position is 1, and thus 1: f1; 3g is stored Similarly, for the last tran-saction, sequence A appears at positions f1; 4g and the starting

Table 2

Example of 16-byte bit vector.

Table 3 Conversion of bit vector in Table 2 to DBV.

Table 4 Conversion of database D in Table 1 to DBV format.

Table 5 CloFS-DBVPattern for sequence A in Table 1

Trang 4

position is 1, thus 1: f1; 4g is stored Table 5 presents the

CloFS-DBVPattern for sequence A inTable 1

The CloFS-DBV tree is used to store CloFS-DBVPattern The

CloFS-DBV tree is an extension of the preﬁx tree The preﬁx tree

can be constructed in the following way The root node of the tree

is at the top level and labeled NULL Recursively, each node X at

level k in the tree can be extended by adding one item to get a

child node X0 at level k þ 1 The children of node X are generated

and arranged in lexicographical order By using the preﬁx tree, the

generation of sequence rules becomes more efﬁcient Typical

algorithms for building a preﬁx tree include CloGen (Pham et al.,

2013), IMSR_PreTree (Van et al., 2014), and MNSR_PreTree (Pham

et al., 2014) In the DBV tree, each node is a

CloFS-DBVPattern: a sequence, a DBV, and a list of positions of the

sequence in each transaction Each node in the tree is extended in

two forms: sequence extension and itemset extension.Fig 1shows

candidates for the database inTable 1obtained using the

CloFS-DBV algorithm

4.3 CloFS-DBV algorithm

Proposition 1 (checking sequence closure) If there exists a sequence

Sb that is a forward-extension or backward-extension of sequence Sa,

sequence Sa is not closed, and Sa can be safely absorbed by Sb

Considering the above example, suppose that Sa¼ CC : 2 and

Sb¼ CAC : 2 Then, CC : 2 will be absorbed by CAC because CC DCAC

and supDðCCÞ ¼ supDðCACÞ ¼ 2

Proposition 2 (pruning a preﬁx) Consider a preﬁx Sp¼ e1e2⋯en If

there exists an item e before the starting position of preﬁx Spin each of

the transactions containing Spin sequence database D, the extension

can be pruned by preﬁx Sp For example, consider the database D in

Table 1 There is no need to extend preﬁx B because there exists a

sequence A that occurs before B in each transaction that contains preﬁx

B If we extend prefix B, the results obtained will be absorbed due to the extension of prefix A already containing B and having the same support The CloFS-DBV algorithm consists of four main phases: (1) conver-sion of the sequence database to the CloFS-DBVPattern structure, (2) examination of the closure of frequent sequences, (3) early pruning of prefix sequences, and (4) extension of sequences Since CloFS-DBV uses the CloFS-DBVPattern structure, it can check the backward-extension and forward-extension quickly For each transac-tion, CloFS-DBV just considers the start position or the last position of the sequence Therefore, if the sequence has N transactions, the CloFS-DBV takes only N operations to check each candidate In contrast, BIDE algorithm that is more efficiently than CloSpan in almost all the cases (Wang et al., 2007) uses a local database to check backward-extension and uses a projected local database to check forward-extension, i.e., it has to scan each item on each transaction in this database Let k be the sequence length, and N be the number transaction of sequence Thus, BIDE requires k N operations to check each candidate

A

(1,1111) 1:{1,4}

1:{1,3}

2:{2,3,4}

B

(2,111) 2:{2,3}

2:{2,4}

2:{2,3,4}

C

(1,1111) 3:{3}

2:{2,5}

3:{3}

1:{1,4}

AA

(1,1111) 1:{4}

1:{3}

2:{3,4}

s

AB

(2,111) 1:{2,3}

1:{2,4}

1:{2,3,4}

AC

(1,1111) 1:{3}

1:{2,5}

1:{3}

2:{4}

(AC)

(1,11) 3:{3}

4:{4}

s s i

NULL

CA

(1,1101) 3:{4}

2:{3}

1:{2,3,4}

CB

(2,11) 2:{4}

3:{4}

CC

(1,101) 2:{5}

1:{4}

s s s

AAB

(2,11) 1:{4}

1:{4}

AAC

(1,101) 1:{5}

2:{4}

A(AC)

(1,11) 1:{3}

2:{4}

s s i

ABA

(2,111) 1:{4}

1:{3}

s

ABB

(2,111) 1:{3}

1:{4}

1:{3,4}

s

ABC

(2,111) 1:{3}

1:{5}

1:{3}

s

A(BC)

(2,111) 1:{3}

1:{2}

1:{3}

i

ACA

(3,11) 1:{4}

1:{3}

s

ACB

(2,11) 1:{4}

1:{4}

s

CAC

(1,101) 2:{5}

1:{4}

s

ABAB

(2,11) 1:{4}

1:{4}

s

AB(BC)

(2,101) 1:{4}

1:{4}

i

A(BC)A

(3,11) 1:{4}

1:{3}

A(BC)B

(2,11) 1:{4}

1:{4}

s s

BA

(2,111) 2:{3}

2:{3}

2:{4}

BB

(2,111) 2:{3,4}

2:{4}

2:{3}

BC

(2,111) 2:{3}

2:{5}

2:{3}

(BC)

(2,111) 3:{3}

2:{2}

3:{3}

s s s i

BAB

(2,11) 2:{4}

2:{4}

s

B(BC)

(2,101) 2:{3}

2:{3}

i

(BC)A

(3,11) 2:{3}

3:{4}

s

(BC)B

(2,11) 3:{4}

2:{4}

s

Fig 1 CloFS-DBV tree for database in Table 1 Shaded rectangles represent candidates that are not closed Unshaded rectangles represent frequent closed sequences Lines

Table 6 CloFS-DBV algorithm Method: CloFS-DBV (D, minSup) Input: A sequence database D and a support threshold minSup Output: A complete set of frequent closed sequences FCS

1 Let FCS:root ¼ NULL;

2 Let f cs1 ¼ fiCloFS DBV Pattern ið ÞjiAI U supðiÞZminSupg;

3 Sort (f cs1) increase order by item i;

4 Add f cs1 to child node of FCS:root;

5 For (each child node subNode in FCS:root) do

6 Call DBV-Pattern-Extension (subNode, minSup);

7 End For

Trang 5

Table 6 shows the pseudo code of proposed CloFS-DBV

algo-rithm The algorithm ﬁrst scans database D to ﬁnd frequent

1-sequences and stores them in f cs1 as CloFS-DBVPattern (line 2)

Then, the items in f cs1 are sorted in ascending order (line 3) to

reduce the steps in the extension phase of the itemsets On line 6,

the algorithm performs the sequence extension according to the

child nodes of FCS:root

Table 7 shows DBV-Pattern-Extension algorithm called by the

CloFS-DBV algorithm The sequence extension in two forms: sequence

extension (line 5) and itemset extension (line 8) Before sequence extension, the algorithm tests and eliminates preﬁxes that cannot extend frequent closed sequences using Proposition 2(line 3) The process executes recursively (line 12) until no frequent closed sequences are generated Line 14 uses Proposition 1 to check the preﬁx Sp If Spis not a frequent closed sequence, it will be set to NULL Example 4 This example demonstrates sequence extension for the CloFS-DBV algorithm with sequence database D inTable 1and

Table 7

DBV-Pattern-Extension algorithm.

Method: DBV-Pattern-Extension (root, minSup)

Input: A root of preﬁx tree root and a minSup

Output: A set of frequent closed sequences root

1 Let list_node ¼child node of root;

2 For (each Sp in list_node) do

3 If (Sp is not pruned) then

4 For (each Sa in list_node) do

5 If (sup (Let Spa ¼Sequence-Extension (S p, Sa)) ZminSup) then

8 If (sup (Let Spa ¼Itemset-Extension (Sp, Sa)) ZminSup) then

12 Call DBV-Pattern-Extension (Sp, minSup)

14 If (Sp is not a frequent closed sequence) then

Table 8

Sequences A, B, and C in the sample database after conversion to CloFS-DBVPattern

Table 9

Example of (a) sequence extension and (b) itemset extension for preﬁx A.

Trang 6

minSup ¼ 2 ð50%Þ After line 2 (Table 6) is executed, three frequent

1-sequences are stored, i.e., f cs1¼ fA : 4; B : 3; C : 4g (Table 8)

In this example, preﬁx A is not a closed sequence after the

backward-extension process and preﬁx B can be pruned after the

pruning preﬁx process The algorithm performs sequence extension

to create new frequent closed 2-sequences Starting with preﬁx A, the

extension proceeds with sequences A, B, and C in the forms of

sequence extension (Table 9a) and itemset extension (Table 9b)

Positions 3 and 4 of itemset ðABÞ are empty, so the bits corresponding

to those positions are set to ‘0’ and this itemset is removed (supDððABÞÞ ¼ 1ominSup) The process of expanding continues until

no candidate is generated The results obtained are shown inFig 1

5 Experiment results

Experiments were performed to evaluate the proposed algo-rithm All algorithms were implemented on a personal computer with an Intel Core Duo 2.0-GHz CPU and 4 GB of RAM running Windows 8.1 The BIDE and CloSpan algorithm, the currently well-known state of the art methods, were used for comparison The databases used for comparison were generated using the IBM synthetic data generator The deﬁnitions of parameters used to generate the databases are shown inTable 10

The comparisons of runtime and memory usage were per-formed on three databases: C6T5S4I4N1kD10k, T10I4D100k, and

Table 10

Deﬁnitions of parameters for standard databases from IBM.

Fig 2 Comparison of runtime for various minSup values for (a) C6T5S4I4N1kD10k, Fig 3 Comparison of memory usage for various minSup values for

Trang 7

N10kD10k First, experiments were conducted to compare the

execution time of the three algorithms The results are shown in

Fig 2.Fig 2a shows the runtimes for minSup values of 6% to 10%

for the C6T5S4I4N1kD10k database, Fig 2b shows those for

minSup values of 3.5% to 5.5% for the T10I4D100k database, and

Fig 2c shows those for minSup values of 5–9% for the N10kD10k

database When decreasing the minSup, there are more obtained

frequent sequences The result is the number of checking frequent

closed sequences also increase So the execution time of

algo-rithms increases quickly

Fig 2shows the execution time of three algorithms increases

with decreasing minSup, CloFS-DBV being faster in all cases For

example,Fig 2a shows the mining time of CloFS-DBV, CloSpan,

and BIDE With minSup ¼ 6%, the mining time of BIDE is 7432 ms,

that of CloSpan is 7825 ms, and that of CloFS-DBV is 5040 ms

Almost the execution time of CloFS-DBV occurs in theﬁrst stage of

the mining process, i.e., CloFS-DBVﬁrst scans a sequence database

to construct a CloFS-DBVPattern structure for each item After that

stage, this algorithm takes very little time due to its operation

mostly works on bit manipulation

Next, experiments were conducted to compare the total

mem-ory usage (MBs) of the three algorithms.Fig 3shows the memory

usage of the three algorithms for various minSup values.Fig 3a–c

shows the memory usage for C6T5S4I4N1kD10k, T10I4D100k and

N10kD10k database respectively With decreasing minSup, the

number of generated candidates and required memory increases

for the three algorithms CloFS-DBV requires less storage space

than does BIDE or CloSpan due to its use of a compressed data

structure For example,Fig 3c shows the total memory usage of

CloFS-DBV, CloSpan, and BIDE for the N1kD10k database With

minSup ¼ 5%, the total memory usage of BIDE is 29.5 MBs, that of

CloSpan is 60.2 MBs, and that of CloFS-DBV is 7.37 MBs The total

memory usage of CloFS-DBV less than that of BIDE or CloSpan

because CloFS-DBV uses a DBV data structure and stores the

needed information in the mining process In the mining process,

CloFS-DBV neither uses a hash table nor uses database projection

as CloSpan or BIDE does Moreover, while extending sequences,

counting support of sequences, and other operations of CloFS-DBV

are mainly based on bit manipulation So that it consumes less

memory usage in the process

6 Conclusion and future work

This paper proposed the CloFS-DBV algorithm, which uses

DBVs and transaction information to mine frequent closed

sequences The CloFS-DBV algorithm is divided into two main

stages: (1) the original sequence database is transformed into a

vertical data format called DBVPattern, where each

CloFS-DBVPattern stores the position of frequent closed sequences which

appear in the database; (2) frequent closed sequences are

gener-ated and tested, and preﬁxes are pruned early The CloFS-DBV

algorithm scans the database only once and calculates the

sup-ports based on the DBV to generate new patterns Due to its use of

a compressed structure, the CloFS-DBV algorithm is more efﬁcient

than the BIDE and CloSpan algorithm in terms of memory usage

and runtime

The CloFS-DBV algorithm has a few limitations that will be

addressed in the future Frequent closed inter-sequences will be

mined to reduce the number of redundant patterns Based on

mining frequent closed inter-sequences, the generation of rules

will be made more compact and efﬁcient In addition, mining

maximal frequent sequences has been proposed in recent years

(Guan et al., 2005; García-Hernández et al., 2006; Lin et al., 2007;

Fournier-Viger et al., 2013) The DBV data structure will be applied

for the efﬁcient mining of such sequences

Acknowledgment

This work was funded by the Vietnam National Foundation for Science and Technology Development (NAFOSTED) under Grant

no 102.05-2013.20

References

Agrawal, R., Srikant, R., 1995 Mining sequential patterns In: Proceedings of the IEEE International Conference on Data Engineering, pp 3–14.

Ayres, J., Gehrke, J., Yiu, T., Flannick, J., 2002 Sequential pattern mining using a bitmap representation In: Proceedings of the Eighth ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining Edmonton Alberta, Canada, pp 429–435.

Chiu, D.Y., Wu, Y.H., Chen, A.L.P., 2004 An efﬁcient algorithm for mining frequent sequences by a new strategy without support counting In: Proceedings of the 20th International Conference on Data Engineering, pp 375–386.

Dong, J., Han, M., 2007 BitTableFI: an efﬁcient mining frequent itemsets algorithm Knowl.-Based Syst 20 (4), 329–335

Fournier-Viger, P., Wu, C.W., Tseng, V.S., 2013 Mining maximal sequential patterns without candidate maintenance In: Advanced Data Mining and Applications, Lecture Notes in Computer Science, vol 8346, pp 169–180.

Gomariz, A., Campos, M., Marin, R., Goethals, B., 2013 ClaSP: an efﬁcient algorithm for mining frequent closed sequences In: Advances in Knowledge Discovery and Data Mining, LNA, vol 7818, pp 50–61.

Guan, E.Z., Chang, X.Y., Wang, Z., Zhou, C.G., 2005 Mining maximal sequential patterns In: Proceedings of the Second International Conference on Neural Networks and Brain, pp 525–528.

García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., 2006 A new algorithm for fast discovery of maximal sequential patterns in a document collection In: Computer Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science, vol 3878, pp 514–523.

Lin, N.P., Hao, W.H., Chen, H.J., Chueh, H.E., Chang, C.I., 2007 Fast mining maximal sequential patterns In: Proceedings of the 7th International Conference on Simulation, Modeling and Optimization, September 15–17 Beijing, China,

pp 405–408.

Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L., 1999 Discovering frequent closed itemsets for association rules In: Proceedings of the International Conference

on Database Theory (ICDT’99), pp 398–416.

Pei, J., Han, J., Mao, R., 2000 CLOSET: an efﬁcient algorithm for mining frequent closed itemsets In: Proceedings of the ACM SIGMOD Workshop Research Issues

in Data Mining and Knowledge Discovery (DMKD’00), pp 21–30.

Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.-C., 2001 PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth In: Proceedings of the International Conference on Data Engineering,

pp 215–224.

Pham, T.T., Luo, J., Hong, T.P., Vo, B., 2012 MSGPs: a novel algorithm for mining sequential generator patterns In: Computational Collective Intelligence, Tech-nologies and Applications, Lecture Notes in Computer Science, vol 7654,

pp 393–401.

Pham, T.T., Luo, J., Vo, B., 2013 An effective algorithm for mining closed sequential patterns and their minimal generators based on preﬁx trees Int J Intell Inf Database Syst 7 (4), 324–339

Pham, T.T., Luo, J., Hong, T.P., Vo, B., 2014 An efﬁcient method for mining non-redundant sequential rules using attributed preﬁx-trees Eng Appl.Artif Intell.

32, 88–99 Srikant, R., Agrawal, R.,1996 Mining sequential patterns: Generalizations and performance improvements In: Proceedings of the International Conference

on Extending Database Technology, pp 3–17.

Song, W., Yang, B., Xu, Z., 2008 Index-BitTableFI: an improved algorithm for mining frequent itemsets Knowl.-Based Syst 21 (6), 507–513

Song, S., Hu, H., Jin, S., 2005 HVSM: a new sequential pattern mining algorithm using bitmap representation In: Proceedings of the Advanced Data Mining and Applications, pp 455–463.

Vo, B., Hong, T.P., Le, B., 2012 DBV-Miner: a dynamic bit-vector approach for fast mining frequent itemsets Exp Syst Appl 39 (8), 7196–7206

Van, T.T., Vo, B., Le, B., 2014 IMSR_PreTree: an improved algorithm for mining sequential rules based on the preﬁx-tree Vietnam J Comput Sci 1 (2), 97–105 Wang, J., Han, J., Pei, J., 2003 CLOSETþ: searching for the best strategies for mining frequent closed itemsets In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’03), pp 236–245.

Wang, J., Han, J., Li, C., 2007 Frequent closed sequence mining without candidate maintenance IEEE Trans Knowl Data Eng 19 (8), 1042–1056

Yan, X., Han, J., Afshar, R., 2003 CloSpan: mining closed sequential patterns in large datasets In: Proceedings of the SIAM International Conference on Data Mining,

pp 166–177.

Yang, Z., Kitsuregawa, M., 2005 LAPIN–SPAM: an improved algorithm for mining sequential pattern In: Proceedings of the ICDE Workshops 2005, p 1222.

Zaki, M., 2001 SPADE: an efﬁcient algorithm for mining frequent sequences Mach Learn 42 (1–2), 31–60

Zaki, M., Hsiao, C.,2002 CHARM: an efﬁcient algorithm for closed itemset mining In: Proceedings of the SIAM International Conference on Data Mining (SDM’02),

pp 457–473.

Định dạng
Số trang	7
Dung lượng	1,04 MB