DSpace at VNU: Mining non-redundant sequential rules with dynamic bit vectors and pruning techniques

Some algorithms build a prefix tree of frequent sequences before generating sequential rules.. The present study proposes the algorithm NRD-DBV for mining non-redundant sequential rules

Trang 1

DOI 10.1007/s10489-016-0765-3

Mining non-redundant sequential rules with dynamic

bit vectors and pruning techniques

Minh-Thai Tran 1 · Bac Le 2 · Bay Vo 3,4 · Tzung-Pei Hong 5,6

Abstract Most algorithms for mining sequential rules

focus on generating all sequential rules These algorithms

produce an enormous number of redundant rules, making

mining inefficient in intelligent systems In order to solve

this problem, the mining of non-redundant sequential rules

was recently introduced Most algorithms for mining such

rules depend on patterns obtained from existing frequent

sequence mining algorithms Several steps are required to

Bay Vo

bayvodinh@gmail.com

vodinhbay@tdt.edu.vn

Minh-Thai Tran

minhthai@huflit.edu.vn

Bac Le

lhbac@fit.hcmus.edu.vn

Tzung-Pei Hong

tphong@nuk.edu.tw

1 Faculty of Information Technology, University of Foreign

Languages - Information Technology, Ho Chi Minh, Vietnam

2 Department of Computer Science, University of Science,

VNU-HCM, Vietnam

3 Division of Data Science, Ton Duc Thang University,

Ho Chi Minh, Vietnam

4 Faculty of Information Technology, Ton Duc Thang

University, Ho Chi Minh, Vietnam

5 Department of CSIE, National University of Kaohsiung,

Kaohsiung, Taiwan, Republic of China

6 Department of Computer Science and Engineering,

National Sun Yat-sen University, Kaohsiung,

Taiwan, Republic of China

organize the data structure of these sequences before rules can be generated This process requires a great deal of time and memory The present study proposes a technique for mining non-redundant sequential rules directly from sequence databases The proposed method uses a dynamic bit vector data structure and adopts a prefix tree in the mining process In addition, some pruning techniques are used to remove unpromising candidates early in the min-ing process Experimental results show the efficiency of the algorithm in terms of runtime and memory usage

Keywords Data mining· Dynamic bit vector · Non-redundant rule· Sequential rule

1 Introduction

The goal of sequential rule mining is to find the relation-ships between occurrences of sequential items in sequence

databases A sequential rule is expressed in the form X →

Y ; i.e., if X occurs in a sequence of a database then Y also occurs in that sequence following X with high confidence.

In general, the mining process is divided into two main phases: (1) mining of frequent sequences and (2) generation

of sequential rules based on those sequences

Since Agrawal and Srikant proposed the AprioriAll algo-rithm [1], the mining of frequent sequences has been widely studied The mining of frequent sequences is a necessary step before the generation of sequential rules, and thus researchers mainly focus on improving the effi-ciency of this step Several algorithms that use differ-ent strategies for the organization of data, data struc-ture, and mining techniques have been proposed Algo-rithms for mining sequential rules include Full [15] and MSR PreTree [18]

Trang 2

With a large sequence database, the number of frequent

sequences is very large, which affects the efficiency of

min-ing sequential rules Some scholars have thus attempted to

remove sequences that do not affect the final mining results

to make the sequences compact Some examples include the

mining of frequent closed sequences and the mining of

non-redundant sequential rules Typical algorithms for mining

frequent closed sequences are CloSpan [21], BIDE [20] and

CloGen [11] Efficient algorithms for mining non-redundant

sequential rules include CNR [8] and MNSR PreTree [12]

However, these algorithms generate sequential rules based

on the results of existing frequent sequence mining

algo-rithms Thus, they depend entirely on the data structure

of the mined frequent sequences Some algorithms build

a prefix tree of frequent sequences before generating

sequential rules

The present study proposes the algorithm NRD-DBV for

mining non-redundant sequential rules based on dynamic bit

vectors and pruning techniques, which adopts a prefix tree

and uses a dynamic bit vector structure to compress the data

The algorithm uses a depth-first search order with pruning

prefixes in order to traverse the search space efficiently The

pruning techniques are adopted to reduce the required

stor-age and execution time for mining non-redundant sequential

rules directly from sequence databases

The rest of this paper is organized as follows Section2

defines the problem Section 3 summarizes some related

work Section4presents the proposed algorithm Section5

shows the experimental results Conclusions and

sugges-tions for future work are given in Section6

2 Problem definitions

Consider a sequence database with a set I of distinct events

where I = {i1, i2, i3, · · · , i n }, i j is an event (or an item),

and 1≤ j ≤ n A set of unordered events is called an

item-set Each itemset is represented in brackets For example

(ABC) represents an itemset with three items, namely A,

B , and C The brackets are omitted to simplify the

nota-tion for itemsets with only a single item For example, the

notation B is used to represent an itemset with only item

B A sequence S = {e1, e2, e3, · · · , e m} is an ordered

list of events, where e j is an itemset and 1 ≤ j ≤ m.

The size of a sequence is the number m of itemsets in

the sequence The length of a sequence is the number of

items in the sequence A sequence with length k is called a

k-sequence

Definition 1 (Subsequence and supersequence) Let S a =

a1, a2, · · · , a m and S b = b1, b2, · · · , b n be two

sequences The sequence S a is a subsequence of S bif there

are m integers i1to i mand 1 ≤ i1 < i2 < · · · < i m ≤ n

such that a1 = b i1, a2 = b i2, · · · , a m = b im In this case,

S b is also called a supersequence of S a , denoted as S a ⊆ S b

Definition 2 (Sequence database) A sequence database

{s1, s2, s3, · · · , s |D| }, where |D| is the number of sequences

in D and s i (1≤ i ≤ |D|) is the i−th sequence in D For example, the database D in Table1includes five sequences, i.e.,|D| = 5.

Definition 3 (Support of a sequence) The support of a

sequence S a in a sequence database D is calculated as the number of sequences with at least one occurrence of S a in

D divided by |D| and is denoted as sup(S a ) A sequence

S a with a support sup(S a ) will be shown in the form S a:

sup(S a )to simplify the notation For example, in Table1,

the sequence (AC) appears in three sequences; thus the support of (AC) is 60 %, denoted (AC): 60 %.

Definition 4 (Frequent sequence) Given a minimum

sup-port threshold minSup, a sequence S a is called a frequent

sequence in D if sup(S a ) > = minSup The problem of

min-ing frequent sequences is to find a complete set of frequent

subsequences for an input sequence database D and a given minimum support threshold minSup.

Definition 5 (Frequent closed sequence) Let S a and

S b be two frequent sequences S a is called a

fre-quent closed sequence if there is no S b such that

S a ⊆ S b ∧ sup(S a ) = sup(S b ) Different from the problem of mining frequent sequences, the problem of mining frequent closed sequences is to find a complete set of frequent closed sequences for an input sequence

database D and a given minimum support threshold min-Sup Frequent closed sequences are more compact than general frequent sequences because subsequence S a, which

has the same support as that of supersequence S b, is

absorbed by S b without affecting the mining results For example, in Table 1, sequence A(BC) is absorbed by

Table 1 Example sequence database

Trang 3

Definition 6 (Substring of a sequence) Let S be a sequence.

A substring of S, denoted as sub i,j (S)(i ≤ j), is defined

as the segment from position i to position j of S Its length

is (j − i + 1) For example, sub 1,2 ( AA(AC)) is AA and

sub 4,4 ( AA(AC)) is C.

Definition 7 (Concatenation) Let S a and S b be two

sequences A sequence S a + S b denotes the

concatena-tion of S a and S b by appending S b after S a For example,

AB + AC = ABAC.

Definition 8 (Sequential rule) A sequential rule r is

denoted by pre → post (sup, conf), where pre and post are

frequent sequences, and sup and conf are the support and

confidence values of r respectively, wheresup = sup(pre +

post) and conf = sup(pre + post) / sup(pre).

Definition 9 (Frequent sequential rule and strong

sequen-tial rule) Given a minimum support threshold minSup and a

minimum confidence threshold minConf, a rule whose

sup-port value is higher than or equal to minSup is considered a

frequent sequential rule, and a rule whose confidence value

is higher than or equal to minConf is a strong sequential

rule

For each frequent sequence f of size k, (k− 1) rules

are possibly generated For example, if there is a frequent

sequence A(BC)C, then two possible rules are A →

(BC)C and A(BC) → C.

Definition 10 (Rule inference and redundant rule) Let D

be a sequence database, S i be the i-th sequence in D (1≤

i ≤ |D|), and r1and r2be two sequential rules r1infers r2

if and only if both of the following two situations hold: (1)

∀S i ∈ D, 1 ≤ i ≤ |D|, r1.pre + r1.post ⊆ S i ∧ r2.pre +

r2.post ⊆ S i and (2) sup(r1) = sup(r2) ∧ conf(r1) =

conf(r2) A sequential rule is said to be redundant if it can

be inferred by another rule

For example, assume that the two rules r1 : A →

(BC)C and r2 : A → C have the same support

and confidence values r2is thus redundant since it can be

inferred by r1

Definition 11 (Prefixed generator) A frequent sequence P

is considered to be a prefixed generator if there is no other

P such that P ⊆ P ∧ sup(P ) = sup(P).

Definition 12 (Non-redundant rule) Based on Definitions

5 and 10, a rule r: pre → post is said to be

non-redundant if pre + post ∈ frequent closed sequence and

pre ∈ prefixed generator.

Given two minimum thresholds minSup and minConf, the

goal of this study is to find the non-redundant sequential

rules from a sequence database

3 Related work

In order to mine sequential rules, frequent sequences need to

be mined before sequential rules can be generated The min-ing of frequent sequences was first proposed by Agrawal and Srikant with their AprioriAll algorithm [1], based on the downward closure property Agrawal and Srikant expanded this mining problem in a general way with the GSP rithm [16] Then, several frequent sequence mining algo-rithms have been proposed to improve mining efficiency These algorithms use various approaches for organizing data and storing mined information Most of them trans-form an original database into a vertical trans-format or use the database projection technique to reduce the search space and thus execution time Typical algorithms include SPADE [23], PrefixSpan [9], SPAM [2], LAPIN-SPAM [22], and CMAP [5]

The mining of frequent closed sequences has also been studied Its runtime and required storage are relatively low due to its compact representation The original informa-tion of frequent sequences can be entirely retrieved from frequent closed sequences Frequent closed sequence min-ing or frequent closed itemset minmin-ing algorithms include CloSpan [21], BIDE [20], ClaSP [6], and CloFS-DBV [17] These algorithms prune non-closed sequences using techniques such as the common prefix and backward sub-pattern methods BIDE differs from the other algorithms

in that it does not keep track of previously obtained fre-quent closed sequences for checking the downward closure

of new patterns Instead, it uses bi-directional extension techniques to examine candidate frequent closed sequences before extending a sequence Moreover, the algorithm uses

a back scan process to determine candidates that cannot be extended to reduce mining time The algorithm also uses the pseudo-projection technique to reduce database storage space which is efficient for low support thresholds

In addition to frequent sequence mining algorithms, many researchers have proposed sequential and non-redundant sequential rule mining algorithms The latter has the advantage of complete and compact rules because only non-redundant rules are derived, reducing runtime and memory usage For example, Spiliopoulou [15] proposed

a method for generating a complete set of sequential rules from frequent sequences that removes redundant rules in a post-mining phase Lo et al [8] proposed the compressed non-redundant algorithm (CNR) for mining a compressed set of non-redundant sequential rules generated from two types of sequence set: LS-Closed and CS-Closed The premise of a rule is a sequence in the LS-Closed set and the consequence is a sequence in the CS-Closed set The gener-ation of sequential rules is based on a prefix tree to increase efficiency Some other typical algorithms include CloGen [11], MNSR PreTree [12] and IMSR PreTree [19]

Trang 4

Most sequential or non-redundant sequential rule

min-ing algorithms, however, use a set of frequent sequences

or frequent closed sequences mined using existing frequent

sequence miners A lot of time is required to transform

or construct frequent sequences for generating sequential

rules An efficient method is proposed here to mine

non-redundant sequential rules It uses a compressed data

struc-ture in a vertical data format and some pruning techniques to

mine frequent closed sequences and generate non-redundant

sequential rules directly

4 Proposed algorithm

This section describes the proposed NRD-DBV

algo-rithm, which uses the DBVPattern to mine frequent closed

sequences A prefix tree is used for storing all frequent

closed sequences Based on the property of a prefix tree,

non-redundant sequential rules can be generated efficiently

Before a discussion of the proposed approach, the

DBVPat-tern data structure will first be described

Sequence mining algorithms based on a vertical data

for-mat have proven to be more efficient than those based on a

horizontal data format Typical algorithms that use a

verti-cal data format include SPADE [23], DISC-all [3], HVSM

[13], and MSGPs [10] These algorithms scan the database

only once and quickly calculate the support of the sequence

However, a disadvantage is that they require a great deal

of memory to store additional information BitTableFI [4]

and Index-BitTableFI [14] have solved this problem by

compressing data by using a bit table (BitTable)

The main drawback of the bit vector structure is a fixed

size, which depends on the number of transactions in a

sequence database ‘1’ indicates that a given item appears

in the transaction and ‘0’ indicates otherwise In practice,

there are usually many ‘0’ bits in a bit vector, i.e., items in a

sequence database often randomly appear In addition,

dur-ing the extension of sequences (usdur-ing bitwise AND) ‘0’ bits

will appear more often, which increases the required

mem-ory and processing time In order to overcome this problem,

a dynamic bit vector (DBV) architecture is used here (Tran

et al [17]; Le et al [7])

Let A and B be two bit vectors p1and p2are the

proba-bilities of ‘1’ bits in bit vectors A and B, respectively Let k

be the probability of ‘0’ bits after joining A and B to get AB

by extending the sequence Therefore, the probability of ‘1’

bits in bit vector AB is min(p1, p2) − k, where min(p1, p2)

Table 2 Example of bit vector for 16 transactions of item i

Table 3 Conversion of bit vector in Table2 to DBV

index = 7

0 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0

DBV = {7, 1 0 1 1 1 0 1}

is the minimum value of p1and p2 The probability of ‘1’

in AB decreases as that of ‘0’ increases Moreover, the gap between p1and p2quickly increases after several sequence extensions For example, suppose that there are 16

transac-tions in a sequence database An item i exists in transactransac-tions

7, 9, 10, 11, and 13 The bit vector for item i needs 16

bytes, as shown in Table2 The first non-zero byte appears

at index 7 The DBV only stores the starting index and the sequence of bytes starting from the first non-zero byte until the last non-zero byte, as shown in Table3 Only 8 bytes are required to store the information using the DBV structure

To efficiently find frequent closed sequences, the DBV-Pattern data structure is used in the proposed algorithm

It uses a DBV structure combined with the location infor-mation of sequences The DBV structure is used to store sequences in a vertical format, so the support of a sequence pattern can easily be calculated by counting the number of

‘1’ bits Each DBV consists of two parts: (1) start bit: the position of the first appearance of ‘1’, and (2) bit vector: the sequence of bits starting from the first non-zero bit until the last non-zero bit Table 4shows the conversion of the

example database D in Table1into the DBV format Take the itemB as an example It appears in sequences

3, 4 and 5 Thus the bit vector for B is (0, 0, 1, 1, 1).

Since its leading ‘1’ bit is at the third position, the DBV representation is (3, 111) Note that if a bit vector is (0, 1, 0,

1, 0), then its DBV is (2, 101)

Since a pattern may appear multiple times in a sequence, its starting position and all appearance positions are stored

in the form of startPos: {list positions} For example, the

itemB in sequence 4 first appears in the second position

and then appears in the third position The list of positions for B in the sequence is thus 2: {2, 3}, where the first

Table 4 Conversion of database D in Table1 into DBV format

Conversion to DBV

Start bit Bit vector

A 1, 2, 3, 4, 5 1 1 1 1 1 1 1 1 1 1 1

B 3, 4, 5 0 0 1 1 1 3 1 1 1

C 1, 2, 3, 4 1 1 1 1 0 1 1 1 1 1

Trang 5

Table 5 DBVPattern for itemB in Table1

List of positions 2: {2} 2: {2, 3} 2: {2, 3}

2 represents the first appearance position Table 5 shows

the DBVPattern of item B in the example The index

field represents the corresponding sequence with bit ‘1’ in

Table4

The NRD-DBV algorithm consists of five main steps:

(1) conversion of a sequence database to the DBVPattern

structure; (2) early pruning of prefix sequences; (3)

exami-nation of the downward closure of frequent sequences; (4)

sequence extension; and (5) generation of non-redundant

sequential rules

The proposed algorithm uses several kinds of sequence

extension:

1 1-sequence extension: Assume that α and β are two

frequent 1-sequences represented in the DBVPattern

form Let {DBV α , p α } and {DBV β , p β} be the DBVs

and the list of positions for α and β, respectively A

bit AND operator on two DBVs with the same indices

(data sequences) is defined as DBV αβ = DBV α ∧DBV β

There are two forms of 1-sequence extension:

(a) Itemset extension: α+i β = (αβ){DBV αβ , p β},

if (α<β) ∧ (p α = p β ), and

(b) Sequence extension: α+s β = αβ{DBV αβ , p β},

if (p α < p β )

2 k -sequence extension: Assume that α and β are two

frequent k-sequences (k > 1) represented in the

DBV-Pattern form Let u = sub k,k (α) , v = sub k,k (β), and

{DBV α , p α } and {DBV β , p β} represent the DBVs and

Table 6 NRD-DBV algorithm: mining non-redundant sequential

rules

Algorithm: NRD-DBV (D, minSup, minConf ) Input: Sequence database D with item set I, minSup, and minConf Output: Set of non-redundant sequential rules nr-SeqRule

1 root = root node with value {NULL};

2 nr-SeqRule = ;

3 fcs={Convert pattern i to DBVPattern | i I in D and sup(i) ≥ minSup};

4 Add fcs as the child nodes of root;

5 For (each child node c of root) do

6. Call ClosedPattern-Extension (c, minSup);

7 For (each child node c of root) do

8. Call Generate-NRRule (c, minConf, nr-SeqRule);

the list of positions for α and β, respectively There are

two forms of sequence extension:

(a) Itemset extension:

α+i,k β = sub 1,k−1(α)(uv) {DBV αβ , p β }, if (u < v)∧p α

= p β ) ∧ (sub 1,k−1(α) = sub 1,k−1(β)), and (b) Sequence extension:

α+s,k β = αv{DBV αβ , p β }, if (p α < p β ) ∧(sub 1,k−1(α)

= sub 1,k−1(β)).

3 Backward extension and forward extension: Let S be a sequence, S = e1e2· · · e n An item e can be added to a

sequence S in one of three positions:

(a) S = e1e2· · · e n e ∧ (sup(S ) = sup(S)),

(b) ∃i(1 ≤ i < n) such that S = e1e2· · · e i e · · · e n∧

(sup(S ) = sup(S)), and

(c) S = e e1e2· · · e n ∧ (sup(S ) = sup(S)).

Case (a) is called a forward extension and cases (b) and (c) are called backward extensions

Fig 1 Frequent closed

sequences for database D in

Table 1 Nodes with dashed

border correspond to pruning

prefixes and shaded node

corresponds to not frequent

closed sequence, which is

removed

A: 100%

{}

A(AC): 60% ABA: 40% ABB: 40% ABC: 40% A(BC): 40% ACC: 40%

A(BC)C: 40%

Trang 6

In order to prune and check candidates early, the

pro-posed approach uses the following three operations

1 Checking a sequence closure: If there is a sequence S b

that is a forward extension or a backward extension of

sequence S a , then sequence S a is not closed and can

be safely absorbed by S b For example, suppose that

S a = A(BC): 40 % and S b = A(BC)C: 40 %, where

the number 2 represents the support value According

to the cases above, S b is a backward extension of S a,

soA(BC): 40 % will be absorbed by A(BC)C: 40 %

because A(BC) ⊆ A(BC)C and sup(A(BC)) =

sup( A(BC)C) = 40 %.

2 Pruning a prefix: Consider a prefix S p = e1e2· · · e n If

there is an item e before the starting position of S p in

each of the data sequences containing S pin a sequence

database D, the extension can be pruned by prefix

S p Based on the starting position (startPos) of each

sequence in the DBVPattern, the proposed algorithm

can check it quickly by comparing two start positions of

two sequences For example, consider the database D

in Table1 There is no need to extend prefix B because

there is a pattern A (startPos = 1) that occurs before

B (startPos = 2) in each data sequence that contains

prefix B If prefix Bis extended, the results obtained

will be absorbed since the extension of prefix A already

contains B and has the same support.

Figure 1illustrates frequent closed sequence

min-ing usmin-ing the two operations above for the example

database in Table1

3 Stopping generation of sequential rules for a subtree

of a prefix: Consider three nodes n, n1, and n2, where

n1 is a child node of n, and n2 is a child node of n1

Table 7 ClosedPattern-Extension method: mining frequent closed

sequences

Method: ClosedPattern-Extension (root, minSup)

Input: Prefix tree root and minSup

Output: Set of frequent closed sequences in prefix tree root

9 listNode = child nodes of root;

10 For (each S p in listNode) do

11. If (S pis not pruned) then

12. For (each S a in listNode) do

13. If (sup(S pa = Sequence-extension(S p , S a )) ≥ minSup) then

14. Add S pa as a child node of S p;

15. If (sup(S pa = Itemset-extension(S p , S a )) ≥ minSup) then

16. Add S pa as a child node of S p;

18. Call ClosedPattern-Extension (S p , minSup);

20. Check and set the attribute of S p : closed, prefixed generator or NULL;

21 End For

Table 8 Generate-NRRule method: generating non-redundant rules

from prefix tree

Method: Generate-NRRule (root, minConf, nr-SeqRule) Input: Prefix tree root and minConf

Output: Set of non-redundant sequential rules nr-SeqRule

22 pre = sequence of root;

23 subNode = child nodes of root;

24 For (each node S r in subNode) do

25. If (S r is a prefixed generator) then

26. For (each node S n in the subtree with root S r) do

27. r = pre post, where pre+post = a sequence of Sn

28. If ((sup (Sequence of pre )sup ( )) ≥ minConf) then

29. nr-SeqRule = nr-SeqRule r;

30. Else Stop generating rules for child nodes of S n;

32. End If

33. Call Generate-NRRule (S r , minConf, nr-SeqRule);

34 End For

Since sup(n2) < sup(n1), if sup(n1)

sup(n) < minConf then

sup(n2) sup(n) < minConf Thus, if the confidence of the rule

r = pre + post is less than minConf, then we can safely stop generating the rules for all child nodes of post For example, suppose that minConf= 65 % in Fig.1; then,

there is no need to generate rules for nodes ABA, ABB, and ABC (child nodes of AB) because the confidence of rule A → B is 60 % (less than minConf ).

Table 6shows the pseudo-code of the proposed NRD-DBV algorithm, which is based on the above principles

Table 9 The relation between a number of nodes (n) and an average

number of child nodes (k) in a prefix tree

Database minSup Number Average number of

(%) of nodes (n) child nodes (k)

Trang 7

Table 10 Definitions of parameters for generating databases using

IBM synthetic data generator

C Average number of itemsets per sequence

T Average number of items per itemset

S Average number of itemsets in maximal sequences

I Average number of items in maximal sequences

N Number of distinct items

The algorithm first scans the given sequence database

D to find frequent 1-sequences and stores them in fcs as

DBVPattern (line 3) Then, the 1-sequences in fcs are added

to a prefix tree with the root of the tree set to NULL

(line 4) On line 6, the algorithm performs the sequence

extension for each child node of the root in the tree by

call-ing the ClosedPattern-Extension method in Table8 After

finding all frequent closed sequences, the algorithm begins

generating all significant sequential rules by calling the

Generate-NRRule method in Table8

In Table 7, the ClosedPattern-Extension method is

used to extend sequences in a given group prefix The

(a)

(b)C6T5S4I4N1kD10k C6T5S4I4N1kD1k

Fig 2 Comparison of runtime for a C6T5S4I4N1kD1k and b

C6T5S4I4N1kD10k with various minConf values (minSup = 0.5 %)

method executes line 18 recursively until no frequent closed sequences are generated The sequence extension

is performed in two forms: sequence extension (line 13) and itemset extension (line 15) Before the sequence extension, the algorithm tests and eliminates prefixes that cannot be used to extend frequent closed sequences using the second extension judgment on line 11 If the sequence results obtained are frequent, they are stored

as the child nodes of the prefix The prefix S p will be checked and marked as a frequent closed sequence or

a prefixed generator by using the first extension judg-ment and Definition 11 (line 20) Otherwise, it is set to

NULL.

After finding all frequent closed sequences, the algorithm begins generating all significant sequential rules by calling a Generate-NRRule method in Table8 For a prefixed gener-ator in a node of a given prefix tree, the algorithm generates all rules within a subtree with the node being the prefix (line 25) In this process, the third sequence extension judgment

is used to stop generating rules for child nodes that do not

meet the minConf value (line 30) The method is executed

recursively for all nodes in the prefix tree (line 33)

(a)

Fig 3 Comparison of runtime for a C6T5S4I4N1kD1k and b

C6T5S4I4N1kD10k with various minSup values (minConf= 50 %)

Trang 8

Suppose n be a number of nodes in a prefix tree (a

com-plete set of frequent closed sequences) Let k be an average

number of child nodes in the prefix tree For each node

that is a prefixed generator, the generating sequential rule

process will be done on its child nodes once Thus, based

on the prefix tree structure, the generating sequential rule

algorithm will be performed n × k times However, if we

do not enumerate the set of frequent sequences on the

pre-fix tree, for each sequence, we have to perform (n− 1)

operations for checking and generating sequential rules So,

the complexity of generating rules will be O(n × k) Since

k << n (Table 9 shows the relation between n and kin

some sequence databases) the complexity of the NRD-DBV

algorithm is thus≈ O(n).

5 Experimental results

Experiments were performed to evaluate the proposed

algo-rithm The CNR algorithm [8], a state-of-the-art method,

was used for comparison Both algorithms were

imple-mented on a personal computer with an Intel Core i7

(a) C6T5S4I4N1kD1k

(b) C6T5S4I4N1kD10k

Fig 4 Comparison of memory usage for a C6T5S4I4N1kD1k and b

C6T5S4I4N1kD10k with various minConf values (minSup = 0.5 %)

3.4-GHz CPU and 8.0 GB of RAM running Windows 8.1 The runtime was measured in seconds (s), and the memory usage was measured in megabytes (MB)

5.1 Experiments on synthetic databases

The synthetic databases used for comparison were gener-ated using the IBM synthetic data generator The definitions

of parameters used to generate the databases are shown in Table10

Two databases, C6T5S4I4N1kD1k and C6T5S4I4N1kD10k were used for comparison of runtime and memory usage First, experiments were conducted to compare the

execu-tion time of the two algorithms for various minConf values.

Figure 2a, b show the runtime with various minConf val-ues (from 50 to 90 %) for databases C6T5S4I4N1kD1k and

C6T5S4I4N1kD10k, respectively The minSup value was

set to 0.5 % for all of the cases in this test Figure3a, b show

the runtime with various minSup values (from 0.3 to 0.7 %) The minConf value was set to 50 % in this test.

From Fig.2the runtime increased with decreasing min-Conf value This is due to more non-redundant sequential

(a)

Fig 5 Comparison of memory usage for a C6T5S4I4N1kD1k and b

C6T5S4I4N1kD10k with various minSup values (minConf= 50 %)

Trang 9

rules being obtained for smaller minConf values For

exam-ple, for the database C6T5S4I4N1kD1k with minConf =

90 %, only 29 non-redundant sequential rules were

gener-ated For minConf = 50 %, the number of non-redundant

rules increased to 3135 rules Thus, the execution time for

generating rules increased with decreasing minConf value.

The results also show that NRD-DBV was faster than CNR

in all cases

Next, experiments were conducted to compare the

mem-ory usage of the two algorithms The results obtained for

various minConf and minSup values are shown in Figs.4

and5, respectively

Similar to the trends in Figs 2 and 3, the amount of

memory required increased with decreasing minConf value

because of the increasing number of sequential rules Since

NRD-DBV uses the DBVPattern structure and prunes the

subtrees of prefixes that infer non-significant rules early in

the process, the memory usage of NRD-DBV was lower

than that of CNR in all cases

5.2 Experiments on a real database

A real database named Gazelle was used to evaluate the

per-formance of the algorithms This database contains 59,601

(a)

(b)Memory usage Runtime

Fig 6 Comparisons of runtime and memory usage for Gazelle

database with various minConf values (minSup = 0.05 %)

(a)

(b)

Runtime

Memory usage

Fig 7 Comparisons of runtime and memory usage for Gazelle

database with various minSup values (minConf= 5 %)

sequences of clickstream data from an e-commerce web-site with 497 distinct items Figure6a, b show the runtime and memory usage, respectively, for the Gazelle database

with minSup = 0.05 % and minConf was set from 5 to

9 % in all cases Figure7a, b show the runtime and memory

usage, respectively, with various minSup values (from 0.03

to 0.07 %)

The results in Figs.6and7show that NRD-DBV outper-forms CNR in most cases for the real database

6 Conclusions and future work

This paper proposed the NRD-DBV algorithm, which uses DBVs and data sequence information to generate non-redundant sequential rules The NRD-DBV algorithm first finds all frequent closed patterns from a given sequence database A prefix tree, which is suitable for generating sequential rules, is generated during the mining of frequent closed sequences Based on the prefix tree, the algorithm generates non-redundant sequential rules quickly This pro-cess is further improved by early stopping rule generation for a supersequence of the postfix if there is a rule with low confidence The NRD-DBV algorithm scans the database

Trang 10

only once Based on the DBV data structure, the supports

of patterns can be determined quickly, and bit operators are

used to extend new patterns Due to its use of a compressed

structure and a prefix tree, the NRD-DBV algorithm is more

efficient than the CNR algorithm in terms of memory usage

and runtime

Non-redundant sequential rules can be applied in the

field of Web mining effectively, such as Web behavior or

Web restructuring, in order to help companies promote their

products or their services better Based on the educational

database, non-redundant sequential rules can be used to help

students select appropriate subjects or majors In addition,

non-redundant sequential rules can be applied to predict the

pathway of WiFi users in a university or a company in order

to manage bandwidth automatically

Mining frequent inter-sequences can find itemsets across

several transactions and discover the order relationships of

itemsets within a transaction Thus, the mining of

non-redundant inter-sequential rules for such patterns can be

applied to reduce the number of redundant rules The

NRD-DBV algorithm could be further extended to solve these

problems

Acknowledgments This work was funded by Vietnam’s National

Foundation for Science and Technology Development (NAFOSTED)

under grant number 102.05-2015.07.

References

1 Agrawal R, Srikant R (1995) Mining sequential patterns In: IEEE

international conference on data engineering, pp 3– 14

2 Ayres J, Gehrke J, Yiu T, Flannick J (2002) Sequential pattern

mining using a bitmap representation In: 8th ACM SIGKDD

international conference on knowledge discovery and data mining,

Edmonton, pp 429–435

3 Chiu DY, Wu YH, Chen ALP (2004) An efficient algorithm for

mining frequent sequences by a new strategy without support

counting In: 20th International conference on data engineering,

pp 375–386

4 Dong J, Han M (2007) BitTableFI: an efficient mining frequent

itemsets algorithm Knowl-Based Syst 20(4):329–335

5 Fournier-Viger P, Gomariz A, Campos M, Thomas R (2014) Fast

vertical mining of sequential patterns using co-occurrence

infor-mation In: Advances in knowledge discovery and data mining,

LNAI, vol 8443, pp 40–52

6 Gomariz A, Campos M, Marin R, Goethals B (2013) ClaSP:

an efficient algorithm for mining frequent closed sequences In:

Advances in knowledge discovery and data mining, LNAI, vol

7818, pp 50–61

7 Le B, Tran MT, Vo B (2015) Mining frequent closed inter-sequence patterns efficiently using dynamic bit vectors Appl Intell 43:74–84

8 Lo D, Khoo SC, Wong L (2009) Non-redundant sequential rules theory and algorithm Inf Syst 34(4):438–453

9 Pei J, et al (2001) PrefixSpan: mining sequential patterns effi-ciently by prefix-projected pattern growth In: International con-ference Data engineering, pp 215–224

10 Pham TT, Luo J, Hong TP, Vo B (2012) MSGPs: a novel algorithm for mining sequential generator patterns In: Com-putational collective intelligence, technologies and applications, LNCS, vol 7654, pp 393–401

11 Pham TT, Luo J, Vo B (2013) An effective algorithm for mining closed sequential patterns and their minimal generators based on prefix trees Int J Intell Inf Database Syst 7(4):324– 339

12 Pham TT, Luo J, Hong TP, Vo B (2014) An efficient method for mining non-redundant sequential rules using attributed prefix-trees Eng Appl Artif Intell 32:88–99

13 Song S, Hu H, Jin S (2005) HVSM: a new sequential pattern min-ing algorithm usmin-ing bitmap representation Advanced Data Minmin-ing and Applications:455–463

14 Song W, Yang B, Xu Z (2008) Index-BitTableFI: an improved algorithm for mining frequent itemsets Knowl-Based Syst 21(6):507–513

15 Spiliopoulou M (1999) Managing interesting rules in sequence mining In: Principles of data mining and knowledge discovery, Springer Berlin Heidelberg, pp 554–560

16 Srikant R, Agrawal R (1996) Mining sequential patterns: Gen-eralizations and performance improvements In: Apers PMG, Bouzeghoub M, Gardarin G (eds) EDBT 1996, LNCS, vol 1057,

pp 3–17

17 Tran MT, Le B, Vo B (2015) Combination of dynamic bit vectors and transaction information for mining frequent closed sequences efficiently Eng Appl Artif Intell 38:183–189

18 Van TT, Vo B, Le B (2011) Mining sequential rules based

on prefix-tree In: New Challenges for Intelligent Infor-mation and Database Systems, Springer Berlin Heidelberg,

pp 147–156

19 Van TT, Vo B, Le B (2014) IMSR PreTree: an improved algorithm for mining sequential rules based on the prefix-tree Vietnam J Comput Sci 1(2):97–105

20 Wang J, Han J, Li C (2007) Frequent closed sequence mining without candidate maintenance IEEE Trans Knowl Data Eng 19(8):1042–1056

21 Yan X, Han J, Afshar R (2003) CloSpan: mining closed sequential patterns in large datasets In: Proceedings of SIAM international conference on data mining, pp 166–177

22 Yang Z, Kitsuregawa M (2005) LAPIN-SPAM: an improved algo-rithm for mining sequential pattern ICDE Workshops 2005:1222

23 Zaki M (2001) SPADE: an efficient algorithm for mining frequent sequences Mach Learn 42(1–2):31–60

Định dạng
Số trang	10
Dung lượng	1,2 MB