Some algorithms build a prefix tree of frequent sequences before generating sequential rules.. The present study proposes the algorithm NRD-DBV for mining non-redundant sequential rules
Trang 1DOI 10.1007/s10489-016-0765-3
Mining non-redundant sequential rules with dynamic
bit vectors and pruning techniques
Minh-Thai Tran 1 · Bac Le 2 · Bay Vo 3,4 · Tzung-Pei Hong 5,6
© Springer Science+Business Media New York 2016
Abstract Most algorithms for mining sequential rules
focus on generating all sequential rules These algorithms
produce an enormous number of redundant rules, making
mining inefficient in intelligent systems In order to solve
this problem, the mining of non-redundant sequential rules
was recently introduced Most algorithms for mining such
rules depend on patterns obtained from existing frequent
sequence mining algorithms Several steps are required to
Bay Vo
bayvodinh@gmail.com
vodinhbay@tdt.edu.vn
Minh-Thai Tran
minhthai@huflit.edu.vn
Bac Le
lhbac@fit.hcmus.edu.vn
Tzung-Pei Hong
tphong@nuk.edu.tw
1 Faculty of Information Technology, University of Foreign
Languages - Information Technology, Ho Chi Minh, Vietnam
2 Department of Computer Science, University of Science,
VNU-HCM, Vietnam
3 Division of Data Science, Ton Duc Thang University,
Ho Chi Minh, Vietnam
4 Faculty of Information Technology, Ton Duc Thang
University, Ho Chi Minh, Vietnam
5 Department of CSIE, National University of Kaohsiung,
Kaohsiung, Taiwan, Republic of China
6 Department of Computer Science and Engineering,
National Sun Yat-sen University, Kaohsiung,
Taiwan, Republic of China
organize the data structure of these sequences before rules can be generated This process requires a great deal of time and memory The present study proposes a technique for mining non-redundant sequential rules directly from sequence databases The proposed method uses a dynamic bit vector data structure and adopts a prefix tree in the mining process In addition, some pruning techniques are used to remove unpromising candidates early in the min-ing process Experimental results show the efficiency of the algorithm in terms of runtime and memory usage
Keywords Data mining· Dynamic bit vector · Non-redundant rule· Sequential rule
1 Introduction
The goal of sequential rule mining is to find the relation-ships between occurrences of sequential items in sequence
databases A sequential rule is expressed in the form X →
Y ; i.e., if X occurs in a sequence of a database then Y also occurs in that sequence following X with high confidence.
In general, the mining process is divided into two main phases: (1) mining of frequent sequences and (2) generation
of sequential rules based on those sequences
Since Agrawal and Srikant proposed the AprioriAll algo-rithm [1], the mining of frequent sequences has been widely studied The mining of frequent sequences is a necessary step before the generation of sequential rules, and thus researchers mainly focus on improving the effi-ciency of this step Several algorithms that use differ-ent strategies for the organization of data, data struc-ture, and mining techniques have been proposed Algo-rithms for mining sequential rules include Full [15] and MSR PreTree [18]
Trang 2With a large sequence database, the number of frequent
sequences is very large, which affects the efficiency of
min-ing sequential rules Some scholars have thus attempted to
remove sequences that do not affect the final mining results
to make the sequences compact Some examples include the
mining of frequent closed sequences and the mining of
non-redundant sequential rules Typical algorithms for mining
frequent closed sequences are CloSpan [21], BIDE [20] and
CloGen [11] Efficient algorithms for mining non-redundant
sequential rules include CNR [8] and MNSR PreTree [12]
However, these algorithms generate sequential rules based
on the results of existing frequent sequence mining
algo-rithms Thus, they depend entirely on the data structure
of the mined frequent sequences Some algorithms build
a prefix tree of frequent sequences before generating
sequential rules
The present study proposes the algorithm NRD-DBV for
mining non-redundant sequential rules based on dynamic bit
vectors and pruning techniques, which adopts a prefix tree
and uses a dynamic bit vector structure to compress the data
The algorithm uses a depth-first search order with pruning
prefixes in order to traverse the search space efficiently The
pruning techniques are adopted to reduce the required
stor-age and execution time for mining non-redundant sequential
rules directly from sequence databases
The rest of this paper is organized as follows Section2
defines the problem Section 3 summarizes some related
work Section4presents the proposed algorithm Section5
shows the experimental results Conclusions and
sugges-tions for future work are given in Section6
2 Problem definitions
Consider a sequence database with a set I of distinct events
where I = {i1, i2, i3, · · · , i n }, i j is an event (or an item),
and 1≤ j ≤ n A set of unordered events is called an
item-set Each itemset is represented in brackets For example
(ABC) represents an itemset with three items, namely A,
B , and C The brackets are omitted to simplify the
nota-tion for itemsets with only a single item For example, the
notation B is used to represent an itemset with only item
B A sequence S = {e1, e2, e3, · · · , e m} is an ordered
list of events, where e j is an itemset and 1 ≤ j ≤ m.
The size of a sequence is the number m of itemsets in
the sequence The length of a sequence is the number of
items in the sequence A sequence with length k is called a
k-sequence
Definition 1 (Subsequence and supersequence) Let S a =
a1, a2, · · · , a m and S b = b1, b2, · · · , b n be two
sequences The sequence S a is a subsequence of S bif there
are m integers i1to i mand 1 ≤ i1 < i2 < · · · < i m ≤ n
such that a1 = b i1, a2 = b i2, · · · , a m = b im In this case,
S b is also called a supersequence of S a , denoted as S a ⊆ S b
Definition 2 (Sequence database) A sequence database
{s1, s2, s3, · · · , s |D| }, where |D| is the number of sequences
in D and s i (1≤ i ≤ |D|) is the i−th sequence in D For example, the database D in Table1includes five sequences, i.e.,|D| = 5.
Definition 3 (Support of a sequence) The support of a
sequence S a in a sequence database D is calculated as the number of sequences with at least one occurrence of S a in
D divided by |D| and is denoted as sup(S a ) A sequence
S a with a support sup(S a ) will be shown in the form S a:
sup(S a )to simplify the notation For example, in Table1,
the sequence (AC) appears in three sequences; thus the support of (AC) is 60 %, denoted (AC): 60 %.
Definition 4 (Frequent sequence) Given a minimum
sup-port threshold minSup, a sequence S a is called a frequent
sequence in D if sup(S a ) > = minSup The problem of
min-ing frequent sequences is to find a complete set of frequent
subsequences for an input sequence database D and a given minimum support threshold minSup.
Definition 5 (Frequent closed sequence) Let S a and
S b be two frequent sequences S a is called a
fre-quent closed sequence if there is no S b such that
S a ⊆ S b ∧ sup(S a ) = sup(S b ) Different from the problem of mining frequent sequences, the problem of mining frequent closed sequences is to find a complete set of frequent closed sequences for an input sequence
database D and a given minimum support threshold min-Sup Frequent closed sequences are more compact than general frequent sequences because subsequence S a, which
has the same support as that of supersequence S b, is
absorbed by S b without affecting the mining results For example, in Table 1, sequence A(BC) is absorbed by
Table 1 Example sequence database
Trang 3Definition 6 (Substring of a sequence) Let S be a sequence.
A substring of S, denoted as sub i,j (S)(i ≤ j), is defined
as the segment from position i to position j of S Its length
is (j − i + 1) For example, sub 1,2 ( AA(AC)) is AA and
sub 4,4 ( AA(AC)) is C.
Definition 7 (Concatenation) Let S a and S b be two
sequences A sequence S a + S b denotes the
concatena-tion of S a and S b by appending S b after S a For example,
AB + AC = ABAC.
Definition 8 (Sequential rule) A sequential rule r is
denoted by pre → post (sup, conf), where pre and post are
frequent sequences, and sup and conf are the support and
confidence values of r respectively, wheresup = sup(pre +
post) and conf = sup(pre + post) / sup(pre).
Definition 9 (Frequent sequential rule and strong
sequen-tial rule) Given a minimum support threshold minSup and a
minimum confidence threshold minConf, a rule whose
sup-port value is higher than or equal to minSup is considered a
frequent sequential rule, and a rule whose confidence value
is higher than or equal to minConf is a strong sequential
rule
For each frequent sequence f of size k, (k− 1) rules
are possibly generated For example, if there is a frequent
sequence A(BC)C, then two possible rules are A →
(BC)C and A(BC) → C.
Definition 10 (Rule inference and redundant rule) Let D
be a sequence database, S i be the i-th sequence in D (1≤
i ≤ |D|), and r1and r2be two sequential rules r1infers r2
if and only if both of the following two situations hold: (1)
∀S i ∈ D, 1 ≤ i ≤ |D|, r1.pre + r1.post ⊆ S i ∧ r2.pre +
r2.post ⊆ S i and (2) sup(r1) = sup(r2) ∧ conf(r1) =
conf(r2) A sequential rule is said to be redundant if it can
be inferred by another rule
For example, assume that the two rules r1 : A →
(BC)C and r2 : A → C have the same support
and confidence values r2is thus redundant since it can be
inferred by r1
Definition 11 (Prefixed generator) A frequent sequence P
is considered to be a prefixed generator if there is no other
P such that P ⊆ P ∧ sup(P ) = sup(P).
Definition 12 (Non-redundant rule) Based on Definitions
5 and 10, a rule r: pre → post is said to be
non-redundant if pre + post ∈ frequent closed sequence and
pre ∈ prefixed generator.
Given two minimum thresholds minSup and minConf, the
goal of this study is to find the non-redundant sequential
rules from a sequence database
3 Related work
In order to mine sequential rules, frequent sequences need to
be mined before sequential rules can be generated The min-ing of frequent sequences was first proposed by Agrawal and Srikant with their AprioriAll algorithm [1], based on the downward closure property Agrawal and Srikant expanded this mining problem in a general way with the GSP rithm [16] Then, several frequent sequence mining algo-rithms have been proposed to improve mining efficiency These algorithms use various approaches for organizing data and storing mined information Most of them trans-form an original database into a vertical trans-format or use the database projection technique to reduce the search space and thus execution time Typical algorithms include SPADE [23], PrefixSpan [9], SPAM [2], LAPIN-SPAM [22], and CMAP [5]
The mining of frequent closed sequences has also been studied Its runtime and required storage are relatively low due to its compact representation The original informa-tion of frequent sequences can be entirely retrieved from frequent closed sequences Frequent closed sequence min-ing or frequent closed itemset minmin-ing algorithms include CloSpan [21], BIDE [20], ClaSP [6], and CloFS-DBV [17] These algorithms prune non-closed sequences using techniques such as the common prefix and backward sub-pattern methods BIDE differs from the other algorithms
in that it does not keep track of previously obtained fre-quent closed sequences for checking the downward closure
of new patterns Instead, it uses bi-directional extension techniques to examine candidate frequent closed sequences before extending a sequence Moreover, the algorithm uses
a back scan process to determine candidates that cannot be extended to reduce mining time The algorithm also uses the pseudo-projection technique to reduce database storage space which is efficient for low support thresholds
In addition to frequent sequence mining algorithms, many researchers have proposed sequential and non-redundant sequential rule mining algorithms The latter has the advantage of complete and compact rules because only non-redundant rules are derived, reducing runtime and memory usage For example, Spiliopoulou [15] proposed
a method for generating a complete set of sequential rules from frequent sequences that removes redundant rules in a post-mining phase Lo et al [8] proposed the compressed non-redundant algorithm (CNR) for mining a compressed set of non-redundant sequential rules generated from two types of sequence set: LS-Closed and CS-Closed The premise of a rule is a sequence in the LS-Closed set and the consequence is a sequence in the CS-Closed set The gener-ation of sequential rules is based on a prefix tree to increase efficiency Some other typical algorithms include CloGen [11], MNSR PreTree [12] and IMSR PreTree [19]
Trang 4Most sequential or non-redundant sequential rule
min-ing algorithms, however, use a set of frequent sequences
or frequent closed sequences mined using existing frequent
sequence miners A lot of time is required to transform
or construct frequent sequences for generating sequential
rules An efficient method is proposed here to mine
non-redundant sequential rules It uses a compressed data
struc-ture in a vertical data format and some pruning techniques to
mine frequent closed sequences and generate non-redundant
sequential rules directly
4 Proposed algorithm
This section describes the proposed NRD-DBV
algo-rithm, which uses the DBVPattern to mine frequent closed
sequences A prefix tree is used for storing all frequent
closed sequences Based on the property of a prefix tree,
non-redundant sequential rules can be generated efficiently
Before a discussion of the proposed approach, the
DBVPat-tern data structure will first be described
Sequence mining algorithms based on a vertical data
for-mat have proven to be more efficient than those based on a
horizontal data format Typical algorithms that use a
verti-cal data format include SPADE [23], DISC-all [3], HVSM
[13], and MSGPs [10] These algorithms scan the database
only once and quickly calculate the support of the sequence
However, a disadvantage is that they require a great deal
of memory to store additional information BitTableFI [4]
and Index-BitTableFI [14] have solved this problem by
compressing data by using a bit table (BitTable)
The main drawback of the bit vector structure is a fixed
size, which depends on the number of transactions in a
sequence database ‘1’ indicates that a given item appears
in the transaction and ‘0’ indicates otherwise In practice,
there are usually many ‘0’ bits in a bit vector, i.e., items in a
sequence database often randomly appear In addition,
dur-ing the extension of sequences (usdur-ing bitwise AND) ‘0’ bits
will appear more often, which increases the required
mem-ory and processing time In order to overcome this problem,
a dynamic bit vector (DBV) architecture is used here (Tran
et al [17]; Le et al [7])
Let A and B be two bit vectors p1and p2are the
proba-bilities of ‘1’ bits in bit vectors A and B, respectively Let k
be the probability of ‘0’ bits after joining A and B to get AB
by extending the sequence Therefore, the probability of ‘1’
bits in bit vector AB is min(p1, p2) − k, where min(p1, p2)
Table 2 Example of bit vector for 16 transactions of item i
Table 3 Conversion of bit vector in Table2 to DBV
index = 7
0 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0
DBV = {7, 1 0 1 1 1 0 1}
is the minimum value of p1and p2 The probability of ‘1’
in AB decreases as that of ‘0’ increases Moreover, the gap between p1and p2quickly increases after several sequence extensions For example, suppose that there are 16
transac-tions in a sequence database An item i exists in transactransac-tions
7, 9, 10, 11, and 13 The bit vector for item i needs 16
bytes, as shown in Table2 The first non-zero byte appears
at index 7 The DBV only stores the starting index and the sequence of bytes starting from the first non-zero byte until the last non-zero byte, as shown in Table3 Only 8 bytes are required to store the information using the DBV structure
To efficiently find frequent closed sequences, the DBV-Pattern data structure is used in the proposed algorithm
It uses a DBV structure combined with the location infor-mation of sequences The DBV structure is used to store sequences in a vertical format, so the support of a sequence pattern can easily be calculated by counting the number of
‘1’ bits Each DBV consists of two parts: (1) start bit: the position of the first appearance of ‘1’, and (2) bit vector: the sequence of bits starting from the first non-zero bit until the last non-zero bit Table 4shows the conversion of the
example database D in Table1into the DBV format Take the itemB as an example It appears in sequences
3, 4 and 5 Thus the bit vector for B is (0, 0, 1, 1, 1).
Since its leading ‘1’ bit is at the third position, the DBV representation is (3, 111) Note that if a bit vector is (0, 1, 0,
1, 0), then its DBV is (2, 101)
Since a pattern may appear multiple times in a sequence, its starting position and all appearance positions are stored
in the form of startPos: {list positions} For example, the
itemB in sequence 4 first appears in the second position
and then appears in the third position The list of positions for B in the sequence is thus 2: {2, 3}, where the first
Table 4 Conversion of database D in Table1 into DBV format
Conversion to DBV
Start bit Bit vector
A 1, 2, 3, 4, 5 1 1 1 1 1 1 1 1 1 1 1
B 3, 4, 5 0 0 1 1 1 3 1 1 1
C 1, 2, 3, 4 1 1 1 1 0 1 1 1 1 1
Trang 5Table 5 DBVPattern for itemB in Table1
List of positions 2: {2} 2: {2, 3} 2: {2, 3}
2 represents the first appearance position Table 5 shows
the DBVPattern of item B in the example The index
field represents the corresponding sequence with bit ‘1’ in
Table4
The NRD-DBV algorithm consists of five main steps:
(1) conversion of a sequence database to the DBVPattern
structure; (2) early pruning of prefix sequences; (3)
exami-nation of the downward closure of frequent sequences; (4)
sequence extension; and (5) generation of non-redundant
sequential rules
The proposed algorithm uses several kinds of sequence
extension:
1 1-sequence extension: Assume that α and β are two
frequent 1-sequences represented in the DBVPattern
form Let {DBV α , p α } and {DBV β , p β} be the DBVs
and the list of positions for α and β, respectively A
bit AND operator on two DBVs with the same indices
(data sequences) is defined as DBV αβ = DBV α ∧DBV β
There are two forms of 1-sequence extension:
(a) Itemset extension: α+i β = (αβ){DBV αβ , p β},
if (α<β) ∧ (p α = p β ), and
(b) Sequence extension: α+s β = αβ{DBV αβ , p β},
if (p α < p β )
2 k -sequence extension: Assume that α and β are two
frequent k-sequences (k > 1) represented in the
DBV-Pattern form Let u = sub k,k (α) , v = sub k,k (β), and
{DBV α , p α } and {DBV β , p β} represent the DBVs and
Table 6 NRD-DBV algorithm: mining non-redundant sequential
rules
Algorithm: NRD-DBV (D, minSup, minConf ) Input: Sequence database D with item set I, minSup, and minConf Output: Set of non-redundant sequential rules nr-SeqRule
1 root = root node with value {NULL};
2 nr-SeqRule = ;
3 fcs={Convert pattern i to DBVPattern | i I in D and sup(i) ≥ minSup};
4 Add fcs as the child nodes of root;
5 For (each child node c of root) do
6. Call ClosedPattern-Extension (c, minSup);
7 For (each child node c of root) do
8. Call Generate-NRRule (c, minConf, nr-SeqRule);
the list of positions for α and β, respectively There are
two forms of sequence extension:
(a) Itemset extension:
α+i,k β = sub 1,k−1(α)(uv) {DBV αβ , p β }, if (u < v)∧p α
= p β ) ∧ (sub 1,k−1(α) = sub 1,k−1(β)), and (b) Sequence extension:
α+s,k β = αv{DBV αβ , p β }, if (p α < p β ) ∧(sub 1,k−1(α)
= sub 1,k−1(β)).
3 Backward extension and forward extension: Let S be a sequence, S = e1e2· · · e n An item e can be added to a
sequence S in one of three positions:
(a) S = e1e2· · · e n e ∧ (sup(S ) = sup(S)),
(b) ∃i(1 ≤ i < n) such that S = e1e2· · · e i e · · · e n∧
(sup(S ) = sup(S)), and
(c) S = e e1e2· · · e n ∧ (sup(S ) = sup(S)).
Case (a) is called a forward extension and cases (b) and (c) are called backward extensions
Fig 1 Frequent closed
sequences for database D in
Table 1 Nodes with dashed
border correspond to pruning
prefixes and shaded node
corresponds to not frequent
closed sequence, which is
removed
A: 100%
{}
A(AC): 60% ABA: 40% ABB: 40% ABC: 40% A(BC): 40% ACC: 40%
A(BC)C: 40%
Trang 6In order to prune and check candidates early, the
pro-posed approach uses the following three operations
1 Checking a sequence closure: If there is a sequence S b
that is a forward extension or a backward extension of
sequence S a , then sequence S a is not closed and can
be safely absorbed by S b For example, suppose that
S a = A(BC): 40 % and S b = A(BC)C: 40 %, where
the number 2 represents the support value According
to the cases above, S b is a backward extension of S a,
soA(BC): 40 % will be absorbed by A(BC)C: 40 %
because A(BC) ⊆ A(BC)C and sup(A(BC)) =
sup( A(BC)C) = 40 %.
2 Pruning a prefix: Consider a prefix S p = e1e2· · · e n If
there is an item e before the starting position of S p in
each of the data sequences containing S pin a sequence
database D, the extension can be pruned by prefix
S p Based on the starting position (startPos) of each
sequence in the DBVPattern, the proposed algorithm
can check it quickly by comparing two start positions of
two sequences For example, consider the database D
in Table1 There is no need to extend prefix B because
there is a pattern A (startPos = 1) that occurs before
B (startPos = 2) in each data sequence that contains
prefix B If prefix Bis extended, the results obtained
will be absorbed since the extension of prefix A already
contains B and has the same support.
Figure 1illustrates frequent closed sequence
min-ing usmin-ing the two operations above for the example
database in Table1
3 Stopping generation of sequential rules for a subtree
of a prefix: Consider three nodes n, n1, and n2, where
n1 is a child node of n, and n2 is a child node of n1
Table 7 ClosedPattern-Extension method: mining frequent closed
sequences
Method: ClosedPattern-Extension (root, minSup)
Input: Prefix tree root and minSup
Output: Set of frequent closed sequences in prefix tree root
9 listNode = child nodes of root;
10 For (each S p in listNode) do
11. If (S pis not pruned) then
12. For (each S a in listNode) do
13. If (sup(S pa = Sequence-extension(S p , S a )) ≥ minSup) then
14. Add S pa as a child node of S p;
15. If (sup(S pa = Itemset-extension(S p , S a )) ≥ minSup) then
16. Add S pa as a child node of S p;
18. Call ClosedPattern-Extension (S p , minSup);
20. Check and set the attribute of S p : closed, prefixed generator or NULL;
21 End For
Table 8 Generate-NRRule method: generating non-redundant rules
from prefix tree
Method: Generate-NRRule (root, minConf, nr-SeqRule) Input: Prefix tree root and minConf
Output: Set of non-redundant sequential rules nr-SeqRule
22 pre = sequence of root;
23 subNode = child nodes of root;
24 For (each node S r in subNode) do
25. If (S r is a prefixed generator) then
26. For (each node S n in the subtree with root S r) do
27. r = pre post, where pre+post = a sequence of Sn
28. If ((sup (Sequence of pre )sup ( )) ≥ minConf) then
29. nr-SeqRule = nr-SeqRule r;
30. Else Stop generating rules for child nodes of S n;
32. End If
33. Call Generate-NRRule (S r , minConf, nr-SeqRule);
34 End For
Since sup(n2) < sup(n1), if sup(n1)
sup(n) < minConf then
sup(n2) sup(n) < minConf Thus, if the confidence of the rule
r = pre + post is less than minConf, then we can safely stop generating the rules for all child nodes of post For example, suppose that minConf= 65 % in Fig.1; then,
there is no need to generate rules for nodes ABA, ABB, and ABC (child nodes of AB) because the confidence of rule A → B is 60 % (less than minConf ).
Table 6shows the pseudo-code of the proposed NRD-DBV algorithm, which is based on the above principles
Table 9 The relation between a number of nodes (n) and an average
number of child nodes (k) in a prefix tree
Database minSup Number Average number of
(%) of nodes (n) child nodes (k)
Trang 7Table 10 Definitions of parameters for generating databases using
IBM synthetic data generator
C Average number of itemsets per sequence
T Average number of items per itemset
S Average number of itemsets in maximal sequences
I Average number of items in maximal sequences
N Number of distinct items
The algorithm first scans the given sequence database
D to find frequent 1-sequences and stores them in fcs as
DBVPattern (line 3) Then, the 1-sequences in fcs are added
to a prefix tree with the root of the tree set to NULL
(line 4) On line 6, the algorithm performs the sequence
extension for each child node of the root in the tree by
call-ing the ClosedPattern-Extension method in Table8 After
finding all frequent closed sequences, the algorithm begins
generating all significant sequential rules by calling the
Generate-NRRule method in Table8
In Table 7, the ClosedPattern-Extension method is
used to extend sequences in a given group prefix The
(a)
(b)C6T5S4I4N1kD10k C6T5S4I4N1kD1k
Fig 2 Comparison of runtime for a C6T5S4I4N1kD1k and b
C6T5S4I4N1kD10k with various minConf values (minSup = 0.5 %)
method executes line 18 recursively until no frequent closed sequences are generated The sequence extension
is performed in two forms: sequence extension (line 13) and itemset extension (line 15) Before the sequence extension, the algorithm tests and eliminates prefixes that cannot be used to extend frequent closed sequences using the second extension judgment on line 11 If the sequence results obtained are frequent, they are stored
as the child nodes of the prefix The prefix S p will be checked and marked as a frequent closed sequence or
a prefixed generator by using the first extension judg-ment and Definition 11 (line 20) Otherwise, it is set to
NULL.
After finding all frequent closed sequences, the algorithm begins generating all significant sequential rules by calling a Generate-NRRule method in Table8 For a prefixed gener-ator in a node of a given prefix tree, the algorithm generates all rules within a subtree with the node being the prefix (line 25) In this process, the third sequence extension judgment
is used to stop generating rules for child nodes that do not
meet the minConf value (line 30) The method is executed
recursively for all nodes in the prefix tree (line 33)
(a)
(b)C6T5S4I4N1kD10k C6T5S4I4N1kD1k
Fig 3 Comparison of runtime for a C6T5S4I4N1kD1k and b
C6T5S4I4N1kD10k with various minSup values (minConf= 50 %)
Trang 8Suppose n be a number of nodes in a prefix tree (a
com-plete set of frequent closed sequences) Let k be an average
number of child nodes in the prefix tree For each node
that is a prefixed generator, the generating sequential rule
process will be done on its child nodes once Thus, based
on the prefix tree structure, the generating sequential rule
algorithm will be performed n × k times However, if we
do not enumerate the set of frequent sequences on the
pre-fix tree, for each sequence, we have to perform (n− 1)
operations for checking and generating sequential rules So,
the complexity of generating rules will be O(n × k) Since
k << n (Table 9 shows the relation between n and kin
some sequence databases) the complexity of the NRD-DBV
algorithm is thus≈ O(n).
5 Experimental results
Experiments were performed to evaluate the proposed
algo-rithm The CNR algorithm [8], a state-of-the-art method,
was used for comparison Both algorithms were
imple-mented on a personal computer with an Intel Core i7
(a) C6T5S4I4N1kD1k
(b) C6T5S4I4N1kD10k
Fig 4 Comparison of memory usage for a C6T5S4I4N1kD1k and b
C6T5S4I4N1kD10k with various minConf values (minSup = 0.5 %)
3.4-GHz CPU and 8.0 GB of RAM running Windows 8.1 The runtime was measured in seconds (s), and the memory usage was measured in megabytes (MB)
5.1 Experiments on synthetic databases
The synthetic databases used for comparison were gener-ated using the IBM synthetic data generator The definitions
of parameters used to generate the databases are shown in Table10
Two databases, C6T5S4I4N1kD1k and C6T5S4I4N1kD10k were used for comparison of runtime and memory usage First, experiments were conducted to compare the
execu-tion time of the two algorithms for various minConf values.
Figure 2a, b show the runtime with various minConf val-ues (from 50 to 90 %) for databases C6T5S4I4N1kD1k and
C6T5S4I4N1kD10k, respectively The minSup value was
set to 0.5 % for all of the cases in this test Figure3a, b show
the runtime with various minSup values (from 0.3 to 0.7 %) The minConf value was set to 50 % in this test.
From Fig.2the runtime increased with decreasing min-Conf value This is due to more non-redundant sequential
(a)
(b)C6T5S4I4N1kD10k C6T5S4I4N1kD1k
Fig 5 Comparison of memory usage for a C6T5S4I4N1kD1k and b
C6T5S4I4N1kD10k with various minSup values (minConf= 50 %)
Trang 9rules being obtained for smaller minConf values For
exam-ple, for the database C6T5S4I4N1kD1k with minConf =
90 %, only 29 non-redundant sequential rules were
gener-ated For minConf = 50 %, the number of non-redundant
rules increased to 3135 rules Thus, the execution time for
generating rules increased with decreasing minConf value.
The results also show that NRD-DBV was faster than CNR
in all cases
Next, experiments were conducted to compare the
mem-ory usage of the two algorithms The results obtained for
various minConf and minSup values are shown in Figs.4
and5, respectively
Similar to the trends in Figs 2 and 3, the amount of
memory required increased with decreasing minConf value
because of the increasing number of sequential rules Since
NRD-DBV uses the DBVPattern structure and prunes the
subtrees of prefixes that infer non-significant rules early in
the process, the memory usage of NRD-DBV was lower
than that of CNR in all cases
5.2 Experiments on a real database
A real database named Gazelle was used to evaluate the
per-formance of the algorithms This database contains 59,601
(a)
(b)Memory usage Runtime
Fig 6 Comparisons of runtime and memory usage for Gazelle
database with various minConf values (minSup = 0.05 %)
(a)
(b)
Runtime
Memory usage
Fig 7 Comparisons of runtime and memory usage for Gazelle
database with various minSup values (minConf= 5 %)
sequences of clickstream data from an e-commerce web-site with 497 distinct items Figure6a, b show the runtime and memory usage, respectively, for the Gazelle database
with minSup = 0.05 % and minConf was set from 5 to
9 % in all cases Figure7a, b show the runtime and memory
usage, respectively, with various minSup values (from 0.03
to 0.07 %)
The results in Figs.6and7show that NRD-DBV outper-forms CNR in most cases for the real database
6 Conclusions and future work
This paper proposed the NRD-DBV algorithm, which uses DBVs and data sequence information to generate non-redundant sequential rules The NRD-DBV algorithm first finds all frequent closed patterns from a given sequence database A prefix tree, which is suitable for generating sequential rules, is generated during the mining of frequent closed sequences Based on the prefix tree, the algorithm generates non-redundant sequential rules quickly This pro-cess is further improved by early stopping rule generation for a supersequence of the postfix if there is a rule with low confidence The NRD-DBV algorithm scans the database
Trang 10only once Based on the DBV data structure, the supports
of patterns can be determined quickly, and bit operators are
used to extend new patterns Due to its use of a compressed
structure and a prefix tree, the NRD-DBV algorithm is more
efficient than the CNR algorithm in terms of memory usage
and runtime
Non-redundant sequential rules can be applied in the
field of Web mining effectively, such as Web behavior or
Web restructuring, in order to help companies promote their
products or their services better Based on the educational
database, non-redundant sequential rules can be used to help
students select appropriate subjects or majors In addition,
non-redundant sequential rules can be applied to predict the
pathway of WiFi users in a university or a company in order
to manage bandwidth automatically
Mining frequent inter-sequences can find itemsets across
several transactions and discover the order relationships of
itemsets within a transaction Thus, the mining of
non-redundant inter-sequential rules for such patterns can be
applied to reduce the number of redundant rules The
NRD-DBV algorithm could be further extended to solve these
problems
Acknowledgments This work was funded by Vietnam’s National
Foundation for Science and Technology Development (NAFOSTED)
under grant number 102.05-2015.07.
References
1 Agrawal R, Srikant R (1995) Mining sequential patterns In: IEEE
international conference on data engineering, pp 3– 14
2 Ayres J, Gehrke J, Yiu T, Flannick J (2002) Sequential pattern
mining using a bitmap representation In: 8th ACM SIGKDD
international conference on knowledge discovery and data mining,
Edmonton, pp 429–435
3 Chiu DY, Wu YH, Chen ALP (2004) An efficient algorithm for
mining frequent sequences by a new strategy without support
counting In: 20th International conference on data engineering,
pp 375–386
4 Dong J, Han M (2007) BitTableFI: an efficient mining frequent
itemsets algorithm Knowl-Based Syst 20(4):329–335
5 Fournier-Viger P, Gomariz A, Campos M, Thomas R (2014) Fast
vertical mining of sequential patterns using co-occurrence
infor-mation In: Advances in knowledge discovery and data mining,
LNAI, vol 8443, pp 40–52
6 Gomariz A, Campos M, Marin R, Goethals B (2013) ClaSP:
an efficient algorithm for mining frequent closed sequences In:
Advances in knowledge discovery and data mining, LNAI, vol
7818, pp 50–61
7 Le B, Tran MT, Vo B (2015) Mining frequent closed inter-sequence patterns efficiently using dynamic bit vectors Appl Intell 43:74–84
8 Lo D, Khoo SC, Wong L (2009) Non-redundant sequential rules theory and algorithm Inf Syst 34(4):438–453
9 Pei J, et al (2001) PrefixSpan: mining sequential patterns effi-ciently by prefix-projected pattern growth In: International con-ference Data engineering, pp 215–224
10 Pham TT, Luo J, Hong TP, Vo B (2012) MSGPs: a novel algorithm for mining sequential generator patterns In: Com-putational collective intelligence, technologies and applications, LNCS, vol 7654, pp 393–401
11 Pham TT, Luo J, Vo B (2013) An effective algorithm for mining closed sequential patterns and their minimal generators based on prefix trees Int J Intell Inf Database Syst 7(4):324– 339
12 Pham TT, Luo J, Hong TP, Vo B (2014) An efficient method for mining non-redundant sequential rules using attributed prefix-trees Eng Appl Artif Intell 32:88–99
13 Song S, Hu H, Jin S (2005) HVSM: a new sequential pattern min-ing algorithm usmin-ing bitmap representation Advanced Data Minmin-ing and Applications:455–463
14 Song W, Yang B, Xu Z (2008) Index-BitTableFI: an improved algorithm for mining frequent itemsets Knowl-Based Syst 21(6):507–513
15 Spiliopoulou M (1999) Managing interesting rules in sequence mining In: Principles of data mining and knowledge discovery, Springer Berlin Heidelberg, pp 554–560
16 Srikant R, Agrawal R (1996) Mining sequential patterns: Gen-eralizations and performance improvements In: Apers PMG, Bouzeghoub M, Gardarin G (eds) EDBT 1996, LNCS, vol 1057,
pp 3–17
17 Tran MT, Le B, Vo B (2015) Combination of dynamic bit vectors and transaction information for mining frequent closed sequences efficiently Eng Appl Artif Intell 38:183–189
18 Van TT, Vo B, Le B (2011) Mining sequential rules based
on prefix-tree In: New Challenges for Intelligent Infor-mation and Database Systems, Springer Berlin Heidelberg,
pp 147–156
19 Van TT, Vo B, Le B (2014) IMSR PreTree: an improved algorithm for mining sequential rules based on the prefix-tree Vietnam J Comput Sci 1(2):97–105
20 Wang J, Han J, Li C (2007) Frequent closed sequence mining without candidate maintenance IEEE Trans Knowl Data Eng 19(8):1042–1056
21 Yan X, Han J, Afshar R (2003) CloSpan: mining closed sequential patterns in large datasets In: Proceedings of SIAM international conference on data mining, pp 166–177
22 Yang Z, Kitsuregawa M (2005) LAPIN-SPAM: an improved algo-rithm for mining sequential pattern ICDE Workshops 2005:1222
23 Zaki M (2001) SPADE: an efficient algorithm for mining frequent sequences Mach Learn 42(1–2):31–60