The experimental results show that sequential rule mining with interestingness measures using the proposed algorithm based on the prefix-tree was always much faster than that using the o
Trang 1学校代号 10532 学 号 LB2010034
分 类 号 TP391 密 级 Normal
博士学位论文 基于前缀树结构的序列模式 挖掘算法研究 (英文版) 学位申请人姓名 : PHAM THI THIET
培 养 单 位 : 信息科学与工程学院
导师姓名及职称 : 骆嘉伟 教授
学 科 专 业 : 计算机科学
研 究 方 向 : 数据挖掘和知识发现
论 文 提 交 日 期 : 2013-06-07
Trang 2University ID: 10532 Student ID: LB2010034
Security Level: Normal
Hunan University Doctoral Thesis
RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE
Aplicant’s Name : PHAM THI THIET College : Information Science and Engineering Supervisor : Professor Luo Jiawei Major : Computer Science Research Field : Data Mining and Knowledge Discorvery Submission Date : 2013-05 Defense Date : 2013-06-07 Defense committee Chairman : Professor Bo Liao
Trang 3Research On The Sequential Pattern Mining Algorithms Using Prefix-Tree Structure
By
PHAM THI THIET
Masters in Computer Science (Guru Gobind Singh Indraprastha University) 2008
A dissertation submitted in partial satisfaction of the
Requirements for the degree of
Trang 5HUNAN UNIVERSITY
DECLARATION
I, Pham Thi Thiet, hereby declare that the work presented in this PhD thesis
titled “Research on The Sequential Pattern Mining Algorithms Using Prefix-Tree
Structure”, is my original work and has not been presented elsewhere for any academic
qualification Where references have been used from books, published papers, reports and web sites, it is fully acknowledged in accordance with the standard referencing practices of the discipline
Student’s signature: ……… Date: ……… …
This thesis belongs to:
1 Secure □, and this power of attorney is valid after
Trang 6DOCTORAL THESIS
ABSTRACT
Together with the rapid development of computer and internet technology, the huge amounts of data have been gathered together from various kinds of applications become more enormous and have far exceeded our human power for apprehension without powerful tools They have been described as a data rich but information poor situation Therefore, data mining with the aim of finding the valuable information and necessary knowledge hidden in a vast amount of data has become one of the most important tasks in the field of data mining research The variety and richness of data have formed different data kinds include transaction data, sequence data, stream data, time-series data and so on
Sequence data is an important type of data which occurs frequently in many applications
It is composed of sequences of ordering elements or events, listed with or without a specific notion of time Although there is the existence of a lot of general data mining methods to other kinds of data but for sequence data, these methods could not be applied because of among all kinds of data, sequence data has its own unique sequence features and can be existed in many interesting applications which leads to many interesting new kinds of knowledge to be discovered including sequential patterns, approximate biological sequence patterns, partially ordered patterns, periodic patterns, motifs, and so on; and these kinds of patterns will assist the development of new classification, clustering and outlier analysis methods, which in turn call for new, the development of different application kinds
The sequential pattern mining is one of important tasks of data mining research and often used popular in sequence data mining applications The process of sequential pattern mining
is to extract frequent subsequences in a sequence database This work has also attracted much more attention to researchers in data mining research Many works has been examined on mining sequential patterns, however, the main challenges still exist as large search spaces and the ineffectiveness in handling dense datasets To resolve the above challenges, the problems for mining closed sequential pattern, sequential generator pattern, and sequential rules have been proposed In this thesis, we have proposed novel algorithms to address these problems with the following two main objectives:
⚫ Exploitation of secondary information as sequential pattern, closed sequential pattern, sequential generator pattern based on the corresponding prefix-tree structures
⚫ Generate the kinds of sequential rules based on the secondary information in the prefix-tree structure
Trang 7In this thesis, we have four mainly contributions which can be briefly described as follows:
⚫ Firstly, this thesis mentions several interestingness measures as Lift, Conviction, Piatetsky-Shapiro, Cosine, Jaccard and so on, which have proposed for mining association rules and classification rules but they have not been applied to mine sequential rules in sequence databases except the traditional measures of rule such as the support and confidence
We also propose then an efficient algorithm to generate all relevant sequential rules with the above interestingness measures from the prefix-tree which stored the whole sequential pattern where each child node stores a sequential pattern and its corresponding support value By traversing the prefix-tree, the algorithm can then easily identify the components of a rule, and can calculate the measured values of the rule The experimental results show that sequential rule mining with interestingness measures using the proposed algorithm based on the prefix-tree was always much faster than that using the other existing algorithm as modified
Full Especially when mining in large sequence databases with the low minimum support
values, the number of sequential patterns generated from sequence databases was large and the proposed algorithm outperformed much because the proposed algorithm only traverse the prefix-tree to immediately determine which sequences are the left- and right-hand sides of a rule as well as their support values to compute the interestingness measure values of the rule from the sequential pattern set In addition, the experimental results also show that the time
for mining sequential rules with the confidence measure was the smallest, because it did not need to revisit the prefix-tree to determine the support of Y (the antecedence of rules), while
the other interestingness measures need to revisit the prefix-tree to determine the support values of the consequent of rules or both the antecedence and the consequent
⚫ Secondly, in this thesis, the characteristics of sequential generator patterns are combined with the extension of a sequence on the prefix-tree to propose two efficient
algorithms, called MSGPs and MSGP_PreTree, for finding all the sequential generator
patterns at the same time of the generating sequential patterns Using the prefix-tree, new sequences, which are child nodes, can be easily created by appending an item to the last position of a parent node as an itemset extension or a sequence extension The proposed algorithms use the prime block encoding approach to represent candidate sequences and uses join operations over the prime blocks to determine the frequency for each candidate In the
MSGPs algorithm, it uses a hash table to store sequential generator patterns with the hash key
as the support of the pattern for fast checking MSGP_PreTree algorithm that is improved from the MSGPs algorithm, to generate all sequential generator patterns The idea of the
improved algorithm is performed by modifying the prefix-tree such that each node on the
Trang 8DOCTORAL THESIS
prefix-tree will be added fields to check whether the sequence stored in this node is a sequential generator pattern or not The whole information of the sequence is stored on the
prefix-tree, so the MSGP_PreTree algorithm does not need to use a hash table to store
sequential generator patterns, which reduce significantly the use of memory The supersequence frequency-based pruning and the non-generator-based pruning on the
prefix-tree are applied in the MSGP_PreTree algorithm to reduce the search space The process of extending prefix-tree and determining sequential pattern in the MSGP_PreTree algorithm is performed similar to the MSGPs algorithm All the experimental results for
synthetic and real databases show that the number of sequential generator pattern is always smaller than the number of sequential patterns, and in all cases the proposed algorithms outperform the other algorithm in terms of running time
⚫ Thirdly, we propose an efficient algorithm for directly finding both closed sequential patterns and their sequential generator patterns in the generating sequential patterns process
called CloGen algorithm (Closed sequential pattern-sequential Generator pattern), which is
based on the combination of the child-parent relationship on prefix-tree structure and the definition of closed sequential pattern and sequential generator pattern Each node on the prefix-tree in our approach stores a sequential pattern and its corresponding support value
Besides, it will be added one field (IsmSGP) to consider whether this node is a minimal sequential generator pattern, and another field (IsCSP) to consider whether this node is a
closed sequential pattern Based on these fields added to each node, the algorithm easily determines if the sequence at each node is a minimal sequential generator pattern or closed sequential pattern, so the mining time is reduced significantly This algorithm also uses join operations over the prime block encoding approach of the prime factorization theory to represent candidate sequences and determine the frequency for each candidate Experimental results show that the performance runtime for mining closed sequential patterns and their
minimal sequential generator patterns using the CloGen algorithm is much faster than one order of magnitude The CloGen algorithm can generate all sequential patterns, sequential
generator patterns, and closed sequential patterns at the same time Furthermore, the built prefix-tree in the our approach will be one of the most efficient prefix-trees for mining non-redundant sequential rules in the future and also for mining all sequential rules
⚫ Fourthly, an efficient algorithm called MNSR-Pretree for mining non-redundant
sequential rules is proposed in this thesis The proposed algorithm is decomposed two phases
In the first phase, it builds a prefix-tree that stores all the sequential patterns from a given sequence database Then in the second phase, it mines non-redundant sequential rules from this prefix-tree In the prefix-tree building process, each node on the prefix-tree has a field
Trang 9(IsmSGP) that indicates whether this node is a minimal sequential generator pattern, and another field (IsCSP) that indicates whether this node is a closed sequential pattern, which is performed by the CloGen algorithm in the previous contribution By traversing the prefix-tree,
non-redundant sequential rules can be easily mined from a minimal sequential generator
pattern X to a closed sequential pattern Y such that X is a prefix of Y, which greatly reduces the mining time required Based on the values of IsmSGP and IsCSP, the MNSR-Pretree algorithm only mines rules from a parent node whose IsmSGP value is true to children nodes whose IsCSP value is true, so that the sequence at the parent node is considered as an
antecedent of the rules to be generated, and the consequents of rules are generated by removing the prefix part, which the sequence at the parent node has, from closed sequential patterns The experimental results on synthetic and real databases show that the number of non-redundant sequential rules is much smaller than that of sequential rules, and that the time required for mining non-redundant sequential rules is much less than that required for mining sequential rules Besides, the results also show that the time required for mining non-redundant sequential rules of the proposed algorithm is less than that required by an existing algorithm
In summary, in this thesis we have proposed the efficient algorithms and also completed
the initial introduced objective is that "To improve the efficiency of the exploitation of
secondary information algorithms as sequential pattern, closed sequential pattern, sequential generator pattern based on the prefix-tree structure" with the main contribution is "the use of the prefix-tree in order to generate significantly the kinds of sequential rules as sequential rules with interestingness measures and non-redundant sequential rules from the secondary information" The goal of this thesis has been achieved by using the child- parent relationship
on the prefix-tree structure and the extension of sequences to propose novel algorithms for mining works related to sequential patterns in the sequence database including algorithms for mining sequential rules with interestingness measures, mining sequential generator patterns, mining closed sequential patterns and their sequential generator patterns and mining non-redundant sequential rules The above proposed methods can be evaluated with both synthetic and real datasets Experimental results illustrate the effectiveness and efficiency of our algorithms, which improved significantly the efficiency
Keywords: Sequential pattern, closed sequential pattern, sequential generator pattern,
interestingness measure, sequential rule, non-redundant sequential rule, prefix-tree
Trang 12得 其 他 项 目 可 以 很 容 易 被 表 示 为 n X Y = n X - n X Y,
Y X
序 模 式 。如 果 pre 与 post 串 联 ,表 示 为 pre++post,那 么 结 果 是 初 始 的 序 列
模 式 。 序 列 规 则 r 由 此 可 以 形 成 pre post (Sup, imv)。 r 的 支 持 Sup(r)因
Trang 13则 “pre post”形 成 , 由 此 post 是 SP 的 一 个 关 于 pre 前 缀 的 前 缀 。
一 个 规 则 的 大 多 数 有 趣 的 方 法 依 赖 于 Post 的 支 持 ,为 了 获 得 Post 的 支
持 ,程 序 FIND_SUP_POST(RNode,Post)被 调 用 ,RNode 是 Post 的 前 缀 树 中
的 第 一 个 根 节 点 并 且 为 非 空 。 FIND_SUP_POST 程 序 (RNode,Post)产 生 Post
的 支 持 通 过 遍 历 以 RNode 为 根 节 点 的 前 缀 树 的 所 有 分 支 , Rnode 为 Post 的
Trang 14DOCTORAL THESIS
EXTEN D_SEQUEN C E 通 过 增 加 dbpat 中 每 一 项 到 扩 充 节 点 的 最 后 位 置
来 创 建 新 模 式 Pnew。每 一 个 添 加 的 项 目 新 节 点 Pnew 最 近 的 项 集 。如 果 Pnew
Trang 15器 , 然 后 Pnew 作 为 扩 充 节 点 的 孩 子 节 点 的 扩 充 项 增 加 到 pretree 。
EXTEN D_SEQUEN C E 通 过 在 dbpat 增 加 每 一 项 到 扩 充 节 点 的 最 后 位 置 来 创
建 新 模 式 Pnew。每 一 个 添 加 的 项 目 新 节 点 Pnew 最 近 的 项 集 。如 果 Pnew 的
Trang 16DOCTORAL THESIS
块 。 因 为 每 一 个 新 创 建 的 子 节 点 Pnew 被 分 配 {IsmSGP,IsCSP}={true,true},
如 果 Sup(Pnew)=Sup(P), Pnew.IsmSGP 和 P.IsCSP 将 被 设 置 为 false。 调 用
UPDAT E_PRET REE (Pnew, pret ree) 更 新 闭 序 列 模 式 和 前 缀 树 的 序 列 生 成 器
Trang 18DOCTORAL THESIS
TABLE OF CONTENTS
DECLARATION I ABSTRACT II
摘 要 VI
TABLE OF CONTENTS XIV LIST OF FIGURES XVIII LIST OF TABLES XXI LIST OF ABBREVIATIONS XXII
CHAPTER 1: INTRODUCTION 1
1.1 Overview of the sequence database in data mining 1
1.2 Motivation 3
1.3 Sequential pattern 4
1.4 Closed sequential pattern 5
1.5 Sequential generator pattern 6
1.6 Sequential rule 7
1.7 Objective of the thesis 8
1.8 Contributions of the thesis 8
1.9 Organization of the thesis 9
CHAPTER 2: DEFINITIONS AND RELATED WORKS 11
2.1 Introduction 11
2.2 Sequential Pattern Mining 11
2.2.1 Definitions 11
Trang 192.2.2 Organization of the sequence data 13
2.2.3 Prefix-tree Structure 14
2.2.4 Sequential patterns mining algorithms 16
2.2.4.1 AprioriAll 16
2.2.4.2 GSP 18
2.2.4.3 PSP 18
2.2.4.4 SPADE 19
2.2.4.5 PrefixSpan 20
2.2.4.6 SPAM 21
2.2.4.7 PRISM 24
2.3 Closed sequential patterns mining 32
2.3.1 CloSpan 33
2.3.2 BIDE 34
2.4 Sequential generator patterns mining 35
2.4.1 GenMiner 36
2.4.2 FEAT 37
2.4.3 FSGP 38
2.5 Sequential rules mining 39
2.6 Non-redundant sequential rules mining 44
2.7 Summary 45
CHAPTER 3: MINING SEQUENTIAL RULE WITH INTERESTINGNESS MEASURES USING PREFIX-TREE 46
3.1 Introduction 46
3.2 Problem statement 46
Trang 20DOCTORAL THESIS
3.3 Mining sequential rules with interestingness measures 48
3.3.1 Interestingness measures 48
3.3.2 Algorithm 50
3.3.3 Illustration 52
3.3.4 Experiments 54
3.4 Summary 59
CHAPTER 4: SEQUENTIAL GENERATOR PATTERN MINING 60
4.1 Introduction 60
4.2 Unique Characteristics of Sequential Generator Patterns 60
4.3 Mining sequential generator pattern on hash table 61
4.3.1 Algorithm 61
4.3.2 Illustration 63
4.3.3 Experiments 65
4.4 Mining sequential generator pattern on prefix-tree 66
4.4.1 Algorithm 67
4.4.2 Illustration 70
4.4.3 Experiments 70
4.5 Summary 73
CHAPTER 5: CLOSED SEQUENTIAL PATTERNS AND THEIR MINIMAL SEQUENTIAL GENERATOR PATTERNS MINING 75
5.1 Introduction 75
5.2 Definitions 77
5.3 Mining closed sequential patterns and their minimal sequential generator patterns 78 5.3.1 CloGen Algorithm 78
Trang 215.3.2 Illustration 80
5.3.3 Experiments 81
5.4 Summary 85
CHAPTER 6: NON-REDUNDANT SEQUENTIAL RULE MINING 86
6.1 Introduction 86
6.2 Definitions 86
6.3 Mining non-redundant sequential rules based on prefix-tree 87
6.3.1 Algorithm 88
6.3.2 Illustration 90
6.3.3 Experiments 91
6.4 Summary 96
CONCLUSION AND FUTURE RESEARCH WORKS 97
1 Summary of the thesis 97
2 Future works 100
REFERENCES 102
APPENDIX A: LIST OF RESEARCH PUBLICATIONS 111
APPENDIX B: PROJECTS 112
ACKNOWLEDGMENTS 113
Trang 22DOCTORAL THESIS
LIST OF FIGURES
Figure 1.1 A DNA sequence fragment 2 Figure 1.2 A protein sequence fragment 2 Figure 1.3 A weblog sequence 2 Figure 1.4 A customer purchase history 2 Figure 1.5 A storewide sales history 2 Figure 2.1 The Prefix-tree structure 16
Figure 2.2 AprioriAll Algorithm 17 Figure 2.3 A prefix-tree structure used in PSP algorithm 19 Figure 2.4 The SPADE Algorithm 19
Figure 2.5 The pseudo-code for the Enumerate_Seq[X] procedure 20
Figure 2.6 The PrefixSpan algorithm 21
Figure 2.7 The lexicographical tree of sequences 22 Figure 2.8 A bitmap representation of the sequence database in Table 2.5 23
Figure 2.9 The SPAM algorithm 23 Figure 2.10 Lattice built over P(G), each node shows a set SP(G) under bit-vector S B and
the value obtained by multiplying its element S(S ). 25 Figure 2.11 Example of primal block encoding 28 Figure 2.12 Extensions via Prime Block Joins 31
Figure 2.13 The CloSpan algorithm 34 Figure 2.14 The BIDE algorithm 35 Figure 2.15 The GenMiner algorithm 37
Figure 2.16 A sample Prefix Search Tree (a) and Prefix Search Lattice (b) 37
Trang 23Figure 2.17 The FEAT algorithm 38 Figure 2.18 The FSGP algorithm 39 Figure 2.19 The Full algorithm 40 Figure 2.20 The MSR_ImpFull algorithm 43 Figure 2.21 The MSR_PreTree algorithm 43
Figure 3.1 A prefix-tree structure storing sequential patterns from Table 2.1 47 Figure 3.2 The interestingness measures roles in data mining process 49 Figure 3.3 The proposed algorithm for generating sequential rules based on a prefix-tree 51 Figure 3.4 The mining times of the two algorithms for different interestingness measures in
database in Table 2.1 with minSup = 50% 67 Figure 4.4 The MSGP-PreTree algorithm for generating set of sequential generator patterns 68
Figure 4.5 The comparison between number of sequential patterns and sequential generator patterns in databases 71 Figure 4.6 The mining sequential generator patterns times of two algorithms in databases 72
Figure 5.1 The CSGM algorithm 75
Figure 5.2 An algorithm for generating closed sequential patterns and their minimal sequential generator patterns 78
Figure 5.3 Level 1 of the pretree tree (each node contains: sequential pattern, support,
IsmSGP, and IsCSP) 80
Trang 24= 50% 88 Figure 6.2 Algorithm for generating a set of non-redundant sequential rules 89 Figure 6.3 Runtime for mining sequential rules and non-redundant sequential rules for the
database 95
Trang 25LIST OF TABLES
Table 2.1 An example sequence database (SD) 12
Table 2.2 Sequence database 14 Table 2.3 Horizontal Format 14 Table 2.4 Vertical Format 14 Table 2.5 A sequence database 23 Table 2.6 Sequential patterns 41 Table 2.7 The set of sequential rules is generated from the set of sequential patterns 41
Table 3.1 Some interestingness measures for a rule X Y 49
Table 3.2 The sequential rules generated for any interestingness measures in Table 3.1 with
minThreshold = 0 53
Table 3.3 The sequential rules with minThreshold = 0.8 54
Table 3.4 The time ratios for different interestingness measures 55 Table 4.1 The list of sequential patterns and sequential generator patterns 64 Table 4.2 Experimental results for three databases 66 Table 5.1 Results of all sequential patterns, closed sequential patterns and sequential generator patterns 83
Table 6.1 Sequential rules and non-redundant rules obtained with minConf = 50% 90
Table 6.2 Numbers of sequential rules and non-redundant sequential rules obtained from three
databases with minConf = 0% 92
Trang 26FP-tree Frequent Pattern-tree
id-list identifiers list
imv interestingness measure value
IsCSP Is Closed Sequential Pattern
IsmSGP Is minimal Sequential Generator Pattern
litemset large itemset
minThreshold minimum interestingness measure Threshold
mSGP() set of all minimal Sequential Generator Patterns of
MSGP_PreTree Mining Sequential Generator Pattern on Prefix-Tree
SB bit-vector of sequence S with B be a bit-vector of length N
SGP() set of Sequential Generator Patterns of
Trang 27CHAPTER 1: INTRODUCTION
1.1 Overview of the sequence database in data mining
Due to the rapid development of computer and internet technology, the huge amounts of data have been gathered together from various kinds of applications become more enormous and have far exceeded our human power for apprehension without powerful tools They have been described as a data rich but information poor situation Therefore, data mining with the aim of finding the valuable information and necessary knowledge hidden in a vast amount of data has become one of the most important tasks in the field of data mining research The diversity and richness of data have made different data kinds [1] include transaction data, sequence data, stream data, time-series data and so on
Sequence data is an important type of data occurred frequently in many scientific and engineering [2~ 4], business [5~ 7], customer behavior analysis [8~9], stock trend prediction [10~11], DNA sequence analysis [12], web usage behaviour analysis [13~ 15] and other applications It is composed of sequences of ordering elements or events, listed with or without a specific notion of time as biological sequence (Figure 1.1 and Figure 1.2), weblogs sequence (Figure 1.3), a sequence of the customer purchase and sale histories (Figure 1.4 and Figure 1.5), a sequence of events in science, in the natural or social … Although there is the existence of a lot of general data mining methods to other kinds of data but for sequence data, these methods could not be applied because of among all kinds of data, sequence data has its own unique sequence features and can be seen in many interesting applications which leads to many interesting new kinds of knowledge to be discovered including sequential patterns, approximate biological sequence patterns, partially ordered patterns, periodic patterns, motifs, and so on; and these kinds of patterns will assist the development of new classification, clustering and outlier analysis methods, which in turn call for new, the development of different application kinds Beside, sequence data clearly describes the through time relationships among data, so the mining rules in the sequence data is also expected to provide
a lot of valuable knowledge hidden with meaningful through time
Trang 28THE ALGORITHMS RESEARCH ON SEQUENTIAL PATTERNS MINING USING PREFIX-TREE STRUCTURE
GAATTCTCTGTAACACTAAGCTCTCTTCCTCAAAACCAGAGGTAGATAGAATGTGTAATAAT TTACAGAATTTCTAGACTTCAACGATCTGATTTTTTAAATTTATTTTTATTTTTTCAGGTTGAG ACTGAGCTAAAGTTAATCTGTGGC
Figure 1.1 A DNA sequence fragment
SSQIRQNYSTEVEAAVNRLVNLYLRASYTYLSLGFYFDRDDVALEGVCHEFRELAEEKREGAE RLLKMQNQRGGRALFQDLQKPSQDEWGTTPDAMKAAIVLEKSLNQALLDLHALGSAQADPH LCDFLESHFLDEEVKLIKKMGDHLTNIQRLVGSQAGLGEYLFERLTLKHD
Figure 1.2 A protein sequence fragment
100, a, 100,b, 200, a, 300, b 400, a, 100, a, 400, b, 300, a, 100, c, 200, c, 400, a,
400, e
Figure 1.3 A weblog sequence
223100, 05/26/06, 10am, CentralStation, {WholeMealBread, AppleJuice},
223100, 05/26/06, 11am, CentralStation, {Burger, Pepsi, Banana },
223100, 05/26/06, 4am, WalMart, {Milk, Cereal, Vegetable},
223100, 05/26/06, 10am, CentralStation, {WholeMealBread, AppleJuice}
Figure 1.4 A customer purchase history
97100, 05/06, {Apple : $85K, Bread : $100K, Cereal : $150K, …},
90089, 05/06, {Apple : $65K, Bread : $105K, Diaper : $20K, …},
97100, 05/06, {Apple : $95K, Bread : $110K, Cereal : $160K, …},
90089, 05/06, {Apple : $66K, Bread : $95K, Diaper : $22K, …}
Figure 1.5 A storewide sales history
Sequence data has several distinct characteristics compared with other kinds of data So, sequence data mining lead to many opportunities, challenges, and as well as draw the attention of researchers for sequence data mining These include the following [5]:
⚫ The length of sequences can be very long In a given sequence database, the length of each sequence is difference even may have a very large variation For example, the length of a gene can be as small as several hundred, but as large as over 100K
⚫ A pattern can be substring or subsequence Sometimes, a pattern must occur as a substring in a sequence i.e the elements in a substring must be consecutive elements
in an original supersequence, without gaps between elements At other times, a pattern can also be a subset of sequence, the elements of a pattern can occur as a
Trang 29subsequence of a sequence, allowing gaps between matching elements
⚫ Absolute positions of elements in sequences may/may not have significance, e.g when we want to look for a sequence containing a pattern or not, we don’t need to care that pattern occur in any absolute position in the sequence
⚫ The relative ordering/positional relationship between elements in sequences often
plays an important role For example, sequence XY is usually different from sequence
YX Furthermore, the distance between two elements in sequences is also often
significant The relative ordering/positional relationship between elements is a unique feature to sequences This is the basic difference of sequence data compared with other kinds of data
Several the tasks of data mining are often used popular in sequence data mining applications [5]: mining sequential pattern, classification of sequences, clustering of sequences The sequential pattern is a sequence of itemsets that frequently appeared in a specific order and all items in the same itemset are given to have the same transaction-time value or within a time-gap Finding sequential patterns from sequence database is an important problem and a focused subject in data mining research field
1.2 Motivation
The sequential pattern mining is one of important tasks of data mining research and often used common in sequence data mining applications It plays a fundamental role in mining associations [9,16~ 19], correlations [20], and many other interesting relationships among data Moreover, it serves in data classification [2], clustering [21~ 23], and other data mining tasks The process of sequential pattern mining is to extract frequent subsequences in a sequence database There are many sequential pattern mining methods examined widely in many related problems, including the general sequential pattern mining [24~30], constraint-based sequential pattern mining [31~33], incremental sequential pattern mining [34~36], approximate sequential pattern mining [37~38], partial periodic pattern mining [6, 39], temporal pattern mining
in data stream [40]
Although many problems related to sequential pattern mining are examined, but we understand that the development of the general sequential pattern method is the most basic one Hence, in this thesis, we only investigate the tasks for the general sequential pattern mining and generating rules from a sequence database This work has also attracted much more attention to researchers in data mining research In this thesis, sequential pattern stands for general sequential pattern There are many works which has been examined on the
Trang 30THE ALGORITHMS RESEARCH ON SEQUENTIAL PATTERNS MINING USING PREFIX-TREE STRUCTURE
sequential patterns mining [24~30], however, the main challenges is still existing as large search spaces and the ineffectiveness in handling dense datasets To resolve the above challenges, the problems for mining sequential rules, closed sequential pattern, and sequential generator pattern have been proposed
Sequential rules are generated from the set of sequential patterns It expresses the temporal relationships between event sequences in a sequence database Sequential rules can
be considered as natural extension of original sequential patterns, just as association rules are natural extension of frequent itemsets Like a sequential pattern, a sequential rule is also applied in many application areas including the trade [5], stock market [8~9, 41], weather observation [42], e-learning [43], and software engineering [44~48] Sequential rule has been used
to remove irrelevant or spurious patterns in the set of sequential patterns by applying the interestingness measures for rules On the best of our knowledge, there are many studies about the interestingness measures used for mining association rules [33,37,49~51] or classification rules [33,52] in transaction databases but have not been used to mine sequential rules in sequence databases except the traditional measures
Sequential generator patterns used together with closed sequential patterns can bring additional information that closed sequential patterns alone are not able to provide and often used for mining non-redundant sequential rules Many efficient methods have been proposed
to mine sequential patterns [24~30], closed sequential patterns [41,53~56], and sequential generators patterns [57~ 59] But these algorithms have generated different types of patterns separately, which consumes much time
Non-redundant sequential rule can remove a lot of low-quality sequential rules that are almost meaningless and reduce the spending time when generating a full set of sequential rules from the complete set of sequential patterns There are recently two algorithms proposed
by Lo et al., 2009 [47] and Zang et al., 2010 [60~61] to address this problem These methods have removed a significant number of redundant sequential rules but require a lot of time for checking sequential generator patterns and closed sequential patterns to generate rules 1.3 Sequential pattern
Sequential pattern plays an important role in the data mining research area The sequential pattern mining problem was first proposed by Agrawal and Srikant [24] in 1995, and has also attracted more and more attention to researchers in the field of data mining research [25~30] Given a sequence database, the mining sequential patterns problem is to find the frequent sequences among all sequences that satisfy a user-specified minimum support
Trang 31threshold Sequential pattern has a broad range of applications, including customer purchase behavior analysis [8~9], DNA sequence pattern analysis [12], web usage behavior analysis [13~14], guidance systems [62], and so on
In the last decade, many algorithms and techniques have also been proposed to improve
the effect of mining sequential patterns, including the SPADE [27] algorithm, which was proposed to divide candidate sequences into distinct groups such that each group could be
completely stored in the main memory PrefixSpan [28] examined the prefix subsequences and
projected the corresponding postfix subsequences into projected databases The SPAM [29]algorithm could speed up the mining process using a lexicographic sequence tree and a
bitmap representation The PRISM [30] algorithm used the primal block encoding approach to represent candidate sequences and joined operations over the primal blocks to determine the frequency of each candidate Experimental results [30] also showed that PRISM was one of the
best methods for mining sequential patterns It outperformed existing methods by an order of magnitude or more and had a low memory footprint
1.4 Closed sequential pattern
When mining long frequent sequences that contain a combinatorial number of frequent subsequences, such a mining will generate an explosive number of frequent subsequences for long patterns, or when using very low support thresholds to mine sequential patterns, which is prohibitively expensive in both time and space cost So, the performance of the sequential pattern mining algorithms often degrades unexpectedly To overcome this difficultly, the mining closed sequential patterns problem has been developed A sequence is called closed if there exists no its supersequence with the same support in the sequence database Mining sequential patterns with closed patterns may significantly reduce the number of patterns generated in the process without losing any information because it can be used to derive the complete set of sequential patterns; the number of closed sequential patterns is usually fewer than the number of sequential patterns Several studies have been recently proposed to mine closed sequential patterns [41,53~56] The CloSpan algorithm [53] has been proposed Like most
of the frequent closed itemset mining algorithms CLOSET [63] and CHARM [64], CloSpan
algorithm used the candidate maintenance and test approach It needs to maintain the set of already mined closed sequence candidates for doing the backward subpattern and backward superpattern check to verify if a newly found frequent sequence is promising to be closed or not So, it will consume much memory and lead to huge search space for pattern closure
checking when there are many frequent closed sequences BIDE [54] is another faster closed
Trang 32THE ALGORITHMS RESEARCH ON SEQUENTIAL PATTERNS MINING USING PREFIX-TREE STRUCTURE
sequence mining algorithm Different from CloSpan, it used a novel sequence closure
checking scheme called BI-Directional Extension, and pruned the search space more by using
the BackScan pruning method and the ScanSkip optimization technique to directly get the
complete set of the frequent closed sequence patterns without candidate maintenance Thus, in
most cases, BIDE is more efficient than CloSpan, especially when a database is dense or the minimum support value is low But to implement the closure check, the BIDE algorithm
spends a lot of time on scanning the pseudo-projected database repeatedly to verify the existence of extension of position with a prefix sequence, which costs much time in the mining process To reduce the time consumed on scanning the pseudo-projected database for
verifying in the BIDE algorithm, the FCSM-PD algorithm was proposed by Huang et al [41]
the positional data was used to reserve the position information of items in the data sequences
In the pattern growth process, the extension of position with a prefix sequence is checked directly and all the position information of the new prefix sequences will be recorded
However, the FCSM-PD algorithm must store all the position information of a prefix
sequence in the process of pattern growth in advance; so it consumes more memory in this algorithm
1.5 Sequential generator pattern
In a sequence database, the sequential generator pattern is a pattern that does not have any its subsequence with the same support Sequential generator patterns used together with closed sequential patterns can provide additional information that closed sequential patterns alone cannot provide According to the Minimum Description Length (MDL) principle [65], sequential generator patterns are the minimal members and the length of sequential generator patterns are shorter than that of closed sequential patterns, so sequential generator patterns are preferable sequential patterns and closed sequential patterns for mining non-redundant sequential rules where the sequential generator patterns are antecedents of rules and each sequential generator pattern, consequents of rules are generated by removing the same prefix part, which the sequential generator pattern has, from closed sequential patterns Several sequential generator mining methods [57~ 59] have recently been proposed Lo et al [57]
proposed the first sequential generator mining algorithm, called the GenMiner method The
method extracts sequential generators in a three-step compact-generate-and-filter approach In the first step, it traverses all the sequential patterns and presents a compact representation of the space of sequential patterns in a lattice format [54] In the second step, it retrieves a set of candidate generators, which is a super-set of all generators, from the compact lattice and
Trang 33prunes the sub-search spaces containing non-generators by using the unique characteristics of sequential generators [65] to ensure that the candidate generator set is not too large In the final
step, all non-generators from the candidate set are filtered The FEAT algorithm was
introduced by Gao et al [58] It is based on sequential pattern growth with forward and backward pruning strategies, along with a sequential generator checking technique to speed
up the mining process However, pruning non-generator sequences is time-consuming To
avoid the cost of pruning, the FSGP algorithm [59] was proposed In FSGP, a safe pruning
strategy based on the inclusion relationship between a sequence and its subsequence is used Each valid frequent sequential pattern is checked by the sequential generator checking theorem from the set of valid frequent sequential patterns The non-generators are then removed, and the resulting set of sequential generators is generated
1.6 Sequential rule
Based on sequence database, there have been a lot of different kinds of rules researched
in recent years such as recurrent rules [46], sequential rules [47,66~ 68], sequential classification rules [66], and interesting rules [67]
In the all of above rule kinds, sequential rule is the most basic rule; the remaining kinds
of rules are often the modified sequential rule by adding or removing some of the information
or binding into the sequential rule Consequently, this thesis focuses on the investigation of the mining sequential rule problem
Sequential rules are generated from the set of sequential patterns It expresses the temporal relationships between sequential patterns from a sequence database [67] Sequential rules can be considered as natural extension of original sequential patterns, just as association rules are natural extension of frequent itemsets [25] The sequential rule mining problem is thus
to find the relationships between occurrences of sequential events like “if event(s) X appears
in any sequence of the sequence database then event(s) Y is likely to appear in that sequence following X with a given confidence afterward” Compared with sequential patterns, the
sequential rules can help users better understand the chronological order of the sequences present in the sequence database For example, at the Video store, customer purchase the fourth Star Wars movie discs will buy season 5 and season 6 So, purchasing sequences (4, 5, 6) present purchasing activities However, in the fact, at the store have hundreds of customers with different preferences Therefore, sequence (4, 5, 6) tends to occur with low support Mining sequential patterns from a sequence database with low support values will get many sequential patterns, which may include irrelevant or spurious patterns Thus, sequential rule
Trang 34THE ALGORITHMS RESEARCH ON SEQUENTIAL PATTERNS MINING USING PREFIX-TREE STRUCTURE
has been used to remove these spurious patterns by applying the support and confidence for rules Only the rules that satisfy both a minimum support threshold and a minimum confidence threshold are thus mined In addition, sequential rule mining is also applied to address the prediction problem [18~19,42,69~73] In the problem of prediction, a sequence of events appears frequently in a database is not sufficient for the making prediction of events, while sequential rules allow better understanding of the problem of prediction in a sequence
database For example, some event C appears frequently after some events A and B but that there are also many cases where A and B are not followed by C In this case, predicting that C will occur if A and B occur on the basis of a sequential pattern ABC could be a huge mistake Thus, for prediction, it is desirable to have patterns that indicate how many times C appeared before AB and how many times AB appeared and C did not Thus, using sequential rules, we
can know the series of events that will usually occur after a series of previous ones Sequential rules are rather simple, but their information has many important implications, they are used for the process of decision making, management and orientation, and an appropriate sequential rule mining process, instead of mining only sequential patterns, is also desired Like a sequential pattern, a sequential rule is also applied in many application areas, including the trade [5], stock market [8~9,73], weather observation [42], e-learning [43], and software engineering [45~48]
1.7 Objective of the thesis
The goal of this thesis is to study and propose new algorithms that are efficient and effective to address the following two main objectives:
⚫ Exploitation of secondary information as sequential pattern, closed sequential pattern, sequential generator pattern based on the corresponding prefix-tree structures
⚫ Generate the kinds of sequential rules based on the secondary information on the prefix-tree structure
1.8 Contributions of the thesis
In this thesis, we propose efficient and effective algorithms for the mining problem related to sequential patterns All these algorithms in our work are based on the prefix-tree structure and the input database for them is organized in the vertical format The prime-block encoding approach is also used in the whole works related to generate sequential patterns In particular, the main contributions of this thesis can be briefly summarized as follows:
- Introduce the definitions related to our works and survey some existing algorithms
Trang 35for mining sequential patterns
- Introduce several interestingness measures as lift, cosine, jaccard and so on which
used to mine association rules and propose an algorithm to generate all relevant sequential rules from a sequence database using these interestingness measures
- Propose efficient algorithms for mining sequential generator pattern
- Modify the prefix-tree structure to propose a new algorithm called CloGen for
mining closed sequential patterns and their sequential generator patterns at the same time
- Propose an efficient algorithm for mining non-redundant sequential rules based on the fields of closed sequential pattern and sequential generator pattern on the prefix-tree
- Both real and synthetic datasets can be used in an extensive experimental evaluation
of these techniques and a comparison with the existing methods
In summary, we have proposed efficient algorithms related to mine sequential patterns in sequence databases by using the prefix-tree structures They do not only improve the performance but also reduce the redundant rules when mining huge number of sequences from sequence databases
1.9 Organization of the thesis
The remainder of this thesis is organized as follows:
Chapter 2: Problem Definition and Related Work
Chapter 3: Sequential Rules with Interestingness Measures Mining
Chapter 4: Sequential Generator Pattern Mining
Chapter 5: Closed Sequential Patterns and Their Sequential Generator Patterns Mining Chapter 6: Non-Redundant Sequential Rule Mining
Finally: Conclusion and Future Works
In Chapter 2, we give the common definitions and the survey of several existing sequential patterns mining algorithms In additional, some algorithms for closed sequential pattern mining, sequential generator pattern mining and generating sequential rule are also mentioned in this chapter
Chapter 3 examines some specific interestingness measures, which have been used in the association rules and the classification rules, then build an efficient algorithm to find sequential rules with these interestingness measures
In Chapter 4, we provide a novel algorithm called MSGPs that used to find sequential
Trang 36RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE
generator patterns at the same process of generating sequential patterns By modifying
prefix-tree structure, another efficient algorithm called MSGP_PreTree is then developed to
resolve this problem
Chapter 5 also designs an algorithm called CloGen to find closed sequential patterns and
their sequential generator patterns by constructing a corresponding prefix-tree structure to store the properties of generator and closed for each sequential pattern
In Chapter 6, based on the prefix-tree that achieved from the CloGen algorithm, an
efficient algorithm for generating non-redundant sequential rules is proposed
Finally, the conclusion and future work are discussed
The correctness and efficiency of the proposed algorithms are also verified by experimental results in each corresponding chapter
Trang 37CHAPTER 2: DEFINITIONS AND RELATED WORKS
2.1 Introduction
In the field of data mining on the sequence database, sequence mining is essentially an enumeration problem over the subsequence partial order looking for those sequences that are frequent Sequential pattern mining on sequence database is to identify the patterns which
appear in the database satisfy the minimum support threshold (minSup) The first algorithms were proposed for sequential pattern mining problem were AprioriAll [24] in 1995 and GSP [25]
in 1996 by Agrawal and Srikant Other algorithms like PSP [26], SPADE [27], PrefixSpan [28],
SPAM [29], CloSpan [53], were developed afterwards and successively improved the task of finding sequential patterns Exploiting sequential patterns are applied in many fields such as market analysis, web analysis, predicted the shopping needs of customers, and so on
Sequential rule extends the ability of using and significance of expression of sequential patterns, implicit knowledge of the sequence data Sequential rule is generated from sequential patterns, it represent the relationship between the two series of events, this event will occur after a series of other events
In this chapter, we present the common definitions of sequential pattern mining problem and introduce several existing sequential patterns mining methods that are the foundation for our contributions in chapters 3, 4, 5, and 6 In additional, definitions, some algorithms for closed sequential patterns mining, sequential generator patterns mining and generating sequential rules are also mentioned
2.2 Sequential Pattern Mining
2.2.1 Definitions
Definition 2.1: Sequence & sequence database [1,30,68] Let I = {i 1 , i 2 , …, i m } be a set of
items An itemset is a non-empty subset of items, an itemset i is denoted by (i1 , i 2 , …, i k), where ij is an item Without loss of generality, we assume that items in an itemset are sorted in lexicographic order S = {s1 , s 2 , …, s n } be a set of sequences, where each sequence s x is an ordered list of itemsets and sx ={x 1 , x 2 , …, x p } where x i is an itemset and p is the number of itemsets such that x1 , x 2 , …, x p I In sx , x 1 occurs before x2, which occurs before x3, and so
Trang 38RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE
on The size of a sequence is the number of itemsets in the sequence The number of instances
of items in a sequence is called the i-length of a sequence, defined by
with i-length l is called a l-sequence For example, given a sequence s =
(AB)(B)(B)(AB)(B)(AC), sequence s has 6 itemsets is that: (AB), (B), (B), (AB), (B), (AC) and has 9 items So, the size of sequence s is 6, and the i-length of sequence s is 9, called a
9-sequence A sequence database SD is composed of a set (S) of sequences
Definition 2.2: Subsequence & supersequence [1,30,68] Sequence = 1 2 … n is called a subsequence of = 12 … m and β is a supersequence of α (where i and j are
itemsets), denoted as α β, if there exist integers 1 ≤ j1 < j 2 < … < j n ≤m (n ≤ m) such that 1
j1 , 2 j2 , , n jn For example, if α = (AB), D and β = (ABC), (DE), where A,
B, B, D, and E are items, then α is a subsequence of β and β is a supersequence of α
Definition 2.3: Pattern Pattern is a subsequence of a sequence Each itemset in a pattern
is called an element or event
Definition 2.4: Support & sequential pattern [1,30,68] Given a sequence database SD and
sequence s, the absolute support of s in SD is the number of sequences in SD containing s, denoted SupSD (s) = S iSDsSi The relative support of s in SD is the ratio of the absolute support of s in SD and the number of sequences in SD Without loss of generality, in the remaining of this dissertation, whenever support is mentioned, the absolute support of s or the relative support of s will be used the mutual conversion, denoted as Sup(s)
Definition 2.5: Sequential pattern [1,30,68] Given a minimum support threshold, denoted
as minSup, and minSup (0, 1 A sequence s is called a sequential pattern in SD if Sup(s) ≥
minSup A sequential pattern with length l is called an l-pattern
Table 2.1 An example sequence database (SD)
is 9 In s1, item A occurs three times in this sequence, so it contributes 3 to the length of the
Trang 39sequence However, when counting the support of item A on the whole sequence s1 is only counted one A sequence p = (AB)(C) is a subsequence of s1 , therefore, subsequence p is
called a pattern In SD, only sequences s1, s2 and s5 which contain pattern p, p has a support of
3, Sup(p)= 3 Sup(p) > minSup, so p is a sequential pattern The length of p is 3, hence p is called a sequential pattern with 3-pattern
Given a sequence database SD and minSup The sequential pattern mining problem is to find the full set of sequential patterns in the sequence database SD The sequential pattern
mining problem [24,30] was also simultaneously identified as the frequent episode mining problem by Mannila et al [74] In this thesis, we use a sequence database in Table 2.1 as an example sequence database to illustrate our works throughout the chapters
Definition 2.6: Prefix, incomplete prefix & postfix [68] Given two sequences s1 = a1
a 2 … a n and s2 = b1 b 2 … b m, where ai, bj are itemsets and m n, sequence s1 is a prefix of s2
if and only if ai = bi for all 1 ≤ i ≤ n The remaining part of sequence s2 (after the removal of the prefix part s1) is called a postfix of s2 Sequence s1is an incomplete prefix of s2 if and only
if ai = bi for all 1 ≤ i ≤ n-1, an bn, and all the items in (bn - an) are lexicographically after those in an From the above definition, it can be inferred that a sequence of size k has (k-1) prefixes For example, a sequence (A)(BC)(D) has 2 prefixes: (A) and (A)(BC) Therefore,
(BC)(D) is the postfix for prefix (A), and (D) is the postfix for prefix (A)(BC) However, both (A)(B) and (BC) are not considered as the prefix of given sequence, but (A)(B) is an
incomplete prefix of given sequence
Definition 2.7: Projected database [28,47,54] Given be a sequential pattern in sequence
database SD The -projected database, denoted as SD , is the set of postfixes of sequences
in SD with the prefix For example, given SD = {(A)(BC)(CD), (AB)(C)(DE)(F),
(A)(CE)(F)}, sequential pattern = (A) (C), then D = {(CD), (DE)(F), (E)(F)}
2.2.2 Organization of the sequence data
Each sequence database can be represented in two basic ways:
• Horizontal Format: The database is organized horizontally; each row represents the series of events corresponding to the object as shown in Table 2.3
• Vertical Format: The database is organized vertically; each row represents the series
of objects corresponding to the event as shown in Table 2.4
Trang 40RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE
Table 2.2 Sequence database
Object Series of events
2.2.3 Prefix-tree Structure
Prefix-tree is an ordered tree data structure used to store sequences for a fast look-up, where all the children nodes of a parent node have a common prefix of the sequences associated with that node, and the root is associated with the empty sequence Its simplest form can often be used as a list of keywords or a dictionary Unlike a binary search tree, no