Research on the sequential pattern mining algorithms using prefix tree structure

The experimental results show that sequential rule mining with interestingness measures using the proposed algorithm based on the prefix-tree was always much faster than that using the o

Trang 1

学校代号 10532 学号 LB2010034

分类号 TP391 密级 Normal

博士学位论文 基于前缀树结构的序列模式挖掘算法研究 (英文版) 学位申请人姓名 : PHAM THI THIET

培养单位 : 信息科学与工程学院

导师姓名及职称 : 骆嘉伟教授

学科专业 : 计算机科学

研究方向 : 数据挖掘和知识发现

论文提交日期 : 2013-06-07

Trang 2

University ID: 10532 Student ID: LB2010034

Security Level: Normal

Hunan University Doctoral Thesis

RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE

Aplicant’s Name : PHAM THI THIET College : Information Science and Engineering Supervisor : Professor Luo Jiawei Major : Computer Science Research Field : Data Mining and Knowledge Discorvery Submission Date : 2013-05 Defense Date : 2013-06-07 Defense committee Chairman : Professor Bo Liao

Trang 3

Research On The Sequential Pattern Mining Algorithms Using Prefix-Tree Structure

By

PHAM THI THIET

Masters in Computer Science (Guru Gobind Singh Indraprastha University) 2008

A dissertation submitted in partial satisfaction of the

Requirements for the degree of

Trang 5

HUNAN UNIVERSITY

DECLARATION

I, Pham Thi Thiet, hereby declare that the work presented in this PhD thesis

titled “Research on The Sequential Pattern Mining Algorithms Using Prefix-Tree

Structure”, is my original work and has not been presented elsewhere for any academic

qualification Where references have been used from books, published papers, reports and web sites, it is fully acknowledged in accordance with the standard referencing practices of the discipline

Student’s signature: ……… Date: ……… …

This thesis belongs to:

1 Secure □, and this power of attorney is valid after

Trang 6

DOCTORAL THESIS

ABSTRACT

Together with the rapid development of computer and internet technology, the huge amounts of data have been gathered together from various kinds of applications become more enormous and have far exceeded our human power for apprehension without powerful tools They have been described as a data rich but information poor situation Therefore, data mining with the aim of finding the valuable information and necessary knowledge hidden in a vast amount of data has become one of the most important tasks in the field of data mining research The variety and richness of data have formed different data kinds include transaction data, sequence data, stream data, time-series data and so on

Sequence data is an important type of data which occurs frequently in many applications

It is composed of sequences of ordering elements or events, listed with or without a specific notion of time Although there is the existence of a lot of general data mining methods to other kinds of data but for sequence data, these methods could not be applied because of among all kinds of data, sequence data has its own unique sequence features and can be existed in many interesting applications which leads to many interesting new kinds of knowledge to be discovered including sequential patterns, approximate biological sequence patterns, partially ordered patterns, periodic patterns, motifs, and so on; and these kinds of patterns will assist the development of new classification, clustering and outlier analysis methods, which in turn call for new, the development of different application kinds

The sequential pattern mining is one of important tasks of data mining research and often used popular in sequence data mining applications The process of sequential pattern mining

is to extract frequent subsequences in a sequence database This work has also attracted much more attention to researchers in data mining research Many works has been examined on mining sequential patterns, however, the main challenges still exist as large search spaces and the ineffectiveness in handling dense datasets To resolve the above challenges, the problems for mining closed sequential pattern, sequential generator pattern, and sequential rules have been proposed In this thesis, we have proposed novel algorithms to address these problems with the following two main objectives:

⚫ Exploitation of secondary information as sequential pattern, closed sequential pattern, sequential generator pattern based on the corresponding prefix-tree structures

⚫ Generate the kinds of sequential rules based on the secondary information in the prefix-tree structure

Trang 7

In this thesis, we have four mainly contributions which can be briefly described as follows:

⚫ Firstly, this thesis mentions several interestingness measures as Lift, Conviction, Piatetsky-Shapiro, Cosine, Jaccard and so on, which have proposed for mining association rules and classification rules but they have not been applied to mine sequential rules in sequence databases except the traditional measures of rule such as the support and confidence

We also propose then an efficient algorithm to generate all relevant sequential rules with the above interestingness measures from the prefix-tree which stored the whole sequential pattern where each child node stores a sequential pattern and its corresponding support value By traversing the prefix-tree, the algorithm can then easily identify the components of a rule, and can calculate the measured values of the rule The experimental results show that sequential rule mining with interestingness measures using the proposed algorithm based on the prefix-tree was always much faster than that using the other existing algorithm as modified

Full Especially when mining in large sequence databases with the low minimum support

values, the number of sequential patterns generated from sequence databases was large and the proposed algorithm outperformed much because the proposed algorithm only traverse the prefix-tree to immediately determine which sequences are the left- and right-hand sides of a rule as well as their support values to compute the interestingness measure values of the rule from the sequential pattern set In addition, the experimental results also show that the time

for mining sequential rules with the confidence measure was the smallest, because it did not need to revisit the prefix-tree to determine the support of Y (the antecedence of rules), while

the other interestingness measures need to revisit the prefix-tree to determine the support values of the consequent of rules or both the antecedence and the consequent

⚫ Secondly, in this thesis, the characteristics of sequential generator patterns are combined with the extension of a sequence on the prefix-tree to propose two efficient

algorithms, called MSGPs and MSGP_PreTree, for finding all the sequential generator

patterns at the same time of the generating sequential patterns Using the prefix-tree, new sequences, which are child nodes, can be easily created by appending an item to the last position of a parent node as an itemset extension or a sequence extension The proposed algorithms use the prime block encoding approach to represent candidate sequences and uses join operations over the prime blocks to determine the frequency for each candidate In the

MSGPs algorithm, it uses a hash table to store sequential generator patterns with the hash key

as the support of the pattern for fast checking MSGP_PreTree algorithm that is improved from the MSGPs algorithm, to generate all sequential generator patterns The idea of the

improved algorithm is performed by modifying the prefix-tree such that each node on the

Trang 8

DOCTORAL THESIS

prefix-tree will be added fields to check whether the sequence stored in this node is a sequential generator pattern or not The whole information of the sequence is stored on the

prefix-tree, so the MSGP_PreTree algorithm does not need to use a hash table to store

sequential generator patterns, which reduce signiﬁcantly the use of memory The supersequence frequency-based pruning and the non-generator-based pruning on the

prefix-tree are applied in the MSGP_PreTree algorithm to reduce the search space The process of extending prefix-tree and determining sequential pattern in the MSGP_PreTree algorithm is performed similar to the MSGPs algorithm All the experimental results for

synthetic and real databases show that the number of sequential generator pattern is always smaller than the number of sequential patterns, and in all cases the proposed algorithms outperform the other algorithm in terms of running time

⚫ Thirdly, we propose an efficient algorithm for directly finding both closed sequential patterns and their sequential generator patterns in the generating sequential patterns process

called CloGen algorithm (Closed sequential pattern-sequential Generator pattern), which is

based on the combination of the child-parent relationship on prefix-tree structure and the definition of closed sequential pattern and sequential generator pattern Each node on the prefix-tree in our approach stores a sequential pattern and its corresponding support value

Besides, it will be added one field (IsmSGP) to consider whether this node is a minimal sequential generator pattern, and another field (IsCSP) to consider whether this node is a

closed sequential pattern Based on these fields added to each node, the algorithm easily determines if the sequence at each node is a minimal sequential generator pattern or closed sequential pattern, so the mining time is reduced significantly This algorithm also uses join operations over the prime block encoding approach of the prime factorization theory to represent candidate sequences and determine the frequency for each candidate Experimental results show that the performance runtime for mining closed sequential patterns and their

minimal sequential generator patterns using the CloGen algorithm is much faster than one order of magnitude The CloGen algorithm can generate all sequential patterns, sequential

generator patterns, and closed sequential patterns at the same time Furthermore, the built prefix-tree in the our approach will be one of the most efficient prefix-trees for mining non-redundant sequential rules in the future and also for mining all sequential rules

⚫ Fourthly, an efficient algorithm called MNSR-Pretree for mining non-redundant

sequential rules is proposed in this thesis The proposed algorithm is decomposed two phases

In the first phase, it builds a prefix-tree that stores all the sequential patterns from a given sequence database Then in the second phase, it mines non-redundant sequential rules from this prefix-tree In the prefix-tree building process, each node on the prefix-tree has a field

Trang 9

(IsmSGP) that indicates whether this node is a minimal sequential generator pattern, and another field (IsCSP) that indicates whether this node is a closed sequential pattern, which is performed by the CloGen algorithm in the previous contribution By traversing the prefix-tree,

non-redundant sequential rules can be easily mined from a minimal sequential generator

pattern X to a closed sequential pattern Y such that X is a prefix of Y, which greatly reduces the mining time required Based on the values of IsmSGP and IsCSP, the MNSR-Pretree algorithm only mines rules from a parent node whose IsmSGP value is true to children nodes whose IsCSP value is true, so that the sequence at the parent node is considered as an

antecedent of the rules to be generated, and the consequents of rules are generated by removing the prefix part, which the sequence at the parent node has, from closed sequential patterns The experimental results on synthetic and real databases show that the number of non-redundant sequential rules is much smaller than that of sequential rules, and that the time required for mining non-redundant sequential rules is much less than that required for mining sequential rules Besides, the results also show that the time required for mining non-redundant sequential rules of the proposed algorithm is less than that required by an existing algorithm

In summary, in this thesis we have proposed the efficient algorithms and also completed

the initial introduced objective is that "To improve the efficiency of the exploitation of

secondary information algorithms as sequential pattern, closed sequential pattern, sequential generator pattern based on the prefix-tree structure" with the main contribution is "the use of the prefix-tree in order to generate significantly the kinds of sequential rules as sequential rules with interestingness measures and non-redundant sequential rules from the secondary information" The goal of this thesis has been achieved by using the child- parent relationship

on the prefix-tree structure and the extension of sequences to propose novel algorithms for mining works related to sequential patterns in the sequence database including algorithms for mining sequential rules with interestingness measures, mining sequential generator patterns, mining closed sequential patterns and their sequential generator patterns and mining non-redundant sequential rules The above proposed methods can be evaluated with both synthetic and real datasets Experimental results illustrate the effectiveness and efficiency of our algorithms, which improved significantly the efficiency

Keywords: Sequential pattern, closed sequential pattern, sequential generator pattern,

interestingness measure, sequential rule, non-redundant sequential rule, prefix-tree

Trang 12

得其他项目可以很容易被表示为 n X Y = n X - n X Y，

Y X

序模式。如果 pre 与 post 串联，表示为 pre++post，那么结果是初始的序列

模式。序列规则 r 由此可以形成 pre  post (Sup, imv)。 r 的支持 Sup(r)因

Trang 13

则 “pre  post”形成，由此 post 是 SP 的一个关于 pre 前缀的前缀。

一个规则的大多数有趣的方法依赖于 Post 的支持，为了获得 Post 的支

持，程序 FIND_SUP_POST(RNode,Post)被调用，RNode 是 Post 的前缀树中

的第一个根节点并且为非空。 FIND_SUP_POST 程序 (RNode,Post)产生 Post

的支持通过遍历以 RNode 为根节点的前缀树的所有分支， Rnode 为 Post 的

Trang 14

DOCTORAL THESIS

EXTEN D_SEQUEN C E 通过增加 dbpat 中每一项到扩充节点的最后位置

来创建新模式 Pnew。每一个添加的项目新节点 Pnew 最近的项集。如果 Pnew

Trang 15

器，然后 Pnew 作为扩充节点的孩子节点的扩充项增加到 pretree 。

EXTEN D_SEQUEN C E 通过在 dbpat 增加每一项到扩充节点的最后位置来创

建新模式 Pnew。每一个添加的项目新节点 Pnew 最近的项集。如果 Pnew 的

Trang 16

DOCTORAL THESIS

块。因为每一个新创建的子节点 Pnew 被分配 {IsmSGP,IsCSP}={true,true}，

如果 Sup(Pnew)=Sup(P)， Pnew.IsmSGP 和 P.IsCSP 将被设置为 false。调用

UPDAT E_PRET REE (Pnew, pret ree) 更新闭序列模式和前缀树的序列生成器

Trang 18

DOCTORAL THESIS

TABLE OF CONTENTS

DECLARATION I ABSTRACT II

摘要 VI

TABLE OF CONTENTS XIV LIST OF FIGURES XVIII LIST OF TABLES XXI LIST OF ABBREVIATIONS XXII

CHAPTER 1: INTRODUCTION 1

1.1 Overview of the sequence database in data mining 1

1.2 Motivation 3

1.3 Sequential pattern 4

1.4 Closed sequential pattern 5

1.5 Sequential generator pattern 6

1.6 Sequential rule 7

1.7 Objective of the thesis 8

1.8 Contributions of the thesis 8

1.9 Organization of the thesis 9

CHAPTER 2: DEFINITIONS AND RELATED WORKS 11

2.1 Introduction 11

2.2 Sequential Pattern Mining 11

2.2.1 Definitions 11

Trang 19

2.2.2 Organization of the sequence data 13

2.2.3 Prefix-tree Structure 14

2.2.4 Sequential patterns mining algorithms 16

2.2.4.1 AprioriAll 16

2.2.4.2 GSP 18

2.2.4.3 PSP 18

2.2.4.4 SPADE 19

2.2.4.5 PrefixSpan 20

2.2.4.6 SPAM 21

2.2.4.7 PRISM 24

2.3 Closed sequential patterns mining 32

2.3.1 CloSpan 33

2.3.2 BIDE 34

2.4 Sequential generator patterns mining 35

2.4.1 GenMiner 36

2.4.2 FEAT 37

2.4.3 FSGP 38

2.5 Sequential rules mining 39

2.6 Non-redundant sequential rules mining 44

2.7 Summary 45

CHAPTER 3: MINING SEQUENTIAL RULE WITH INTERESTINGNESS MEASURES USING PREFIX-TREE 46

3.1 Introduction 46

3.2 Problem statement 46

Trang 20

DOCTORAL THESIS

3.3 Mining sequential rules with interestingness measures 48

3.3.1 Interestingness measures 48

3.3.2 Algorithm 50

3.3.3 Illustration 52

3.3.4 Experiments 54

3.4 Summary 59

CHAPTER 4: SEQUENTIAL GENERATOR PATTERN MINING 60

4.1 Introduction 60

4.2 Unique Characteristics of Sequential Generator Patterns 60

4.3 Mining sequential generator pattern on hash table 61

4.3.1 Algorithm 61

4.4 Mining sequential generator pattern on prefix-tree 66

4.4.1 Algorithm 67

4.5 Summary 73

CHAPTER 5: CLOSED SEQUENTIAL PATTERNS AND THEIR MINIMAL SEQUENTIAL GENERATOR PATTERNS MINING 75

5.1 Introduction 75

5.2 Definitions 77

5.3 Mining closed sequential patterns and their minimal sequential generator patterns 78 5.3.1 CloGen Algorithm 78

Trang 21

5.4 Summary 85

CHAPTER 6: NON-REDUNDANT SEQUENTIAL RULE MINING 86

6.1 Introduction 86

6.2 Definitions 86

6.3 Mining non-redundant sequential rules based on prefix-tree 87

6.3.1 Algorithm 88

6.4 Summary 96

CONCLUSION AND FUTURE RESEARCH WORKS 97

1 Summary of the thesis 97

2 Future works 100

REFERENCES 102

APPENDIX A: LIST OF RESEARCH PUBLICATIONS 111

APPENDIX B: PROJECTS 112

ACKNOWLEDGMENTS 113

Trang 22

DOCTORAL THESIS

LIST OF FIGURES

Figure 1.1 A DNA sequence fragment 2 Figure 1.2 A protein sequence fragment 2 Figure 1.3 A weblog sequence 2 Figure 1.4 A customer purchase history 2 Figure 1.5 A storewide sales history 2 Figure 2.1 The Prefix-tree structure 16

Figure 2.2 AprioriAll Algorithm 17 Figure 2.3 A prefix-tree structure used in PSP algorithm 19 Figure 2.4 The SPADE Algorithm 19

Figure 2.5 The pseudo-code for the Enumerate_Seq[X] procedure 20

Figure 2.6 The PrefixSpan algorithm 21

Figure 2.7 The lexicographical tree of sequences 22 Figure 2.8 A bitmap representation of the sequence database in Table 2.5 23

Figure 2.9 The SPAM algorithm 23 Figure 2.10 Lattice built over P(G), each node shows a set SP(G) under bit-vector S B and

the value obtained by multiplying its element S(S ). 25 Figure 2.11 Example of primal block encoding 28 Figure 2.12 Extensions via Prime Block Joins 31

Figure 2.13 The CloSpan algorithm 34 Figure 2.14 The BIDE algorithm 35 Figure 2.15 The GenMiner algorithm 37

Figure 2.16 A sample Prefix Search Tree (a) and Prefix Search Lattice (b) 37

Trang 23

Figure 2.17 The FEAT algorithm 38 Figure 2.18 The FSGP algorithm 39 Figure 2.19 The Full algorithm 40 Figure 2.20 The MSR_ImpFull algorithm 43 Figure 2.21 The MSR_PreTree algorithm 43

Figure 3.1 A prefix-tree structure storing sequential patterns from Table 2.1 47 Figure 3.2 The interestingness measures roles in data mining process 49 Figure 3.3 The proposed algorithm for generating sequential rules based on a prefix-tree 51 Figure 3.4 The mining times of the two algorithms for different interestingness measures in

database in Table 2.1 with minSup = 50% 67 Figure 4.4 The MSGP-PreTree algorithm for generating set of sequential generator patterns 68

Figure 4.5 The comparison between number of sequential patterns and sequential generator patterns in databases 71 Figure 4.6 The mining sequential generator patterns times of two algorithms in databases 72

Figure 5.1 The CSGM algorithm 75

Figure 5.2 An algorithm for generating closed sequential patterns and their minimal sequential generator patterns 78

Figure 5.3 Level 1 of the pretree tree (each node contains: sequential pattern, support,

IsmSGP, and IsCSP) 80

Trang 24

= 50% 88 Figure 6.2 Algorithm for generating a set of non-redundant sequential rules 89 Figure 6.3 Runtime for mining sequential rules and non-redundant sequential rules for the

database 95

Trang 25

LIST OF TABLES

Table 2.1 An example sequence database (SD) 12

Table 2.2 Sequence database 14 Table 2.3 Horizontal Format 14 Table 2.4 Vertical Format 14 Table 2.5 A sequence database 23 Table 2.6 Sequential patterns 41 Table 2.7 The set of sequential rules is generated from the set of sequential patterns 41

Table 3.1 Some interestingness measures for a rule X  Y 49

Table 3.2 The sequential rules generated for any interestingness measures in Table 3.1 with

minThreshold = 0 53

Table 3.3 The sequential rules with minThreshold = 0.8 54

Table 3.4 The time ratios for different interestingness measures 55 Table 4.1 The list of sequential patterns and sequential generator patterns 64 Table 4.2 Experimental results for three databases 66 Table 5.1 Results of all sequential patterns, closed sequential patterns and sequential generator patterns 83

Table 6.1 Sequential rules and non-redundant rules obtained with minConf = 50% 90

Table 6.2 Numbers of sequential rules and non-redundant sequential rules obtained from three

databases with minConf = 0% 92

Trang 26

FP-tree Frequent Pattern-tree

id-list identifiers list

imv interestingness measure value

IsCSP Is Closed Sequential Pattern

IsmSGP Is minimal Sequential Generator Pattern

litemset large itemset

minThreshold minimum interestingness measure Threshold

mSGP() set of all minimal Sequential Generator Patterns of 

MSGP_PreTree Mining Sequential Generator Pattern on Prefix-Tree

SB bit-vector of sequence S with B be a bit-vector of length N

SGP() set of Sequential Generator Patterns of 

Trang 27

CHAPTER 1: INTRODUCTION

1.1 Overview of the sequence database in data mining

Due to the rapid development of computer and internet technology, the huge amounts of data have been gathered together from various kinds of applications become more enormous and have far exceeded our human power for apprehension without powerful tools They have been described as a data rich but information poor situation Therefore, data mining with the aim of finding the valuable information and necessary knowledge hidden in a vast amount of data has become one of the most important tasks in the field of data mining research The diversity and richness of data have made different data kinds [1] include transaction data, sequence data, stream data, time-series data and so on

Sequence data is an important type of data occurred frequently in many scientiﬁc and engineering [2~ 4], business [5~ 7], customer behavior analysis [8~9], stock trend prediction [10~11], DNA sequence analysis [12], web usage behaviour analysis [13~ 15] and other applications It is composed of sequences of ordering elements or events, listed with or without a specific notion of time as biological sequence (Figure 1.1 and Figure 1.2), weblogs sequence (Figure 1.3), a sequence of the customer purchase and sale histories (Figure 1.4 and Figure 1.5), a sequence of events in science, in the natural or social … Although there is the existence of a lot of general data mining methods to other kinds of data but for sequence data, these methods could not be applied because of among all kinds of data, sequence data has its own unique sequence features and can be seen in many interesting applications which leads to many interesting new kinds of knowledge to be discovered including sequential patterns, approximate biological sequence patterns, partially ordered patterns, periodic patterns, motifs, and so on; and these kinds of patterns will assist the development of new classification, clustering and outlier analysis methods, which in turn call for new, the development of different application kinds Beside, sequence data clearly describes the through time relationships among data, so the mining rules in the sequence data is also expected to provide

a lot of valuable knowledge hidden with meaningful through time

Trang 28

THE ALGORITHMS RESEARCH ON SEQUENTIAL PATTERNS MINING USING PREFIX-TREE STRUCTURE

GAATTCTCTGTAACACTAAGCTCTCTTCCTCAAAACCAGAGGTAGATAGAATGTGTAATAAT TTACAGAATTTCTAGACTTCAACGATCTGATTTTTTAAATTTATTTTTATTTTTTCAGGTTGAG ACTGAGCTAAAGTTAATCTGTGGC

Figure 1.1 A DNA sequence fragment

SSQIRQNYSTEVEAAVNRLVNLYLRASYTYLSLGFYFDRDDVALEGVCHEFRELAEEKREGAE RLLKMQNQRGGRALFQDLQKPSQDEWGTTPDAMKAAIVLEKSLNQALLDLHALGSAQADPH LCDFLESHFLDEEVKLIKKMGDHLTNIQRLVGSQAGLGEYLFERLTLKHD

Figure 1.2 A protein sequence fragment

100, a, 100,b, 200, a, 300, b 400, a, 100, a, 400, b, 300, a, 100, c, 200, c, 400, a,

400, e

Figure 1.3 A weblog sequence

223100, 05/26/06, 10am, CentralStation, {WholeMealBread, AppleJuice},

223100, 05/26/06, 11am, CentralStation, {Burger, Pepsi, Banana },

223100, 05/26/06, 4am, WalMart, {Milk, Cereal, Vegetable},

223100, 05/26/06, 10am, CentralStation, {WholeMealBread, AppleJuice}

Figure 1.4 A customer purchase history

97100, 05/06, {Apple : $85K, Bread : $100K, Cereal : $150K, …},

90089, 05/06, {Apple : $65K, Bread : $105K, Diaper : $20K, …},

97100, 05/06, {Apple : $95K, Bread : $110K, Cereal : $160K, …},

90089, 05/06, {Apple : $66K, Bread : $95K, Diaper : $22K, …}

Figure 1.5 A storewide sales history

Sequence data has several distinct characteristics compared with other kinds of data So, sequence data mining lead to many opportunities, challenges, and as well as draw the attention of researchers for sequence data mining These include the following [5]:

⚫ The length of sequences can be very long In a given sequence database, the length of each sequence is difference even may have a very large variation For example, the length of a gene can be as small as several hundred, but as large as over 100K

⚫ A pattern can be substring or subsequence Sometimes, a pattern must occur as a substring in a sequence i.e the elements in a substring must be consecutive elements

in an original supersequence, without gaps between elements At other times, a pattern can also be a subset of sequence, the elements of a pattern can occur as a

Trang 29

subsequence of a sequence, allowing gaps between matching elements

⚫ Absolute positions of elements in sequences may/may not have signiﬁcance, e.g when we want to look for a sequence containing a pattern or not, we don’t need to care that pattern occur in any absolute position in the sequence

⚫ The relative ordering/positional relationship between elements in sequences often

plays an important role For example, sequence XY is usually different from sequence

YX Furthermore, the distance between two elements in sequences is also often

signiﬁcant The relative ordering/positional relationship between elements is a unique feature to sequences This is the basic difference of sequence data compared with other kinds of data

Several the tasks of data mining are often used popular in sequence data mining applications [5]: mining sequential pattern, classification of sequences, clustering of sequences The sequential pattern is a sequence of itemsets that frequently appeared in a specific order and all items in the same itemset are given to have the same transaction-time value or within a time-gap Finding sequential patterns from sequence database is an important problem and a focused subject in data mining research field

1.2 Motivation

The sequential pattern mining is one of important tasks of data mining research and often used common in sequence data mining applications It plays a fundamental role in mining associations [9,16~ 19], correlations [20], and many other interesting relationships among data Moreover, it serves in data classiﬁcation [2], clustering [21~ 23], and other data mining tasks The process of sequential pattern mining is to extract frequent subsequences in a sequence database There are many sequential pattern mining methods examined widely in many related problems, including the general sequential pattern mining [24~30], constraint-based sequential pattern mining [31~33], incremental sequential pattern mining [34~36], approximate sequential pattern mining [37~38], partial periodic pattern mining [6, 39], temporal pattern mining

in data stream [40]

Although many problems related to sequential pattern mining are examined, but we understand that the development of the general sequential pattern method is the most basic one Hence, in this thesis, we only investigate the tasks for the general sequential pattern mining and generating rules from a sequence database This work has also attracted much more attention to researchers in data mining research In this thesis, sequential pattern stands for general sequential pattern There are many works which has been examined on the

Trang 30

sequential patterns mining [24~30], however, the main challenges is still existing as large search spaces and the ineffectiveness in handling dense datasets To resolve the above challenges, the problems for mining sequential rules, closed sequential pattern, and sequential generator pattern have been proposed

Sequential rules are generated from the set of sequential patterns It expresses the temporal relationships between event sequences in a sequence database Sequential rules can

be considered as natural extension of original sequential patterns, just as association rules are natural extension of frequent itemsets Like a sequential pattern, a sequential rule is also applied in many application areas including the trade [5], stock market [8~9, 41], weather observation [42], e-learning [43], and software engineering [44~48] Sequential rule has been used

to remove irrelevant or spurious patterns in the set of sequential patterns by applying the interestingness measures for rules On the best of our knowledge, there are many studies about the interestingness measures used for mining association rules [33,37,49~51] or classification rules [33,52] in transaction databases but have not been used to mine sequential rules in sequence databases except the traditional measures

Sequential generator patterns used together with closed sequential patterns can bring additional information that closed sequential patterns alone are not able to provide and often used for mining non-redundant sequential rules Many efficient methods have been proposed

to mine sequential patterns [24~30], closed sequential patterns [41,53~56], and sequential generators patterns [57~ 59] But these algorithms have generated different types of patterns separately, which consumes much time

Non-redundant sequential rule can remove a lot of low-quality sequential rules that are almost meaningless and reduce the spending time when generating a full set of sequential rules from the complete set of sequential patterns There are recently two algorithms proposed

by Lo et al., 2009 [47] and Zang et al., 2010 [60~61] to address this problem These methods have removed a significant number of redundant sequential rules but require a lot of time for checking sequential generator patterns and closed sequential patterns to generate rules 1.3 Sequential pattern

Sequential pattern plays an important role in the data mining research area The sequential pattern mining problem was first proposed by Agrawal and Srikant [24] in 1995, and has also attracted more and more attention to researchers in the field of data mining research [25~30] Given a sequence database, the mining sequential patterns problem is to find the frequent sequences among all sequences that satisfy a user-specified minimum support

Trang 31

threshold Sequential pattern has a broad range of applications, including customer purchase behavior analysis [8~9], DNA sequence pattern analysis [12], web usage behavior analysis [13~14], guidance systems [62], and so on

In the last decade, many algorithms and techniques have also been proposed to improve

the effect of mining sequential patterns, including the SPADE [27] algorithm, which was proposed to divide candidate sequences into distinct groups such that each group could be

completely stored in the main memory PrefixSpan [28] examined the prefix subsequences and

projected the corresponding postfix subsequences into projected databases The SPAM [29]algorithm could speed up the mining process using a lexicographic sequence tree and a

bitmap representation The PRISM [30] algorithm used the primal block encoding approach to represent candidate sequences and joined operations over the primal blocks to determine the frequency of each candidate Experimental results [30] also showed that PRISM was one of the

best methods for mining sequential patterns It outperformed existing methods by an order of magnitude or more and had a low memory footprint

1.4 Closed sequential pattern

When mining long frequent sequences that contain a combinatorial number of frequent subsequences, such a mining will generate an explosive number of frequent subsequences for long patterns, or when using very low support thresholds to mine sequential patterns, which is prohibitively expensive in both time and space cost So, the performance of the sequential pattern mining algorithms often degrades unexpectedly To overcome this difficultly, the mining closed sequential patterns problem has been developed A sequence is called closed if there exists no its supersequence with the same support in the sequence database Mining sequential patterns with closed patterns may significantly reduce the number of patterns generated in the process without losing any information because it can be used to derive the complete set of sequential patterns; the number of closed sequential patterns is usually fewer than the number of sequential patterns Several studies have been recently proposed to mine closed sequential patterns [41,53~56] The CloSpan algorithm [53] has been proposed Like most

of the frequent closed itemset mining algorithms CLOSET [63] and CHARM [64], CloSpan

algorithm used the candidate maintenance and test approach It needs to maintain the set of already mined closed sequence candidates for doing the backward subpattern and backward superpattern check to verify if a newly found frequent sequence is promising to be closed or not So, it will consume much memory and lead to huge search space for pattern closure

checking when there are many frequent closed sequences BIDE [54] is another faster closed

Trang 32

sequence mining algorithm Different from CloSpan, it used a novel sequence closure

checking scheme called BI-Directional Extension, and pruned the search space more by using

the BackScan pruning method and the ScanSkip optimization technique to directly get the

complete set of the frequent closed sequence patterns without candidate maintenance Thus, in

most cases, BIDE is more efficient than CloSpan, especially when a database is dense or the minimum support value is low But to implement the closure check, the BIDE algorithm

spends a lot of time on scanning the pseudo-projected database repeatedly to verify the existence of extension of position with a prefix sequence, which costs much time in the mining process To reduce the time consumed on scanning the pseudo-projected database for

verifying in the BIDE algorithm, the FCSM-PD algorithm was proposed by Huang et al [41]

the positional data was used to reserve the position information of items in the data sequences

In the pattern growth process, the extension of position with a prefix sequence is checked directly and all the position information of the new prefix sequences will be recorded

However, the FCSM-PD algorithm must store all the position information of a prefix

sequence in the process of pattern growth in advance; so it consumes more memory in this algorithm

1.5 Sequential generator pattern

In a sequence database, the sequential generator pattern is a pattern that does not have any its subsequence with the same support Sequential generator patterns used together with closed sequential patterns can provide additional information that closed sequential patterns alone cannot provide According to the Minimum Description Length (MDL) principle [65], sequential generator patterns are the minimal members and the length of sequential generator patterns are shorter than that of closed sequential patterns, so sequential generator patterns are preferable sequential patterns and closed sequential patterns for mining non-redundant sequential rules where the sequential generator patterns are antecedents of rules and each sequential generator pattern, consequents of rules are generated by removing the same prefix part, which the sequential generator pattern has, from closed sequential patterns Several sequential generator mining methods [57~ 59] have recently been proposed Lo et al [57]

proposed the first sequential generator mining algorithm, called the GenMiner method The

method extracts sequential generators in a three-step compact-generate-and-filter approach In the first step, it traverses all the sequential patterns and presents a compact representation of the space of sequential patterns in a lattice format [54] In the second step, it retrieves a set of candidate generators, which is a super-set of all generators, from the compact lattice and

Trang 33

prunes the sub-search spaces containing non-generators by using the unique characteristics of sequential generators [65] to ensure that the candidate generator set is not too large In the final

step, all non-generators from the candidate set are filtered The FEAT algorithm was

introduced by Gao et al [58] It is based on sequential pattern growth with forward and backward pruning strategies, along with a sequential generator checking technique to speed

up the mining process However, pruning non-generator sequences is time-consuming To

avoid the cost of pruning, the FSGP algorithm [59] was proposed In FSGP, a safe pruning

strategy based on the inclusion relationship between a sequence and its subsequence is used Each valid frequent sequential pattern is checked by the sequential generator checking theorem from the set of valid frequent sequential patterns The non-generators are then removed, and the resulting set of sequential generators is generated

1.6 Sequential rule

Based on sequence database, there have been a lot of different kinds of rules researched

in recent years such as recurrent rules [46], sequential rules [47,66~ 68], sequential classification rules [66], and interesting rules [67]

In the all of above rule kinds, sequential rule is the most basic rule; the remaining kinds

of rules are often the modified sequential rule by adding or removing some of the information

or binding into the sequential rule Consequently, this thesis focuses on the investigation of the mining sequential rule problem

Sequential rules are generated from the set of sequential patterns It expresses the temporal relationships between sequential patterns from a sequence database [67] Sequential rules can be considered as natural extension of original sequential patterns, just as association rules are natural extension of frequent itemsets [25] The sequential rule mining problem is thus

to find the relationships between occurrences of sequential events like “if event(s) X appears

in any sequence of the sequence database then event(s) Y is likely to appear in that sequence following X with a given confidence afterward” Compared with sequential patterns, the

sequential rules can help users better understand the chronological order of the sequences present in the sequence database For example, at the Video store, customer purchase the fourth Star Wars movie discs will buy season 5 and season 6 So, purchasing sequences (4, 5, 6) present purchasing activities However, in the fact, at the store have hundreds of customers with different preferences Therefore, sequence (4, 5, 6) tends to occur with low support Mining sequential patterns from a sequence database with low support values will get many sequential patterns, which may include irrelevant or spurious patterns Thus, sequential rule

Trang 34

has been used to remove these spurious patterns by applying the support and confidence for rules Only the rules that satisfy both a minimum support threshold and a minimum confidence threshold are thus mined In addition, sequential rule mining is also applied to address the prediction problem [18~19,42,69~73] In the problem of prediction, a sequence of events appears frequently in a database is not sufficient for the making prediction of events, while sequential rules allow better understanding of the problem of prediction in a sequence

database For example, some event C appears frequently after some events A and B but that there are also many cases where A and B are not followed by C In this case, predicting that C will occur if A and B occur on the basis of a sequential pattern ABC could be a huge mistake Thus, for prediction, it is desirable to have patterns that indicate how many times C appeared before AB and how many times AB appeared and C did not Thus, using sequential rules, we

can know the series of events that will usually occur after a series of previous ones Sequential rules are rather simple, but their information has many important implications, they are used for the process of decision making, management and orientation, and an appropriate sequential rule mining process, instead of mining only sequential patterns, is also desired Like a sequential pattern, a sequential rule is also applied in many application areas, including the trade [5], stock market [8~9,73], weather observation [42], e-learning [43], and software engineering [45~48]

1.7 Objective of the thesis

The goal of this thesis is to study and propose new algorithms that are efficient and effective to address the following two main objectives:

⚫ Exploitation of secondary information as sequential pattern, closed sequential pattern, sequential generator pattern based on the corresponding prefix-tree structures

⚫ Generate the kinds of sequential rules based on the secondary information on the prefix-tree structure

1.8 Contributions of the thesis

In this thesis, we propose efficient and effective algorithms for the mining problem related to sequential patterns All these algorithms in our work are based on the prefix-tree structure and the input database for them is organized in the vertical format The prime-block encoding approach is also used in the whole works related to generate sequential patterns In particular, the main contributions of this thesis can be briefly summarized as follows:

- Introduce the definitions related to our works and survey some existing algorithms

Trang 35

for mining sequential patterns

- Introduce several interestingness measures as lift, cosine, jaccard and so on which

used to mine association rules and propose an algorithm to generate all relevant sequential rules from a sequence database using these interestingness measures

- Propose efficient algorithms for mining sequential generator pattern

- Modify the prefix-tree structure to propose a new algorithm called CloGen for

mining closed sequential patterns and their sequential generator patterns at the same time

- Propose an efficient algorithm for mining non-redundant sequential rules based on the fields of closed sequential pattern and sequential generator pattern on the prefix-tree

- Both real and synthetic datasets can be used in an extensive experimental evaluation

of these techniques and a comparison with the existing methods

In summary, we have proposed efficient algorithms related to mine sequential patterns in sequence databases by using the prefix-tree structures They do not only improve the performance but also reduce the redundant rules when mining huge number of sequences from sequence databases

1.9 Organization of the thesis

The remainder of this thesis is organized as follows:

Chapter 2: Problem Definition and Related Work

Chapter 3: Sequential Rules with Interestingness Measures Mining

Chapter 4: Sequential Generator Pattern Mining

Chapter 5: Closed Sequential Patterns and Their Sequential Generator Patterns Mining Chapter 6: Non-Redundant Sequential Rule Mining

Finally: Conclusion and Future Works

In Chapter 2, we give the common definitions and the survey of several existing sequential patterns mining algorithms In additional, some algorithms for closed sequential pattern mining, sequential generator pattern mining and generating sequential rule are also mentioned in this chapter

Chapter 3 examines some specific interestingness measures, which have been used in the association rules and the classification rules, then build an efficient algorithm to find sequential rules with these interestingness measures

In Chapter 4, we provide a novel algorithm called MSGPs that used to find sequential

Trang 36

RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE

generator patterns at the same process of generating sequential patterns By modifying

prefix-tree structure, another efficient algorithm called MSGP_PreTree is then developed to

resolve this problem

Chapter 5 also designs an algorithm called CloGen to find closed sequential patterns and

their sequential generator patterns by constructing a corresponding prefix-tree structure to store the properties of generator and closed for each sequential pattern

In Chapter 6, based on the prefix-tree that achieved from the CloGen algorithm, an

efficient algorithm for generating non-redundant sequential rules is proposed

Finally, the conclusion and future work are discussed

The correctness and efficiency of the proposed algorithms are also verified by experimental results in each corresponding chapter

Trang 37

CHAPTER 2: DEFINITIONS AND RELATED WORKS

2.1 Introduction

In the field of data mining on the sequence database, sequence mining is essentially an enumeration problem over the subsequence partial order looking for those sequences that are frequent Sequential pattern mining on sequence database is to identify the patterns which

appear in the database satisfy the minimum support threshold (minSup) The ﬁrst algorithms were proposed for sequential pattern mining problem were AprioriAll [24] in 1995 and GSP [25]

in 1996 by Agrawal and Srikant Other algorithms like PSP [26], SPADE [27], PreﬁxSpan [28],

SPAM [29], CloSpan [53], were developed afterwards and successively improved the task of ﬁnding sequential patterns Exploiting sequential patterns are applied in many fields such as market analysis, web analysis, predicted the shopping needs of customers, and so on

Sequential rule extends the ability of using and significance of expression of sequential patterns, implicit knowledge of the sequence data Sequential rule is generated from sequential patterns, it represent the relationship between the two series of events, this event will occur after a series of other events

In this chapter, we present the common definitions of sequential pattern mining problem and introduce several existing sequential patterns mining methods that are the foundation for our contributions in chapters 3, 4, 5, and 6 In additional, definitions, some algorithms for closed sequential patterns mining, sequential generator patterns mining and generating sequential rules are also mentioned

2.2 Sequential Pattern Mining

2.2.1 Definitions

Definition 2.1: Sequence & sequence database [1,30,68] Let I = {i 1 , i 2 , …, i m } be a set of

items An itemset is a non-empty subset of items, an itemset i is denoted by (i1 , i 2 , …, i k), where ij is an item Without loss of generality, we assume that items in an itemset are sorted in lexicographic order S = {s1 , s 2 , …, s n } be a set of sequences, where each sequence s x is an ordered list of itemsets and sx ={x 1 , x 2 , …, x p } where x i is an itemset and p is the number of itemsets such that x1 , x 2 , …, x p  I In sx , x 1 occurs before x2, which occurs before x3, and so

Trang 38

on The size of a sequence is the number of itemsets in the sequence The number of instances

of items in a sequence is called the i-length of a sequence, defined by 

with i-length l is called a l-sequence For example, given a sequence s =

(AB)(B)(B)(AB)(B)(AC), sequence s has 6 itemsets is that: (AB), (B), (B), (AB), (B), (AC) and has 9 items So, the size of sequence s is 6, and the i-length of sequence s is 9, called a

9-sequence A sequence database SD is composed of a set (S) of sequences

Definition 2.2: Subsequence & supersequence [1,30,68] Sequence  = 1 2 … n is called a subsequence of  = 12 … m and β is a supersequence of α (where i and j are

itemsets), denoted as α  β, if there exist integers 1 ≤ j1 < j 2 < … < j n ≤m (n ≤ m) such that 1

 j1 , 2 j2 , , n jn For example, if α = (AB), D and β = (ABC), (DE), where A,

B, B, D, and E are items, then α is a subsequence of β and β is a supersequence of α

Definition 2.3: Pattern Pattern is a subsequence of a sequence Each itemset in a pattern

is called an element or event

Definition 2.4: Support & sequential pattern [1,30,68] Given a sequence database SD and

sequence s, the absolute support of s in SD is the number of sequences in SD containing s, denoted SupSD (s) = S iSDsSi The relative support of s in SD is the ratio of the absolute support of s in SD and the number of sequences in SD Without loss of generality, in the remaining of this dissertation, whenever support is mentioned, the absolute support of s or the relative support of s will be used the mutual conversion, denoted as Sup(s)

Definition 2.5: Sequential pattern [1,30,68] Given a minimum support threshold, denoted

as minSup, and minSup  (0, 1 A sequence s is called a sequential pattern in SD if Sup(s) ≥

minSup A sequential pattern with length l is called an l-pattern

Table 2.1 An example sequence database (SD)

is 9 In s1, item A occurs three times in this sequence, so it contributes 3 to the length of the

Trang 39

sequence However, when counting the support of item A on the whole sequence s1 is only counted one A sequence p = (AB)(C) is a subsequence of s1 , therefore, subsequence p is

called a pattern In SD, only sequences s1, s2 and s5 which contain pattern p, p has a support of

3, Sup(p)= 3 Sup(p) > minSup, so p is a sequential pattern The length of p is 3, hence p is called a sequential pattern with 3-pattern

Given a sequence database SD and minSup The sequential pattern mining problem is to find the full set of sequential patterns in the sequence database SD The sequential pattern

mining problem [24,30] was also simultaneously identified as the frequent episode mining problem by Mannila et al [74] In this thesis, we use a sequence database in Table 2.1 as an example sequence database to illustrate our works throughout the chapters

Definition 2.6: Prefix, incomplete prefix & postfix [68] Given two sequences s1 = a1

a 2 … a n and s2 = b1 b 2 … b m, where ai, bj are itemsets and m n, sequence s1 is a prefix of s2

if and only if ai = bi for all 1 ≤ i ≤ n The remaining part of sequence s2 (after the removal of the prefix part s1) is called a postfix of s2 Sequence s1is an incomplete prefix of s2 if and only

if ai = bi for all 1 ≤ i ≤ n-1, an  bn, and all the items in (bn - an) are lexicographically after those in an From the above definition, it can be inferred that a sequence of size k has (k-1) prefixes For example, a sequence (A)(BC)(D) has 2 prefixes: (A) and (A)(BC) Therefore,

(BC)(D) is the postfix for prefix (A), and (D) is the postfix for prefix (A)(BC) However, both (A)(B) and (BC) are not considered as the prefix of given sequence, but (A)(B) is an

incomplete prefix of given sequence

Definition 2.7: Projected database [28,47,54] Given  be a sequential pattern in sequence

database SD The -projected database, denoted as SD , is the set of postfixes of sequences

in SD with the prefix  For example, given SD = {(A)(BC)(CD), (AB)(C)(DE)(F),

(A)(CE)(F)}, sequential pattern  = (A) (C), then D = {(CD), (DE)(F), (E)(F)}

2.2.2 Organization of the sequence data

Each sequence database can be represented in two basic ways:

• Horizontal Format: The database is organized horizontally; each row represents the series of events corresponding to the object as shown in Table 2.3

• Vertical Format: The database is organized vertically; each row represents the series

of objects corresponding to the event as shown in Table 2.4

Trang 40

Table 2.2 Sequence database

Object Series of events

2.2.3 Prefix-tree Structure

Prefix-tree is an ordered tree data structure used to store sequences for a fast look-up, where all the children nodes of a parent node have a common prefix of the sequences associated with that node, and the root is associated with the empty sequence Its simplest form can often be used as a list of keywords or a dictionary Unlike a binary search tree, no

Định dạng
Số trang	139
Dung lượng	3,15 MB