1. Trang chủ
  2. » Giáo Dục - Đào Tạo

DATA MINING LECTURE 5 Sequential Pattern Mining

40 6 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 40
Dung lượng 3,41 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

DATA MINING LECTURE 5 Sequential Pattern Mining. Sequential Pattern Mining DATA MINING LECTURE 5 Sequential Pattern Mining Outline Sequence database Sequential pattern mining Methods for sequential pattern mining Apriori based Approaches GSP SPADE.

Trang 1

DATA MINING

LECTURE 5

Sequential Pattern Mining

Trang 2

Outline

• Sequence database

• Methods for sequential pattern mining

• GSP

• SPADE

• PrefixSpan

Trang 4

Applications

• Applications of sequential pattern mining

– Customer shopping sequences:

• First buy computer, then CD-ROM, and then digital camera, within 3 months.

– Medical treatments, natural disasters (e.g.,

earthquakes), science & eng processes, stocks and markets, etc.

– Telephone calling patterns, Weblog click streams

– DNA sequences and gene structures

Trang 5

bj1, a2 b⊆ j2,…, an b⊆ jn

• β is a super sequence of α

– E.g α=< ( ab ) ( d) > and β=< ( ab c) ( d e)>

Trang 6

What Is Sequential Pattern Mining?

• Given a set of sequences and support threshold,

A sequence : < (ef) (ab) (df) c b

>

Items within an element are unordered

and we list them alphabetically.

<a(bc)dc> is a

subsequence of <a (a bc ) (ac) d ( c f)>

Given support threshold min_sup =2, <(ab)c> is

Trang 7

Challenges on Sequential Pattern Mining

are hidden in databases

• A mining algorithm should:

– find the complete set of patterns , when possible,

satisfying the minimum support (frequency) threshold

– be highly efficient, scalable , involving only a

small number of database scans

– be able to incorporate (include) various kinds of

user-specific constraints

Trang 9

The Apriori Property of Sequential

Patterns

• A basic property: Apriori (Agrawal & Sirkant’94)

– If a sequence S is not frequent, then none of the

Given support

=2

Trang 10

• Outline of the method

– Initially, every item in DB is a candidate of length-1

– for each level (i.e., sequences of length-k) do

• scan database to collect support count for each candidate sequence

• generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori

– repeat until no frequent sequence or no candidate

can be found

• Major strength: Candidate pruning by Apriori

Trang 11

Finding Length-1 Sequential

Trang 12

8*8+8*7/2= 92

candidates

Apriori prunes 44.57% candidates

Trang 13

Finding Lenth-2 Sequential Patterns

• Scan database one more time, collect support count for each length-2 candidate

• There are 19 length-2 candidates which pass the

minimum support threshold

– They are length-2 sequential patterns

Trang 14

The GSP Mining Process

<a> <b> <c> <d> <e> <f> <g> <h>

<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>

<abb> <aab> <aba> <baa> <bab> …

<abba> <(bd)bc> …

<(bd)cba>

1 st scan: 8 cand 6 length-1

seq pat.

2 nd scan: 51 cand 19 length-2

seq pat 10 cand not in DB at

all

3 rd scan: 46 cand 19 length-3

seq pat 20 cand not in DB at

threshold Cand not in DB at

min_sup

=2

Trang 15

– Form C k+1 (the set of length-(k+1) candidates) from F k ;

– If C k+1 is not empty, scan database once, find F k+1 , the set of length-(k+1) sequential patterns

– Let k=k+1;

Trang 16

The GSP Algorithm

• Benefits from the Apriori pruning

– Reduces search space

• Bottlenecks

– Scans the database multiple times

– Generates a huge set of candidate sequences

There is a need for more efficient mining methods

Trang 17

The SPADE Algorithm

• SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001

• A vertical format sequential pattern mining method

• A sequence database is mapped to a large set of Item: <SID, EID>

• Sequential pattern mining is performed by

– growing the subsequences (patterns) one item at a

Trang 18

The SPADE Algorithm

Trang 19

Thuật toán SPADE

Trang 20

Thuật toán SPADE

• Generates candidates by merging patterns from the same equivalence class (An equivalence class of size

k is defined as the set of all frequent patterns

items)

20

Trang 21

500 , 499 ,

1 2

999 1000

1000

1000 × + × =

Trang 22

Sequential Pattern Growth)

• PrefixSpan: Prefix-Projected Sequential Pattern Growth;

• J.Pei, J.Han,… PrefixSpan : Mining sequential patterns

efficiently by prefix-projected pattern growth ICDE’01.

Trang 23

Prefix and Suffix (Projection)

• <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of

sequence <a(abc)(ac)d(cf)>

• Given sequence <a(abc)(ac)d(cf)>

Prefix Suffix (Prefix-Based

Projection)

Trang 24

• Step 2: divide search space The complete set of

seq pat can be partitioned into 6 subsets:

– The ones having prefix <a>;

– The ones having prefix <b>;

Trang 25

Finding Seq Patterns with Prefix <a>

• Consider projections with respect to <a>

– <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)

(ae)>, <(_b)(df)cb>, <(_f)cbc>

• Find all the length-2 seq pat Having prefix <a>:

<aa>, <ab>, <(ab)>, <ac>, <ad>, <af>,…

– Further partition into 6 subsets

• Having prefix <aa>;

Trang 26

<aa>, <ab>, <(ab)>,

<ac>, <ad>, <af>

Having prefix <a>

Having prefix <aa>

Trang 28

Example in detail

28

1 6

`

PrefixSpan – Example (2)

3 Find subsets of sequential patterns

<(cf)>

<c(bc) (ae)>

<c>

<b>

<dcb>

Trang 29

The Algorithm of PrefixSpan

• Input: A sequence database S, and the minimum

support threshold min_sup

• Output: The complete set of sequential patterns

Trang 30

a) b can be assembled to the last element of α to form

a sequential pattern (I-Concatenation)

b) (b) can be appended to α to form a sequential

pattern (S-Concatenation)

sequential pattern α’, and output α’;

3 For each α’, construct α’-projected database S|α’,

Trang 31

Efficiency of PrefixSpan

• No candidate sequence needs to be generated

• Projected databases keep shrinking

• Major cost of PrefixSpan: constructing projected databases

Trang 32

Optimization in PrefixSpan

• Kỹ thuật bi-level projection

– Bi-level projection can reduce the number and size of

projected databases

• Kỹ thuật pseudo-projection

– Pseudo-projection can reduce the effort of projection

when the projected database fits in main memory

Trang 33

Scaling Up by Bi-Level Projection

• Partition search space based on length-2 sequential patterns

• Create projected databases and pursue (follow)

recursive mining over bi-level projected databases

Trang 34

Speed-up by Pseudo-projection

34

`

repeatedly in recursive projected databases

collecting all the postfixes, we can use pointers referring to

the sequences in the database as a pseudo-projection

 Every projection consists of two pieces of information:

pointer to the sequence in database and offset to the

postfix in the sequences1=<a(abc)(ac)d(cf)>

Trang 35

Bi-Level Projection Pair-wise Checking Using S-

<(ac)> happens twice

All length-2 sequential patterns are found in S-matrix

Trang 36

Mining <ab>-projected Database

S-matrix

No hope to form (a_c), so

no need to count it.

Lead to pattern

<a(bc)a>

Trang 37

Benefits of Bi-level

Projection

 In the example, there are 51 patterns.

 51 level-by-level projections

 22 bi-level projections (S-Matrix có 22 ô có giá trị

Trang 38

3-way Apriori Checking

Từ S-Matrix trên, khi xây dựng <ac>-projected database chúng

ta thực hiện loại bỏ item d vì chắc chắn <acd> is not a

pattern

Trang 39

Example - Bi-level Projection

 Scan to get 1-length sequences

 Construct a triangular matrix instead of projected databases for each length-1 patterns

3

1 (3,3,2)

2 (4,2,2) (4,2,1)

a

b c

Support(<cc>) = 3

ALL length-2 sequential pattern

Trang 40

Example - Bi-level Projection

40

`

Bi-level projection (2)

 For each length-2 sequential pattern α , construct the α -projected database and find the frequent items

 Construct corresponding S-matrix

<(_c)(ac) (cf)>

2 2 0 2

c (_c) b

a

φ

1 ( φ ,1,

Ngày đăng: 08/11/2022, 14:02

TỪ KHÓA LIÊN QUAN