DSpace at VNU: Mining erasable itemsets with subset and superset itemset constraints

DSpace at VNU: Mining erasable itemsets with subset and superset itemset constraints tài liệu, giáo án, bài giảng , luận...

Trang 1

Accepted Manuscript

Mining Erasable Itemsets with Subset and Superset Itemset

Constraints

Bay Vo , Tuong Le , Witold Pedrycz , Giang Nguyen ,

Sung Wook Baik

To appear in: Expert Systems With Applications

Received date: 19 July 2016

Revised date: 12 October 2016

Accepted date: 13 October 2016

Please cite this article as: Bay Vo , Tuong Le , Witold Pedrycz , Giang Nguyen , Sung Wook Baik ,

Mining Erasable Itemsets with Subset and Superset Itemset Constraints, Expert Systems With cations (2016), doi:10.1016/j.eswa.2016.10.028

Appli-This is a PDF file of an unedited manuscript that has been accepted for publication As a service

to our customers we are providing this early version of the manuscript The manuscript will undergocopyediting, typesetting, and review of the resulting proof before it is published in its final form Pleasenote that during the production process errors may be discovered which could affect the content, andall legal disclaimers that apply to the journal pertain

Trang 2

ACCEPTED MANUSCRIPT

Highlights

 We state the problem of mining EIs with subset and superset itemset constraints

 Two propositions supporting a quick pruning nodes were proposed

 pMEIC algorithm based on two above propositions was proposed

 The experiments were conducted to show the effectiveness of pMEIC

Trang 3

ACCEPTED MANUSCRIPT

Mining Erasable Itemsets with Subset and Superset Itemset Constraints

Bay Vo1, 2, Tuong Le3, 4, *, Witold Pedrycz5,6,7, Giang Nguyen1, Sung Wook Baik2

1

Faculty of Information Technology, Ho Chi Minh City University of Technology, Ho Chi

Minh City, Vietnam 2

College of Electronics and Information Engineering, Sejong University, Seoul, Republic of

Korea 3

Division of Data Science, Ton Duc Thang University, Ho Chi Minh City, Vietnam 4

Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam 5

Department of Electrical and Computer Engineering, University of Alberta, Edmonton, T6R

2V4 AB, Canada 6

Department of Electrical and Computer Engineering, Faculty of Engineering, King

Abdulaziz University, Jeddah, 21589, Saudi Arabia 7

Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland

Email: Bay Vo(bayvodinh@gmail.com), Tuong Le (lecungtuong@tdt.edu.vn and tuonglecung@gmail.com), Witold Pedrycz (wpedrycz@ualberta.ca), Giang Nguyen

(nh.giang@hutech.edu.vn), Sung Wook Baik (sbaik@sejong.ac.kr)

Abstract Erasable itemset (EI) mining, a branch of pattern mining, helps managers to

establish new plans for the development of new products Although the problem of mining EIs was first proposed in 2009, many efficient algorithms for mining these have since been developed However, these algorithms usually require a lot of time and memory usage In reality, users only need a small number of EIs which satisfy a particular condition Having this observation in mind, in this study we develop an efficient algorithm for mining EIs with

subset and superset itemset constraints (C0  X  C1) Firstly, based on the MEI (Mining

Trang 4

ACCEPTED MANUSCRIPT

Erasable Itemsets) algorithm, we present the MEIC (Mining Erasable Itemsets with subset and superset itemset Constraints) algorithm in which each EI is checked with regard to the constraints before being added to the results Next, two propositions supporting quick pruning

of nodes that do not satisfy the constraints are established Based on these, we propose an efficient algorithm for mining EIs with subset and superset itemset constraints (called pMEIC – p: pruning) The experimental results show that pMEIC outperforms MEIC in terms of mining time and memory usage

Keywords: data mining, erasable itemset, subset and superset itemset constraint, pruning

techniques

1 Introduction

Data mining and knowledge discovery is the process of discovering interesting patterns and rules in large databases This process combines a variety of methods stemming from artificial intelligence, machine learning, and statistics Many problems in data mining have attracted the attention of researchers, such as mining association rules (Lin et al., 2016; Sahoo

et al., 2015), application of association rules (Cheng et al., 2016; Parkinson et al., 2016; Khader et al., 2016), classification (Sun et al., 2015; Jia et al., 2015; Wang et al., 2015), and clustering (Agarwal & Bharadwaj, 2015; Das & Maji, 2015; Nanda & Panda, 2015) Pattern mining is the fundamental approach to solving the above problems, and there are currently many methods of mining frequent patterns, including Apriori (Agrawal et al., 1993), FP-growth (Han et al., 2000), dEclat (Zaki & Hsiao, 2005), NSFI (Vo et al., 2016), and FIN+ (Deng, 2016) In 2009, Deng et al (2009) formulated a problem of mining erasable patterns (EPs), a variant of pattern mining In this problem a factory produces many products, which are composed of a number of items (components) Each product generates revenue, and each item has a cost of purchase and storage During a financial crisis, a factory will not have enough money to purchase all the required components as usual The problem of EP mining

Trang 5

ACCEPTED MANUSCRIPT

is thus to find the patterns that can be removed to reduce the loss to the factory’s profit under some conditions Managers can then utilize the knowledge obtained from EPs to make a new production plan Although many algorithms have been developed for mining EIs, the problem

of mining EIs with itemset constraints remains unexplored In this study we thus propose an efficient algorithm for mining EIs with subset and superset itemset constraints The results to

this problem come in the form Y = {X  I | X is EI and C0  X  C1}, where I is the set of items in databases and C0  C1  I Mining EIs with the subset and superset itemset

constraint (Y) can be achieved simply by mining all EIs and then filtering the results which

satisfy this constraint However, this approach is inefficient in both time and memory usage

We thus develop two pruning techniques to reduce the search space based on two clauses related to the paternity of the nodes in the tree search The main contributions of this study are as follows: (1) presenting the problem of mining EIs with subset and superset itemset constraints, (2) proposing two pruning techniques to reduce the search space, (3) presenting

an efficient algorithm for mining EIs with subset and superset itemset constraints, and (4) conducting experiments to show the effectiveness of the proposed algorithm

The paper is organized as follows In Section 2, we present the underlying concept of EI mining and the problem statement of mining erasable itemsets with subset and superset itemset constraints Section 3 reviews related works on EI mining In Section 4, a modified version of the MEI algorithm for mining erasable itemsets with subset and superset itemset constraints, named the MEIC algorithm, is presented The main contribution of this article, the pMEIC algorithm, is then proposed in Section 5 We show results of the performance evaluation of our algorithm and MEIC in Section 6, while the conclusions of this work are then presented in Section 7

Trang 6

ACCEPTED MANUSCRIPT

2 Basic concepts and problem statement

Let = { , , , } be a set of all items, which are the abstract representations of components of products A product dataset is denoted by , , , }, where

is a product presented in the form of Items, Val, where Items are the items (or components) that constitute , and Val is the profit that the factory generates by selling

the product A set is also called an itemset, and an itemset with k items is called a itemset The example dataset shown in Table 1 will be used throughout this article In which,

k-{a, b, c, d, e, f, g, h} is the set of items, and {P1, P2, , P11} is the set of products

Table 1 An example database (DB e)

Product Items Val (USD)

For example, X = {ac} is an itemset We have {P1, P2, P3, P4, P6, P7, P11} that are

products which contain {a}, {c}, or {ac} Therefore, (X) = P1.Val + P2.Val + P3.Val +

P4.Val + P6.Val + P7.Val + P11.Val = 4,650 USD

Trang 7

ACCEPTED MANUSCRIPT

Given a threshold and product dataset DB, pattern X is said to be erasable if and only if

 Where is the total profit of product dataset DB: ∑ The

total gain of the factory is the sum of the gain of all products Consider DB e , we set T = 5,000 USD An itemset X is an EI if and only if (X) ≤ T × , where  is a user given threshold For example, let  = 16%, (e) = 600 USD and e is an EI due to (e) = 600 ≤ 5,000 × 16% = 800 From Definitions 1 and 2, the problem of mining EIs is to find all itemsets that have (X) smaller than or equal to T ×  Table 2 shows all EIs for DB e with  = 16%

Table 2 All EIs for DB e with  = 16%

Erasable Itemsets Val (USD)

Trang 8

ACCEPTED MANUSCRIPT

Le and Vo (2014) defined the index of gain as [ ] , where for

For DB e, the index of gain is shown in Table 3 For example, the gain of product

P4 is the value of the element at position 4 in G denoted by G[4] = 150 dollars

Table 3 Index of gain for DB e

In addition, Le and Vo (2014) proposed Pidset and dPidset structures for mining EIs,

which are summarized as follows For an itemset X, dPidset (the set of product identifiers) is

computed by ⋃ , where A is an item in X and p(A) is the pidset of item A,

i.e., the set of product identifiers which includes A Let XA and XB be two itemsets with the

same prefix X The pidset of XAB is computed by p(XAB) = p(XB)  p(XA), where p(XA) and

p(XB) are pidsets of XA and XB, respectively The dPidset of pidsets p(XA) and p(XB),

denoted as dP(XAB), is computed by: dP(XAB)= p(XB) \ p(XA) Moreover, assume that dP(XA) and dP(XB) are the dPidsets of XA and XB, respectively The dPidset of XAB is computed by: dP(XAB) = dP(XB) \ dP(XA)

Definition 1 Given two itemsets C0 and C1, and C0  C1  I, the problem of mining EIs with a subset and superset itemset constraint is to find all EIs with:

For example, consider DB e (with T = 5,000 USD), let  = 16%, C0 = {f} and C1 = {fdh}

Table 4 shows all EIs satisfying this constraint

Trang 9

Xu, 2012), MEI (Mining Erasable Itemsets) (Le & Vo, 2014), and EIFDD (Erasable Itemsets for very Dense Datasets) (Nguyen et al., 2015) First, META is an Apriori-based algorithm,

which is slow because it generates candidate patterns level by level An erasable (k-1)-itemset

X is checked with all the remaining erasable (k-1)-itemsets for combination to generate candidate erasable k-itemsets Only a small number of the remaining erasable (k-1)-itemsets that have the same prefix as X are combined Second, MERIT uses the NC_Sets structure to

reduce memory usage, which is its main advantage However, there are still some disadvantages with regard to storing its structure, as this leads to high memory consumption and long execution time Third, MEI uses a divide-and-conquer strategy and the concept of the difference pidset (dPidset) Some theorems for efficiently computing itemset information

to reduce mining time and memory usage were derived for MEI Although the mining time and memory usage are better than those of META and MERIT, MEI’s performance in mining EIs from very dense databases is relatively weak Fourth, EIFDD was thus proposed to overcome this weakness of MEI for very dense databases by using the subsume concept This

is used to help in the early determination of information about a large number of EIs, without

Trang 10

ACCEPTED MANUSCRIPT

the usual computational cost In summary, EIFDD is now generally used to mine EIs for very dense databases, while MEI is used to mine EIs for the remaining types of databases

Besides the problem of mining EIs, a number of related problems have been proposed, as

follows (1) The problem of mining top-rank-k EIs (Deng, 2013; Nguyen et al., 2014) is finding the top k rank of gain EIs to avoid finding all EPs Deng (2013) first proposed solving

this problem with a basic algorithm named VM, which uses the PID_list structure Nguyen et

al (2014) then presented an improved structure of PID_list named dPID_list Based on this,

the authors proposed a fast algorithm called dVM for mining top-rank-k EIs (2) For the

problem of mining erasable closed itemsets (ECIs), Nguyen et al (2015) first represented and compressed the mined EIs without loss of information They then proposed an effective algorithm to deal with this problem, named the MECP algorithm (3) The problem of mining weighted erasable patterns was first proposed in Lee et al (2015), which considers the distinct weight of each item according to its quality, size, price, and so on In 2016, the same group of authors (Yun et al., 2016) proposed a new approach for mining weighted erasable patterns for streaming data applications

In this study, we present the problem of mining EIs with subset and superset itemset constraints The following three approaches can be applied to solve this problem: (1) Using one of the existing algorithms to mine all EIs satisfying the threshold and then singling out EIs satisfying the constraint In the experiment section, we use MEI to mine all EIs and single out EIs satisfying the constraint We call this approach MEI-N (2) In the process of mining EIs, we check if an EI satisfied the constraint or not when it was created If so, it will be added to the results This approach will be presented in Section 4 (3) The itemsets that satisfy the constraint are expanded, and this approach will be proposed in Section 5

Trang 11

ACCEPTED MANUSCRIPT

4 MEIC algorithm

In this section, we propose the MEIC algorithm for mining EIs with subset and superset

itemset constraints C0 and C1 In the process of mining EIs, MEIC will check if an EI satisfied the constraint or not when it was created If so, it will be added to the results

4.1 The algorithm

MEIC uses a divide and conquer strategy, dPidset structure and related theorems (Le &

Vo, 2014) for mining EIs with subset and superset itemset constraints C0 and C1 In the first

step, MEIC scans the dataset to determine T, G, and EI1 with their pidsets (Line 1) MEIC

will then sort EI1 in descending order of pidsets’ size (Line 2) Next, MEIC scans all elements

in EI1 to find EIs that satisfy the threshold and constraints (Lines 4-11) Each itemset X (X 

EI1) which satisfies the constraint C 0  X  C1 is added to the result (Lines 10-11) In the

second step, for EI k (k > 1) in the same class of equivalence, the algorithm combines the first element with the remaining elements to create sets of (k+1)-itemsets candidates If itemset X satisfies T × ξ (Line 21), the algorithm will: (i) add X to EI k+1 (Line 22); (ii) if X satisfies the constraints C 0  X  C1 then it will add it to EI C (Lines 23-24); and (iii) combine the

elements in EI k+1 together to create EI k+2 (Line 26) The MEIC algorithm is outlined below Note that the Sub_pidsets procedure is presented in Le and Vo (2014)

Algorithm 1 MEIC algorithm

Input: Database DB, threshold  and the constraints C0, C1

Output: EI C (the erasable itemsets satisfy the constraints C0, C1)

Scan DB to calculate T, G, and EI1 with their pidsets

Sort EI1 in descending order of pidsets’ size

EI next

for k 1 to | | - 1 do

E.items = [k].item

Trang 12

.Items = [k].Items  [j].Items

( pidset, Gain) Sub_pidsets( [k].pidset, [j].pidset)

E.gain = [k].gain + Gain

call Expand_E(EI next)

MEIC algorithm: an outline

4.2 Illustration of the MEIC process

An execution of the MEIC algorithm for DB e with  = 16%, C0 = {f} and C1 = {fdh} is

described as a sequence of the following steps:

1 MEIC scans DB e to calculate T = 5,000 USD; the index of gain ( ); and EI1 = {e, f, d, h, g} with their pidsets (Line 1)

2 Sort EI1 in descending order of pidsets’ size After this step, a new order of EI1 is {e, f, d,

h, g} (Line 2)

Trang 13

ACCEPTED MANUSCRIPT

3 In EI1, {e, f, d, h, g} will be added to EI next and only {f} satisfies the constraints, therefore, this algorithm adds {f} to EI C (Lines 4-11)

4 For first class equivalence (e), the algorithm will combine it with all remaining EIs {f, d,

h, g} to create {ef}, {ed}, {eh} and {eg} {ef} is excluded because its gain is 900 > T

The remaining itemsets are EIs and they will be added to EI next = {ef, ed, eh} At this stage, no EIs satisfy the constraints, and therefore EI C = MEIC uses the Expand_E

procedure with EI next as a parameter for combining these EIs to create EIs at the next level This step is then repeated recursively until no more EIs are created The algorithm will stop at this class equivalence The result of this class equivalence is shown in Figure

1, and note that EI C is still empty

Fig 1 The search tree of class equivalence (e) for DB e with  = 16%, C0 = {f} and C1 =

{fdh}

5 Repeat step 4 with each remaining class equivalence in EI1 The process completes when

no more EIs are created The results, EI C , are {f, fd, fh, fdh} Figure 2 shows the search tree of MEIC for DB e with  = 16%, C0 = {f} and C1 = {fdh}

Trang 14

ACCEPTED MANUSCRIPT

Fig 2 The search tree of MEIC for DB e with  = 16%, C0 = {f} and C1 = {fdh}

The search tree of MEI-N is the same as the search tree of MEIC in Fig 1 without the set

of EIs that satisfy the constraints C0 and C1 Therefore, MEI-N has to scan the whole search

tree again to find the EIs that satisfy the constraints C0 and C1 While MEIC will give the result without scanning this tree Therefore, MEIC is better than MEI-N However, in the

process of searching for EIs satisfying the constraints C0 and C1, MEIC creates many

redundant nodes For example, the branches of {e}, {d}, {h} and {g} are redundant

Therefore, we propose the pMEIC algorithm to alleviate this disadvantage

5 pMEIC algorithm

In this section we propose two propositions for fast mining of EIs with subset and

superset itemset constraints C0 and C1 The two propositions are as follows

Proposition 1 Let EIC be the set of EIs that satisfy the constraint, and EC(C 0 ) be the

equivalence class of C 0 in the search tree If X  EC(C 0 ) then X does not belong to EIC

Định dạng
Số trang	28
Dung lượng	8,22 MB