DSpace at VNU: Mining erasable itemsets with subset and superset itemset constraints tài liệu, giáo án, bài giảng , luận...
Trang 1Accepted Manuscript
Mining Erasable Itemsets with Subset and Superset Itemset
Constraints
Bay Vo , Tuong Le , Witold Pedrycz , Giang Nguyen ,
Sung Wook Baik
To appear in: Expert Systems With Applications
Received date: 19 July 2016
Revised date: 12 October 2016
Accepted date: 13 October 2016
Please cite this article as: Bay Vo , Tuong Le , Witold Pedrycz , Giang Nguyen , Sung Wook Baik ,
Mining Erasable Itemsets with Subset and Superset Itemset Constraints, Expert Systems With cations (2016), doi:10.1016/j.eswa.2016.10.028
Appli-This is a PDF file of an unedited manuscript that has been accepted for publication As a service
to our customers we are providing this early version of the manuscript The manuscript will undergocopyediting, typesetting, and review of the resulting proof before it is published in its final form Pleasenote that during the production process errors may be discovered which could affect the content, andall legal disclaimers that apply to the journal pertain
Trang 2ACCEPTED MANUSCRIPT
Highlights
We state the problem of mining EIs with subset and superset itemset constraints
Two propositions supporting a quick pruning nodes were proposed
pMEIC algorithm based on two above propositions was proposed
The experiments were conducted to show the effectiveness of pMEIC
Trang 3ACCEPTED MANUSCRIPT
Mining Erasable Itemsets with Subset and Superset Itemset Constraints
Bay Vo1, 2, Tuong Le3, 4, *, Witold Pedrycz5,6,7, Giang Nguyen1, Sung Wook Baik2
1
Faculty of Information Technology, Ho Chi Minh City University of Technology, Ho Chi
Minh City, Vietnam 2
College of Electronics and Information Engineering, Sejong University, Seoul, Republic of
Korea 3
Division of Data Science, Ton Duc Thang University, Ho Chi Minh City, Vietnam 4
Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam 5
Department of Electrical and Computer Engineering, University of Alberta, Edmonton, T6R
2V4 AB, Canada 6
Department of Electrical and Computer Engineering, Faculty of Engineering, King
Abdulaziz University, Jeddah, 21589, Saudi Arabia 7
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Email: Bay Vo(bayvodinh@gmail.com), Tuong Le (lecungtuong@tdt.edu.vn and tuonglecung@gmail.com), Witold Pedrycz (wpedrycz@ualberta.ca), Giang Nguyen
(nh.giang@hutech.edu.vn), Sung Wook Baik (sbaik@sejong.ac.kr)
Abstract Erasable itemset (EI) mining, a branch of pattern mining, helps managers to
establish new plans for the development of new products Although the problem of mining EIs was first proposed in 2009, many efficient algorithms for mining these have since been developed However, these algorithms usually require a lot of time and memory usage In reality, users only need a small number of EIs which satisfy a particular condition Having this observation in mind, in this study we develop an efficient algorithm for mining EIs with
subset and superset itemset constraints (C0 X C1) Firstly, based on the MEI (Mining
Trang 4ACCEPTED MANUSCRIPT
Erasable Itemsets) algorithm, we present the MEIC (Mining Erasable Itemsets with subset and superset itemset Constraints) algorithm in which each EI is checked with regard to the constraints before being added to the results Next, two propositions supporting quick pruning
of nodes that do not satisfy the constraints are established Based on these, we propose an efficient algorithm for mining EIs with subset and superset itemset constraints (called pMEIC – p: pruning) The experimental results show that pMEIC outperforms MEIC in terms of mining time and memory usage
Keywords: data mining, erasable itemset, subset and superset itemset constraint, pruning
techniques
1 Introduction
Data mining and knowledge discovery is the process of discovering interesting patterns and rules in large databases This process combines a variety of methods stemming from artificial intelligence, machine learning, and statistics Many problems in data mining have attracted the attention of researchers, such as mining association rules (Lin et al., 2016; Sahoo
et al., 2015), application of association rules (Cheng et al., 2016; Parkinson et al., 2016; Khader et al., 2016), classification (Sun et al., 2015; Jia et al., 2015; Wang et al., 2015), and clustering (Agarwal & Bharadwaj, 2015; Das & Maji, 2015; Nanda & Panda, 2015) Pattern mining is the fundamental approach to solving the above problems, and there are currently many methods of mining frequent patterns, including Apriori (Agrawal et al., 1993), FP-growth (Han et al., 2000), dEclat (Zaki & Hsiao, 2005), NSFI (Vo et al., 2016), and FIN+ (Deng, 2016) In 2009, Deng et al (2009) formulated a problem of mining erasable patterns (EPs), a variant of pattern mining In this problem a factory produces many products, which are composed of a number of items (components) Each product generates revenue, and each item has a cost of purchase and storage During a financial crisis, a factory will not have enough money to purchase all the required components as usual The problem of EP mining
Trang 5ACCEPTED MANUSCRIPT
is thus to find the patterns that can be removed to reduce the loss to the factory’s profit under some conditions Managers can then utilize the knowledge obtained from EPs to make a new production plan Although many algorithms have been developed for mining EIs, the problem
of mining EIs with itemset constraints remains unexplored In this study we thus propose an efficient algorithm for mining EIs with subset and superset itemset constraints The results to
this problem come in the form Y = {X I | X is EI and C0 X C1}, where I is the set of items in databases and C0 C1 I Mining EIs with the subset and superset itemset
constraint (Y) can be achieved simply by mining all EIs and then filtering the results which
satisfy this constraint However, this approach is inefficient in both time and memory usage
We thus develop two pruning techniques to reduce the search space based on two clauses related to the paternity of the nodes in the tree search The main contributions of this study are as follows: (1) presenting the problem of mining EIs with subset and superset itemset constraints, (2) proposing two pruning techniques to reduce the search space, (3) presenting
an efficient algorithm for mining EIs with subset and superset itemset constraints, and (4) conducting experiments to show the effectiveness of the proposed algorithm
The paper is organized as follows In Section 2, we present the underlying concept of EI mining and the problem statement of mining erasable itemsets with subset and superset itemset constraints Section 3 reviews related works on EI mining In Section 4, a modified version of the MEI algorithm for mining erasable itemsets with subset and superset itemset constraints, named the MEIC algorithm, is presented The main contribution of this article, the pMEIC algorithm, is then proposed in Section 5 We show results of the performance evaluation of our algorithm and MEIC in Section 6, while the conclusions of this work are then presented in Section 7
Trang 6ACCEPTED MANUSCRIPT
2 Basic concepts and problem statement
Let = { , , , } be a set of all items, which are the abstract representations of components of products A product dataset is denoted by , , , }, where
is a product presented in the form of Items, Val, where Items are the items (or components) that constitute , and Val is the profit that the factory generates by selling
the product A set is also called an itemset, and an itemset with k items is called a itemset The example dataset shown in Table 1 will be used throughout this article In which,
k-{a, b, c, d, e, f, g, h} is the set of items, and {P1, P2, , P11} is the set of products
Table 1 An example database (DB e)
Product Items Val (USD)
For example, X = {ac} is an itemset We have {P1, P2, P3, P4, P6, P7, P11} that are
products which contain {a}, {c}, or {ac} Therefore, (X) = P1.Val + P2.Val + P3.Val +
P4.Val + P6.Val + P7.Val + P11.Val = 4,650 USD
Trang 7ACCEPTED MANUSCRIPT
Given a threshold and product dataset DB, pattern X is said to be erasable if and only if
Where is the total profit of product dataset DB: ∑ The
total gain of the factory is the sum of the gain of all products Consider DB e , we set T = 5,000 USD An itemset X is an EI if and only if (X) ≤ T × , where is a user given threshold For example, let = 16%, (e) = 600 USD and e is an EI due to (e) = 600 ≤ 5,000 × 16% = 800 From Definitions 1 and 2, the problem of mining EIs is to find all itemsets that have (X) smaller than or equal to T × Table 2 shows all EIs for DB e with = 16%
Table 2 All EIs for DB e with = 16%
Erasable Itemsets Val (USD)
Trang 8ACCEPTED MANUSCRIPT
Le and Vo (2014) defined the index of gain as [ ] , where for
For DB e, the index of gain is shown in Table 3 For example, the gain of product
P4 is the value of the element at position 4 in G denoted by G[4] = 150 dollars
Table 3 Index of gain for DB e
In addition, Le and Vo (2014) proposed Pidset and dPidset structures for mining EIs,
which are summarized as follows For an itemset X, dPidset (the set of product identifiers) is
computed by ⋃ , where A is an item in X and p(A) is the pidset of item A,
i.e., the set of product identifiers which includes A Let XA and XB be two itemsets with the
same prefix X The pidset of XAB is computed by p(XAB) = p(XB) p(XA), where p(XA) and
p(XB) are pidsets of XA and XB, respectively The dPidset of pidsets p(XA) and p(XB),
denoted as dP(XAB), is computed by: dP(XAB)= p(XB) \ p(XA) Moreover, assume that dP(XA) and dP(XB) are the dPidsets of XA and XB, respectively The dPidset of XAB is computed by: dP(XAB) = dP(XB) \ dP(XA)
Definition 1 Given two itemsets C0 and C1, and C0 C1 I, the problem of mining EIs with a subset and superset itemset constraint is to find all EIs with:
For example, consider DB e (with T = 5,000 USD), let = 16%, C0 = {f} and C1 = {fdh}
Table 4 shows all EIs satisfying this constraint
Trang 9Xu, 2012), MEI (Mining Erasable Itemsets) (Le & Vo, 2014), and EIFDD (Erasable Itemsets for very Dense Datasets) (Nguyen et al., 2015) First, META is an Apriori-based algorithm,
which is slow because it generates candidate patterns level by level An erasable (k-1)-itemset
X is checked with all the remaining erasable (k-1)-itemsets for combination to generate candidate erasable k-itemsets Only a small number of the remaining erasable (k-1)-itemsets that have the same prefix as X are combined Second, MERIT uses the NC_Sets structure to
reduce memory usage, which is its main advantage However, there are still some disadvantages with regard to storing its structure, as this leads to high memory consumption and long execution time Third, MEI uses a divide-and-conquer strategy and the concept of the difference pidset (dPidset) Some theorems for efficiently computing itemset information
to reduce mining time and memory usage were derived for MEI Although the mining time and memory usage are better than those of META and MERIT, MEI’s performance in mining EIs from very dense databases is relatively weak Fourth, EIFDD was thus proposed to overcome this weakness of MEI for very dense databases by using the subsume concept This
is used to help in the early determination of information about a large number of EIs, without
Trang 10ACCEPTED MANUSCRIPT
the usual computational cost In summary, EIFDD is now generally used to mine EIs for very dense databases, while MEI is used to mine EIs for the remaining types of databases
Besides the problem of mining EIs, a number of related problems have been proposed, as
follows (1) The problem of mining top-rank-k EIs (Deng, 2013; Nguyen et al., 2014) is finding the top k rank of gain EIs to avoid finding all EPs Deng (2013) first proposed solving
this problem with a basic algorithm named VM, which uses the PID_list structure Nguyen et
al (2014) then presented an improved structure of PID_list named dPID_list Based on this,
the authors proposed a fast algorithm called dVM for mining top-rank-k EIs (2) For the
problem of mining erasable closed itemsets (ECIs), Nguyen et al (2015) first represented and compressed the mined EIs without loss of information They then proposed an effective algorithm to deal with this problem, named the MECP algorithm (3) The problem of mining weighted erasable patterns was first proposed in Lee et al (2015), which considers the distinct weight of each item according to its quality, size, price, and so on In 2016, the same group of authors (Yun et al., 2016) proposed a new approach for mining weighted erasable patterns for streaming data applications
In this study, we present the problem of mining EIs with subset and superset itemset constraints The following three approaches can be applied to solve this problem: (1) Using one of the existing algorithms to mine all EIs satisfying the threshold and then singling out EIs satisfying the constraint In the experiment section, we use MEI to mine all EIs and single out EIs satisfying the constraint We call this approach MEI-N (2) In the process of mining EIs, we check if an EI satisfied the constraint or not when it was created If so, it will be added to the results This approach will be presented in Section 4 (3) The itemsets that satisfy the constraint are expanded, and this approach will be proposed in Section 5
Trang 11ACCEPTED MANUSCRIPT
4 MEIC algorithm
In this section, we propose the MEIC algorithm for mining EIs with subset and superset
itemset constraints C0 and C1 In the process of mining EIs, MEIC will check if an EI satisfied the constraint or not when it was created If so, it will be added to the results
4.1 The algorithm
MEIC uses a divide and conquer strategy, dPidset structure and related theorems (Le &
Vo, 2014) for mining EIs with subset and superset itemset constraints C0 and C1 In the first
step, MEIC scans the dataset to determine T, G, and EI1 with their pidsets (Line 1) MEIC
will then sort EI1 in descending order of pidsets’ size (Line 2) Next, MEIC scans all elements
in EI1 to find EIs that satisfy the threshold and constraints (Lines 4-11) Each itemset X (X
EI1) which satisfies the constraint C 0 X C1 is added to the result (Lines 10-11) In the
second step, for EI k (k > 1) in the same class of equivalence, the algorithm combines the first element with the remaining elements to create sets of (k+1)-itemsets candidates If itemset X satisfies T × ξ (Line 21), the algorithm will: (i) add X to EI k+1 (Line 22); (ii) if X satisfies the constraints C 0 X C1 then it will add it to EI C (Lines 23-24); and (iii) combine the
elements in EI k+1 together to create EI k+2 (Line 26) The MEIC algorithm is outlined below Note that the Sub_pidsets procedure is presented in Le and Vo (2014)
Algorithm 1 MEIC algorithm
Input: Database DB, threshold and the constraints C0, C1
Output: EI C (the erasable itemsets satisfy the constraints C0, C1)
Scan DB to calculate T, G, and EI1 with their pidsets
Sort EI1 in descending order of pidsets’ size
EI next
for k 1 to | | - 1 do
E.items = [k].item
Trang 12.Items = [k].Items [j].Items
( pidset, Gain) Sub_pidsets( [k].pidset, [j].pidset)
E.gain = [k].gain + Gain
call Expand_E(EI next)
MEIC algorithm: an outline
4.2 Illustration of the MEIC process
An execution of the MEIC algorithm for DB e with = 16%, C0 = {f} and C1 = {fdh} is
described as a sequence of the following steps:
1 MEIC scans DB e to calculate T = 5,000 USD; the index of gain ( ); and EI1 = {e, f, d, h, g} with their pidsets (Line 1)
2 Sort EI1 in descending order of pidsets’ size After this step, a new order of EI1 is {e, f, d,
h, g} (Line 2)
Trang 13ACCEPTED MANUSCRIPT
3 In EI1, {e, f, d, h, g} will be added to EI next and only {f} satisfies the constraints, therefore, this algorithm adds {f} to EI C (Lines 4-11)
4 For first class equivalence (e), the algorithm will combine it with all remaining EIs {f, d,
h, g} to create {ef}, {ed}, {eh} and {eg} {ef} is excluded because its gain is 900 > T
The remaining itemsets are EIs and they will be added to EI next = {ef, ed, eh} At this stage, no EIs satisfy the constraints, and therefore EI C = MEIC uses the Expand_E
procedure with EI next as a parameter for combining these EIs to create EIs at the next level This step is then repeated recursively until no more EIs are created The algorithm will stop at this class equivalence The result of this class equivalence is shown in Figure
1, and note that EI C is still empty
Fig 1 The search tree of class equivalence (e) for DB e with = 16%, C0 = {f} and C1 =
{fdh}
5 Repeat step 4 with each remaining class equivalence in EI1 The process completes when
no more EIs are created The results, EI C , are {f, fd, fh, fdh} Figure 2 shows the search tree of MEIC for DB e with = 16%, C0 = {f} and C1 = {fdh}
Trang 14ACCEPTED MANUSCRIPT
Fig 2 The search tree of MEIC for DB e with = 16%, C0 = {f} and C1 = {fdh}
The search tree of MEI-N is the same as the search tree of MEIC in Fig 1 without the set
of EIs that satisfy the constraints C0 and C1 Therefore, MEI-N has to scan the whole search
tree again to find the EIs that satisfy the constraints C0 and C1 While MEIC will give the result without scanning this tree Therefore, MEIC is better than MEI-N However, in the
process of searching for EIs satisfying the constraints C0 and C1, MEIC creates many
redundant nodes For example, the branches of {e}, {d}, {h} and {g} are redundant
Therefore, we propose the pMEIC algorithm to alleviate this disadvantage
5 pMEIC algorithm
In this section we propose two propositions for fast mining of EIs with subset and
superset itemset constraints C0 and C1 The two propositions are as follows
Proposition 1 Let EIC be the set of EIs that satisfy the constraint, and EC(C 0 ) be the
equivalence class of C 0 in the search tree If X EC(C 0 ) then X does not belong to EIC