A Fast Parallel Algorithm for Discovering Frequent Patterns docx

In this paper, we propose a novel data mining algorithm named FD-Mine that is able to efficiently utilize the nodes to discover frequent patterns in cloud computing environments with dat

Trang 1

A Fast Parallel Algorithm for Discovering Frequent Patterns

Kawuu W Lin Department of Computer Science and Information

Engineering National Kaohsiung University of Applied Sciences

Kaohsiung, Taiwan, R.O.C

linwc@cc.kuas.edu.tw

Abstract Fast discovery of frequent patterns is the most

extensively discussed problem in data mining fields

due to its wide applications As the size of database

increases, the computation time and the required

memory increase severely The difficulty of mining

large database launched the research of designing

parallel and distributed algorithms to solve the

problem Most of the past studies tried to parallelize

the computation by dividing the database and

distribute the divided database to other nodes for

mining This approach might leak data out and

evidently is not suitable to be applied to sensitive

domains like health-care In this paper, we propose a

novel data mining algorithm named FD-Mine that is

able to efficiently utilize the nodes to discover

frequent patterns in cloud computing environments

with data privacy preserved Through empirical

evaluations on various simulation conditions, the

proposed FD-Mine delivers excellent performance in

terms of scalability and execution time

association rule mining; frequent pattern mining;

privacy preserved

I Introduction With the progress of information technology, data

mining techniques have been extensively applied to

many applications in various domains The goal of

data mining is to discover the hidden useful

information from large databases The discovered

information could help the decision processes, aid the

commercial promotion, and so forth The data mining

includes four main topics: association rule mining [2],

sequential pattern mining [3], clustering [11] and

classification [5] Among the data mining studies, the

problem of frequent pattern mining, i.e association

rule mining and sequential pattern mining, is mostly

discussed due to its wide applications

The basic conception of frequent pattern mining

problem is to discover the pattern whose frequency of

appearance in the database is greater than a specific

threshold An association rule is defined as X=>Y,

where X and Yare sets of items The concept of

association rule mining is to discover the sets of

items tending to associate with the others in the

database The studies on association rule mining can

be classified into two types, 1) the generate-and-test

Yu-Chin Luo Department of Computer Science and Information

Engineering National Kaohsiung University of Applied Sciences,

Kaohsiung, Taiwan, R.O.C

kim-x@yahoo.com.tw

[2] (Apriori-like) approach and 2) the frequent pattern growth approach [6] (FP-growth-like) The

itemset of size (k+1) from frequent itemset of size k and scan the database repetitively to test the frequency of each candidate itemset Definitely, the Apriori-like methods suffer from the large number of candidate itemsets, especially when the support threshold is small In view of this reason, Han et al [6] proposed a novel data structure, named frequent pattern tree (FP-tree), in which the transactions are compressed and stored A mining algorithm, namely FP-growth was also proposed for discovering the frequent patterns from the FP-tree FP-growth needs only two scans on physical databases and therefore has a great improvement on the execution time

As the size of database increases, the computation time and the required memory increase severely Many studies on association rules mining were proposed mainly to improve the efficiency in terms of execution time In the past decades, parallel and distributed computing (PDC) techniques have attracted extensive attentions on the ability to manage and compute the significant amount of data The difficulty of mining large database launched the research of designing parallel and distributed algorithms to solve the problem [7], [8], [10], [13], [14] The main approach of the existing studies is to divide the database and then to distribute each part of the database to nodes or processors for mining with the goal to distribute the computation loading During the mining process, the nodes will exchange required transactions from each other The workload of data exchanging among nodes becomes heavy when the average length of transaction is long or the size of database is large Although many algorithms have been proposed, the execution efficiency of frequent pattern mining is still a challenge to the researchers due to the data explosion In addition to the exchanging workload, the data privacy is also a major concern since this kind of algorithms duplicates the database to every node in the PDC architecture This approach evidently is not suitable to be applied to sensitive domains like health-care

In this paper, we propose a novel data mining method named FD-Mine that is able to efficiently utilize the cloud nodes to fast discover frequent patterns in cloud computing environments with data privacy preserved Through empirical evaluations on

Trang 2

various simulation conditions, the proposed FD-Mine

delivers excellent performance in terms of scalability

and execution time

In the following sections, we briefly review related

work in Section 2 In Section 3, we propose the

architecture and present the data mining algorithm

The empirical evaluation for performance study is

made in Section 4 The conclusions are given in

Section 5

II Related Work

In order to improve the performance of association

rule mining, many researchers tried to distribute the

processor/node In [9], the authors proposed a parallel

algorithm named Parallel FP-tree (PFP-tree) based on

the FP-tree data structure for mining frequent patterns

on message passing multiprocessor systems The

proposed algorithm divides the database into several

non-overlapping parts according to number the

available processors, and lets each processor

construct its FP-tree by exchanging necessary

information from other processors Because the

exchanging is done in the same node so that the

overhead might not be severe To parallelize the

frequent pattern mining, the past studies relied on

mainly the database dividing method [4], [15] The

database is divided equally or by some criteria and

each part of the database is sent to the node for

mining The approach that duplicates the database to

other nodes risks leaking out the data The data

privacy cannot be preserved by this approach

Note that in cloud computing environments the

network latency is an important issue that should be

carefully considered Generally, the size of the

targeted database is always large in the mining

applications Transmitting the database and

exchanging large amount of data over the internet

will greatly slow down the performance In [12], the

proposed method, named QFP-growth, divides the

database equally and constructs the FP-trees based on

the assigned parts of database The FP-trees are then

merged to a FP-tree to complete the mining task

The data transmission overhead was studied in [14]

The authors observed that the elapsed time by

exchanging transactions is much more than mining

time To efficiently exchange transactions among

nodes for database dividing approach, TPFP-tree was

proposed by using transaction identification set

(Tidset) to select the transactions directly instead of

scanning the physical database The Tidset is a table

recording the IDs of transactions that contain a

certain item, so the required memory of Tidset is as

the same size as the assigned partial database

Therefore, TPFP is bound to the size of the targeted

database

To balance the computing loading of TPFP-tree,

the authors [15] proposed BTP-tree algorithm, which

is a balanced Tidset-based parallel FP-tree algorithm, for mining frequent patterns The algorithm equally divides the database into p parts, where p is the number of nodes The partial databases are sent to the nodes individually Each node establishes the Tidset and header table in accordance with the assigned database A global header table named GHT is derived by filtering the items with support smaller than the threshold from the table in which all of the header tables of the nodes are gathered Before executing the mining task, BTP-tree algorithm calculates a performance index for each node, and records the sum of performance indexes A mining task is then separated into p sub-tasks, where the loading of each task is calculated in unit of the number of items in header table The task assignment

indexing After the task assignment, each node constructs its Tidset for fast selection use The required transactions are exchanged among nodes to generate the new sub-databases by referring to the items of header tables Finally, the FP-growth is performed on each node to discover the frequent patterns The frequent patterns are further gathered from all the nodes to obtain the complete frequent patterns

III Proposed Algorithm: FD-Mine

In this section, we describe the proposed algorithm that is able to efficiently distribute the computation in the cloud computing environments The cloud architecture for mining frequent patterns is introduced in Section 3.1 In Section 3.2, we formulate the problem The details of the proposed algorithms are described in Section 3.3

3.1 Proposed Cloud Architecture for Frequent Pattern Mining

Note that in the cloud computing environments the data privacy is an important issue Since the clouds are distributed physically and each cloud node provides only its computation ability, the trusty of the nodes cannot be preserved Therefore, in order to preserve the data privacy only a node that is safe, while not every node, can access the database In our architecture, we name this node as trusted node or kernel node, the cloud in which the node locates as kernel cloud Considering the efficiency of data transmission among clouds, each cloud is designed to have only a node to connect other clouds, named connection-node, abbreviated as conn-node If a node

N needs data from trusted node, the node N will ask

the conn-node of N's cloud to see whether the conn-node has the data or not If the conn-node has

the data, N can download the data from conn-node

via intranet Otherwise, the data will be duplicated to

the conn-node via internet, and then N can download

the data from conn-node via intranet By using this transmission policy, the network latency can be minimized

Trang 3

~ Physical Machine

IIIII!!II Trusted Xode

• (Virtual Machine)

~ ConnectionNode

~ tvirt ual Mactunej

CI Comreting Xcdc

~ [VirtualMachine)

Figure 1 Proposed architecture for frequent pattern mining.

In this architecture, each conn-node should maintain

a table to record the status of the nodes of its cloud

The recorded information for each node contains the

node's ID and the availability All of the tables are

then gathered in the kernel node so that the kernel

node has complete information of computation ability

in terms of available nodes The information is

updated periodically

3.2 Mining frequent patterns in cloud computing

environments

One of the characteristics of the proposed algorithm

is that the data privacy is preserved Unlike the

parallel Apriori-like algorithms [4] that need to

duplicate the database to remote nodes or the

BTP-tree [15] algorithm that distributes part of the

database directly to cloud nodes, only the kernel node

is permitted to access the database in our designed

architecture and algorithms In addition to the leaking

problem of data privacy of the conventional

algorithms, the required time for duplicating physical

database is considerable

The data structure used by the proposed algorithms

is based on that of FP-growth The FP-tree is a data

structure that stores the frequent items in compressed

form Because the items with support smaller than the

support threshold are filtered and the filtered

transactions have been constructed in the FP-tree,

reversely retrieving the complete transaction of any

user from the FP-tree is impossible Moreover,

because the FP-tree is often implemented in

linked-list and our algorithm will also compress the

FP-tree again by ZIP to reduce the transmission time,

the transactions will not be reversed The data

privacy can be preserved

3.3 FD-Mine algorithm

The purpose of FD-Mine is fast mining In the cloud

computing environments, the distribution of mining

computation accompanies data transmission over the

network In BTP-tree [15], the database is divided

equally into several parts and sent to the available nodes Then the nodes ask the required data from each other to finish the mining task In fact, the database is often large in size Obviously, this approach not only leaks the data but also incurs a lot

perforrnance of this kind of approach is expected to

be bad

An intuitive way to save the time is to minimize the amount of data transmission Our proposed FD-Mine is designed to transmit as less data as possible to save the time from network latency and disk I/O time The algorithm is presented in Figure 2

We describe the details of FD-Mine as below The

trusted node TN follows the FP-tree construction

algorithm to scan the database twice times, and

constructs the corresponding FP-tree stored in TN (line I) The next step is to obtain the header table HT (line 2) and to divide HT into IN! disjointed sets,

stored inIS (line 3) Since the frequent patterns are

not predictable, HT is divided randomly with the goal

to balance the loading of each node Considering the execution efficiency, the most important issue is that the amount of data transmission should be minimized

To minimize the amount of data transmission, the

FP-tree constructed on TN is duplicated to each idle

node In the cloud computing environments, we also consider the problem of network latency Since the internet latency always larger than intranet latency, the FP-tree duplication should be done in intranet Algorithm FD-Mine

Input: A transaction database DB, a minimum support threshold ~, the trusted node TN, and a set of nodes N with cloud architecture C

Output: The complete set of frequent patterns , FP

IITN reads the DB and construct the corresponding FP-tree

IIObtain the header table ofFPT

3 IS~ divideHT (lNI)

IIRandomly divide the items ofHT into IN[ disjointed sets

5 n~selectNode(N ,i)IISelect the ith node

IISelect the conn-node ofn

IIDuplicate FPT from TN if en does not have FPT

IIDuplicate FPT from the conn-node ofn

11 is,~getSet(IS ,i)IIObtain ith set of IS

IIBatch-run FP-growth for each conditional item in is; to mine the frequent patterns

Figure 2 FD-Mine Algorithm.

Trang 4

80 - - - -,

Number of Nodes

Figure 3 The execut ion time for FD-Mine and BTP-tree with number of nodes varied on dataset T20.IS.NIOOK.DIOOK.

10

· 0 · ·· · ·· · 0 0 -O

30

~ 60

!E-Q) E

c:

.2

"S

~ 40

w

70

BTP-tree decreases with the increase in the number

of nodes It is observed that the execution time of FD-Mine is almost the same to that ofBTP-tree when there is only one node available to be used This is trivial because both of them perform FP-growth in a single node The execution time of FD-Mine is slightly more than that of BTP-tree when the number

of processors is equal to 2 or 3 This is because the

decompression is more than the time to directly transmit the divided parts of database When there are more than 3 nodes, FD-Mine exhibits the advantage

of sending after compression, less time required for completing the whole mining task

Figure 4 shows the impact on execution time when the average length of transaction is lengthened to 40

It is found that FD-Mine delivers better performance than BTP-tree when the number of nodes is greater than 2 The reason is that BTP-tree, the database dividing approach, needs to exchange the transactions

to each other, and the performance suffers from the large number of exchanged transactions

Figure 5 shows the performance of FD-Mine and BTP-tree under the number of transactions set to 200K In this experiment, FD-Mine outperforms BTP-tree when the number of nodes is greater than 2,

in which the intrinsic drawback of the database dividing approach is demonstrated In the series of experiments, it is observed that FD-Mine not only can preserve the data privacy but also delivers better performance than BTP-tree in terms of execution time especially when the database is large in size 5.2 Effects of varying the parameters of dataset

In the section , we study the effects by varying the support threshold, and the parameters, number of transactions and average transaction length, of the data generator Two algorithms are compared, FD-Mine and BTP-tree in the experiment

IV Experimental Results

To evaluate the performance of the proposed

algorithm, we use IBM's Quest Synthetic Data

Generator [1] to generate the workload data for

mining The experiments were conducted on a cloud

system with three clouds The first cloud contains

four nodes , including the kernel node , in which each

node is equipped with an E8400 204GHZ CPU, 1GB

of available RAM and 320GB of disk storage The

second cloud and third cloud contain four and three

nodes respectively, in which each node is equipped

with a P8600 204GHZ CPU, IGB of available RAM

and 160GB of disk storage Note that the kernel node

is responsible for receiving the requests and is not

used for mining Therefore totally ten nodes can be

used for mining in the system To verify the

performance, since there are very few parallel and

privacy-preserved algorithms of frequent pattern

mining, we select the BTP-tree for comparison,

which is one of the most efficient algorithms that can

parallelize the mining task on grid systems Both of

FD-Mine and BTP-tree were implemented in Java,

and the message passing among nodes and remote

technology Since the most of the existing parallel

algorithms are database dividing approach, we select

the most efficient one, BTP-tree, for performance

comparison

5.1 Effects of varying the number of cloud nodes

In the following experiments, we investigate the

performance of FD-Mine in terms of execution time

by varying the number of cloud nodes from I to 10

T20.I5.NIOOK.D100K are described The support

threshold is set to 0.03%, which is a very small value,

in order to verify the performance of both the

algorithms, FD-Mine and BTP-tree Figure 3 shows

For this reason, the FP-tree duplication is processed

as follows First, the algorithm selects an idle node n

(line 5), and selects the connection node en of n from

the cloud architecture C (line 6) If en has no

duplicated FP-tree ,TN will duplicate one to en (line

7 to line 9) Note that in order to minimize the

transmitting overhead the FP-tree should be

compressed in advance Afterwards, node n can

obtain the compressed FP-tree via intranet and

decompress it (line 10) After receiving the FP-tree,

node n is assigned to a subset of IS (line 11), and

batch-runs FP-growth for each conditional item in the

subset to mine the frequent patterns (line 12 to line

13) Obviously, each node needs only one data

transmission, i.e FP-tree duplication, and the

transmission is in intranet to minimize the network

latency After all of the IN! disjointed sets are

processed, the frequent patterns are returned (line

15)

Trang 5

0.05 0.04

0.03 0.02

o

··· ····.0.

··· ··· ··· 0.

· ···· ·· ··· ·· ·· ·· ·· .0

··· ··· ··· ··· 0

0.01 34 32 20 u-! 30 Q) E 28 i= c Q 26 3 ~ 24 22 18 L - , - - - r - - - - , - - - - - - - , J ·· ·· ·· ·· 0 ··· ·· ··· 0 {) 0 120 U-Q) $ Q) 100 E i= c a ~ 80 c Q) o ,

x w >'0 60 40 8 10 Number of Nodes Figure 4 The execution time for FD-Mine and BTP-tree with number of nodes varied on dataset T40.I5 N100K.D100K Su pport Thresh old (%) Figure 6 The execution time for FD-Mine and BTP-tree with support threshold varied on dataset T20.15.N100K.D100K. data privacy is preserved Unlike the parallel Apriori-like algorithms that need to duplicate the database to remote nodes or the BTP-tree algorithm that distributes part of the database directly to cloud nodes, the database will never be duplicated and only the kernel node is permitted to access the database in our designed architecture and algorithms Through empirical evaluations on various simulation conditions , the proposed FD-Mine delivers excellent performance in terms of scalability and execution time · ···· ··0 ··· 0 ··· 0 0 100 90 u- 80 Q) $ Q) 70 E i= c Q 60 :5 o Q) x ·0 w 50 40 30 -'-r -. -r -r -r r -. -r -r ,

10

Number of Nodes

Figure 5 T he execution time for FD-Mine and BTP -tree with

number of nodes varied on dataset T20.I5 Nl OOK.D200K.

Acknowledgement This research was partially supported by National

No.97-2218-E-151-003-MY2

In Figure 6, we explore the impact on execution

time by varying the support threshold from 0.05% to

0.0 I% with ten cloud nodes It can be found that

FD-Mine always requires less time than BTP-tree

The efficiency in execution time of FD-Mine is

mainly achieved by reducing the transmission

overhead and the disk I/O times In the experiment,

the required time of FD-Mine is only about 82% of

the execution time ofBTP-tree in average

V Conclusions

In this paper, we have presented an efficient

algorithm named FD-Mine that is able to efficiently

utilize the cloud nodes to discover frequent patterns

in cloud computing environments with data privacy

preserved The proposed FD-Mine is composed of

two algorithms, namely HD-Mine and FD-Mine The

limitation of the conventional algorithm for mining

the dataset with a large number of frequent patterns is

bounded to the available memory The proposed

HD-Mine is able to discover the frequent patterns

from this kind of datasets by merging the memory of

several nodes The proposed FD-Mine focuses on the

fast discovery of frequent patterns by utilizing the

cloud nodes, and is useful to the applications that

emphasize real time mining Another important

characteristic of the proposed algorithms is that the

References

[IJ R Agrawal and R Srikant Quest Synthetic Data Generator IBM Almaden Research Center, San Jose, California, http://www.almaden.ibm com/cs/quest/syndata.html.

[2J R Agrawal, Imielinski T, Swami A Mining association rules between sets of items in large databases In: Proc ACM SIGMOD IntI ConfManagement Data, 1993.

[3J R Agrawal, R Srikant, Mining Sequential Patterns, in: Proc of the 11 th 1nt'l Conf on Data Engineering, 1995, pp 3-14.

[4J R Agrawal, John C Shafer, " Parallel Mining of Association Rules", IEEE Transactions on knowledge and Data Engineering, December 1996.

[5J R J Bayardo, Jr., Brute-force mining of high-confidence classification rules In Proceedings of the 3rd international conference on knowledge discovery and data mining (KDD'97), Newport Beach, California, USA.

[6J J Han, 1 Pei, and Y Yin Mining Frequent Patterns Without Candidate Generation Proc of ACM Int Conf on Management of Data (SIGMOD), \-12,2000.

[7J J.D Holt, S.M Chung, " Parallel mining of association rules from text databases on a cluster of workstations," Proceedings of 18th International Symposium on Parallel and Distributed Processing, 2004, pp 86.

[8J P Iko and M Kitsuregawa, "Shared Nothing Parallel Execution

of FPgrowth." DBSJ Letters, Volume 2, No.1, 2003, pp 43-46 [9J A Javed, A Khokhar, " Frequent Pattern Mining on Message Passing Multiprocessor Systems," Distributed and Parallel database, Volume 16, Issue 3, 2004, pp 321-334.

[ IOJ T Li, S Zhu, M Ogihara, "A New Distributed Data Mining Model Based on Similarity," Symposium on Applied Computing,

2003, pp.432-436.

[II J Ester M., Kriegel H.-P., Sander 1., Xu X.: "A Density-Based

Trang 6

with Noise", Proc 2nd Int Conf on Knowledge Discovery and

Data Mining, Portland, OR, AAAI Press, 1996, pp 226-231.

[12] Y Qiu, Y 1 Lan and Q S Xie, "An improved algorithm of

mining from FP- tree," Proceedings of the Third International

Conference on Machine Learning and Cybernetics, pp 26-29,

2004.

[13] E.-H S Han, G Karypis, and V Kumar Scalable parallel data

mining for association rules IEEE Transactions on Knowledge and

Data Engineering, 12(3):352 -377, 2000.

[14] J Zhou, K.-M Yu, "Tidset-based Parallel FP-tree Algorithm

for the Frequent Pattern Mining Problem on PC Clusters," Lecture

Notes in Computer Science 5036, 2008, pp 18-28.

[15] 1 Zhou, K.-M Yu, Balanced Tidset-based Parallel FP-tree

Algorithm for the Frequent Pattern Mining on Grid System, Fourth

International Conference on Semantics, Knowledge and Grid, 2008.

Định dạng
Số trang	6
Dung lượng	2,77 MB