In this paper, we propose a novel data mining algorithm named FD-Mine that is able to efficiently utilize the nodes to discover frequent patterns in cloud computing environments with dat
Trang 1A Fast Parallel Algorithm for Discovering Frequent Patterns
Kawuu W Lin Department of Computer Science and Information
Engineering National Kaohsiung University of Applied Sciences
Kaohsiung, Taiwan, R.O.C
linwc@cc.kuas.edu.tw
Abstract Fast discovery of frequent patterns is the most
extensively discussed problem in data mining fields
due to its wide applications As the size of database
increases, the computation time and the required
memory increase severely The difficulty of mining
large database launched the research of designing
parallel and distributed algorithms to solve the
problem Most of the past studies tried to parallelize
the computation by dividing the database and
distribute the divided database to other nodes for
mining This approach might leak data out and
evidently is not suitable to be applied to sensitive
domains like health-care In this paper, we propose a
novel data mining algorithm named FD-Mine that is
able to efficiently utilize the nodes to discover
frequent patterns in cloud computing environments
with data privacy preserved Through empirical
evaluations on various simulation conditions, the
proposed FD-Mine delivers excellent performance in
terms of scalability and execution time
association rule mining; frequent pattern mining;
privacy preserved
I Introduction With the progress of information technology, data
mining techniques have been extensively applied to
many applications in various domains The goal of
data mining is to discover the hidden useful
information from large databases The discovered
information could help the decision processes, aid the
commercial promotion, and so forth The data mining
includes four main topics: association rule mining [2],
sequential pattern mining [3], clustering [11] and
classification [5] Among the data mining studies, the
problem of frequent pattern mining, i.e association
rule mining and sequential pattern mining, is mostly
discussed due to its wide applications
The basic conception of frequent pattern mining
problem is to discover the pattern whose frequency of
appearance in the database is greater than a specific
threshold An association rule is defined as X=>Y,
where X and Yare sets of items The concept of
association rule mining is to discover the sets of
items tending to associate with the others in the
database The studies on association rule mining can
be classified into two types, 1) the generate-and-test
Yu-Chin Luo Department of Computer Science and Information
Engineering National Kaohsiung University of Applied Sciences,
Kaohsiung, Taiwan, R.O.C
kim-x@yahoo.com.tw
[2] (Apriori-like) approach and 2) the frequent pattern growth approach [6] (FP-growth-like) The
itemset of size (k+1) from frequent itemset of size k and scan the database repetitively to test the frequency of each candidate itemset Definitely, the Apriori-like methods suffer from the large number of candidate itemsets, especially when the support threshold is small In view of this reason, Han et al [6] proposed a novel data structure, named frequent pattern tree (FP-tree), in which the transactions are compressed and stored A mining algorithm, namely FP-growth was also proposed for discovering the frequent patterns from the FP-tree FP-growth needs only two scans on physical databases and therefore has a great improvement on the execution time
As the size of database increases, the computation time and the required memory increase severely Many studies on association rules mining were proposed mainly to improve the efficiency in terms of execution time In the past decades, parallel and distributed computing (PDC) techniques have attracted extensive attentions on the ability to manage and compute the significant amount of data The difficulty of mining large database launched the research of designing parallel and distributed algorithms to solve the problem [7], [8], [10], [13], [14] The main approach of the existing studies is to divide the database and then to distribute each part of the database to nodes or processors for mining with the goal to distribute the computation loading During the mining process, the nodes will exchange required transactions from each other The workload of data exchanging among nodes becomes heavy when the average length of transaction is long or the size of database is large Although many algorithms have been proposed, the execution efficiency of frequent pattern mining is still a challenge to the researchers due to the data explosion In addition to the exchanging workload, the data privacy is also a major concern since this kind of algorithms duplicates the database to every node in the PDC architecture This approach evidently is not suitable to be applied to sensitive domains like health-care
In this paper, we propose a novel data mining method named FD-Mine that is able to efficiently utilize the cloud nodes to fast discover frequent patterns in cloud computing environments with data privacy preserved Through empirical evaluations on
Trang 2various simulation conditions, the proposed FD-Mine
delivers excellent performance in terms of scalability
and execution time
In the following sections, we briefly review related
work in Section 2 In Section 3, we propose the
architecture and present the data mining algorithm
The empirical evaluation for performance study is
made in Section 4 The conclusions are given in
Section 5
II Related Work
In order to improve the performance of association
rule mining, many researchers tried to distribute the
processor/node In [9], the authors proposed a parallel
algorithm named Parallel FP-tree (PFP-tree) based on
the FP-tree data structure for mining frequent patterns
on message passing multiprocessor systems The
proposed algorithm divides the database into several
non-overlapping parts according to number the
available processors, and lets each processor
construct its FP-tree by exchanging necessary
information from other processors Because the
exchanging is done in the same node so that the
overhead might not be severe To parallelize the
frequent pattern mining, the past studies relied on
mainly the database dividing method [4], [15] The
database is divided equally or by some criteria and
each part of the database is sent to the node for
mining The approach that duplicates the database to
other nodes risks leaking out the data The data
privacy cannot be preserved by this approach
Note that in cloud computing environments the
network latency is an important issue that should be
carefully considered Generally, the size of the
targeted database is always large in the mining
applications Transmitting the database and
exchanging large amount of data over the internet
will greatly slow down the performance In [12], the
proposed method, named QFP-growth, divides the
database equally and constructs the FP-trees based on
the assigned parts of database The FP-trees are then
merged to a FP-tree to complete the mining task
The data transmission overhead was studied in [14]
The authors observed that the elapsed time by
exchanging transactions is much more than mining
time To efficiently exchange transactions among
nodes for database dividing approach, TPFP-tree was
proposed by using transaction identification set
(Tidset) to select the transactions directly instead of
scanning the physical database The Tidset is a table
recording the IDs of transactions that contain a
certain item, so the required memory of Tidset is as
the same size as the assigned partial database
Therefore, TPFP is bound to the size of the targeted
database
To balance the computing loading of TPFP-tree,
the authors [15] proposed BTP-tree algorithm, which
is a balanced Tidset-based parallel FP-tree algorithm, for mining frequent patterns The algorithm equally divides the database into p parts, where p is the number of nodes The partial databases are sent to the nodes individually Each node establishes the Tidset and header table in accordance with the assigned database A global header table named GHT is derived by filtering the items with support smaller than the threshold from the table in which all of the header tables of the nodes are gathered Before executing the mining task, BTP-tree algorithm calculates a performance index for each node, and records the sum of performance indexes A mining task is then separated into p sub-tasks, where the loading of each task is calculated in unit of the number of items in header table The task assignment
indexing After the task assignment, each node constructs its Tidset for fast selection use The required transactions are exchanged among nodes to generate the new sub-databases by referring to the items of header tables Finally, the FP-growth is performed on each node to discover the frequent patterns The frequent patterns are further gathered from all the nodes to obtain the complete frequent patterns
III Proposed Algorithm: FD-Mine
In this section, we describe the proposed algorithm that is able to efficiently distribute the computation in the cloud computing environments The cloud architecture for mining frequent patterns is introduced in Section 3.1 In Section 3.2, we formulate the problem The details of the proposed algorithms are described in Section 3.3
3.1 Proposed Cloud Architecture for Frequent Pattern Mining
Note that in the cloud computing environments the data privacy is an important issue Since the clouds are distributed physically and each cloud node provides only its computation ability, the trusty of the nodes cannot be preserved Therefore, in order to preserve the data privacy only a node that is safe, while not every node, can access the database In our architecture, we name this node as trusted node or kernel node, the cloud in which the node locates as kernel cloud Considering the efficiency of data transmission among clouds, each cloud is designed to have only a node to connect other clouds, named connection-node, abbreviated as conn-node If a node
N needs data from trusted node, the node N will ask
the conn-node of N's cloud to see whether the conn-node has the data or not If the conn-node has
the data, N can download the data from conn-node
via intranet Otherwise, the data will be duplicated to
the conn-node via internet, and then N can download
the data from conn-node via intranet By using this transmission policy, the network latency can be minimized
Trang 3~ Physical Machine
IIIII!!II Trusted Xode
• (Virtual Machine)
~ ConnectionNode
~ tvirt ual Mactunej
CI Comreting Xcdc
~ [VirtualMachine)
Figure 1 Proposed architecture for frequent pattern mining.
In this architecture, each conn-node should maintain
a table to record the status of the nodes of its cloud
The recorded information for each node contains the
node's ID and the availability All of the tables are
then gathered in the kernel node so that the kernel
node has complete information of computation ability
in terms of available nodes The information is
updated periodically
3.2 Mining frequent patterns in cloud computing
environments
One of the characteristics of the proposed algorithm
is that the data privacy is preserved Unlike the
parallel Apriori-like algorithms [4] that need to
duplicate the database to remote nodes or the
BTP-tree [15] algorithm that distributes part of the
database directly to cloud nodes, only the kernel node
is permitted to access the database in our designed
architecture and algorithms In addition to the leaking
problem of data privacy of the conventional
algorithms, the required time for duplicating physical
database is considerable
The data structure used by the proposed algorithms
is based on that of FP-growth The FP-tree is a data
structure that stores the frequent items in compressed
form Because the items with support smaller than the
support threshold are filtered and the filtered
transactions have been constructed in the FP-tree,
reversely retrieving the complete transaction of any
user from the FP-tree is impossible Moreover,
because the FP-tree is often implemented in
linked-list and our algorithm will also compress the
FP-tree again by ZIP to reduce the transmission time,
the transactions will not be reversed The data
privacy can be preserved
3.3 FD-Mine algorithm
The purpose of FD-Mine is fast mining In the cloud
computing environments, the distribution of mining
computation accompanies data transmission over the
network In BTP-tree [15], the database is divided
equally into several parts and sent to the available nodes Then the nodes ask the required data from each other to finish the mining task In fact, the database is often large in size Obviously, this approach not only leaks the data but also incurs a lot
perforrnance of this kind of approach is expected to
be bad
An intuitive way to save the time is to minimize the amount of data transmission Our proposed FD-Mine is designed to transmit as less data as possible to save the time from network latency and disk I/O time The algorithm is presented in Figure 2
We describe the details of FD-Mine as below The
trusted node TN follows the FP-tree construction
algorithm to scan the database twice times, and
constructs the corresponding FP-tree stored in TN (line I) The next step is to obtain the header table HT (line 2) and to divide HT into IN! disjointed sets,
stored inIS (line 3) Since the frequent patterns are
not predictable, HT is divided randomly with the goal
to balance the loading of each node Considering the execution efficiency, the most important issue is that the amount of data transmission should be minimized
To minimize the amount of data transmission, the
FP-tree constructed on TN is duplicated to each idle
node In the cloud computing environments, we also consider the problem of network latency Since the internet latency always larger than intranet latency, the FP-tree duplication should be done in intranet Algorithm FD-Mine
Input: A transaction database DB, a minimum support threshold ~, the trusted node TN, and a set of nodes N with cloud architecture C
Output: The complete set of frequent patterns , FP
IITN reads the DB and construct the corresponding FP-tree
IIObtain the header table ofFPT
3 IS~ divideHT (lNI)
IIRandomly divide the items ofHT into IN[ disjointed sets
5 n~selectNode(N ,i)IISelect the ith node
IISelect the conn-node ofn
IIDuplicate FPT from TN if en does not have FPT
IIDuplicate FPT from the conn-node ofn
11 is,~getSet(IS ,i)IIObtain ith set of IS
IIBatch-run FP-growth for each conditional item in is; to mine the frequent patterns
Figure 2 FD-Mine Algorithm.
Trang 480 - - - -,
Number of Nodes
Figure 3 The execut ion time for FD-Mine and BTP-tree with number of nodes varied on dataset T20.IS.NIOOK.DIOOK.
10
· 0 · ·· · ·· · 0 0 -O
30
~ 60
!E-Q) E
c:
.2
"S
~ 40
w
70
BTP-tree decreases with the increase in the number
of nodes It is observed that the execution time of FD-Mine is almost the same to that ofBTP-tree when there is only one node available to be used This is trivial because both of them perform FP-growth in a single node The execution time of FD-Mine is slightly more than that of BTP-tree when the number
of processors is equal to 2 or 3 This is because the
decompression is more than the time to directly transmit the divided parts of database When there are more than 3 nodes, FD-Mine exhibits the advantage
of sending after compression, less time required for completing the whole mining task
Figure 4 shows the impact on execution time when the average length of transaction is lengthened to 40
It is found that FD-Mine delivers better performance than BTP-tree when the number of nodes is greater than 2 The reason is that BTP-tree, the database dividing approach, needs to exchange the transactions
to each other, and the performance suffers from the large number of exchanged transactions
Figure 5 shows the performance of FD-Mine and BTP-tree under the number of transactions set to 200K In this experiment, FD-Mine outperforms BTP-tree when the number of nodes is greater than 2,
in which the intrinsic drawback of the database dividing approach is demonstrated In the series of experiments, it is observed that FD-Mine not only can preserve the data privacy but also delivers better performance than BTP-tree in terms of execution time especially when the database is large in size 5.2 Effects of varying the parameters of dataset
In the section , we study the effects by varying the support threshold, and the parameters, number of transactions and average transaction length, of the data generator Two algorithms are compared, FD-Mine and BTP-tree in the experiment
IV Experimental Results
To evaluate the performance of the proposed
algorithm, we use IBM's Quest Synthetic Data
Generator [1] to generate the workload data for
mining The experiments were conducted on a cloud
system with three clouds The first cloud contains
four nodes , including the kernel node , in which each
node is equipped with an E8400 204GHZ CPU, 1GB
of available RAM and 320GB of disk storage The
second cloud and third cloud contain four and three
nodes respectively, in which each node is equipped
with a P8600 204GHZ CPU, IGB of available RAM
and 160GB of disk storage Note that the kernel node
is responsible for receiving the requests and is not
used for mining Therefore totally ten nodes can be
used for mining in the system To verify the
performance, since there are very few parallel and
privacy-preserved algorithms of frequent pattern
mining, we select the BTP-tree for comparison,
which is one of the most efficient algorithms that can
parallelize the mining task on grid systems Both of
FD-Mine and BTP-tree were implemented in Java,
and the message passing among nodes and remote
technology Since the most of the existing parallel
algorithms are database dividing approach, we select
the most efficient one, BTP-tree, for performance
comparison
5.1 Effects of varying the number of cloud nodes
In the following experiments, we investigate the
performance of FD-Mine in terms of execution time
by varying the number of cloud nodes from I to 10
T20.I5.NIOOK.D100K are described The support
threshold is set to 0.03%, which is a very small value,
in order to verify the performance of both the
algorithms, FD-Mine and BTP-tree Figure 3 shows
For this reason, the FP-tree duplication is processed
as follows First, the algorithm selects an idle node n
(line 5), and selects the connection node en of n from
the cloud architecture C (line 6) If en has no
duplicated FP-tree ,TN will duplicate one to en (line
7 to line 9) Note that in order to minimize the
transmitting overhead the FP-tree should be
compressed in advance Afterwards, node n can
obtain the compressed FP-tree via intranet and
decompress it (line 10) After receiving the FP-tree,
node n is assigned to a subset of IS (line 11), and
batch-runs FP-growth for each conditional item in the
subset to mine the frequent patterns (line 12 to line
13) Obviously, each node needs only one data
transmission, i.e FP-tree duplication, and the
transmission is in intranet to minimize the network
latency After all of the IN! disjointed sets are
processed, the frequent patterns are returned (line
15)
Trang 50.05 0.04
0.03 0.02
o
··· ····.0.
··· ··· ··· 0.
· ···· ·· ··· ·· ·· ·· ·· .0
··· ··· ··· ··· 0
0.01 34 32 20 u-! 30 Q) E 28 i= c Q 26 3 ~ 24 22 18 L - , - - - r - - - - , - - - - - - - , J ·· ·· ·· ·· 0 ··· ·· ··· 0 {) 0 120 U-Q) $ Q) 100 E i= c a ~ 80 c Q) o ,
x w >'0 60 40 8 10 Number of Nodes Figure 4 The execution time for FD-Mine and BTP-tree with number of nodes varied on dataset T40.I5 N100K.D100K Su pport Thresh old (%) Figure 6 The execution time for FD-Mine and BTP-tree with support threshold varied on dataset T20.15.N100K.D100K. data privacy is preserved Unlike the parallel Apriori-like algorithms that need to duplicate the database to remote nodes or the BTP-tree algorithm that distributes part of the database directly to cloud nodes, the database will never be duplicated and only the kernel node is permitted to access the database in our designed architecture and algorithms Through empirical evaluations on various simulation conditions , the proposed FD-Mine delivers excellent performance in terms of scalability and execution time · ···· ··0 ··· 0 ··· 0 0 100 90 u- 80 Q) $ Q) 70 E i= c Q 60 :5 o Q) x ·0 w 50 40 30 -'-r -. -r -r -r r -. -r -r ,
10
Number of Nodes
Figure 5 T he execution time for FD-Mine and BTP -tree with
number of nodes varied on dataset T20.I5 Nl OOK.D200K.
Acknowledgement This research was partially supported by National
No.97-2218-E-151-003-MY2
In Figure 6, we explore the impact on execution
time by varying the support threshold from 0.05% to
0.0 I% with ten cloud nodes It can be found that
FD-Mine always requires less time than BTP-tree
The efficiency in execution time of FD-Mine is
mainly achieved by reducing the transmission
overhead and the disk I/O times In the experiment,
the required time of FD-Mine is only about 82% of
the execution time ofBTP-tree in average
V Conclusions
In this paper, we have presented an efficient
algorithm named FD-Mine that is able to efficiently
utilize the cloud nodes to discover frequent patterns
in cloud computing environments with data privacy
preserved The proposed FD-Mine is composed of
two algorithms, namely HD-Mine and FD-Mine The
limitation of the conventional algorithm for mining
the dataset with a large number of frequent patterns is
bounded to the available memory The proposed
HD-Mine is able to discover the frequent patterns
from this kind of datasets by merging the memory of
several nodes The proposed FD-Mine focuses on the
fast discovery of frequent patterns by utilizing the
cloud nodes, and is useful to the applications that
emphasize real time mining Another important
characteristic of the proposed algorithms is that the
References
[IJ R Agrawal and R Srikant Quest Synthetic Data Generator IBM Almaden Research Center, San Jose, California, http://www.almaden.ibm com/cs/quest/syndata.html.
[2J R Agrawal, Imielinski T, Swami A Mining association rules between sets of items in large databases In: Proc ACM SIGMOD IntI ConfManagement Data, 1993.
[3J R Agrawal, R Srikant, Mining Sequential Patterns, in: Proc of the 11 th 1nt'l Conf on Data Engineering, 1995, pp 3-14.
[4J R Agrawal, John C Shafer, " Parallel Mining of Association Rules", IEEE Transactions on knowledge and Data Engineering, December 1996.
[5J R J Bayardo, Jr., Brute-force mining of high-confidence classification rules In Proceedings of the 3rd international conference on knowledge discovery and data mining (KDD'97), Newport Beach, California, USA.
[6J J Han, 1 Pei, and Y Yin Mining Frequent Patterns Without Candidate Generation Proc of ACM Int Conf on Management of Data (SIGMOD), \-12,2000.
[7J J.D Holt, S.M Chung, " Parallel mining of association rules from text databases on a cluster of workstations," Proceedings of 18th International Symposium on Parallel and Distributed Processing, 2004, pp 86.
[8J P Iko and M Kitsuregawa, "Shared Nothing Parallel Execution
of FPgrowth." DBSJ Letters, Volume 2, No.1, 2003, pp 43-46 [9J A Javed, A Khokhar, " Frequent Pattern Mining on Message Passing Multiprocessor Systems," Distributed and Parallel database, Volume 16, Issue 3, 2004, pp 321-334.
[ IOJ T Li, S Zhu, M Ogihara, "A New Distributed Data Mining Model Based on Similarity," Symposium on Applied Computing,
2003, pp.432-436.
[II J Ester M., Kriegel H.-P., Sander 1., Xu X.: "A Density-Based
Trang 6with Noise", Proc 2nd Int Conf on Knowledge Discovery and
Data Mining, Portland, OR, AAAI Press, 1996, pp 226-231.
[12] Y Qiu, Y 1 Lan and Q S Xie, "An improved algorithm of
mining from FP- tree," Proceedings of the Third International
Conference on Machine Learning and Cybernetics, pp 26-29,
2004.
[13] E.-H S Han, G Karypis, and V Kumar Scalable parallel data
mining for association rules IEEE Transactions on Knowledge and
Data Engineering, 12(3):352 -377, 2000.
[14] J Zhou, K.-M Yu, "Tidset-based Parallel FP-tree Algorithm
for the Frequent Pattern Mining Problem on PC Clusters," Lecture
Notes in Computer Science 5036, 2008, pp 18-28.
[15] 1 Zhou, K.-M Yu, Balanced Tidset-based Parallel FP-tree
Algorithm for the Frequent Pattern Mining on Grid System, Fourth
International Conference on Semantics, Knowledge and Grid, 2008.