To overcome the problem, recent studies deal with frequent closed item-set mining as it is significantly smaller than the whole frequent items and has similar strength.. In this paper,
Trang 1International Journal of Advanced Computer Science, Vol 1, No 4, Pp 142-145, Oct 2011.
Manuscript
Received:
9, Sep., 2011
Revised:
11, Oct., 2011
Accepted:
25, Oct.,2011
Published:
30, Oct.,2011
Keywords
Data mining,
Association
rule mining,
Frequent
closed
item-set
Abstract The efficient frequent item-set mining is the most important problem in association rule mining To date, a number
of algorithms have been developed in the field But the high number of such items in data base results in redundancy in rules To overcome the problem, recent studies deal with frequent closed item-set mining as it is significantly smaller than the whole frequent items and has similar strength In this paper,
a new algorithm called FC-Close is introduced for frequent closed item-set mining is introduced This algorithm employs a pruning technique to improve the efficiency of frequent closed item-set mining
The results of the tests show that FC-Close is more efficient that the existing FP-close algorithm.
1 Introduction
Association Rule Mining (ARM) is one of the most
important data mining techniques ARM aims at extraction,
hidden relation, and interesting associations between the
existing items in a transactional database It is highly useful
in market basket analysis for stores and business centers
For example, database mining of a department store
customers reveal that those who buy milk would buy butter
in 60% of occasions, and such principle is observed in 80%
of transactions In this example, the above-mentioned
probability is called confidence percentage, and a
percentage of transactions which cover this rule is termed
support percentage To find the rules, user should set a
minimum amount for support and confidence which are
called minimum support (min-sup) and minimum
confidence (min-conf) respectively [1]
The main step in association rule mining is the mining
frequent itemsets In effect, with frequent itemsets in hand,
generating association rules would be highly
straightforward Frequent itemsets mining often generates a
very large number of frequent itemsets and rules As such, it
reduces the efficiency and power of mining To overcome
the problem, in recent years, condensed representation has
This work was supported by Islamic Azad University, Sarvestan Branch,
Shiraz, Iran
Maryam Shekofteh is with Department of Computer Engineering,
Sarvestan Branch, Shiraz, Iran
Shekofteh-m@iau.sarv.ac.ir
been used for frequent itemsets [3, 7] A popular condensed representation method is using to frequent closed itemsets Compared with frequent itemsets, the frequent closed itemsets is a much more limited set but with similar power
In addition, it decreases redundant rules and increases mining efficiency Many algorithms have been presented for mining frequent closed itemsets
In this paper a new algorithm called FC-Close is introduced for frequent closed item-set This algorithm which is the developed based on FP-growth [5], employs a pruning technique to improve its efficiency The rest of the paper is structured as follow: section 2 introduces frequent close item-set mining and related concepts Section 3 sketches out topics and structures of the new algorithm Section 4 describes our developed algorithm The evaluation of findings is presented in section 5 and section 6
is devoted to conclusion
2 Problem Development
Let D be a transactional database Each transactional
database includes a set of transactions Each transaction t is
represented by <TID, x> in which x is a set of items and
TID is the unique identifier of transaction Further, let us
consider I = {i1, i2, …, in} as the complete set of distinct
items in D Each non- empty subset y of I is termed an itemset, and if includes k items, it would be called k-itemset The number of transactions existing in D including itemset
y, is called the support of itemset y, denoted as sup(y) and it
is usually represented in percentage Given a minimum
support, min-sup, an itemset y is frequent itemset, if sup(y)
min-sup.
Definition1- Closed Itemset: An itemset y is a closed itemset if there is not any superset of y like y'that sup(y) =
sup(y ' ).
3 Related Literature
A FP-tree and FP –growth Method:
As FC-close is the extended version of FP-growth, an introduction to FP-growth and FP-tree structure is needed
In FP-growth a new structure called FP-tree is used FP-tree
is a dense data structure for saving that has all necessary information on frequent item set in a database Each branch
of FP-tree presents one frequent item set and the nodes along the branch are the count of items in descending order
An Efficient Algorithm for Association Rule Mining
Maryam Shekofteh
Trang 2Maryam Shekofteh: An Efficient Algorithm for Association Rule Mining
International Journal Publishers Group (IJPG)©
143
Each node in FP-tree has three fields: item-name, count, and
node –link where item name includes the name of item
which the nodes has Count shows the number of
transactions along the covered path to the node, and
node-link indicates the next node in FP-tree which includes
the same item If there is no such node, the node link is null
Likewise, FP-tree has a header table related to its own The
single items of database are saved in this table in a
descending order Each entry to the header table includes
two fields: the item name and the node-link start which
refers to the first node in FP-tree which has the item name
Compared with Apriori [1] and its types which require
considerable pass from database, FP-growth need just two
passes during mining of all frequent item sets In the first
pass, support of each item is calculated and the repeated
single items (repeated items with length 1) are put in the
database in a descending order At the second pass, an
FP-tree which includes all frequent information of database
is created In other words, each transaction of the ordered
database is read and each time one transaction is added to
the FP-tree structure (To add a new transaction in FP-tree,
if the transaction has similar prefix with other added
transactions, for the items inside the prefix no node is
considered and just the support number (number of fields) is
added Therefore, mining on the database leads to the
mining on FP-tree Figure 1 shows the first FP-tree of the
second pass with minimum support of 20% While adding
item I to the itemset y where y i is called z, the path
from the father node of this node (node i) to the root node in
FP-tree related to y, is called prefix path z
Let us review further information on FP-growth Once
FP-tree is created, mining of frequent patterns of FP-tree
with FP-growth algorithm is performed The algorithm
FP-growth performs based on recursive deletion In this
algorithm the database is repeatedly limited to the existing
itemsets where the database limited to item is named
conditional pattern and it is shown by T To create
the conditional pattern each itemset , all prefix patterns
( beginning from root) is written After creating
conditional pattern of one itemset, its conditional FP-tree is
made To make conditional FP-tree of an itemset, we follow
the steps taken in making initial FP-tree In this stage,
however, instead of using the whole database we employ the
conditional pattern of that item Therefore, total of the
number of supports of all items in all conditional patterns
related to the item is calculated and if it is higher than
threshold, it is added to the header table and FP-tree The
mining procedure in conditional FP-tree is conducted
recursively until it is null or includes one single branch,
otherwise all frequent patterns are extracted
B CFI-tree
In FP-close algorithm, CFI-tree is introduced as a special
database for storing closed frequent itemsets CFI-tree is
like an FP-tree It includes a root node which is named
along with root Each node under the tree has four fields:
Item-name, count, node surface, and node link All nodes
with similar item names are connected The node link refers
to the next node with the same item name A header table is
created for items in CFI-tree where the order of items in the table is the same as the order of items in the first made FP-tree on the database first pass Each entry on the header table includes two fields: the item name and head of node-link The node-link links to the first node with the same item name in CFI-tree The surface field is used to test sub-fields The count field is required to compare y with the set z of three as it is regularly tested until it is confirmed that there is no case of y z and y and z have similar count
T ABLE 1
A S AMPLE D ATABASE
items tid
abcefo acg
ei acdeg ace gl
ej abcefp acd acegm acegn
1 2 3 4 5 6 7 8 9
10
Fig 1 Structure of FP-tree
The arrangement of a frequent itemset in CFI-tree is similar to the arrangement of a transaction in FP-tree However, to add an item of transaction with prefix similar
to the added transactions, the count of nodes is not increased, but the maximum of counts is updated In effect,
in FP-close, one newly discovered itemset is put in CFI-tree, unless that item is the sub-set of an item, and they have similar count of occurrences in the tree Figure 2 displays CFI-tree of database in table 1when minimum support equal
to 20% In this figure, a node x,c,1 shows that it includes item node x with count c and surface 1 More details on CFI-tree is available in [4]
Fig 2 Structure of CFI-tree
Trang 3International Journal of Advanced Computer Science, Vol 1, No 4, Pp 142-145, Oct 2011.
International Journal Publishers Group (IJPG)©
144
4 FC-Close Algorithm
FC-Close algorithm employs the same structures of
FP-tree, and CFI-tree for mining frequent closed item-set
Here, however, the search space is decreased effectively
thanks to an optimal technique called pruning The smaller
search space and trees' count implies less time compared
with the similar algorithm such as FP-close Let us review
the pruning technique
A Optimal Prunning Technique in FC-Close Algorithm
Suppose y is an itemset One optimal method is that
itemset support y and (y i) (I is member of conditional
pattern y) are compared If support y and (y i) are
equal, then each transaction including y also includes i This
guarantees that each frequent itemset z including y which
does not include I, includes frequent super set z i and
these two sets have similar support
Based on close frequent itemset, counting the itemsets
including y which does not include i is not needed Then it
would be possible to transfer i to y and delete item i from
conditional pattern y which includes item i
Another optimization method involves comparison of
itemsets (y i) and( z i ) (z includes items y-j where j
is already added to y) If support of (y i) and that of
)
( z i is equal, it is guaranteed each frequent itemset has
frequent super set(y i), and these two sets have similar
count (similar support)
According to close frequent itemset, it is possible to
avoid a search of the branch including itemset ( z i )
Figure 3 shows a pseudo-code of function that performs
pruning technique:
Pruning (current itemset: y)
{
For each item i in y's conditional pattern base
{
Newitem=y i
If (support (Newitem)==support(y)
{ move i to y
Remove i from y's conditional pattern base}
Newitem=y i
(Newitem)==support( z i ) )
Stop the branch search that containing the itemset ( z i )
}
}
Fig 3 Pseudo-Code of Pruning Function
B Mining on Frequent Itemsets using FC-Close
In this article, a new algorithm called FC-Close is
introduced using pruning technique This is a developed
algorithm of FP-growth method Like FP-growth, FC-Close
is recursive In the first call, one FP-tree is made of the first
database pass A link list y includes items which make the
current call conditional pattern If there is just one single
path in FP-tree, each frequent itemset X created in this
single path along with the frequent itemset y Then it should
be checked that whether itemset is a close frequent itemset
In the second line, if the itemset (y x) is a close frequent itemset, (y x) is put in CFI-tree If FP-tree is not a single path tree, for each item in header item, the item
is added to y Then if-closed function is called so that it analyzes whether the itemset y is a close frequent itemset If
so, y is put in CFI-tree In the next line, FP-tree of Y is made and the function of pruning is called Then FC-close
is recursively called Figure 4 displays FC-Close pseudo-code function:
FC-Close(T) Input T:FP-tree Global:
y: a linked list of items CFI-tree: CFI-tree Output: the CFI-tree which contains CFI Method:
(1)If T only contains a single path p{
(2)Generate all frequent itemset from p{
(3) For each x in frequent itemset (4) If not if-closed(y x){ (5) Insert x into CFI-tree}
(6)Else for each i in header-table of T{
(7) Append i to head (8)If not if-closed(y) (9) {
(10) Insert y into CFI-tree (11) Construct the y's FP-tree T y
(12) Pruning(y) (13) FC-Close(T y ) (14) Remove i from y}
Fig 4 FC-Close Algorithm Pseudo-Code
5 Results
In this section, our developed algorithm, i.e FC-algorithm is compared with the existing FP-close algorithm To do so, a computer with the following specifics is employed It runs a Pentium processor with 3GHz and 1GB Ram memory It carries 200 GB disk with Wondows XP 2003 All codes are implemented by C++ The results are tested on two databases where one of them is
a dense database called Chess and the other one is T40.I10.D100k sparse database The specifications of these databases are shown in Table 2 The information on both databases are taken from [11]
T ABLE 2.
S PECIFICATIONS OF TESTED DATABASES
Dataset #transactions Avg transaction size
Figures 5 and 6 contracts the time consumed for algorithm execution in our developed algorithm, i.e FP-Close with that of existing FP-close in Chess and T40.I10.D100K databases
Trang 4Maryam Shekofteh: An Efficient Algorithm for Association Rule Mining
International Journal Publishers Group (IJPG)©
145
Fig 5 Comparison of execution time on Chess database
Fig 6 Comparison of execution time on Chess database
As shown in both figures 5 and 6, FC-Close has higher
efficiency over FP-Close
6 Conclusion
In this article, FC-Close is introduced as an effective
algorithm for mining close frequent itemsets This algorithm
decreases search space and FP-tree size using pruning
technique The experiments show that FC-Close has higher
efficiency over FP-close algorithm
References
[1] R Agrawal & R Srikant, “Fast algorithms for mining
association rules,” (1994) Proceeding of the VLDB, Santiago de
chile.
[2] C-C Chang & C-Y Lin, “perfect hashing schemes for
mining association rules,” (2005) Oxford university press on behalf
of the british computer society, vol 48, no 2
[3] B Goethals, “Survey on Frequent pattern mining,” (2004)
Department of computer science university of Helsinki.
[4] G Grahne & J Zhu, “Efficiently using prefix-trees in mining
frequent itemsets,” (2003) IEEE ICDM Workshop on Frequent
Itemset Mining Implementations.
[5] J Han, J Pie, Y Yin, & R Mao, “Mining frequent pattern
without candidate generation,” (2003) Data mining and knowledge
discovery.
[6] J.S Park, M-s Chen, & P.S Yu, “An effective hash based
algorithm for mining association rules,” (1995) ACM SIGMOD
international conference on management of Data, vol 24, pp
175-186
[7] N Pasquier, Y Bastide, R Taouil, & L Lakhal,
“Discovering frequent closed itemsets for association rules,” (1999)
Proc Int'l conf Database Theory, pp 398-416
[8] J Pei, J Han, & R Mao, “CLOSET: An efficient Algorithm
for mining frequent closed itemsets,” (2000) ACM SIGMOD
workshop research issue in Data mining and knowledge Discovery,
pp 21-30
[9] J Wang, J Han, & J Pei, “CLOSET: Searching for the best
strategies for mining frequent closed itemsets,” (2003) proc Int'l
Conf Knowledge Discovery and Data Mining, pp 236-245
[10] M.J Zaki & C Hsiao, “Charm: An efficient algorithm for
closed itemset mining,” (2002) Proc SIAM Int'l Conf Data
Mining, pp 457-473
[11] http://fimi.cs.helsinki.fi, 2003
[12] http://www.cs.bme.hu/~bodon, 2005