A an efficient algorithm for association rule mining

To overcome the problem, recent studies deal with frequent closed item-set mining as it is significantly smaller than the whole frequent items and has similar strength.. In this paper,

Trang 1

International Journal of Advanced Computer Science, Vol 1, No 4, Pp 142-145, Oct 2011.

Manuscript

Received:

9, Sep., 2011

Revised:

11, Oct., 2011

Accepted:

25, Oct.,2011

Published:

30, Oct.,2011

Keywords

Data mining,

Association

rule mining,

Frequent

closed

item-set

Abstract The efficient frequent item-set mining is the most important problem in association rule mining To date, a number

of algorithms have been developed in the field But the high number of such items in data base results in redundancy in rules To overcome the problem, recent studies deal with frequent closed item-set mining as it is significantly smaller than the whole frequent items and has similar strength In this paper,

a new algorithm called FC-Close is introduced for frequent closed item-set mining is introduced This algorithm employs a pruning technique to improve the efficiency of frequent closed item-set mining

The results of the tests show that FC-Close is more efficient that the existing FP-close algorithm.

1 Introduction

Association Rule Mining (ARM) is one of the most

important data mining techniques ARM aims at extraction,

hidden relation, and interesting associations between the

existing items in a transactional database It is highly useful

in market basket analysis for stores and business centers

For example, database mining of a department store

customers reveal that those who buy milk would buy butter

in 60% of occasions, and such principle is observed in 80%

of transactions In this example, the above-mentioned

probability is called confidence percentage, and a

percentage of transactions which cover this rule is termed

support percentage To find the rules, user should set a

minimum amount for support and confidence which are

called minimum support (min-sup) and minimum

confidence (min-conf) respectively [1]

The main step in association rule mining is the mining

frequent itemsets In effect, with frequent itemsets in hand,

generating association rules would be highly

straightforward Frequent itemsets mining often generates a

very large number of frequent itemsets and rules As such, it

reduces the efficiency and power of mining To overcome

the problem, in recent years, condensed representation has

This work was supported by Islamic Azad University, Sarvestan Branch,

Shiraz, Iran

Maryam Shekofteh is with Department of Computer Engineering,

Sarvestan Branch, Shiraz, Iran

Shekofteh-m@iau.sarv.ac.ir

been used for frequent itemsets [3, 7] A popular condensed representation method is using to frequent closed itemsets Compared with frequent itemsets, the frequent closed itemsets is a much more limited set but with similar power

In addition, it decreases redundant rules and increases mining efficiency Many algorithms have been presented for mining frequent closed itemsets

In this paper a new algorithm called FC-Close is introduced for frequent closed item-set This algorithm which is the developed based on FP-growth [5], employs a pruning technique to improve its efficiency The rest of the paper is structured as follow: section 2 introduces frequent close item-set mining and related concepts Section 3 sketches out topics and structures of the new algorithm Section 4 describes our developed algorithm The evaluation of findings is presented in section 5 and section 6

is devoted to conclusion

2 Problem Development

Let D be a transactional database Each transactional

database includes a set of transactions Each transaction t is

represented by <TID, x> in which x is a set of items and

TID is the unique identifier of transaction Further, let us

consider I = {i1, i2, …, in} as the complete set of distinct

items in D Each non- empty subset y of I is termed an itemset, and if includes k items, it would be called k-itemset The number of transactions existing in D including itemset

y, is called the support of itemset y, denoted as sup(y) and it

is usually represented in percentage Given a minimum

support, min-sup, an itemset y is frequent itemset, if sup(y)

min-sup.

Definition1- Closed Itemset: An itemset y is a closed itemset if there is not any superset of y like y'that sup(y) =

sup(y ' ).

3 Related Literature

A FP-tree and FP –growth Method:

As FC-close is the extended version of FP-growth, an introduction to FP-growth and FP-tree structure is needed

In FP-growth a new structure called FP-tree is used FP-tree

is a dense data structure for saving that has all necessary information on frequent item set in a database Each branch

of FP-tree presents one frequent item set and the nodes along the branch are the count of items in descending order

An Efficient Algorithm for Association Rule Mining

Maryam Shekofteh

Trang 2

Maryam Shekofteh: An Efficient Algorithm for Association Rule Mining

International Journal Publishers Group (IJPG)©

143

Each node in FP-tree has three fields: item-name, count, and

node –link where item name includes the name of item

which the nodes has Count shows the number of

transactions along the covered path to the node, and

node-link indicates the next node in FP-tree which includes

the same item If there is no such node, the node link is null

Likewise, FP-tree has a header table related to its own The

single items of database are saved in this table in a

descending order Each entry to the header table includes

two fields: the item name and the node-link start which

refers to the first node in FP-tree which has the item name

Compared with Apriori [1] and its types which require

considerable pass from database, FP-growth need just two

passes during mining of all frequent item sets In the first

pass, support of each item is calculated and the repeated

single items (repeated items with length 1) are put in the

database in a descending order At the second pass, an

FP-tree which includes all frequent information of database

is created In other words, each transaction of the ordered

database is read and each time one transaction is added to

the FP-tree structure (To add a new transaction in FP-tree,

if the transaction has similar prefix with other added

transactions, for the items inside the prefix no node is

considered and just the support number (number of fields) is

added Therefore, mining on the database leads to the

mining on FP-tree Figure 1 shows the first FP-tree of the

second pass with minimum support of 20% While adding

item I to the itemset y where y i is called z, the path

from the father node of this node (node i) to the root node in

FP-tree related to y, is called prefix path z

Let us review further information on FP-growth Once

FP-tree is created, mining of frequent patterns of FP-tree

with FP-growth algorithm is performed The algorithm

FP-growth performs based on recursive deletion In this

algorithm the database is repeatedly limited to the existing

itemsets where the database limited to item is named

conditional pattern and it is shown by T To create

the conditional pattern each itemset , all prefix patterns

( beginning from root) is written After creating

conditional pattern of one itemset, its conditional FP-tree is

made To make conditional FP-tree of an itemset, we follow

the steps taken in making initial FP-tree In this stage,

however, instead of using the whole database we employ the

conditional pattern of that item Therefore, total of the

number of supports of all items in all conditional patterns

related to the item is calculated and if it is higher than

threshold, it is added to the header table and FP-tree The

mining procedure in conditional FP-tree is conducted

recursively until it is null or includes one single branch,

otherwise all frequent patterns are extracted

B CFI-tree

In FP-close algorithm, CFI-tree is introduced as a special

database for storing closed frequent itemsets CFI-tree is

like an FP-tree It includes a root node which is named

along with root Each node under the tree has four fields:

Item-name, count, node surface, and node link All nodes

with similar item names are connected The node link refers

to the next node with the same item name A header table is

created for items in CFI-tree where the order of items in the table is the same as the order of items in the first made FP-tree on the database first pass Each entry on the header table includes two fields: the item name and head of node-link The node-link links to the first node with the same item name in CFI-tree The surface field is used to test sub-fields The count field is required to compare y with the set z of three as it is regularly tested until it is confirmed that there is no case of y z and y and z have similar count

T ABLE 1

A S AMPLE D ATABASE

items tid

abcefo acg

ei acdeg ace gl

ej abcefp acd acegm acegn

1 2 3 4 5 6 7 8 9

10

Fig 1 Structure of FP-tree

The arrangement of a frequent itemset in CFI-tree is similar to the arrangement of a transaction in FP-tree However, to add an item of transaction with prefix similar

to the added transactions, the count of nodes is not increased, but the maximum of counts is updated In effect,

in FP-close, one newly discovered itemset is put in CFI-tree, unless that item is the sub-set of an item, and they have similar count of occurrences in the tree Figure 2 displays CFI-tree of database in table 1when minimum support equal

to 20% In this figure, a node x,c,1 shows that it includes item node x with count c and surface 1 More details on CFI-tree is available in [4]

Fig 2 Structure of CFI-tree

Trang 3

International Journal of Advanced Computer Science, Vol 1, No 4, Pp 142-145, Oct 2011.

144

4 FC-Close Algorithm

FC-Close algorithm employs the same structures of

FP-tree, and CFI-tree for mining frequent closed item-set

Here, however, the search space is decreased effectively

thanks to an optimal technique called pruning The smaller

search space and trees' count implies less time compared

with the similar algorithm such as FP-close Let us review

the pruning technique

A Optimal Prunning Technique in FC-Close Algorithm

Suppose y is an itemset One optimal method is that

itemset support y and (y i) (I is member of conditional

pattern y) are compared If support y and (y i) are

equal, then each transaction including y also includes i This

guarantees that each frequent itemset z including y which

does not include I, includes frequent super set z i and

these two sets have similar support

Based on close frequent itemset, counting the itemsets

including y which does not include i is not needed Then it

would be possible to transfer i to y and delete item i from

conditional pattern y which includes item i

Another optimization method involves comparison of

itemsets (y i) and( z i ) (z includes items y-j where j

is already added to y) If support of (y i) and that of

)

( z i is equal, it is guaranteed each frequent itemset has

frequent super set(y i), and these two sets have similar

count (similar support)

According to close frequent itemset, it is possible to

avoid a search of the branch including itemset ( z i )

Figure 3 shows a pseudo-code of function that performs

pruning technique:

Pruning (current itemset: y)

{

For each item i in y's conditional pattern base

{

Newitem=y i

If (support (Newitem)==support(y)

{ move i to y

Remove i from y's conditional pattern base}

Newitem=y i

(Newitem)==support( z i ) )

Stop the branch search that containing the itemset ( z i )

}

Fig 3 Pseudo-Code of Pruning Function

B Mining on Frequent Itemsets using FC-Close

In this article, a new algorithm called FC-Close is

introduced using pruning technique This is a developed

algorithm of FP-growth method Like FP-growth, FC-Close

is recursive In the first call, one FP-tree is made of the first

database pass A link list y includes items which make the

current call conditional pattern If there is just one single

path in FP-tree, each frequent itemset X created in this

single path along with the frequent itemset y Then it should

be checked that whether itemset is a close frequent itemset

In the second line, if the itemset (y x) is a close frequent itemset, (y x) is put in CFI-tree If FP-tree is not a single path tree, for each item in header item, the item

is added to y Then if-closed function is called so that it analyzes whether the itemset y is a close frequent itemset If

so, y is put in CFI-tree In the next line, FP-tree of Y is made and the function of pruning is called Then FC-close

is recursively called Figure 4 displays FC-Close pseudo-code function:

FC-Close(T) Input T:FP-tree Global:

y: a linked list of items CFI-tree: CFI-tree Output: the CFI-tree which contains CFI Method:

(1)If T only contains a single path p{

(2)Generate all frequent itemset from p{

(3) For each x in frequent itemset (4) If not if-closed(y x){ (5) Insert x into CFI-tree}

(6)Else for each i in header-table of T{

(7) Append i to head (8)If not if-closed(y) (9) {

(10) Insert y into CFI-tree (11) Construct the y's FP-tree T y

(12) Pruning(y) (13) FC-Close(T y ) (14) Remove i from y}

Fig 4 FC-Close Algorithm Pseudo-Code

5 Results

In this section, our developed algorithm, i.e FC-algorithm is compared with the existing FP-close algorithm To do so, a computer with the following specifics is employed It runs a Pentium processor with 3GHz and 1GB Ram memory It carries 200 GB disk with Wondows XP 2003 All codes are implemented by C++ The results are tested on two databases where one of them is

a dense database called Chess and the other one is T40.I10.D100k sparse database The specifications of these databases are shown in Table 2 The information on both databases are taken from [11]

T ABLE 2.

S PECIFICATIONS OF TESTED DATABASES

Dataset #transactions Avg transaction size

Figures 5 and 6 contracts the time consumed for algorithm execution in our developed algorithm, i.e FP-Close with that of existing FP-close in Chess and T40.I10.D100K databases

Trang 4

Maryam Shekofteh: An Efficient Algorithm for Association Rule Mining

145

Fig 5 Comparison of execution time on Chess database

Fig 6 Comparison of execution time on Chess database

As shown in both figures 5 and 6, FC-Close has higher

efficiency over FP-Close

6 Conclusion

In this article, FC-Close is introduced as an effective

algorithm for mining close frequent itemsets This algorithm

decreases search space and FP-tree size using pruning

technique The experiments show that FC-Close has higher

efficiency over FP-close algorithm

References

[1] R Agrawal & R Srikant, “Fast algorithms for mining

association rules,” (1994) Proceeding of the VLDB, Santiago de

chile.

[2] C-C Chang & C-Y Lin, “perfect hashing schemes for

mining association rules,” (2005) Oxford university press on behalf

of the british computer society, vol 48, no 2

[3] B Goethals, “Survey on Frequent pattern mining,” (2004)

Department of computer science university of Helsinki.

[4] G Grahne & J Zhu, “Efficiently using prefix-trees in mining

frequent itemsets,” (2003) IEEE ICDM Workshop on Frequent

Itemset Mining Implementations.

[5] J Han, J Pie, Y Yin, & R Mao, “Mining frequent pattern

without candidate generation,” (2003) Data mining and knowledge

discovery.

[6] J.S Park, M-s Chen, & P.S Yu, “An effective hash based

algorithm for mining association rules,” (1995) ACM SIGMOD

international conference on management of Data, vol 24, pp

175-186

[7] N Pasquier, Y Bastide, R Taouil, & L Lakhal,

“Discovering frequent closed itemsets for association rules,” (1999)

Proc Int'l conf Database Theory, pp 398-416

[8] J Pei, J Han, & R Mao, “CLOSET: An efficient Algorithm

for mining frequent closed itemsets,” (2000) ACM SIGMOD

workshop research issue in Data mining and knowledge Discovery,

pp 21-30

[9] J Wang, J Han, & J Pei, “CLOSET: Searching for the best

strategies for mining frequent closed itemsets,” (2003) proc Int'l

Conf Knowledge Discovery and Data Mining, pp 236-245

[10] M.J Zaki & C Hsiao, “Charm: An efficient algorithm for

closed itemset mining,” (2002) Proc SIAM Int'l Conf Data

Mining, pp 457-473

[11] http://fimi.cs.helsinki.fi, 2003

[12] http://www.cs.bme.hu/~bodon, 2005

Định dạng
Số trang	4
Dung lượng	746,81 KB