DSpace at VNU: A fast algorithm for classification based on association rules

2012 IEEE International Conference on Granular Computing A Fast Algorithm for Classification Based on Association Rules Loan T.T.. Nguyen Faculty of Information Technology VOV Broadcas

Trang 1

2012 IEEE International Conference on Granular Computing

A Fast Algorithm for Classification Based on

Association Rules

Loan T.T Nguyen

Faculty of Information Technology

VOV Broadcasting College II

Ho Chi Minh, Viet Nam nguyenthithuyloan@vov.arg.vn

Tzung-Pei Hong

Department CSIE National University of Kaohsiung

Kaohsiung City, Taiwan, R.O.C

tphong@nuk.edu.tw

Abstract-In this paper, we propose a new method for mining

class-association rules using a tree structure Firstly, we design

a tree structure for storing frequent itemsets of datasets Some

theorems for pruning nodes and computing information in the

tree are then developed We then propose an efficient

algorithm for mining CARs based on them Experimental

results show that our approach is more efficient than those

used previously

Keywords-accuracy, classification, class-association rules,

data mining, tree structure

I INTRODUCTION

A lot of methods for mining classification rules have been

developed in recent years such as C4.5 and ILA These

methods are, however, based on heuristics and greedy

approaches to generate rule sets that are either too general or

too overfitting for a given dataset They thus often yield

high error ratios Recently, a new method for classification

from data mining, called the Classification Based on

Associations (CBA), has been proposed for mining Class

Association Rules (CARs) This method has more

advantages than the heuristic and greedy methods in that the

former could easily remove noise, and the accuracy is thus

higher It can additionaly generate a rule set that is more

complete than C4.5 and ILA Thus, some algorithms for

mining classification rules based on association rule mining

have been proposed Examples include CP AR [17], CMAR

[4], CBA [6-7], MMAC [10], MCAR [11], ACME [12],

Noah [1], LOCA and PLOCA [8], and the use of the

Equivalence Class Rule-tree (ECR-tree) [16] Some

researchers have also reported that classifiers based on

class-association rules are more accurate than those of

traditional methods such as C4.5 [9] and ILA [13-14] both

theoretically [15] and with regard to experimental results

[6]

BayVo

Information Technology College

Ho Chi Minh City, Viet Nam vdbay@itc.edu.vn

Hoang Chi Thanh

Department of Informatics

Ha Noi University of Science

Ha Noi, Viet Nam thanhhc@vnu.vn

All the above methods focused on the design of the algorithms for mining CARs or building classifiers but did not discuss much with regard to their mining time Vo and

Le [16] has proposed a new method for mining CARs using

an ECR-tree An efficient algorithm, named ECR-CARM, was proposed in their study ECR-CARM scanned the dataset once and was based on object identifiers intended to quickly determine the support of itemsets A tree structure for fast mining CARs was applied It, however, was time consuming for generate-and-test candidates because the authors grouped all values with the same attributes into one node in the tree In this paper, we improve by modifying the tree structure Each node in the tree contains one value of attributes instead of their being grouped Some theorems are also designed Based on the tree and these theorems, we propose an algorithm for mining CARs efficiently

II PRELIMINARY CONCEPTS Let D be the set of training data with n attributes A J, A2, • ,

An and IDI objects (cases) Let C = {cJ, C2,' , cd be a list of class labels Specific values of attribute Ai and class C are denoted by lower case letters a and c, respectively

Definition 1: An itemset is a set of some pairs of attribute and a specific value, denoted {(Ail ail), (Ai2• a,2), , (Aim aim)}

Definition 2: A class-association rule r has the form of {(Ail ail), , (Aim aim)} � c, where {(Ail ail), , (Ami aim)} is an itemset and CEC is a class label

Definition 3: The actual occurrence ActOcc(r) of a rule r in

D is the number of rows of D that match r's condition Definition 4: The support of r denoted Sup(rJ is the number of rows of D that match r's condition, and belongs

to r's class

Trang 2

Table 1 An example of training dataset

For example: Consider r = {(A, a1)} -+ y from the

dataset in Table 1, we have ActOcc(r) = 3 and Super) = 2

because there have three objects with A = a1 in that two

objects have the same class y

III MINING CLASS-ASSOCIATION RULES

A Tree structure

We modify the ECR-tree structure [16] into an MECR-tree

structure (M stands for Modification) as follows In the

ECR-tree, the authors arranged all itemsets with the same

attributes into one group and joined item sets in different

groups together This led to consumption of more time for

generate-and-check itemsets In our work, each node in the

tree contains one itemset along with the following

information:

a) Obidset: a set of object identifiers that contain the

item set

b) (Cj,C2, ,Ck) - where Cj is the number of records in

Obidset which belong to class Cj, and

c) pos - store the position of the class with the maximum

count, i.e., pos = argmax{cJ

'E[J,k]

In the ECR-tree, the authors did not store Ci and pos:

therefore, the algorithm had to compute them for all nodes

However, we need not compute the information of some

nodes in the MECR-tree using theorems that are presented

in Section IV.B

For example, consider node containing itemset X = {(A, a3),

(B,b3)} Because X is contained objects 4 and 6, all of them

belong to class y Therefore, we have a node in the tree as

{(A,a3),(B,b3)} or more simply as 3 x a3b3 The pos is 1

(underlined at position 1 of this node) because the count of

class y is maximum (2 as compared to 0) The latter is

another representation of the former for saving memory

when we use the tree structure to store itemsets We use bit

presentation for storage of the itemset's attributes For

example, AB can present as 11 in bit presentation and

therefore, the value of these attributes is 3 With this

presentation, we can use bitwise operation to make itemsets

join faster

B Proposed algorithm

In this section, some theorems for fast mining CARs

are designed Based on these theorems, we propose an

efficient algorithm for mining CARs

Theorem 1: Given two nodes attl x values,

Obidset, (CII,···,Clk) att x values

and z z , If att, = attz and values, f valuesz,

Obidsetz (cw"" CZk)

then Obidsetl !l Obidsetz = 0

Proof: Since att, = attz and values, f valuesz, there exist a vall E valuesl and a valz E valuesz such that vall and valz have the same attribute but different values Thus, if a record with OlD, contains val" it cannot contain valz Therefore, VOID E Obidset" and it can be inferred that OlD " Obidsetz Thus, Obidset, !l Obidsetz = 0

In this theorem, we divide the itemset into form attxvalues for ease of use Theorem 1 infers that, if two itemsets X and Y have the same attributes, they don't need to

be combined into the itemset XY because Sup(XY) = O For

Ix al 1 x a2

example, consider the two nodes and , in

127(�,1) 38(1,1)

which Obidset({(A, aJ)}) = 127, and Obidset({(A, a2)}) =

38 Obidset( {(A, aJ), (A, a2)}) = Obidset( {(A, aJ)}) !l

Obidset({(A, a2)}) = 0 Similarly, Obidset({(A, a1), (B, bJ)}) = 1, and Obidset({(A, aJ), (B, b2)}) = 2 It can be inferred that Obidset( {(A, aJ), (B, b1)}) !l Obidset( {(A, a1), (B, b2)}) = 0 because both of these two itemsets have the same attributes AB but with different values

Theorem 2: Given two nodes itemsetl

Obidset, (cll''''' Clk)

and itemset2

, if itemset, c itemsetz and

Obidsetz (CZI"'" CZk)

IObidsetd = IObidsetzl then ViE [1, k]: Cli = CZi

Proof: We have itemsetl c itemsetz This means that all records containing itemsetz also contain itemset" and therefore, Obidsetz <;;; Obidset, Additionally, according to theory, we have IObidsetd = IObidsetzl This means that we have Obidsetz= Obidset" or ViE [1, k]: C'i = CZi

From Theorem 2, when we join two parent nodes into a child node, then the item set of the child node is always a supperset of the item set of each of the parent nodes Therefore, we will check their cardinations, and if they are the same, we need not compute the count for each class and the pos of this node because they are the same as the parent node

Using these theorems, we develop an algorithm for mining CARs efficiently By Theorem 1, we need not join two nodes with the same attributes, and by Theorem 2, we need not compute the information for some child nodes First of all, the root node of the tree (Lr) contains child nodes such that each node contains a single frequent item set After that, procedure CAR-Miner will be called with the parameter Lr to mine all CARs from the dataset D

Trang 3

Input: A dataset D, minSup and minConf

Output: all CARs satisfY minSup and minConf

Procedure:

CAR-Miner(L" minSup, minCorif)

1 CARs = 0;

2 for all Ii E Lr.children do

4 Pi=0;

5 for all/j E Lr.children, with} > i do

6 if li.att '* Ij.att then IIUsing theorem 1

7 O.att = li.att u Ij.att; IIUsing bit-wise operation

8 0 itemset = Ii values u Ij values;

9 O.Obidset = li.Obidset n Ij.Obidset ;

10 if 1O.0bidsetl = I/i.Obidsetl then I/Using theorem 2

11 O.count = li.count;

12 O.pos = li.POS;

13 else ifl0.obidsetl = I/j.obidsetl then IIUsing theorem 2

14 O.count = Ij.count;

15 O.pos = Ij.pos;

17 O.count = { count(xEO.Obidset I class(x) = Ci, 'v'iE [J,k]};

18 O.pos = arg max {/.count;}; II k is the number of class

ie[l.k]

19 if O.count[O.pos] ;e: minSup then

20 Pi = Pi U ° ;

21 CAR-Miner(Pi, minSup, minCorif)

ENUMERA TE-CAR(l, minConf)

22 conf= l.count[/.pos] III.Obidsetl;

23 if conf2: minConf then

24 CARs = CARs u {/.itemset -+ cpos (l.count[/.pos], cont)}

Figure 1 The proposed algorithm for mining CARs

The CAR-Miner procedure (Figure 1) considers each

node Ii with all the other node Ij in Ln withj > i (Lines 2 and

5) to generate a candidate child node I With each pair (Ii, Ij),

the algorithm checks whether li.att *' Ii-att or not (Line 6,

using Theorem 1) If they are different, it computes the three

elements att, values, Obidset for the new node 0 (Lines 7-9)

Line 10 checks if the number of object identifiers of Ii is

equal to the number of object identifiers of 0 (by Theorem

2) If this is true, then by Theorem 2, the algorithm can copy

all information from node Ii to node 0 (Lines 11-12)

Similarly, in the event that the result of Line lO is false, the

algorithm checks Ii with 0, and if the numbers of their

object identifiers are the same (Line 13), the algorithm can

copy the information from node Ij to node 0 (Lines 14-15)

Otherwise, the algorithm computes the O.count by using

O.Obidset and O.pos (Lines 17 -IS) After computing all of

the information for node 0, the algorithm adds it to Pi (Pi is

initialized empty in Line 4) if O.count[O.pos] � minSup

(Lines 19-20) Finally, CAR-Miner will be recursively

called with a new set Pi as its input parameter (Line 21)

The procedure ENUMERA TE-CAR(I, minConj)

generates a rule from node I It first computes the

confidence of the rule (Line 22), if the confidence of this

rule satisfies minConf (Line 23), then it adds this rule into

the set of CARs (Line 24)

C An example

In this section, we use the example in Table 1 to describe the CAR-Miner process with minSup = 10% and minConf = 60% The MECR-tree was built from the dataset

in Table 1 as follow: First, the root node Lf contains all frequent l-itemsets such as

{ Ix al 1 x a2 1 x a3 2 x b 1 2 x b2 2 x b3 127(�,1) 38 (O,�) 456(�,1) 150,1) 238(O,�) 467(�,O)

4 x c1 4 x c2 }

12346(�,2) 578(l,�) After that, procedure CAR-Miner is called with the parameter Lf' We use node Ii =

lxa2 38(O,�) as an example for illustrating the CAR-Miner

process Ii joins with all nodes following it in Lf:

lxa3

• With node Ij = : They (Ii and Ij) have the

456(�,1)

same attribute and different values Don't make any thing from them

2xbl

• With node Ij = : Because their attributes are

150,1)

different, three elements are computed such as O.att = li.att u Ij.att = 1 I 2 = 3 or 11 in bit presentation, O.itemset = Ii values u Ij.values = a2 u bi = a2bi, and O.Obidset = li.Obidset n Ij.Obidset = {3,S} n

{1,5} = {0} Because the O.count[O.pos] = 0 <

minSup, 0 is not added to Pi

2xb2

238(O,�)

different, three elements are computed such as O.att = li.att u Ij.att = 1 I 2 = 3 or 11 in bit presentation; O.itemset = li.values u Ij.values = a2 u b2 = a2b2, and O.Obidset = li.obidset n Ij.Obidset = {3,S} n

{2,3,S} = {3,S} Because of I/i.Obidsetl = lo Obidsetl, the algorithm copies all information of Ii to O This means that O.count = li.count = (,2,,0) and O.pos = 1 Because O.count[O.pos] = 2 > minSup, add 0 to Pi =>

3xa2b2

Pi = {

38(O,�) }

2xb3

467(�,O)

different, three elements are computed such as O.att = li.att u Ij.att = 1 I 2 = 3 or 11 in bit presentation, O.itemset = li.values u Ij.values = a2 u b3 = a2b3, and O.Obidset = li.Obidset n Ij.Obidset = {3,S} n

{4,6,7} = {0} Because the o.count[O.pos] = 0 <

minSup, 0 is not added to Pi

4xc1

• With node Ij = : Because their attributes

123460,2)

are different, three elements are computed such as O.att = li.att u Ij.att = 1 I 4 = 5 or 10 1 in bit

Trang 4

presentation, O.itemset = li.values u Ij.values = a2 u

cl = a2c1, and O.Obidset = li.Obidset !l Ij.Obidset =

{3,8} !l {1,2,3,4,6} = {3} The algorithm computes

additional information including O.count = {O,l} and

O.pos = 2 Because the O.count[O.pos] = 1 2: minSup,

3x a2b2 5x a2c1

o is added to Pi => Pi = { , }

38(0,�) 3(0,1) 4x c2

578(1,�)

different, three elements are computed such as O.att =

li.att u Ij.att = 1 I 4 = 5 or 101 in bit presentation,

O.itemset = li.values u Ij.values = a2 u c2 = a2c2,

and O.Obidset = li.obidset !l Ij.Obidset = {3,8} !l

{5,7,8} = {8} The algorithm computes additional

information including O.count = {O,l} and O.pos = 2

Because the O.count[O.pos] = 12: minSup, 0 is added

3x a2b2 5x a2c1 5x a2c2

to Pi => Pi = { 38(0,�)' 8(0,!) , 8(0,!) }

• After Pi is created, the CAR-Miner is called recursively

with parameters Pi, minSup, and minConf to create all

children nodes of Pi

Rules are easily to generate in the same step for

traversing Ii (Line 3) by calling procedure ENUMERATE

CAR(li, minConj) For example, when traversing node Ii =

Ix a2

38(0,�) , the procedure computes the confidence of the

candidate rule, conf = li.count[li.pos]/l/i.Obidsetl = 2/2 = l

Because conf2: minConf(60%), add rule {(A,a2)} � n (2,1)

into the rule set CARs The meaning of this rule is "If A =

a2 then class = n" (support = 2 and confidence = lOO%)

IV EXPERIMENT AL RESULTS

A Characteristics of experimental datasets

The algorithms used in the experiments were coded on a

personal computer with C#2008, Windows 7, Centrino

2x2.53 GHz, and 4MBs RAM The experimental results

were tested in the datasets obtained from the VCI Machine

Learning Repository (http://mlearn.ics.uci.edu) Table 4

shows the characteristics of the experimental datasets

T bl 4 Th h a e e c aractenstlcs 0 f h t e expenmenta atasets I d

Dataset #attrs #c1asses #distinct values #Objs

The expenmental datasets had dIfferent features The

Breast, Gennan and Vehicle datasets had many attributes

and distinctive values but had very few numbers of objects

(or records) The Led7 dataset had only a few attributes,

distinctive values and number of objects

B Numbers of rules of the experimental datasets

Table 5 shows numbers of rules of datasets in Table 4

for different mInImum support thresholds We used a minConf= 50% for all experiments

T bl 5 N a e urn er b 0 f ru es or I eren mm ups I f, d'ff, S Dataset minSup(%) #rule

The results from Table 5 show that some datasets had a lot

of rules For example, Lymph dataset had 4,039,186 rules with a minSup = 1 % The German dataset had 752,643 rules with a minSup = 1 %, etc

C Execution time Experiments were then made to compare the execution time between CAR-Miner and ECR-CARM [16] The results are shown in Table 6

Table 6 The execute time for different minSups Dataset minSup(%) ECR-CARM CAR-Miner

The results from Table 6 show CAR-Miner to be more efficient than ECR-CARM in all of the experiments For example: Consider the Breast dataset with a minSup = 0.1 %, the mining time for the CAR -Miner was l.517 second, while that for the ECR-CARM was 17.136 second

Trang 5

V CONCLUSIONS AND FUTURE WORK

This paper proposed a new algorithm for mining CARs

using a tree structure Each node in the tree contained some

information for fast computing the support of the candidate

rule In addition, using Obidset, we were able to compute

the support of item sets quickly Some theorems were been

also developed Based on these theorems, we did not need to

compute the information for a lot of nodes in the tree

Mining itemsets from incremental databases has been

developed in recent years [2-3, 5] We can see that it saves a

lot of time and memory when compared with mining from

integration databases Therefore, in the future, we will study

how to use this approach for mining CARs

REFERENCES

1 G Giuffrida, W.W Chu, D.M Hanssens, "Mining

classification rules from datasets with large number of

many-valued attributes" The 71h International

Conference on Extending Database Technology:

Advances in Database Technology (EDBT'OO), Munich,

Germany, 2000, 335-349

2 T.P Hong, C.l Wang, "An efficient and effective

association-rule maintenance algorithm for record

modification", Expert Systems with Applications, 37(1),

2010, 618 - 626

3 T.P Hong, C.W Lin, Y.L Wu, "Maintenance of fast

updated frequent pattern trees for record deletion",

Computational Statistics and Data Analysis, 53(7),

2009, 2485 - 2499

4 W Li, l Han, l Pei, "CMAR: Accurate and efficient

classification based on multiple class-association rules",

The lSI IEEE International Conference on Data Mining,

San Jose, California, USA, 2001, 369-376

5 C.W Lin, T.P Hong, W.H Lu, "The Pre-FUFP

algorithm for incremental mining", Expert Systems with

Applications, 36(5), 2009, 9498-9505

6 B Liu, W Hsu, Y Ma, "Integrating classification and

association rule mining", The 4th International

Conference on Knowledge Discovery and Data Mining,

New York, USA, 1998, 80-86

7 B Liu, Y Ma, C.K Wong, "Improving an association rule based classifier", The 4th European Conference on Principles of Data Mining and Knowledge Discovery, Lyon, France, 2000, 80-86

8 L.T.T Nguyen, B Yo, T.P Hong, H.C Thanh,

"Classification based on association rules: A lattice based approach", Expert Systems with Applications, 39(13), 2012, 11357-11366

9 lR Quinlan, "C4.5: program for machine learning", Morgan Kaufmann, 1992

10 F Thabtah, P Cowling, Y Peng, "MMAC: A new multi-class, multi-label associative classification approach", The 41h IEEE International Conference

on Data Mining, Brighton, UK, 2004, 217-224

11 F Thabtah, P Cowling, Y Peng, "MCAR: Multi-class classification based on association rule", The 3rd ACSIIEEE International Conference on Computer Systems and Applications, Tunis, Tunisia, 2005, 33-39

12 R Thonangi, V Pudi, "ACME: An associative c1��sifier bas:d on maximum entropy principle", The

16 InternatIOnal Conference Algorithmic Learning Theory, LNAI 3734, Singapore, 2005, 122-134

13 M.R Tolun, S.M Abu-Soud, "ILA: An inductive learning algorithm for production rule discovery", Expert Systems with Applications, 14(3), 1998,

361-370

14 M.R Tolun, H Sever, M Uludag, S.M Abu-Soud,

"ILA-2: An inductive learning algorithm for knowledge discovery", Cybernetics and Systems, 30(7), 1999, 609

- 628

15 A Veloso, W Meira Jr., MJ Zaki, "Lazy associative classification", The 2006 IEEE International Conference on Data Mining (ICDM'06), Hong Kong, China, 2006, 645-654

16 B Yo, B Le, "A novel classification algorithm based

on association rules mining", The 2008 Pacific Rim Knowledge Acquisition Workshop (Held with PRlCAJ'08), LNAI 5465, Ha Noi, Viet Nam, 2008,

61-75

17 X Yin, J Han, "CPAR: Classification based on predictive association rules", SIAM International Conference on Data Mining (SDM'03), San Francisco,

CA, USA, 2003, 331-335

Định dạng
Số trang	5
Dung lượng	497,8 KB