2012 IEEE International Conference on Granular Computing A Fast Algorithm for Classification Based on Association Rules Loan T.T.. Nguyen Faculty of Information Technology VOV Broadcas
Trang 12012 IEEE International Conference on Granular Computing
A Fast Algorithm for Classification Based on
Association Rules
Loan T.T Nguyen
Faculty of Information Technology
VOV Broadcasting College II
Ho Chi Minh, Viet Nam nguyenthithuyloan@vov.arg.vn
Tzung-Pei Hong
Department CSIE National University of Kaohsiung
Kaohsiung City, Taiwan, R.O.C
tphong@nuk.edu.tw
Abstract-In this paper, we propose a new method for mining
class-association rules using a tree structure Firstly, we design
a tree structure for storing frequent itemsets of datasets Some
theorems for pruning nodes and computing information in the
tree are then developed We then propose an efficient
algorithm for mining CARs based on them Experimental
results show that our approach is more efficient than those
used previously
Keywords-accuracy, classification, class-association rules,
data mining, tree structure
I INTRODUCTION
A lot of methods for mining classification rules have been
developed in recent years such as C4.5 and ILA These
methods are, however, based on heuristics and greedy
approaches to generate rule sets that are either too general or
too overfitting for a given dataset They thus often yield
high error ratios Recently, a new method for classification
from data mining, called the Classification Based on
Associations (CBA), has been proposed for mining Class
Association Rules (CARs) This method has more
advantages than the heuristic and greedy methods in that the
former could easily remove noise, and the accuracy is thus
higher It can additionaly generate a rule set that is more
complete than C4.5 and ILA Thus, some algorithms for
mining classification rules based on association rule mining
have been proposed Examples include CP AR [17], CMAR
[4], CBA [6-7], MMAC [10], MCAR [11], ACME [12],
Noah [1], LOCA and PLOCA [8], and the use of the
Equivalence Class Rule-tree (ECR-tree) [16] Some
researchers have also reported that classifiers based on
class-association rules are more accurate than those of
traditional methods such as C4.5 [9] and ILA [13-14] both
theoretically [15] and with regard to experimental results
[6]
978-1-4673-2311-6/12/$31.00 ©2012 IEEE
BayVo
Information Technology College
Ho Chi Minh City, Viet Nam vdbay@itc.edu.vn
Hoang Chi Thanh
Department of Informatics
Ha Noi University of Science
Ha Noi, Viet Nam thanhhc@vnu.vn
All the above methods focused on the design of the algorithms for mining CARs or building classifiers but did not discuss much with regard to their mining time Vo and
Le [16] has proposed a new method for mining CARs using
an ECR-tree An efficient algorithm, named ECR-CARM, was proposed in their study ECR-CARM scanned the dataset once and was based on object identifiers intended to quickly determine the support of itemsets A tree structure for fast mining CARs was applied It, however, was time consuming for generate-and-test candidates because the authors grouped all values with the same attributes into one node in the tree In this paper, we improve by modifying the tree structure Each node in the tree contains one value of attributes instead of their being grouped Some theorems are also designed Based on the tree and these theorems, we propose an algorithm for mining CARs efficiently
II PRELIMINARY CONCEPTS Let D be the set of training data with n attributes A J, A2, • ,
An and IDI objects (cases) Let C = {cJ, C2,' , cd be a list of class labels Specific values of attribute Ai and class C are denoted by lower case letters a and c, respectively
Definition 1: An itemset is a set of some pairs of attribute and a specific value, denoted {(Ail ail), (Ai2• a,2), , (Aim aim)}
Definition 2: A class-association rule r has the form of {(Ail ail), , (Aim aim)} � c, where {(Ail ail), , (Ami aim)} is an itemset and CEC is a class label
Definition 3: The actual occurrence ActOcc(r) of a rule r in
D is the number of rows of D that match r's condition Definition 4: The support of r denoted Sup(rJ is the number of rows of D that match r's condition, and belongs
to r's class
Trang 2Table 1 An example of training dataset
For example: Consider r = {(A, a1)} -+ y from the
dataset in Table 1, we have ActOcc(r) = 3 and Super) = 2
because there have three objects with A = a1 in that two
objects have the same class y
III MINING CLASS-ASSOCIATION RULES
A Tree structure
We modify the ECR-tree structure [16] into an MECR-tree
structure (M stands for Modification) as follows In the
ECR-tree, the authors arranged all itemsets with the same
attributes into one group and joined item sets in different
groups together This led to consumption of more time for
generate-and-check itemsets In our work, each node in the
tree contains one itemset along with the following
information:
a) Obidset: a set of object identifiers that contain the
item set
b) (Cj,C2, ,Ck) - where Cj is the number of records in
Obidset which belong to class Cj, and
c) pos - store the position of the class with the maximum
count, i.e., pos = argmax{cJ
'E[J,k]
In the ECR-tree, the authors did not store Ci and pos:
therefore, the algorithm had to compute them for all nodes
However, we need not compute the information of some
nodes in the MECR-tree using theorems that are presented
in Section IV.B
For example, consider node containing itemset X = {(A, a3),
(B,b3)} Because X is contained objects 4 and 6, all of them
belong to class y Therefore, we have a node in the tree as
{(A,a3),(B,b3)} or more simply as 3 x a3b3 The pos is 1
(underlined at position 1 of this node) because the count of
class y is maximum (2 as compared to 0) The latter is
another representation of the former for saving memory
when we use the tree structure to store itemsets We use bit
presentation for storage of the itemset's attributes For
example, AB can present as 11 in bit presentation and
therefore, the value of these attributes is 3 With this
presentation, we can use bitwise operation to make itemsets
join faster
B Proposed algorithm
In this section, some theorems for fast mining CARs
are designed Based on these theorems, we propose an
efficient algorithm for mining CARs
Theorem 1: Given two nodes attl x values,
Obidset, (CII,···,Clk) att x values
and z z , If att, = attz and values, f valuesz,
Obidsetz (cw"" CZk)
then Obidsetl !l Obidsetz = 0
Proof: Since att, = attz and values, f valuesz, there exist a vall E valuesl and a valz E valuesz such that vall and valz have the same attribute but different values Thus, if a record with OlD, contains val" it cannot contain valz Therefore, VOID E Obidset" and it can be inferred that OlD " Obidsetz Thus, Obidset, !l Obidsetz = 0
In this theorem, we divide the itemset into form attxvalues for ease of use Theorem 1 infers that, if two itemsets X and Y have the same attributes, they don't need to
be combined into the itemset XY because Sup(XY) = O For
Ix al 1 x a2
example, consider the two nodes and , in
127(�,1) 38(1,1)
which Obidset({(A, aJ)}) = 127, and Obidset({(A, a2)}) =
38 Obidset( {(A, aJ), (A, a2)}) = Obidset( {(A, aJ)}) !l
Obidset({(A, a2)}) = 0 Similarly, Obidset({(A, a1), (B, bJ)}) = 1, and Obidset({(A, aJ), (B, b2)}) = 2 It can be inferred that Obidset( {(A, aJ), (B, b1)}) !l Obidset( {(A, a1), (B, b2)}) = 0 because both of these two itemsets have the same attributes AB but with different values
Theorem 2: Given two nodes itemsetl
Obidset, (cll''''' Clk)
and itemset2
, if itemset, c itemsetz and
Obidsetz (CZI"'" CZk)
IObidsetd = IObidsetzl then ViE [1, k]: Cli = CZi
Proof: We have itemsetl c itemsetz This means that all records containing itemsetz also contain itemset" and therefore, Obidsetz <;;; Obidset, Additionally, according to theory, we have IObidsetd = IObidsetzl This means that we have Obidsetz= Obidset" or ViE [1, k]: C'i = CZi
From Theorem 2, when we join two parent nodes into a child node, then the item set of the child node is always a supperset of the item set of each of the parent nodes Therefore, we will check their cardinations, and if they are the same, we need not compute the count for each class and the pos of this node because they are the same as the parent node
Using these theorems, we develop an algorithm for mining CARs efficiently By Theorem 1, we need not join two nodes with the same attributes, and by Theorem 2, we need not compute the information for some child nodes First of all, the root node of the tree (Lr) contains child nodes such that each node contains a single frequent item set After that, procedure CAR-Miner will be called with the parameter Lr to mine all CARs from the dataset D
Trang 3Input: A dataset D, minSup and minConf
Output: all CARs satisfY minSup and minConf
Procedure:
CAR-Miner(L" minSup, minCorif)
1 CARs = 0;
2 for all Ii E Lr.children do
4 Pi=0;
5 for all/j E Lr.children, with} > i do
6 if li.att '* Ij.att then IIUsing theorem 1
7 O.att = li.att u Ij.att; IIUsing bit-wise operation
8 0 itemset = Ii values u Ij values;
9 O.Obidset = li.Obidset n Ij.Obidset ;
10 if 1O.0bidsetl = I/i.Obidsetl then I/Using theorem 2
11 O.count = li.count;
12 O.pos = li.POS;
13 else ifl0.obidsetl = I/j.obidsetl then IIUsing theorem 2
14 O.count = Ij.count;
15 O.pos = Ij.pos;
17 O.count = { count(xEO.Obidset I class(x) = Ci, 'v'iE [J,k]};
18 O.pos = arg max {/.count;}; II k is the number of class
ie[l.k]
19 if O.count[O.pos] ;e: minSup then
20 Pi = Pi U ° ;
21 CAR-Miner(Pi, minSup, minCorif)
ENUMERA TE-CAR(l, minConf)
22 conf= l.count[/.pos] III.Obidsetl;
23 if conf2: minConf then
24 CARs = CARs u {/.itemset -+ cpos (l.count[/.pos], cont)}
Figure 1 The proposed algorithm for mining CARs
The CAR-Miner procedure (Figure 1) considers each
node Ii with all the other node Ij in Ln withj > i (Lines 2 and
5) to generate a candidate child node I With each pair (Ii, Ij),
the algorithm checks whether li.att *' Ii-att or not (Line 6,
using Theorem 1) If they are different, it computes the three
elements att, values, Obidset for the new node 0 (Lines 7-9)
Line 10 checks if the number of object identifiers of Ii is
equal to the number of object identifiers of 0 (by Theorem
2) If this is true, then by Theorem 2, the algorithm can copy
all information from node Ii to node 0 (Lines 11-12)
Similarly, in the event that the result of Line lO is false, the
algorithm checks Ii with 0, and if the numbers of their
object identifiers are the same (Line 13), the algorithm can
copy the information from node Ij to node 0 (Lines 14-15)
Otherwise, the algorithm computes the O.count by using
O.Obidset and O.pos (Lines 17 -IS) After computing all of
the information for node 0, the algorithm adds it to Pi (Pi is
initialized empty in Line 4) if O.count[O.pos] � minSup
(Lines 19-20) Finally, CAR-Miner will be recursively
called with a new set Pi as its input parameter (Line 21)
The procedure ENUMERA TE-CAR(I, minConj)
generates a rule from node I It first computes the
confidence of the rule (Line 22), if the confidence of this
rule satisfies minConf (Line 23), then it adds this rule into
the set of CARs (Line 24)
C An example
In this section, we use the example in Table 1 to describe the CAR-Miner process with minSup = 10% and minConf = 60% The MECR-tree was built from the dataset
in Table 1 as follow: First, the root node Lf contains all frequent l-itemsets such as
{ Ix al 1 x a2 1 x a3 2 x b 1 2 x b2 2 x b3 127(�,1) 38 (O,�) 456(�,1) 150,1) 238(O,�) 467(�,O)
4 x c1 4 x c2 }
12346(�,2) 578(l,�) After that, procedure CAR-Miner is called with the parameter Lf' We use node Ii =
lxa2 38(O,�) as an example for illustrating the CAR-Miner
process Ii joins with all nodes following it in Lf:
lxa3
• With node Ij = : They (Ii and Ij) have the
456(�,1)
same attribute and different values Don't make any thing from them
2xbl
• With node Ij = : Because their attributes are
150,1)
different, three elements are computed such as O.att = li.att u Ij.att = 1 I 2 = 3 or 11 in bit presentation, O.itemset = Ii values u Ij.values = a2 u bi = a2bi, and O.Obidset = li.Obidset n Ij.Obidset = {3,S} n
{1,5} = {0} Because the O.count[O.pos] = 0 <
minSup, 0 is not added to Pi
2xb2
• With node Ij = : Because their attributes are
238(O,�)
different, three elements are computed such as O.att = li.att u Ij.att = 1 I 2 = 3 or 11 in bit presentation; O.itemset = li.values u Ij.values = a2 u b2 = a2b2, and O.Obidset = li.obidset n Ij.Obidset = {3,S} n
{2,3,S} = {3,S} Because of I/i.Obidsetl = lo Obidsetl, the algorithm copies all information of Ii to O This means that O.count = li.count = (,2,,0) and O.pos = 1 Because O.count[O.pos] = 2 > minSup, add 0 to Pi =>
3xa2b2
Pi = {
38(O,�) }
2xb3
• With node Ij = : Because their attributes are
467(�,O)
different, three elements are computed such as O.att = li.att u Ij.att = 1 I 2 = 3 or 11 in bit presentation, O.itemset = li.values u Ij.values = a2 u b3 = a2b3, and O.Obidset = li.Obidset n Ij.Obidset = {3,S} n
{4,6,7} = {0} Because the o.count[O.pos] = 0 <
minSup, 0 is not added to Pi
4xc1
• With node Ij = : Because their attributes
123460,2)
are different, three elements are computed such as O.att = li.att u Ij.att = 1 I 4 = 5 or 10 1 in bit
Trang 4presentation, O.itemset = li.values u Ij.values = a2 u
cl = a2c1, and O.Obidset = li.Obidset !l Ij.Obidset =
{3,8} !l {1,2,3,4,6} = {3} The algorithm computes
additional information including O.count = {O,l} and
O.pos = 2 Because the O.count[O.pos] = 1 2: minSup,
3x a2b2 5x a2c1
o is added to Pi => Pi = { , }
38(0,�) 3(0,1) 4x c2
• With node Ij = : Because their attributes are
578(1,�)
different, three elements are computed such as O.att =
li.att u Ij.att = 1 I 4 = 5 or 101 in bit presentation,
O.itemset = li.values u Ij.values = a2 u c2 = a2c2,
and O.Obidset = li.obidset !l Ij.Obidset = {3,8} !l
{5,7,8} = {8} The algorithm computes additional
information including O.count = {O,l} and O.pos = 2
Because the O.count[O.pos] = 12: minSup, 0 is added
3x a2b2 5x a2c1 5x a2c2
to Pi => Pi = { 38(0,�)' 8(0,!) , 8(0,!) }
• After Pi is created, the CAR-Miner is called recursively
with parameters Pi, minSup, and minConf to create all
children nodes of Pi
Rules are easily to generate in the same step for
traversing Ii (Line 3) by calling procedure ENUMERATE
CAR(li, minConj) For example, when traversing node Ii =
Ix a2
38(0,�) , the procedure computes the confidence of the
candidate rule, conf = li.count[li.pos]/l/i.Obidsetl = 2/2 = l
Because conf2: minConf(60%), add rule {(A,a2)} � n (2,1)
into the rule set CARs The meaning of this rule is "If A =
a2 then class = n" (support = 2 and confidence = lOO%)
IV EXPERIMENT AL RESULTS
A Characteristics of experimental datasets
The algorithms used in the experiments were coded on a
personal computer with C#2008, Windows 7, Centrino
2x2.53 GHz, and 4MBs RAM The experimental results
were tested in the datasets obtained from the VCI Machine
Learning Repository (http://mlearn.ics.uci.edu) Table 4
shows the characteristics of the experimental datasets
T bl 4 Th h a e e c aractenstlcs 0 f h t e expenmenta atasets I d
Dataset #attrs #c1asses #distinct values #Objs
The expenmental datasets had dIfferent features The
Breast, Gennan and Vehicle datasets had many attributes
and distinctive values but had very few numbers of objects
(or records) The Led7 dataset had only a few attributes,
distinctive values and number of objects
B Numbers of rules of the experimental datasets
Table 5 shows numbers of rules of datasets in Table 4
for different mInImum support thresholds We used a minConf= 50% for all experiments
T bl 5 N a e urn er b 0 f ru es or I eren mm ups I f, d'ff, S Dataset minSup(%) #rule
The results from Table 5 show that some datasets had a lot
of rules For example, Lymph dataset had 4,039,186 rules with a minSup = 1 % The German dataset had 752,643 rules with a minSup = 1 %, etc
C Execution time Experiments were then made to compare the execution time between CAR-Miner and ECR-CARM [16] The results are shown in Table 6
Table 6 The execute time for different minSups Dataset minSup(%) ECR-CARM CAR-Miner
The results from Table 6 show CAR-Miner to be more efficient than ECR-CARM in all of the experiments For example: Consider the Breast dataset with a minSup = 0.1 %, the mining time for the CAR -Miner was l.517 second, while that for the ECR-CARM was 17.136 second
Trang 5V CONCLUSIONS AND FUTURE WORK
This paper proposed a new algorithm for mining CARs
using a tree structure Each node in the tree contained some
information for fast computing the support of the candidate
rule In addition, using Obidset, we were able to compute
the support of item sets quickly Some theorems were been
also developed Based on these theorems, we did not need to
compute the information for a lot of nodes in the tree
Mining itemsets from incremental databases has been
developed in recent years [2-3, 5] We can see that it saves a
lot of time and memory when compared with mining from
integration databases Therefore, in the future, we will study
how to use this approach for mining CARs
REFERENCES
1 G Giuffrida, W.W Chu, D.M Hanssens, "Mining
classification rules from datasets with large number of
many-valued attributes" The 71h International
Conference on Extending Database Technology:
Advances in Database Technology (EDBT'OO), Munich,
Germany, 2000, 335-349
2 T.P Hong, C.l Wang, "An efficient and effective
association-rule maintenance algorithm for record
modification", Expert Systems with Applications, 37(1),
2010, 618 - 626
3 T.P Hong, C.W Lin, Y.L Wu, "Maintenance of fast
updated frequent pattern trees for record deletion",
Computational Statistics and Data Analysis, 53(7),
2009, 2485 - 2499
4 W Li, l Han, l Pei, "CMAR: Accurate and efficient
classification based on multiple class-association rules",
The lSI IEEE International Conference on Data Mining,
San Jose, California, USA, 2001, 369-376
5 C.W Lin, T.P Hong, W.H Lu, "The Pre-FUFP
algorithm for incremental mining", Expert Systems with
Applications, 36(5), 2009, 9498-9505
6 B Liu, W Hsu, Y Ma, "Integrating classification and
association rule mining", The 4th International
Conference on Knowledge Discovery and Data Mining,
New York, USA, 1998, 80-86
7 B Liu, Y Ma, C.K Wong, "Improving an association rule based classifier", The 4th European Conference on Principles of Data Mining and Knowledge Discovery, Lyon, France, 2000, 80-86
8 L.T.T Nguyen, B Yo, T.P Hong, H.C Thanh,
"Classification based on association rules: A lattice based approach", Expert Systems with Applications, 39(13), 2012, 11357-11366
9 lR Quinlan, "C4.5: program for machine learning", Morgan Kaufmann, 1992
10 F Thabtah, P Cowling, Y Peng, "MMAC: A new multi-class, multi-label associative classification approach", The 41h IEEE International Conference
on Data Mining, Brighton, UK, 2004, 217-224
11 F Thabtah, P Cowling, Y Peng, "MCAR: Multi-class classification based on association rule", The 3rd ACSIIEEE International Conference on Computer Systems and Applications, Tunis, Tunisia, 2005, 33-39
12 R Thonangi, V Pudi, "ACME: An associative c1��sifier bas:d on maximum entropy principle", The
16 InternatIOnal Conference Algorithmic Learning Theory, LNAI 3734, Singapore, 2005, 122-134
13 M.R Tolun, S.M Abu-Soud, "ILA: An inductive learning algorithm for production rule discovery", Expert Systems with Applications, 14(3), 1998,
361-370
14 M.R Tolun, H Sever, M Uludag, S.M Abu-Soud,
"ILA-2: An inductive learning algorithm for knowledge discovery", Cybernetics and Systems, 30(7), 1999, 609
- 628
15 A Veloso, W Meira Jr., MJ Zaki, "Lazy associative classification", The 2006 IEEE International Conference on Data Mining (ICDM'06), Hong Kong, China, 2006, 645-654
16 B Yo, B Le, "A novel classification algorithm based
on association rules mining", The 2008 Pacific Rim Knowledge Acquisition Workshop (Held with PRlCAJ'08), LNAI 5465, Ha Noi, Viet Nam, 2008,
61-75
17 X Yin, J Han, "CPAR: Classification based on predictive association rules", SIAM International Conference on Data Mining (SDM'03), San Francisco,
CA, USA, 2003, 331-335