A Classification Algorithm Based on Association Rule Mining Yang Junrui College of Computer Science and Technology Xi’an University of Science and Technology Xi’an, China E-mail: Yangjun
Trang 1A Classification Algorithm Based on Association Rule Mining
Yang Junrui College of Computer Science and Technology
Xi’an University of Science and Technology
Xi’an, China E-mail: Yangjunrui66@sina.com
Xu Lisha College of Computer Science and Technology Xi’an University of Science and Technology
Xi’an, China E-mail: beizhan09@163.com
He Hongde College of Computer Science and Technology Xi’an University of Science and Technology
Xi’an, China E-mail: laohu0526@126.com
Abstract—The main difference of the associative classification
algorithms is how to mine frequent item setsǃ analyze the
rules exported and use for classification This paper presents
an associative classification algorithm based on Trie-tree that
named CARPT, which remove the frequent items that cannot
generate frequent rules directly by adding the count of class
labels And we compress the storage of database using the
two-dimensional array of vertical data format, reduce the number
of scanning the database significantly, at the same time, it is
convenient to count the support of candidate sets So, time and
space can be saved effectively The experiment results show
that the algorithm is feasible and effective
Keywords-data mining; associative classification;
classification algorithm; Trie-tree
I INTRODUCTION The classification rules mining and association rules
mining are two important areas of data mining [1] The classic
associative classification algorithm based on class
association rules named CBA[2] which integrated the above
two important mining technologies was presented by Bing
Liu of National University of Singapore in the knowledge
discovery in databases (KDD) International Conference held
in New York, 1998 Since then the prelude of the associative
classification was opened Good classification accuracy of
associative classification algorithm has been confirmed in
the past ten years through a number of studies and
experiments
The earliest Associative classification algorithm CBA
generates classification association rules using the iterative
method which similar to the Apriori [3] algorithm In order to
generate and test longer item sets, the database needs to be
scanned many times, so the number of rules increases
In [5], CMAR algorithm was presented based on the CBA algorithm, using the deformation method of FP-Growth
[6] The CMAR algorithm finds frequent patterns and generates classification association rules simultaneously, uses χ 2 weights to measure the strength of the rules and then classify a new instance, overcomes the bias of using a single rule It greatly improves the efficiency of the algorithm using CR-tree, a high degree compression tree, to store, back, pruning the classification rules While the CMAR algorithm does not take full advantage of the characteristics of classification, there are many redundant nodes in the FP-tree
Trie, also known as dictionary tree or word search tree, is
a variant of the hash tree The typical applications of this tree structure is used to store a large number of strings (but not limited to) The core idea of Trie-tree is using the common prefix of the string to reduce the cost of the query time to improve efficiency
This paper presents a classification algorithm based on the trie-tree of association rules that named CARPT, which effectively reduces the number of scanning during the stage
of association rule mining by changing the data storage structure and storage manner It removes the frequent items that cannot generate frequent rules directly by adding the count of class labels during the construction of Trie-tree, so, time and space can be saved effectively Based on this, the algorithm it also draws the pruning idea of CDDP [7]
algorithm, reduces the number of candidate frequent item sets once again
II THEORIES AND DEFINITIONS
A Association Rules Mining
Association rules mining in a transaction database can be
2012 International Conference on Computer Science and Service System
Ketnooi.com share
Trang 2(Item) The collection of items I called data set, named item
sets for short, an item set contains k items is named k-itemset
Definition 1 Association rules is manifested as the
relationship between the item sets, we express it as: X=>Y,
where X and Y are item sets, X called rule antecedent, Y,
called rule consequent
set D is defined as the percentage of a transaction with I1 in
D, namely Support˄I1˅={t ∈ D I1⊆ t } / D
Definition 3 Association rules that defined in I and D,
namely I1=>I2, are given to meet a certain degree of
confidence
Confidence of rules refers to the ratio of the number of
transactions containing I1 and I2 and transactions containing
I1, namely Confidence (I1=>I2) =
䯵 䯴
䯵 䯴
1
2 1
I Support
I I Support *
, where I1, I2⊆I, I1I2=∅
Definition 4 Frequent Itemsets is defined as all the item
sets that satisfy user-specified minimum support
(Minsupport) in T for I and D, namely the non-empty subset
of I that greater than or equal to Minsupport
Theorem 1: For any given database D, let minsup stands
for the minimum support, I is a frequent item If rule R: iėc
is not frequent rule to all of the category labels, then all the
frequent rules in D do not include the frequent item i
Proof: Assume that R˃: Iėc is an any frequent rule of
DB If I contain only one item, then R˃is a single item rule
For rule R: iėc do not frequent to all of the category labels,
R˃can not include item i If I was a conclusion of several
items and item i was included in I, then for R˃: Iėc, there
must be a sub-rule R: iėc For R is sub-rule of R˃, so
R˃.countİR.count, namely R˃.supİR.supİminisup, it
is a contradiction of ĀR˃is a frequent ruleā, we can see
that rule R˃can not include item i Therefore, Theorem 1 is
proved to be established
B Associative classification
m˅is the value of the category attribute, it named Category
labels
Definition 6 Method of mining association rules with
class labels as rule consequent using association rule mining
algorithm is known as associative classification
Associative classification is essentially classification that
based on association rules, which both reflects the
application characteristics of knowledge-classification or
prediction and embodies the inherent associated
characteristics of knowledge.[8] Associative classification in
data mining is divided into the following four steps
1) Attributes can be discrete and also can be
continuous, for a continuous attribute value, discrete
it fist
2) Mine all possible rules(PR) that frequent and
accurate using a standard association rule mining
algorithm, such as Apriori and FP-Growth, these
frequent rule item sets that meet the minimum confidence constitute the set of Categorical Association Rules(CARs)
3) Construct a classifier base on the categorical association rules that mined
4) Classify the category unknown data using the classifier
C Trie-tree
Amir used tree to mine association rules in [9] Trie-tree can be defined as following:
Set S= {s1,s2,…,sn} as a collection of strings that defined
on the set of characters Ȉ, All of the non-terminal nodes except the root node are represented by a character of Ȉ, each leaf node corresponds to a string which happens to be character connection of the path that from the root to the leaf node
In most reference, the character in each node is known as buckets, such as node ABCD, A is a bucket, B, C, D are each a bucket, the path from each bucket to the root node stands for a frequent itemset
Property 1 of Trie-tree: If a sub-tree takes a
non-frequent bucket for root node, then all the buckets of the sub-tree are not frequent
Proof: According to the property of Apriori, Superset of
any non-frequent item sets is non-frequent Itemset that represented by a bucket is right the superset of itemset that represented by the parents bucket of the path in Trie-tree Associative classification algorithm CARPT proposed in this paper is based on Trie-tree, it reduces the number of scanning during the stage of association rule mining by changing the data storage structure and storage manner and removes the frequent items that cannot generate frequent rules directly by adding the count of class labels during the construction of Trie-tree, so as to achieve the purpose of improving the efficiency of the algorithm
So, how to construct a Trie-tree? First of all, find all the frequent 1-itemset as the first layer of buckets in Trie-tree, and then arrange them according to a certain order Let the set of frequent 1-itemset I = {i1,i2,…,in} and its order is
<i1,i2,…,in>ˈ so in is the rightmost bucket of the first layer
of Trie-tree, and we can see from the property 2 flowing, frequent item in, as the last item of the sort has no child node
itemset which contains two or more items take in for a prefix When p>q, frequent item ip cannot take iq for a prefix
According to <i1,i2,…,in>, the construct process of Trie-tree can be simply described as follows:
Initialize the Trie-tree The initial Trie-tree contains only one bucket in of the first layer;
Add the second bucket in-1 of the first layer to the Trie-tree, add the sub-tree that take in for root node after in-1, at the same time, cut off all the non-frequent non-empty subset of the sub-tree that take in-1 for root node
Similarly, add the third bucket in-1 of the first layer to the Trie-tree, ĂĂˈ until the nth bucket i1 of the first layer is added, a Trie-tree contains all the frequent items is constructed
Trang 3III ALGORITHM CARPT
A description of the data structure involved in CARPT
has been given above, we will now introduce the general
process of the algorithm that proposed in this paper that
named classification algorithm based on Trie-tree of
associative rules (CARPT)
Preprocessing, discretization of continuous attributes and
determine of the frequent items should be completed before
the commencement of the algorithm
The training dataset D in TABLE I is given as an
example, let minimum support=2, and minimum
confidence=60%
TABLE I the training dataset D
1
2
3
4
5
a, c, f, i
a, d, f, j
b, e, g, k
a, d, h, k
a, d, f, k
A
B
A
C
C Scan the database D once, count the support for each
item, get the frequent 1-itemset F= {a, d, f, k} that meets the
minimum support threshold
Database D can be described as a two-dimensional array
shown in TABLE II, in which the horizontal position said the
item number and types of properties, the vertical position
said the transaction number
TABLE II vertical bitmap of two-dimensional array for
database D
1
2
3
4
5
1
1
0
1
1
0
1
0
1
1
1
1
0
0
1
0
0
1
1
1
1
0
1
0
0
0
1
0
0
0
0
0
0
1
1 According to the definition and the construction method
of Trie-tree, drawing the pruning ideology of the CDDP
algorithm, we can obtain the Trie-tree shown in Figure
1from TABLE II
Figure 1 Trie-tree The next work is to export those association rules that
meet the given minimum confidence and take category labels
for rule consequent from Trie-tree
Actually, we can see that f is in the frequent itemset F,
but there is any rule contain f appears in the associative
classification rules which we finally mine after carefully
study, that is all the rules contain f do not meet the minimum
confidence So, item f can be removed directly
According to Theorem 1, remove the frequent item f that
can not generate frequent rules directly when transform the
database into vertical bitmap of two-dimensional array to
improve the achievement of Trie-tree and reduce the number
of its nodes Scan the database D, record the support of each item and the support of category label that items correspond, results are as follows: a:4(A:1,B:1,C:2); d:3(A:0,B:1,C:2); f:3(A:1,B:1,C:1); k:3(A:1,B:0,C:2),find out the items whose support and the corresponding category label support are both greater than given minimum support, then we get F={a,
d, k} The resulting improved vertical bitmap of the two-dimensional array as shown in TABLE III
TABLE III The improved vertical bitmap of the
two-dimensional array
1
2
3
4
5
1
1
0
1
1
0
1
0
1
1
0
0
1
1
1
1
0
1
0
0
0
1
0
0
0
0
0
0
1
1 Reconstruct the Trie-tree, shown in Figure 2
Figure 2 Trie-tree after improved Contrast Figure 1 and Figure 2, the Trie-tree in Figure 1 has 11nodes, while the one in Figure 2 has only 7 It can be seen that the number of nodes of Trie-tree is reduced after adding the count of category labels; the storage space is effectively saved and the generation efficiency of Trie-tree is improved
In addition, according to Theorem 1, the cropped Trie-tree contains only the items that can generate frequent rules Therefore, if the category labels non-frequent itself, it will not be included in the tree for it can not generate frequent rules This case does not have universal significance, not much affect the classification accuracy, so it can be ignored
IV ALGORITHM TESTING AND ANALYSIS
In order to test the performance of CARPT, we compared
it with CBA and CMAR The experimental datasets using 6 datasets that come from the UCI machine learning database
[10], before the experiment started, we discretized the datasets
by weak
Let minsupport=1%, support error threshold=0.01, minconfidence=50%, confidence error threshold=20% Test the accuracy of the algorithm, the results are shown
in Figure 3
Trang 4
'DWDVHWV
&%$
&0$5
&$537
Figure 3 comparison of classification accuracy
In addition, we tested memory usage of algorithm
CMAR and CARPT, the results are shown in Figure 4
Datasets arrange from left to right according to their size
Figure 4 comparison of memory usage
We can easily find from the results above that the
classification accuracy of the CARPT algorithm is improved
and the efficiency of the algorithm is also improved after
adding the count of category labels, using the Trie-tree
storage structure and dynamic pruning strategy Compared
with CMAR, CARPT can effectively reduce the memory
usage and the effect of large data sets is relatively significant
V CONCLUSIONS This paper presents a classification algorithm of
associative rules based on Trie-tree that named CARPT The
algorithm removes the frequent items that cannot generate
frequent rules to improve efficiency of the algorithm by add the support count of the category labels; reduce the number
of database scanning using two-dimensional array of vertical data format to compressed database storage and add pruning strategy to the construction process of Trie-tree, all of these can save time and space effectively The experimental results show that the algorithm is feasible and effective
REFERENCES [1] Fan M, Meng X Data Mining Concepts and Techniques [M] Beijing: Mechanical Industry Press, 2001
[2] Liu B, Hsu W, Ma Y Integrating classification and association rule mining [C] Proc of the KDD New York, 1998: 80-86
[3] Agrawal R, Srikant R Fast algorithms for mining association rules [A] In VLDB’94[C] Santiago, Chile, Sept 1994, pp 487-499
[4] Hu W, Li M Classification algorithm of associative based on the importance of attribute [J] Computer Engineering and Design, 2008.5
[5] Li W, Han J, Pei J CMAR: Accurate and efficient classification based on multiple class-association rules [A] In ICDM’01[C] San Jose, CA, 2001, pp 369-376
[6] Han J, Pei J, Yin Y Mining frequent patterns without candidate generation[A] In SIGMOD’00[C] Dallas, TX, May 2000.1-12
[7] Qin Ch Association rules mining algorithm based on Trie, Journal of Beijing Jiao tong University, June 2011
[8] Zhang J Associated with the classification algorithm and its system implementation, Journal of Nanjing Normal University, 2008
[9] Amir, A., Feldman, R and Kashi, R (1997).A New and Versati1e Method for Association Generation In: Principles
of Data Mining and Knowledge Discovery, Proceedings of the First European Symposium(PKDD’97) Trondheim Norway, 1997, pp 221-231
[10] Merz C J, Murphy P UCI repository of machine learning database[EB/OL].http://www.cs.uci.edu/mlearn/MLRepositor y.html,1996
70
80
90
100
110
Auto Iono Sonar Vehicle Sick Hypo
Datasets