A CLASSIFICATION ALGORITHM BASED ON ASSOCIATION RULE MINING

A Classification Algorithm Based on Association Rule Mining Yang Junrui College of Computer Science and Technology Xi’an University of Science and Technology Xi’an, China E-mail: Yangjun

Trang 1

A Classification Algorithm Based on Association Rule Mining

Yang Junrui College of Computer Science and Technology

Xi’an University of Science and Technology

Xi’an, China E-mail: Yangjunrui66@sina.com

Xu Lisha College of Computer Science and Technology Xi’an University of Science and Technology

Xi’an, China E-mail: beizhan09@163.com

He Hongde College of Computer Science and Technology Xi’an University of Science and Technology

Xi’an, China E-mail: laohu0526@126.com

Abstract—The main difference of the associative classification

algorithms is how to mine frequent item setsǃ analyze the

rules exported and use for classification This paper presents

an associative classification algorithm based on Trie-tree that

named CARPT, which remove the frequent items that cannot

generate frequent rules directly by adding the count of class

labels And we compress the storage of database using the

two-dimensional array of vertical data format, reduce the number

of scanning the database significantly, at the same time, it is

convenient to count the support of candidate sets So, time and

space can be saved effectively The experiment results show

that the algorithm is feasible and effective

Keywords-data mining; associative classification;

classification algorithm; Trie-tree

I INTRODUCTION The classification rules mining and association rules

mining are two important areas of data mining [1] The classic

associative classification algorithm based on class

association rules named CBA[2] which integrated the above

two important mining technologies was presented by Bing

Liu of National University of Singapore in the knowledge

discovery in databases (KDD) International Conference held

in New York, 1998 Since then the prelude of the associative

classification was opened Good classification accuracy of

associative classification algorithm has been confirmed in

the past ten years through a number of studies and

experiments

The earliest Associative classification algorithm CBA

generates classification association rules using the iterative

method which similar to the Apriori [3] algorithm In order to

generate and test longer item sets, the database needs to be

scanned many times, so the number of rules increases

In [5], CMAR algorithm was presented based on the CBA algorithm, using the deformation method of FP-Growth

[6] The CMAR algorithm finds frequent patterns and generates classification association rules simultaneously, uses χ 2 weights to measure the strength of the rules and then classify a new instance, overcomes the bias of using a single rule It greatly improves the efficiency of the algorithm using CR-tree, a high degree compression tree, to store, back, pruning the classification rules While the CMAR algorithm does not take full advantage of the characteristics of classification, there are many redundant nodes in the FP-tree

Trie, also known as dictionary tree or word search tree, is

a variant of the hash tree The typical applications of this tree structure is used to store a large number of strings (but not limited to) The core idea of Trie-tree is using the common prefix of the string to reduce the cost of the query time to improve efficiency

This paper presents a classification algorithm based on the trie-tree of association rules that named CARPT, which effectively reduces the number of scanning during the stage

of association rule mining by changing the data storage structure and storage manner It removes the frequent items that cannot generate frequent rules directly by adding the count of class labels during the construction of Trie-tree, so, time and space can be saved effectively Based on this, the algorithm it also draws the pruning idea of CDDP [7]

algorithm, reduces the number of candidate frequent item sets once again

II THEORIES AND DEFINITIONS

A Association Rules Mining

Association rules mining in a transaction database can be

2012 International Conference on Computer Science and Service System

Ketnooi.com share

Trang 2

(Item) The collection of items I called data set, named item

sets for short, an item set contains k items is named k-itemset

Definition 1 Association rules is manifested as the

relationship between the item sets, we express it as: X=>Y,

where X and Y are item sets, X called rule antecedent, Y,

called rule consequent

set D is defined as the percentage of a transaction with I1 in

D, namely Support˄I1˅={t ∈ D I1⊆ t } / D

Definition 3 Association rules that defined in I and D,

namely I1=>I2, are given to meet a certain degree of

confidence

Confidence of rules refers to the ratio of the number of

transactions containing I1 and I2 and transactions containing

I1, namely Confidence (I1=>I2) =

䯵䯴

1

2 1

I Support

I I Support *

, where I1, I2⊆I, I1I2=∅

Definition 4 Frequent Itemsets is defined as all the item

sets that satisfy user-specified minimum support

(Minsupport) in T for I and D, namely the non-empty subset

of I that greater than or equal to Minsupport

Theorem 1: For any given database D, let minsup stands

for the minimum support, I is a frequent item If rule R: iėc

is not frequent rule to all of the category labels, then all the

frequent rules in D do not include the frequent item i

Proof: Assume that R˃: Iėc is an any frequent rule of

DB If I contain only one item, then R˃is a single item rule

For rule R: iėc do not frequent to all of the category labels,

R˃can not include item i If I was a conclusion of several

items and item i was included in I, then for R˃: Iėc, there

must be a sub-rule R: iėc For R is sub-rule of R˃, so

R˃.countİR.count, namely R˃.supİR.supİminisup, it

is a contradiction of ĀR˃is a frequent ruleā, we can see

that rule R˃can not include item i Therefore, Theorem 1 is

proved to be established

B Associative classification

m˅is the value of the category attribute, it named Category

labels

Definition 6 Method of mining association rules with

class labels as rule consequent using association rule mining

algorithm is known as associative classification

Associative classification is essentially classification that

based on association rules, which both reflects the

application characteristics of knowledge-classification or

prediction and embodies the inherent associated

characteristics of knowledge.[8] Associative classification in

data mining is divided into the following four steps

1) Attributes can be discrete and also can be

continuous, for a continuous attribute value, discrete

it fist

2) Mine all possible rules(PR) that frequent and

accurate using a standard association rule mining

algorithm, such as Apriori and FP-Growth, these

frequent rule item sets that meet the minimum confidence constitute the set of Categorical Association Rules(CARs)

3) Construct a classifier base on the categorical association rules that mined

4) Classify the category unknown data using the classifier

C Trie-tree

Amir used tree to mine association rules in [9] Trie-tree can be defined as following:

Set S= {s1,s2,…,sn} as a collection of strings that defined

on the set of characters Ȉ, All of the non-terminal nodes except the root node are represented by a character of Ȉ, each leaf node corresponds to a string which happens to be character connection of the path that from the root to the leaf node

In most reference, the character in each node is known as buckets, such as node ABCD, A is a bucket, B, C, D are each a bucket, the path from each bucket to the root node stands for a frequent itemset

Property 1 of Trie-tree: If a sub-tree takes a

non-frequent bucket for root node, then all the buckets of the sub-tree are not frequent

Proof: According to the property of Apriori, Superset of

any non-frequent item sets is non-frequent Itemset that represented by a bucket is right the superset of itemset that represented by the parents bucket of the path in Trie-tree Associative classification algorithm CARPT proposed in this paper is based on Trie-tree, it reduces the number of scanning during the stage of association rule mining by changing the data storage structure and storage manner and removes the frequent items that cannot generate frequent rules directly by adding the count of class labels during the construction of Trie-tree, so as to achieve the purpose of improving the efficiency of the algorithm

So, how to construct a Trie-tree? First of all, find all the frequent 1-itemset as the first layer of buckets in Trie-tree, and then arrange them according to a certain order Let the set of frequent 1-itemset I = {i1,i2,…,in} and its order is

<i1,i2,…,in>ˈ so in is the rightmost bucket of the first layer

of Trie-tree, and we can see from the property 2 flowing, frequent item in, as the last item of the sort has no child node

itemset which contains two or more items take in for a prefix When p>q, frequent item ip cannot take iq for a prefix

According to <i1,i2,…,in>, the construct process of Trie-tree can be simply described as follows:

Initialize the Trie-tree The initial Trie-tree contains only one bucket in of the first layer;

Add the second bucket in-1 of the first layer to the Trie-tree, add the sub-tree that take in for root node after in-1, at the same time, cut off all the non-frequent non-empty subset of the sub-tree that take in-1 for root node

Similarly, add the third bucket in-1 of the first layer to the Trie-tree, ĂĂˈ until the nth bucket i1 of the first layer is added, a Trie-tree contains all the frequent items is constructed

Trang 3

III ALGORITHM CARPT

A description of the data structure involved in CARPT

has been given above, we will now introduce the general

process of the algorithm that proposed in this paper that

named classification algorithm based on Trie-tree of

associative rules (CARPT)

Preprocessing, discretization of continuous attributes and

determine of the frequent items should be completed before

the commencement of the algorithm

The training dataset D in TABLE I is given as an

example, let minimum support=2, and minimum

confidence=60%

TABLE I the training dataset D

1

2

3

4

5

a, c, f, i

a, d, f, j

b, e, g, k

a, d, h, k

a, d, f, k

A

B

A

C

C Scan the database D once, count the support for each

item, get the frequent 1-itemset F= {a, d, f, k} that meets the

minimum support threshold

Database D can be described as a two-dimensional array

shown in TABLE II, in which the horizontal position said the

item number and types of properties, the vertical position

said the transaction number

TABLE II vertical bitmap of two-dimensional array for

database D

1

2

3

4

5

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

1 According to the definition and the construction method

of Trie-tree, drawing the pruning ideology of the CDDP

algorithm, we can obtain the Trie-tree shown in Figure

1from TABLE II

Figure 1 Trie-tree The next work is to export those association rules that

meet the given minimum confidence and take category labels

for rule consequent from Trie-tree

Actually, we can see that f is in the frequent itemset F,

but there is any rule contain f appears in the associative

classification rules which we finally mine after carefully

study, that is all the rules contain f do not meet the minimum

confidence So, item f can be removed directly

According to Theorem 1, remove the frequent item f that

can not generate frequent rules directly when transform the

database into vertical bitmap of two-dimensional array to

improve the achievement of Trie-tree and reduce the number

of its nodes Scan the database D, record the support of each item and the support of category label that items correspond, results are as follows: a:4(A:1,B:1,C:2); d:3(A:0,B:1,C:2); f:3(A:1,B:1,C:1); k:3(A:1,B:0,C:2),find out the items whose support and the corresponding category label support are both greater than given minimum support, then we get F={a,

d, k} The resulting improved vertical bitmap of the two-dimensional array as shown in TABLE III

TABLE III The improved vertical bitmap of the

two-dimensional array

1

2

3

4

5

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

1 Reconstruct the Trie-tree, shown in Figure 2

Figure 2 Trie-tree after improved Contrast Figure 1 and Figure 2, the Trie-tree in Figure 1 has 11nodes, while the one in Figure 2 has only 7 It can be seen that the number of nodes of Trie-tree is reduced after adding the count of category labels; the storage space is effectively saved and the generation efficiency of Trie-tree is improved

In addition, according to Theorem 1, the cropped Trie-tree contains only the items that can generate frequent rules Therefore, if the category labels non-frequent itself, it will not be included in the tree for it can not generate frequent rules This case does not have universal significance, not much affect the classification accuracy, so it can be ignored

IV ALGORITHM TESTING AND ANALYSIS

In order to test the performance of CARPT, we compared

it with CBA and CMAR The experimental datasets using 6 datasets that come from the UCI machine learning database

[10], before the experiment started, we discretized the datasets

by weak

Let minsupport=1%, support error threshold=0.01, minconfidence=50%, confidence error threshold=20% Test the accuracy of the algorithm, the results are shown

in Figure 3

Trang 4

'DWDVHWV

&%$

&0$5

&$537

Figure 3 comparison of classification accuracy

In addition, we tested memory usage of algorithm

CMAR and CARPT, the results are shown in Figure 4

Datasets arrange from left to right according to their size

Figure 4 comparison of memory usage

We can easily find from the results above that the

classification accuracy of the CARPT algorithm is improved

and the efficiency of the algorithm is also improved after

adding the count of category labels, using the Trie-tree

storage structure and dynamic pruning strategy Compared

with CMAR, CARPT can effectively reduce the memory

usage and the effect of large data sets is relatively significant

V CONCLUSIONS This paper presents a classification algorithm of

associative rules based on Trie-tree that named CARPT The

algorithm removes the frequent items that cannot generate

frequent rules to improve efficiency of the algorithm by add the support count of the category labels; reduce the number

of database scanning using two-dimensional array of vertical data format to compressed database storage and add pruning strategy to the construction process of Trie-tree, all of these can save time and space effectively The experimental results show that the algorithm is feasible and effective

REFERENCES [1] Fan M, Meng X Data Mining Concepts and Techniques [M] Beijing: Mechanical Industry Press, 2001

[2] Liu B, Hsu W, Ma Y Integrating classification and association rule mining [C] Proc of the KDD New York, 1998: 80-86

[3] Agrawal R, Srikant R Fast algorithms for mining association rules [A] In VLDB’94[C] Santiago, Chile, Sept 1994, pp 487-499

[4] Hu W, Li M Classification algorithm of associative based on the importance of attribute [J] Computer Engineering and Design, 2008.5

[5] Li W, Han J, Pei J CMAR: Accurate and efficient classification based on multiple class-association rules [A] In ICDM’01[C] San Jose, CA, 2001, pp 369-376

[6] Han J, Pei J, Yin Y Mining frequent patterns without candidate generation[A] In SIGMOD’00[C] Dallas, TX, May 2000.1-12

[7] Qin Ch Association rules mining algorithm based on Trie, Journal of Beijing Jiao tong University, June 2011

[8] Zhang J Associated with the classification algorithm and its system implementation, Journal of Nanjing Normal University, 2008

[9] Amir, A., Feldman, R and Kashi, R (1997).A New and Versati1e Method for Association Generation In: Principles

of Data Mining and Knowledge Discovery, Proceedings of the First European Symposium(PKDD’97) Trondheim Norway, 1997, pp 221-231

[10] Merz C J, Murphy P UCI repository of machine learning database[EB/OL].http://www.cs.uci.edu/mlearn/MLRepositor y.html,1996

70

80

90

100

110

Auto Iono Sonar Vehicle Sick Hypo

Datasets

Định dạng
Số trang	4
Dung lượng	231,85 KB