ABSTRACT Over the last several years, the problem of efficiently generating large numbers of association rules has been an active research topic in the data mining community.. Problem de
Trang 1MINING ASSOCIATION RULES WITH ADJUSTABLE INTERESTINGNESS
BY NGUYEN THANH TRUNG
SUPERVISED BY
DR HA QUANG THUY
A THESIS SUBMITTED THE DEGREE OF BACHELOR OF SCIENCE
AT THE FACULTY OF TECHNOLOGY VIETNAM NATIONAL UNIVERSITY, HANOI
Trang 2ACKNOWLEDGEMENTS
This thesis for bachelor’s degree has been accomplished for three months During this time, many people have made substantial contributions in one way or another that I would like to mention herein
First and foremost, I would especially like to thank my research advisor, Dr Ha Quang Thuy for his invaluable guidance and tremendous motivation that he pro-vided at every step of this work His enthusiastic support and untiring interest in the subject is deeply appreciated I have gain immensely from his deep technical in-sight and thoroughness in problem solving
Some portions of this thesis have been previously published in the Conference of Junior Scientists 2002 of Vietnam National University, Hanoi, and I owe thanks to
Dr Do Van Thanh, M.Sc Pham Tho Hoan, B.Sc Phan Xuan Hieu for their able contributions as the co-authors of that paper
valu-My thanks also go to all of my lecturers at Faculty of Technology of Vietnam tional University Hanoi who provided me with indispensable scientific knowledge throughout four school years Special thanks to the following individuals, and many others who are not mentioned by name, for their teaching: M.Sc Le Quang Hieu, M.Sc Nguyen Quang Vinh, M.Sc Nguyen Dinh Viet, M.Sc Pham Hong Thai, Dr Nguyen Tue, M.Sc Nguyen Nam Hai, M.Sc Dao Kien Quoc, M.Sc Le Anh Cuong, Asoc.Prof Trinh Nhat Tien, Dr Dinh Manh Tuong, M.Sc Vu Ba Duy, Asoc.Prof Nguyen Quoc Toan, M.Sc Ngo Le Minh, Asoc.Prof Ngo Quoc Tao Without the knowledge they equipped me, my thesis would never take shape
Na-I am particularly grateful to my family for providing me with a source of strength and encouragement, and giving me the best possible education, and imbibing in me
a thirst for learning
Last but not the least my girlfriend Nguyen Thi Thu Thuy who sacrificed time and energy so that this work could be completed I appreciate it, and hope that the effort
Trang 3ABSTRACT
Over the last several years, the problem of efficiently generating large numbers of association rules has been an active research topic in the data mining community Many different algorithms have been developed with promising results There are two current approaches to the association rule mining problem The first is to mine the frequent itemsets regardless of their coefficients The second is to assign weights to the items to reflect their importance to the users However, they both rely on the using of the minimum support which may confuse us Practically, we may want to mine the best rules to our knowledge instead of those which satisfy a certain threshold, especially if this threshold is an equation To overcome this prob-lem, we introduce the concept of adjustable interestingness and propose a novel ap-proach in mining association rules based on adjustable interestingness Our algo-rithm only works with the most interesting rules, thus reducing significantly search space by skipping many uninteresting itemsets and pruning those that cannot gen-erate interesting itemsets at the earlier stage Therefore, the total time needed for the mining is substantially decreased
Trang 4TABLE OF CONTENTS
Acknowledgements i
Abstract ii
Table of contents iii
List of tables and figures iv
CHAPTER 1: Introduction 1
1.1 What is data mining? 1
1.2 Data mining versus query tools 2
1.3 Mining association rules 3
1.4 Outline of the thesis 5
CHAPTER 2: Mining association rules with weighted items 6
2.1 Introduction 6
2.2 Problem definition 7
CHAPTER 3: Mining association rules with adjustable interestingness 10
3.1 Interestingness and interesting itemsets 10
3.2 Interestingness constraints 11
3.3 Motivation behind interesting itemsets and adjustable interestingness 12
CHAPTER 4: Algorithm for mining association rules with adjustable interestingness (MARAI) 14
4.1 Motivation 14
4.2 Preliminaries 15
4.3 Basic properties of itemset-tidset pairs 18
4.4 MARAI: Algorithm design and implementation 20
4.5 Experimental Evaluation 25
CHAPTER 5: Conclusion 28 References a Appendix b
Trang 5LIST OF TABLES AND FIGURES
Table 1 Database of a stationery store 8
Table 2 Transactions of a stationery store 9
Table 3 Itemsets sorted into descending order of their interestingness 11
Table 4 Itemsets sorted into descending order of the interestingness 17
Table 5 All interesting itemsets 18
Table 6 Database characteristics 25
Figure 1 Example database and frequent itemsets 4
Figure 2 Example database 15
Figure 3 The MARAI algorithm 22
Figure 4 Search process using adjustable interestingness 23
Figure 5 Performance of the MARAI algorithm on Cosmetic 26
Figure 6 Performance of the MARAI algorithm on Census 27
Trang 6CHAPTER 1
INTRODUCTION
In this chapter, we introduce the concept of data mining, and explain why it is garded as such important developments As companies is the background of mining association rules
re-1.1 What is data mining?
There is confusion about the exact meaning between the terms ‘data mining’ and
‘knowledge discovery in databases (KDD)’ At the first international KDD ence in Montreal in 1995, it was proposed that the term ‘KDD’ be used to describe
confer-the whole process of extraction of knowledge from data An official definition of KDD is: ‘the non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data’ [2] The knowledge which is discovered must be new, not obvious, and human must be able to use it for a particular purpose It was also
proposed that the term ‘data mining’ should be used exclusively for the discovery
stage of the KDD process The whole KDD steps include selection, preprocessing, transformation, data mining and the interpretation or evaluation Data mining has been focused on as it is the most significant and most time-consuming among KDD steps
The sudden rise of interest in data mining can partly be explained by the following factors [2]:
1 In the 1980s, all major organizations built infrastructural databases, containing data about their clients, competitors, and products These databases form a potential gold-mine; they contain gigabytes of data with much ‘hidden’ information that
Trang 7rithms can find interesting regularities in databases, whereas, SQL is just a query language; it only helps to find data under constraints of what we already know
2 As the use of networks continues to grow, it will become increasingly easy to connect databases Thus, connecting a client’ s file to a file with demographic data may lead to unexpected views on the spending patterns of certain population groups
3 Over the past few years, machine-learning techniques have expanded mously Neural networks, genetic algorithms and other simple, generally applicable learning techniques often makes it easier to find interesting connections in data-bases
enor-4 The client/sever revolution gives the individual knowledge worker access to tral information systems, from a terminal on his or her desk
cen-1.2 Data mining versus query tools
What is the difference between data mining and a normal query environment? What can a data mining tool do that SQL cannot?
It is significant to realize that data mining tools are complementary to query tools
A data mining tool does not replace a query tool but give a lot of additional bilities [2] Suppose that we have a large file containing millions of records that de-scribe customers’ purchases in a supermarket There is a wealth of potentially use-ful knowledge which can be found by trigger normal queries, such as ‘Who bought butter and bread last week?’ , ‘Is the profit of this month more than that of last month?’ and so on There is, however, knowledge hidden in the databases that is much harder to find using SQL Examples would be the answers to questions such
possi-as ‘What products were often purchpossi-ased together?’ , or ‘What are the subsequent purchases after buying a gas cooker?’ Of course, these questions could be an-swered using SQL but proceeding in such a way could take days or months to solve
Trang 8a much shorter time, sometimes even in minutes or a couple of hours It is said that
if we know exactly what we are looking for, use SQL; but if we know only vaguely what we are looking for, turn to data mining
1.3 Mining association rules
There are various kinds of methods to mine the information from the database, such
as mining association rules, multi-level data generalization and summarization, classification, and clustering [4] The most common type is mining association rules
The problem of mining association rules in databases was first introduced in 1993
by Agrawal [1] An example of such a rule might be that “90% of customers chase bread and butter also purchase milk and coffee” Since its introduction, As-sociation Rules Mining (ARM) [1] has become one of the core data mining tasks ARM is an undirected or unsupervised data mining technique, which work on mas-sive data, and it produces clear and understandable results ARM is aimed at find-ing regularities in data
pur-The following is a formal statement of the problem [1]: Let I ={i1,i2, ,i m} be a set
of literals, called items A set of items is also called an itemset An itemset with k items is called a k-itemset Let D be a set of transactions, where each transaction
T is a set of items such that T ⊆ I Associated with each transaction is a unique identifier, called its TID We say that a transaction T contains X , a set of some items in I, if X ⊆T The support of of an itemset X , denoted σ(X,D), is the number of examples in D where it occurs as a subset An itemset is frequent or large if its support is more than a user-specified minimum support (min_sup) value
An association rule is an implication of the form X ⇒Y where X ⊆ I, Y ⊆I and
φ
=
∩ Y
X X is called the antecedent of the rule, and Y is called the consequence
of the rule The rule X ⇒Y has support s in the transaction set D if s% of
Trang 9X ⇒ is the probability that X ∪Y occurs in the set of transactions in the base D; it is denote by support(X ∪Y) The rule X ⇒Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y The confidence of the association rule X ⇒Y is the probability that a transaction contains Y given that the transaction contains X , or it may be given methamati-cally as support(X ∪Y)/support(X)
data-Example 1.1. Consider a set of itemsets I ={A,B,C,D,E,F} Let D be a set of four transactions as following:
Transaction
identification
Items bought
confidence = support({A} ∪ {C}) / support({A}) = 66%
The problem of discovering all association rules can be decomposed into two problems [1]:
sub-1 Find all acts of items (itemsets) that have transaction support above minimum
support The support for an item is the number of transactions that contain the
item-set Recall that an itemset is frequent or large if its support is more than a specified minimum support (min_sup) value
user-Min support 50%
Min confidence 50%
Trang 10Example 1.2 From the above database, we obtain four frequent itemsets {A}, {B},
{C} and {A, C} with supports of 75%, 50%, 50% and 50% respectively
2 Use the large itemsets to generate the desired rules Here is a straightforward
al-gorithm for this task For every large itemset l, find all non-empty subsets of l For
every such subset a, output a rule of the form a⇒(l −a) if the ratio of support( l)
to support( a ) is at least minconf We need to consider subsets of l to generate rules
with multiple consequents
Example 1.3 From the frequent itemset {A, C} found in example 1.2, we can erate two rules whose confidences are greater than or equal to minconf value
gen-Itemset {A, C}
% 100
% 50
% 50 }) support({C {C}) } suuport({A confidence
% 66
% 75
% 50 }) support({A {C}) } suuport({A confidence
ACrule
CArule
1.4 Outline of the thesis
The remainder of this thesis is as follows In chapter 2, we state the definition of mining association rules with weighted items The main aim of this chapter is to provide a background for weight based problems we base our approach on In chapter 3, we describe the main idea of the thesis A new term, adjustable interest-ingness, is also introduced here After the extensive discussion of mining associa-tion rules with adjustable interestingness in chapter 3, we devote chapter 4 to the algorithm for it Experiments on real databases are also described Finally, we con-clude the thesis with a summary and a discussion of future work
Trang 11CHAPTER 2
MINING ASSOCIATION RULES WITH
WEIGHTED ITEMS
In the last section, we discussed about mining association rule for unweighted case
In the following, we introduce the conceptual framework of weight and apply it to mining association rules The concept of weight will be used in the coming chap-ters
2.1 Introduction
There have been two approaches to the association rule mining problem The first one is to mine the frequent itemsets regardless of their coefficients [1, 7] The sec-ond trend is to assign weights to the items to reflect their importance to the users Some previous works focused on mining frequent itemsets with weighted items [5] and different supports [6]
The association rules, mentioned in previous chapter, are called the ‘unweighted’
association rules [6] as the items are treated uniformly
Example 2.1. The following rule is the unweighted binary association rule from [1]:
(Bread = Yes) => (Ham = Yes) with support = 60% & confidence = 80%
The above rule states that the probability of buying bread and ham in a set of
trans-action is 0.6, and the confidence states that probability that buying ham, given that
that customer buys bread, is 0.8
Trang 12The above rule is an unweighted case However, it is better for the following cases
to consider the importance of the items or attributes
For example, the rule
(Income = High) => (School level = High)
is, in human interpretation, probably more interesting than
(Price = High) => (Sales = Low) even if the support of the latter rule is much more than that of the former
By using the weights, the importance of the attributes or items can be reflected, and
we can mine the rules with interestingness For example, we can add the weights to the sales transactions, where the items are under promotion, or with more profits The unweighted association rules would be the same if the database did not change, thus it cannot provide a flexible way for the users to adjust the priority of the rules Therefore, the mining association rules for weighted items was presented in [6] to resolve this problem
2.2 Problem definition
Similar to section 1.3, we consider a database with a set of transaction D, a set of attributes or items I, and each transaction is assigned a transaction identifier TID Based on the definitions in section 1.3, the weights and weighted association rules are defined [6]:
item 0 indicates the least important item, and 1 denotes the most important item For example, if the weight of the itemset X is 0.95, it tells us the itemset is impor-tant in the set of transaction D The weight of 0.1 indicates a less important set
Trang 13Definition 2 A weighted association rule (or association rule with weighted item)
has the form X ⇒Y, where X ⊆ I, Y ⊆I , X ∩ Y =φ, and the items in X and Y
are given by the weights
ad-justing ratio of the support, or mathematically,
),
(
Y X j
w Y
X wsupport
where the weights of the items {i1,i2, ,i m} are {w1,w2, ,w m} respectively
In order to find the interesting rules, two thresholds, minimum weighted support
(wminsup) and minimum confidence (minconf) must be specified
Definition 4 An itemset X is called a large weighted itemset if the weighted
sup-port of the itemset X is greater than or equal to the weighted supsup-port threshold, or
mathematically,
wminsup X
wsupport( )≥
the confidence of itemset (X ∪Y ) is greater than or equal to a minimum dence threshold, and (X ∪Y ) is a large weighted itemset
Trang 14TID Product ID TID Product ID
Table 2 Transactions of a stationery store
Example 2.2. Suppose in a stationery store, a database is shown in Table 1 Each
item includes information of name, profit and given weight Table 2 gives the transaction database For each transaction, there will be a transaction identifier (TID) and the names of items Suppose there are only 5 items and totally 8 transac-tions in the transaction database
Regardless of the weights given, if the value of minsup is set to 0.4, {1, 4} will be a
large itemset since its support is 50% However, {1, 4, 5} is not a large itemset as it appears only two times in the database
But if we take weights of items into account, and the given value of wminsup is 0.4,
{1, 4} will not be a large weighted itemset since
By the same argument, {5}, {1, 2, 5} will be large weighted itemsets
Although itemset {1, 4} has a greater support than that of {1, 2, 5}, it seem to be true that the latter otherwise make a greater profit than the former can do In this
case, we say that itemset {1, 2, 5} is more interesting than itemset {1, 4}, or the terestingness of itemset {1, 2, 5} is greater than that of itemset {1, 4}
Trang 153.1 Interestingness and interesting itemsets
Based on the definitions of weighted itemsets in previous chapter, we extend the definitions of interestingness and interesting itemsets
co-efficient correlation between the number of transactions in which it occurs as a set and the total weight of its items, or methametically,
sub-∑
∈
=
) ) ( )(
)
(
X j
w X
interest
In order to find the interesting itemsets, the threshold, minimum interestingness
(min_int) must be specified
the itemset X is greater than or equal to the interestingness threshold, or mathematically,
int min X
Trang 16Example 3.1 From the database in Table 1 and 2, we can calculate the
interesting-ness of itemsets as the following table The itemsets are sorted into descending der of their interestingness
{5} 1 62.5% 0.625 {1, 3, 4} 0.9 12.5% 0.1125 {2, 5} 1.2 37.5% 0.45 {1, 2, 4} 0.8 12.5% 0.1
* W, S and I are acronyms for Weight, Support and Interestingness, respectively
Table 3 Itemsets sorted into descending order of their interestingness
If the value of min_int is 0.3, we obtain six interesting itemsets; these are: {5},
{2, 5}, {1, 4, 5}, {4, 5}, {3, 5}, {1, 4} Of these six interesting itemsets, five tain item 5 which represents for pens It proves that the interestingness of an item-set is made up of its weight and support
con-3.2 Interestingness constraints
By sorting the itemsets into descending order of their interestingness, we have two diverse ways to mine the most interesting itemsets The first is to set a threshold for
minimum interestingness, or min_int In the example 3.1, when the min_int value is
set to 0.3, there are six most interesting itemsets found in the database That is,
Trang 17Since the number of itemsets found is unpredictable, it may be cumbersome when
min_int is lowered to 0
In this thesis, we present an alternative way to mine the most interesting itemsets
By this way, the min_int is adjusted throughout the mining process The term straint is defined as the number of itemsets for which we desire to mine and it must
con-be specified From the example 3.1, if the constraint value is set to 5, we can mine
five most interesting itemsets whose interestingness are 0.325 or over Therefore,
the min_int value is adjusted to 0.325 afterward Similarly, if the constraint is 10, the min_int is adjusted to 0.1875 since the interestingness of ten most interesting items are greater or equal to 0.1875 It is clear that the greater the constraint is, the smaller the min_int is adjusted to
3.3 Motivation behind interesting itemsets and adjustable interestingness
By setting the interestingness of an itemset, we can get a balance between the two measures, weights and supports If supports are separated from weights, we can only find itemsets having sufficient support However, this may ignore some inter-esting knowledge Special items and special group of items may be specified indi-vidually and have higher priority For example, there are few customers buying pens, but the profit the pens make is much more than that of other products As a matter of course, the store clerk will want to put the pens under the promotion rather than others For this reason, the weight which is a measure of the important
of an item is applied
The interestingness of an item can be computed at the multiplication of weight and support Interestingness, in some case, can be “the potential usefulness of the knowledge” but it seems to be difficult to understand It is clear that most end-users
are not statisticians, they thus have trouble setting the threshold for min_int Putting
a query “Show me twenty most interesting itemsets” is definitely more
Trang 18comprehen-Furthermore, it is impractical to generate entire set of interesting itemsets Our pose is to mine only most interesting ones Hence, we design a new concept, ad-justable interestingness, in this thesis
they can generate more useful rules By setting the minimum support for each item,
we generate closed sets using a triple minsup-itemset-tidset and then restrict the number of itemsets to be found, thus the search space is fairly reduced
... idea of the thesis A new term, adjustable interest-ingness, is also introduced here After the extensive discussion of mining associa-tion rules with adjustable interestingness in chapter 3, we... mine the rules with interestingness For example, we can add the weights to the sales transactions, where the items are under promotion, or with more profits The unweighted association rules would... to mining association rules The concept of weight will be used in the coming chap-ters2.1 Introduction
There have been two approaches to the association rule mining