Know ledge Discovery in Databases KDD Keywords: Dala mining, association rules, binary association rules, quantitative alien rules, fuzzy association rules, parallel algorithm... We coul
Trang 2I isi o f figures
I ¡st i)f t a b l e s
Notations & Ab br ev ia ti o n s
A c k n o w l e d g e m e n t s
Abstract
Chapter 1 Introduction to Data min in g
1.1 Data m in i n g
1.1.1 Data mining: M o tiv at io n
1.1.2 Data mining: D ef in iti on
1.1.3 Main steps in Knowledge discovery in databases ( K D D )
1.1 Major approaches and techniques in Data m in i n g
1.2.1 Major approaches and techniques in Data m i n i n g
1.2.2 Kinds o f data could be m i n e d
1.2 Applications o f Data m i n i n g
1.2.1 Applications o f Data m i n i n g
1.2.2 Classification o f Data mining s y st e m s
1.3 f o cu se d issues in Data mining
Chapter 2 Association r ul e s
2.1 Association rules: Moti vat io n
2.2 Association rules mining - Problem statement
2.3 Main research trends in Association rules m in i n g
Chapter 3 f u z z y association rules mi n in g
3 1 Quantitative association r u l e s
3.1.1 Association rules with quantitative and categorical attributes
3
4
10
10
12
12
12 13
13
14
14
16 16
17
18
21
21
21
Trang 33.1.2 Methods o f data discretization 22
3.2 Fuzzy association r u l e s 24
3.2.1 Data discretization based on fuzzy s e t 24
3.2.2 Fuzzy association r u l e s 27
3.2.3 Algorithm for fuzzy association rules min in g 31
3.2.4 Relation between fuzzy association rules and quantitative o n e s 35
3.2.5 Experiments and conclu sions 36
Chapter 4 Parallel mining o f fuzzy association r u l e s 41
4.1 Several previously proposed parallel algorithms 42
4.2 A new parallel algorithm for fuzzy association rules m in in g 50
4 2 1 Our a p p r o a c h 50
4.2.2 The new a l g o r i t h m 54
4.2.3 Proof o f correctness and computational comp lexit y 54
4.3 Experiments and conclusions 57
C o n c l u s io n 59
Achievements throughout the dissertation 59
Future w o r k 60
R e f e r e n c e 61
A p p e n d i x 65
Trang 4L i s t o f f i g u r e s
Figure 1 - The volume o f data strongly increases in the past two d e c a d e s
l igure 2 - Steps in KD D p r o c e s s
Figure 3 - Illustration o f an association r u le
Figure 4 - "Sharp boundary problem" in data discretization
Figure 5 - Membership functions o f fuzzy sets associated with “Age" attribute
Figure 6 - Membership functions o f "CholesterolJLow” and "Cholesterol H i g h "
Figure 7 - The processing time increases dramatically as decreasing thq fm in su p
Figure 8 - Num ber o f itemsets and rules strongly increase as reducing the fm in su p Figure 9 - The number o f rules enlarges remarkably as decreasing the f m i n s u p
Figure 10 - Processing time increases largely as slightly increasing number o f attrs f igure 1 1 - Processing time increases linearly as increasing the number o f r e c o r d s Figure 12 - Optional choices for T-norm o p e r a t o r
Figure 13 - The mining results reflect the changing o f threshold v a l u e s
f igure 14 - Count distribution algorithm on a 3-processor parallel s y s t e m
Figure 15 - Data distribution algorithm on a 3-processor parallel s y s t e m
Figure 16 - The rule generating time largely reduces as increasing the m in c o n f
Figure 17 - The number o f rules largely reduces when increasing the m in c o n f
figure 18 - The illustration for division algorithm
f’igire 19 - Processing time largely reduces as increasing the number o f process
Figure 20 - Mining time largely depends on number o f process (logical, physical)
Figure 21 - The main interface window o f F u z z y A R M to o l
Figure 22 - The sub-window for adding new furry s e t s
Figure 23 - The window for viewing mining results
9 11 16 23 25 25 36 37 37 38 38 39 40 43 44 48 48 55 57 58 65 66
66
Trang 5List of tables
Table I - An example o f transactional d at ab as es 17
Table 2 - Frequent itemsets in sample database in table 1 with support = 5 0 % 17
Table 3 - Association rules generated from frequent itemset A C W 1 8 Table 4 - Diagnostic database o f heart disease on 17 p a t i e n t s 21
fable 5 - Data discretization for attributes having finite v a l u e s 22
fable 6 - Data discretization for "Serum cholesterol" a tt ri b ut e 23
fable 7 - Data discretization for "Age" attribute 23
Table 8 - The diagnostic database o f heart disease on 13 p a t i e n t s 27
fable 9 - Notations used in fuzzy association rules mining a l g o r it h m 32
fable 10 - The algorithm for mining fuzzy association rules 32
fable 1 1 - T | : Values o f records at attributes after fuzzifying 33
Table 12 - C i: set o f candidate 1-it em se ts 33
fa ble 13 - F2: set o f frequent 2 - i t e m s e t s 34
fable 14 - Fuzzy association rules generated from database in table 8 35
fable 15 - The sequential algorithm for generating association r u l e s 48
fable 10 - Fuzzy attributes received after being fuzzified the database in table 8 51 fable 1 7 - Fuzzy attributes dividing algorithm among p r o c e s s o r s 53
Trang 6Know ledge Discovery in Databases KDD
Keywords:
Dala mining, association rules, binary association rules, quantitative alien rules, fuzzy association rules, parallel algorithm
Trang 7C h a p te r 1 Introduction to Data m ining
1.1 Data m in in g
1.1.1 Data mining: Motivation
The past two decades lias seen a dramatic increase in the amount o f information or data being stored in electronic devices (i.e hard disk, C D - R O M etc.) This ac cum ul ation o f data has taken place at an explosive rate It has been estimated that the amount
o f information in the world doubles every two years and the size and number o f databases are increasing even faster Figure l illustrates the data explosion [3|
Fi gu r e I - T h e v o l u me o f data s t r o n g l y i nc r e as e s in the past t wo d e c a d e s
We are drowning in data, but starving for useful knowledge The vast amount o f accumulated data is actually a valuable resource because information is the vital factor for business operations, and decision-makers could make the most o f the data to gain precious insight into the business before making decisions Data mining, the extraction o f hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most significant information in their data collection (databases, data warehouses, data repositories) The automated, prospective analyses offered by data mining go beyond the normal analyses o f past events provided by retrospective tools typical o f decision support systems Data mining tools can answer business questions that traditionally were too time-consuming to resolve This is where data mining & knowledge discovery in databases demonstrates its obvious benefits for today’s competitive business environment Nowadays Data Mining & K D D has been becoming a key role in computer science and knowledge engineering areas
Trang 8The initial application o f data mining is only in commerce (retail) and finance (stock market) However, data mining is now widespreadly and successfully put into other fields such as bio-informatics, medical treatment, telecommunication, education, ctc.
1.1.2 Data mining: Definition
Before discussing some definitions o f data mining, I have a small explanation about terminology so that readers can avoid unnecessary confusions As mention-ed before,
we can roughly understand that data mining is a process o f extracting nontrivial, implicit, previously unknown, and potentially useful knowledge from huge sets o f data Thus, we should name this process as knowledge discovery in database (KDD) instead o f data mining However, most o f the researchers agree that the two above terminologies (Data mining and KDD) are similar and they can be used interchangeably They explain for this “humorous misnomer” that the core motivation
o f KDD is the useful knowledge, but the main object they have to deal with during mining process is data Thus, in a sense, data mining and KD D imply the same meaning However, in several materials, data mining is sometimes referred to as one step in the whole K D D process [3] [43]
There are numerous definitions o f data mining and they are all descriptive I would like to restate herein some o f them that are widely accepted
Definition one: W J Frawley, G Piatetsky-Shapiro, and C J Matheus 19 9 1 [43]:
"Knowledge discovery in databases, also known Data mining, is the nontrivial process o f identifying valid, novel, potentially useful, and ultimately understand-able patterns in data ”
Definition two: M Holshemier va A Siebes (1994):
"Data Mining is the search fo r relationships and global patterns that exist in targe databases hut are 'hidden' among the vast amount o f data, such as a relationship between patient data and their medical diagnosis These relationships represent valuable knowledge about the database and the objects in the database and, i f the database is a faithful mirror, o f the real world registered by the database "
1.1.3 Main steps in Knowledge discovery in databases (KDD)
The whole KDD process is usually decomposed into the following steps [3J [ 14] [23]:
Trang 9D a ta selection-, selecting or segmenting the necessary data that needs to be mined
from large data sets (databases, data warehouses, data repositories) according to some criteria
D ata p re p ro c e ssin g : this is the data clean and reconfiguration stage where some
techniques arc applied to deal with incomplete, noisy, and inconsistent data This step also tries to reduce data by using aggregate and group function, data compression methods, histograms, sampling, etc Furthermore, discretization techniques (binning, histograms, cluster analysis, entropy-based discretization, segmentation) can be used
to reduce the number o f values for a given continuous attribute by dividing the range
o f the attribute into separated intervals After this step, data is clean, complete, uniform, reduced, and discretized
D ata transformation-, in this step, data are transformed or consolidated into forms
appropriate for mining Data transformation can involve data smoothing and normalization After this step, data are ready for the mining step
D ata mining: this is considered to be the most important step in KD D process It
applies some data mining techniques (chiefly borrowing from machine learning and other fields) to discover and extract useful patterns or relationships from data
K now ledge rep resentation a n d evaluation: the patterns identified by the system in
previous step are interpreted into knowledge that can then be used to support human decision-making (e.g prediction and classification tasks, summarizing the contents o f
a database or explaining observed phenomena) Knowledge representation also converts patterns into user-readable expressions such as trees, graphs, charts & tables, rules, etc
Trang 101.1 M ajor a p p r o a c h e s and te c h n iq u e s in Data m in in g
1.2.1 Major approaches and techniques in Data mining
Data mining consists o f many approaches They can be classified according to functionality, kind o f knowledge, type o f data to be mined, or whatever appropriate criteria 114], 1 describe major approaches below:
Classification & prediction', this method tries to arrange a given object into an
appropriate class am on g the others The number o f classes and their name are definitely known For example, we can classify or anticipate geographic regions according to weather and climate data This approach normally uses typical techniques and concepts in machine learning such as decision tree, artificial neural network, k-min support vector machine, etc Classification is also called supervised learning
Association rules: this is a relatively simple form o f rule e.g “ 80 percent o f men that
purchase beer also purchase dry beef" Association rule is now successfully applied in supermarket (retail), medicine, bio-informatics, finance & stock market, etc
Sequential/tem poral p a ttern s mining', this method is somewhat similar to association
rules except that data and mining results (a kind o f rule) always contain a temporal attribute to exhibit the order or sequence in which events or objects effect upon each other This approach plays a key role in finance and stock market thanks to its capab lity o f prediction
Clustering & segm entation', this method tries to arrange a given object into a suited
category (also known as cluster) The number o f clusters may be dynamic and their labels (names) are unknown Clustering and segmentation are also called unsup;rvised learning
Concept description & sum m arization', the main objective o f this method is to
describe or summarize an object so that the obtained information is compact and condeised Document or text summarization may be a typical example
1.2.2 Kinds of data could be mined
Data mining can work on various kinds o f data The most typical data types are as follows:
Trang 11R ela tio n a l d a ta b a se s: databases organized according to the relational model Most o f
the existing database management systems support this kind o f model such as Oracle, IBM DB2 MS SQL Server MS Access, etc
M ultidim en sio n a l databases', this kind o f database is also called data warehouse, data
mart, etc The data selected from different sources contain the historical feature thanks
to an implicit or explicit temporal attribute This kind o f database is used primarily in data mining and decision-making support systems
T ransactional d a ta b a ses: this kind o f database is commonly used in supermarket,
banking, etc Each transaction includes a certain number o f items (e.g items may be goods in an order) and a transactional database, in turn, contains a certain number o f transactions
O bject relational d a ta b a se s: this database model is a hybrid o f the object oriented
model and the relational model
S p a tia l, tem poral, a n d tim e-series d a ta ’, this kind o f data always contains either spatial
(e.g map) or temporal (e.g stock market) attributes
M ultim edia d a ta b a ses: this kind o f data includes audio, image, video, text, www, and
many other data format Today, This kind o f data is widely used on Internet thanks to its useful applications
1.2 A p p lic a tio n s o f Data m ining
1.2.1 Applications of Data mining
Although data mining is a relatively new research trend, it is a big attraction o f researchers because o f its practical applications in many areas The following should
be typical applications: ( I ) Data analysis and decision-marking supports This application is popular in commerce (retail industry), finance & stock market, etc (2) Medical treatment: finding the potential relevance among symptoms, diagnoses, and treatment methods (nutrient, prescription, surgeon, etc) (3) Text and Web mining: document summarization, text retrieval and text searching, text and hypertext classification (4) Bio-informatics: search and compare typical or special genetic information such as genomes and DNA, the implicit relations between a set o f genomes and a genetic disease, etc (5) Finance & stock market: examining data to
Trang 12extract predicted information for price o f a certain kind o f coupon (6) Others (telecommunication, medical insurance, astronomy, anti-terrorism, sports, etc).
1.2.2 Classification of Data mining systems
Data mining is a knowledge engineering related field that involves many others research areas such as database, machine learning, artificial intelligence, high performance computing, data & knowledge visualization, etc We could classify data mining systems according to different criteria as follows:
C la ssifyin g based an kin d o f data to be mined', data mining systems work with
relational databases, data warehouses, transactional databases, object-oriented databases, spatial and temporal databases, multimedia databases, text and webdatabases, etc
C la ssifyin g based on type o f m ined kn ow ledge: data mining tools that return
summarization or description, association rules, classification or prediction, clustering, etc
C la ssifyin g ba sed on w hat kin d o f techniques to be used', data mining tools work as
online analytical processing (OLAP ) systems, use machine learning techniques(decision tree, artificial neural network, k-min, genetic algorithm, support vector machine, rough set, fuzzy set etc.), data visualization, etc
C lassifyin g based on w hat fie ld s the data m ining system s are app lied to: data mining
systems are used in different fields such as commerce (retail industry),telecommunication, bio-informatics, medical treatment, finance & stock market, medical insurance, etc
1.3 F o c u s e d issues in D ata m ining
Data mining is a relatively new research topic Thus, there are several pending orunconvincingly solved issues I relate herein some o f them that are attracting much attention o f data mining researchers
( I) O L A M (Online Analytical Mining) is a smooth combination o f databases, data warehouses, and data mining Nowadays, database management systems like Oracle,
MS SQI Server, IBM DB2 have integrated OLAP and data warehouse functionalities
to facilitate users in data retrieval and data analyzing These add-in supports also
Trang 13charge users an additional sum o f money Researchers in these fields hope to go beyond the current limitation by developing multi-purposes Of.AM systems that support data transactions for daily business operations as well as data analyzing for making decision | 14], (2) Data mining systems can mine various forms o f knowledge from different types o f data [14] [7], (3) How to enhance the performance, accuracy, scalability, and integration o f data mining systems? How to decrease the computational complexity? How to improve the ability o f dealing with incomplete, inconsistent, and noisy data? Three questions above should still be concentrated in the future [14 | (4) Taking advantage o f background knowledge or knowledge from users (experts or specialists) to upgrade the total performance o f data mining systems [7] [1 ] (5) Parallel and distributed data mining is an interesting research trend because it makes use o f powerful computing systems to reduce response time This is essential because more and more real-time applications are needed in today’s competitive world [5] [8] f 12) [18] [26] [31] [32] [34] [42] (6) Data Mining Query Language (D MQL): Researchers in this area try to design a standard query language for data mining This language will be used in OL AM systems as if SQL are widely used in relational databases [14] (7) Knowledge representation and visualization are also taken into consideration to express knowledge in human-readable and easy-to-use forms Knowledge can be represented in more intuitive expressions due to multidimensional or multilevel data structures.
This thesis primarily involves in mining fuzzy association rules and parallel algorithms for mining fuzzy association rules
Trang 14C h a p te r 2 A ss o ciatio n rules
2.1 A s s o c ia tio n rules: M o tiv a tio n
Association rule is the form o f “70 percent o f customers that p u rch a se beer also
purchase dry b e e f 20 percent o f customers purchase both” or “75 percent o f patients
who sm oke cigarettes and live near p o llu te d areas also g et lung ca n cer, 25 percent o f
patients smoke and live near polluted areas as well as suffer from lung cancer"
"Purchase beer" or “smoke cigarettes and live near polluted areas” are antecedents,
“purchase dry b e e f and “get lung cancer” are called consequents o f association rules 20% and 30% are called support factors (percentage o f transactions or records that contain both antecedent and consequent o f a rule), 70% and 75% are called confidence factors (percentage o f transactions or records that hold the antecedent also hold the consequent o f a rule) The following figure pictorially depicts the former example o f association rules
7 0 % o f
transactions tli at purchase beer also purchase dry beef
Fi gure 3 - Ill ustration o f an a s s o c i a t i o n rule
I he knowledge and information derived from association rules have an obvious difference in meaning from that o f normal queries (usual in SQL syntax) This knowledge contains previously unknown relationships and predictions hidden in massive volumes o f data It not only results from usual group, aggregate or sort operations but also results from a complicated and time-consuming computing process
Being a simple kind o f rule, association rules, however, carry useful knowledge and contribute substantially to decision-making process Unearthing significant rules from databases is the main motivation o f researchers
number o f
^ transactions that buy beer
Trang 152.2 A s s o c ia tio n rules m in in g - P roblem s ta te m e n t
Let I = {i | i2, i„) be a set o f n items or attributes (in transactional or relational databases respectively) and T = {tt, t2, ., tm} be a set o f m transactions or records (in
transactional or relational databases respectively) Each transaction is identified with its unique TID number A (transactional) database D is a binary relation 5 on the
Descart multiplication I xT (or also written 5 c IxT) If an item i occurs in a transaction t, we write (i, t) e 8 or i8t Generally speaking, a transactional database is
a set o f transactions, where each transaction t contains a set o f items or t e 2 1 (where
m insup 136].
The following table enumerates all possible frequent itemsets in sample database in
table I with m insup value is 50%.
T a b l e 2 - Fr e qu e n t it emsets in s a mpl e d a t a b a s e in table 1 with s u p p o r t = 5 0 %
Association rule is an implication in the form o f X ——> Y , where X and Y are
frequent itemsets that disjoint, i.e X n Y = 0 and c, confidence factor o f the rule, is
the conditional probability that a transaction contains Y given that it contains X, i.e c
Trang 16= s(X uV) / ,v(X) A rule is confident if its confidence factor is larger or equal to a user-specified m inim um confidence (m inconf) value, i.e c > m in c o n f [ 3 6 1.
flic association rules mining task can be stated as follows:
Let D be a (transactional) database, m insup and m in c o n f are minimum support
and mi ni mu m confidence respectively The mining task tries to discover all
frequ ent and confident association rules X —» Y i.e ,?(XuY) > m insup and c( X - > } ' ) = i ( X u Y ) / v(X) > m in c o n f
Almost all previously proposed algorithms decompose this mining task into two separated phases [4] [5] [20] [24] [34] [35]:
Phase o n e : finding all possible frequent itemsets from database D This stage is
usually complicated and time-consuming because it needs much time o f CPU (CPU- bound) and I/O operations (1/O-bound)
Phase tw o : generating confident association rules from discovered frequent itemsets
in the previous step If X is a frequent itemset, confident association rules created
from X have the form o f X ' ——* X \ X ' , where X ’ is any non-empty subset o f X and
X \ X ’ is subtraction o f X ’ from X This step is relatively straightforward and much less ti me-consuming than the step one
The following table lists all possible association rules generated from the frequent
itemset A C W (from database in table 1) with m in c o n f = 70%.
T a b i c 3 - As s oc i at i on rules ge n e r at e d from f re quent itemset A C W
2.3 Main research trends in Association rules mining
Since proposed by R Agrawal in 1993 [36], the field o f association rules mining has developed into various new directions thanks to a variety o f improvements from researchers Some o f proposals are try to enhance the precision and performance
Trang 17some try to tune the interestingness o f rules, etc 1 list herein some o f its dominanttrends.
Mining binary or boolean association rules: this is the initial research direction o f association rules Most o f the early mining algorithms are related to this kind o f rule [20] 13 5 1 [36], In binary association rules, an item is only determined whether it is present or not The quantity associated with each item is fully ignored, e.g a transaction buying twenty bottles o f beer is the same a transaction that buys only one bottle The most well-known algorithms mining binary association rules are Apriori together with its variants (Apriori-Tid and AprioriHybrid) [35] An example o f this
type o f rule is “ bving b rea d = 'y e s ’ A N D buying su g a r = y e s ’ => bu yin g m ilk ~
y e s ' A ND buying butter =■ 'yes ’, with support 20 % and confidence 8 0 % ”
Quantitative and categorical association rules: attributes in databases may be binary (boolean), number (quantitative), or nominal (categorical), etc To discover association rules that involve these data types, quantitative and categorical attributes need to be discretized to convert into binary ones There exist some o f discretization
methods that are proposed in [34] [39] An example o f this kind o f rule is “sex —
Unde ’ AN D age e '50 65 ' AND w eight e '6 0 8 0 ' A N D sug ar in b lo o d > ]2 0 m g /m l
=~> blood p ressu re = 'high ’ with support 30 % and confidence 6 5 % ”
Fuzzy association rules: this type o f rule was suggested to overcome several drawbacks in quantitative association rules such as “ sharp boundary p r ob lem ” or semantic expression Fuzzy association rule is more natural and intuitive to users
thanks to its “ fuzzy” characteristics An example is “d ry cough = 'y e s ' A N D h ig h
fe v e r AN D m uscle aches = 'y e s ’ A N D breathing difficulties = ‘y e s ’ => g et S A R S
(Severe Acute Respiratory Syndrome) = ‘y e s ’, with support 4% and confidence 8 0 % ”
H igh fe v e r in the above rule is a fuzzy attribute We measure the body temperature
based on a fuzzy concept
Multi-level association rules: all kinds o f association rules above are too concrete, so they cannot reflect relationships on general views Multi-level or generalized association rule is devised to surmount this problem [15] [37], In approach, we wouldIf
prefer rule like “ buy PC = 'yes' => buy o p era tin g system = 'y e s ' A N D buy office
tools = ' y e s rather than “ buy IB M PC = 'y e s ' => buy M icrosoft W indow s ~ ‘y e s '
AN D buy M icrosoft O ffice = 'y e s ’” Obviously, the former rule is the generalized
form o f the latter and the latter is the specific form o f the former
Trang 18Association rules with weighted items (or attributes): we use weight associated with each item to indicate “ the level” that item contributes to the rule In other words, weights are used to measure the importance o f items For example, while surveying
SA RS plague within a certain group o f people, the information o f body tem perature and respiratory system is much more essential than that o f age To reflect the difference between the above attributes, we attach greater weight values for body
tem perature and resp ira to ry system attributes This is an attractive research branch
and solutions to it were presented in several papers flO] [441 By using weights, we should discover scarce association rules o f high interestingness This means that we can retain rules with small supports but have a special meaning
Besides examining o f variants o f association rules, researchers pay attention to how to accelerate the phase o f discovering frequent itemsets Most o f recommended algorithms are to try to reduce the number o f frequent itemsets need to be mined by
developing new theories o f m axim al frequent item sets [11] (MAF IA algorithm)
clo se d item sets [13] ( C L O S E T algorithm), [24] (C H A R M algorithm), [30] These
new approaches considerably decrease mining time owning to their “delicate pruning strategies” Experiments show that these algorithms outperform known ones like Apriori, AprioriTid, etc
Parallel and distributed algorithms for association rules mining: in addition to sequential or serial algorithms, parallel algorithms are invented to enhance total performance o f mining process by making use o f robust parallel systems The advent
o f parallel and distributed data mining is highly accepted because size o f databases increases sharply and real-time applications are commonly used in recent years Numerous parallel algorithms for mining association rules were devised during past ten years in [5] [12] [18] [26] [31] [32] [34] They are both platform dependent and platform independent
Mining association rules in the point o f view o f rough set theory [41],
Furthermore, there exist other research trends such as online association rule mining [33] that data mining tools are integrated or directly connected to data warehouses or data repositories based on well-known technologies as OLAP, M OL AP , ROLAP,ADO etc
Trang 19C h a p te r 3 Fuzzy association rules mining
3.1 Q u a n tita tiv e a s s o c ia tio n rules
3.1.1 Association rules with quantitative and categorical attributes
Mining quantitative and categorical association rules is an important task because o f its practical applications on real world databases This kind o f association rules first introduced in [38]
T a b l e 4 - Di agn os t i c d a t a b a s e o f heart di sease on 17 patients
In the above database, three attributes Age, Serum ch o lestero l, M axim um heart rate are quantitative, two attributes Chest p a in type and resting electrocardio-graphics are categorical, and all the rest are binary (Sex, H eart d isea se, F asting blo o d sugar) In
fact, binary data type is also considered to be a special form o f category From the data in table 4, we can extract such rules as:
<Age: 54 74> A N D <Sex: Female> A N D <Cholesterol: 2 0 0 3 0 0 => <Heart
disease: Yes>, with support 23.53% and confidence 80%
<Sex: Male> A N D <Resting electrocardiographics: 0> A N D <Fasting blood suga r < 120> => <Heart disease: No>, with support 17.65% and confidence
100%
Trang 20The approach proposed in [34] discovers this kind o f rules by partitioning value ranges o f quantitative and categorical attributes into separated intervals to convert them into binary ones Traditional well-known algorithms such as Apriori [35],
C H A R M [24], C O L S E T [20] can then work on these new binary attributes as original problem o f mining boolean association rules
3.1.2 Methods of data discretization
Binary association rules mining algorithms [20] [24] [35] [36] only work with relational database containing only binary attributes or transactional databases as shown in table I They cannot be applied directly to practical databases as shown in table 4 In order to conquer this obstacle, quantitative and categorical columns must first be converted into boolean ones [34] [39], However, there remain some limitations in data discretization that influence the quality o f discovered rules The output rules do not satisfy researchers’s expectation The following section describes major discretization methods to contrast their disadvantages
The first c a s e : let A be a discrete quantitative or categorical attribute with finite value
domain {vh v 2 vk} and k is small enough (k < 100) After being discretized, the original attribute is developed into k new binary attributes named A _ V t, A _ V 2,
A V k Value o f a record at column A Vj is equal to True (Yes or 1) if the original
value o f this record at attribute A is equal to Vj, all the rest cases will set the value o f
A Vj to False (No or 0) The attributes C hest pa in type and resting
elcctrocardiographics in table 4 belong to this case After transforming, the initial
attribute C hest pain type will be converted into four binary columns
Chest j p a i n j y p e l , C hest _pain _ typ e_ 2, C hest_pain- J y p e _ 3 , C hest p a in type 4 as
shown in the following table
Chest pain type
( I * 2, 3, 4)
T a b l e 5 - Dat a di scret i zati on for at tribut es h a v i n g finite values
The seco n d ca se : if A is a continuous and quantitative attribute or a categorical one
having value domain {v|, v 2 vp} {p is relatively large) A will be mapped to q new
binary columns in the form o f <A: startj end|>, <A: start2 end2>, ., <A:
Trang 21slaiit] endt|> Value o f a given record at column <A: startj endj> is True (Yes or 1) if (lie original value v at this record o f A is between sta rt, and end,, <A: start,.,end|> will receive False (No or 0) value for all other cases o f v The attributes Age Serum
cholesterol, and M axim um heart rate in table 4 belong to this form Serum cholesterol
and Age could be discretized as shown in the two following tables:
Ta b l e 6 - Dat a di scret i zati on for "S er um chol esterol " attribute
T a b l e 7 - Data discret ization for "Age " at tri but e
Unfortunately, the mentioned discretization methods encounter some pitfalls such as
"sharp boundary problem" [4] |9| The figure below displays the support distribution
o f an attribute A having a value range from 1 to 10 Supposing that we divide A into
two separated intervals 11 5] and [6 10J respectively I f the m insup value is 41%, the range |6 10] will not gain sufficient support Therefore [6 10] cannot satisfy m insup (40% < m insup = 41%) even though there is a large support near its left boundary
For example [4 7] has support 55%, [5 8] has support 45% So, this partition results
in a ‘‘sharp boundary" between 5 and 6, and therefore mining algorithms cannot generate confident rules involving interval [6 10]
Trang 22A n o t h er attribute partitioning method [38] is to divide the attribute domain into ove rlapped regions We can see that the boundaries o f intervals are overlapped with each other As a result, the elements located near the boundary will contribute to more than one interval such that some intervals may become interesting in this case It is, however, not reasonable because total support o f all intervals exceeds 100% and w e unintentionally overemphasize the importance o f values located near boundaries This
is not natural and inconsistent
fu rt h e r m o re , partitioning attribute domain into separated ranges results in a problem
in rule interpretation The table 7 shows that two values 29 and 30 belong to different interv a 1 s though they are very similar in indicating old level Also, supposing that the range | 1 29] denotes young people, [30 59] for middle-aged people, and [60 120] for old ones, so the age o f 59 implies a middle-aged person whereas the age o f 60 implies an old person This is not intuitive and natural in understanding the meaning
o f quantitative association rules
19] This kind o f rule not only successfully improves “sharp boundary problem" but also allows us to express association rules in a more intuitive and a friendly format
For example, the quantitative rule “<Age: 54 74> A N D <Sex: Female> AND
C h o l e s t e r o l : 200 300> => < Heart disease: Yes> is now replaced by “<Age_Old>
A N D < Sex: Female> A N D <Cholesterol_High > => < Heart disease: Yes>”
Age O ld and C holesterol H igh in the above rule are fuzzy attributes.
3.2 F uzzy a s s o c ia tio n rules
3.2 1 D a t a d i s c r e t i z a t i o n b a s e d o n f u z z y s e t
In the fuzzy set theory 12 11 [47), an element can belongs to a set with a membership value in [0, IJ This value is assigned by the membership function associated with
each fuzzy set For attribute x and its domain Dx (also known as universal set), the
mapiping o f the membership function associated with fuzzy s e t / v is as follow:
m, ( x ) : Dx -» [0,l] ( 3 1)
Trang 23I he fuzzy sei provides a smooth change over the boundaries and allows us to express association rules in a more expressive form, b e t' s use the fuzzy set in data
d iserctizing to make the most o f its benefits
f o r the attribute A ge and its universal domain [0, 120], we attach with it three fuzzy sets A ge Young Age M id d le-a g ed , and A g e Old The graphic representations o f
these fuzzy sets are shown in the following figure
AAge_Voung Age_Middle-aged Age 01d
0 - J -•<" - ^ \ - y A ge
Figure 5 - M e m b e r s h i p f unct i ons o f f uzzy sets a s s oc i at ed with “ A g e ” at tri but e
By using fuzzy set we completely get rid o f “sharp boundary problem" thanks to its o*wn characteristics For example, the graph in figure 5 indicates that the ages o f 59
and 60 have membership values o f fuzzy set A g e O ld approximately 0.85 and 0.90 respectively Similarly, the ages o f 30 and 29 towards the fuzzy set Age Young are
0 70 and 0.75 Obviously, this transformation method is much more intuitive and natural than known discretization ones
A nother example, the original attribute Serum cholesterol is decomposed into two new fuzzy attributes C holestero_L ow and C holestero High The following figure
portrays membership functions o f these fuzzy concepts
A
Choie sterol_Lo w Cholesterol_H lgh
Serum cholesterol (mÿml)
Figure 6 - M e m b e r s h i p funct i ons o f " C h o l e s t e r o l L o w " and "Chol es t e rol High"
If A is a categorical attribute having value domain {V|, v2, ., vki and k is relatively
small, we fuzzily this attribute by attaching a new fuzzy attribute A Vj to each value v,- I he value o f membership function m A Vi(x) equals to I if.y = v, and equals to 0 for vice versa Ultimately thinking, A V, is also a normal set because its membership
Trang 24function value is either 0 or 1 If k is too large, we can fuzzily this attribute by
dividing its domain into intervals and attaching a new fuzzy attribute to each partition,
i lowever developers or users should consult experts for necessary knowledge related
to current data to achieve an appropriate division
Data discretization using fu z z y sets could bring the follow ing benefits:
firstly, smooth transition o f membership functions should help us eliminate the
"sharp boundary problem”
Data discretization by using fuzzy sets assists us significantly reduce the number o f new attributes because number o f fuzzy sets associated with each original attribute is relative!} small comparing to that o f an attribute in quantitative association rules For
instance, if we use normal discretization methods over attribute Serum ch o lestero l, we
will obtain five sub-ranges (also five new attributes) from its original domain [100,
60()'| whereas we will create only two new attributes C holesterol Low and
C holesterol H igh by applying fuzzy sets This advantage is very essential because it
allows us to compact the set o f candidate itemsets, and therefore shortening the total mining time
Fuzzy association rule is more intuitive, and natural than known ones
All values o f records at new attributes after fuzzifying are in [0, 1) This is to imply the possibility that a given element belongs to a fuzzy set As a result, this flexible coding offers an exact method to measure the contribution or impact o f each record to the overall support o f an itemset
The next advantage that we will see more clearly in the next section is fuzzified
databases still hold "downward closure property” (all subsets o f a freq u en t item set are
also frequen t, a n d any superset o f a non-frequent item set w ill be not frequent) if we
have a wise choice for T-norm operator Thus, conventional algorithms such as Apriori also work well upon fuzzified databases with just slight modifications
Another benefit is this data discretization method can be easily applied to both relational and transactional databases
Trang 25Tabl e 8 - T h e di a g no s t i c da t a b a s e o f heart di sease on 13 patients
I cl I [ i| i2 in{ be a set o f n attributes, denoting /,, is the u h attribute in I And T r- {I, t2 tmj is a set o f w records, and tv is the vlh record in T The value o f record
I, at attribute /'„ can be referred to as tvfiu] For instance, in the table 8, the value o f
t s | i 21 (also the value o f i5[Serum cholesterol]) is 274 (mg/ml) Using fuzzification
method in the previous section, we associate each attribute with a set o f fuzzy sets
as follows:
For example, with the database in table 8, we have:
F/\gC = i A g e Young, A ge M iddle-aged, A ge Old] (with k = 3) Fscrum cholesterol = {C h o lestero l Low, C holesterol H igh) (with k = 2)
A fu z z y association rule |4j (9] is an implication in the fo r m of:
Where:
• X Y c I are itemsets X = {X|, X2, xp} (Xj * Xj if i * j ) and Y = {y |, y2,
> q ! (y, * y, if i * j)
• A = j fX|, fx2, ., f\P) B = {fyi, fy2, ., fyC|} are sets o f fuzzy sets corresponding
to attributes in X and Y fxi e FXI và fyj g Fyj-
Trang 26W c can rewrite the fuzzy association rules as two following forms:
X - - i x , Xp] is A {i; , i;p) Y={V| yq) is B={fy,.fyq) (3.4)
(x, is fxl) A N D A N D (xp is fxp) zo (y, is fyl) A N D AN D (yq is fyq) (3.5)
A f u z z y item se t is now defined as a pair <X A>, in which X ( c I) is an itemset and A
is a set o f fuzzy sets associated with attributes in X
T h e su p p o rt o f a fuzzy itemset <X, A> is denoted fs{< X , A>) and determined by the
following formula:
¿ K (A- tA'l ] ) ® «V, (fv lX2 ] ) ® ® «v„ ('v [ Xp ] )}
/.v(< X , A >) = — - - (3.6 )
Where:
X = {x | xp} and /v is the vth record in T
0 is the T-norm operator in fuzzy logic theory Its role is similar to that o f logic operator A N D in traditional logic
|T| (card o f T) is the total number o f records in T (also equal to m).
A fr e q u e n t fu z z y item set: a fuzzy itemset <X A> is frequent if its support is greater
or equal to a fuzzy minimum support (fin ins up) specified by users, i.e.
Trang 27T oán tù T -norn t ( 0 ) : there are various ways to choose T-norm operator [ 1 ị [2] (21
14 7 1 for formula (3.6) such as:
• Min function: a 0 b = min(a, b)
• Normal multiplication: a 0 b = a.b
• Limited multiplication: a <8> b = max(0, a + b - 1)
• Drastic multiplication: a 0 b = a (if b=T), = b (if a = l ) , = 0 (if a, b < I )
• Yager joint operator: a 0 b = 1 - m in fl , ( ( l - a ) NV + ( l - b ) w) w] (with vi
0 ) If lv = 1 it becomes limited multiplication I f w runs up to +co, it will
develops into m in function 11' w decreases to 0, it becomes Drastic
multiplication
Based on experiments, we conclude that m in function and norm al m ultiplication are
the two most preferable choices for T-norm operator because they are convenient to calculate support factors as well as can highlight the logical relations among fuzzy attributes in frequent fuzzy itemsets The two following formulas (3.12) and (3.13) are
Trang 28derived from the formula (3.6) by applying mill function and norm al m ultiplication
respectively
Another reason for choosing mill function and algebraic m ultiplication for T-norm operator is related to the question “how we understand the m eaning o f the im plication
operator (—> or =>) ill fu zzy logic theory?" In the classical logic, the implication
operator, used to link two clauses p and Q to form a compound clause p —> Ọ,
expresses the idea “ if p then Q" This is a relatively sophisticated logical link because
it is used to represent a cause and effect relation While formalizing, we, however, consider the truth value o f this relation as a regular combination o f those o f p and Ọ
This assumption mav lead us to a misconception or a misunderstanding o f this kind o f relation ị 11
In i'uzzy logic theory, implication operator expresses a compound clause in the form
of “ if u is p then V is Q ' \ in which, p and Q are two fuzzy sets on two universal domain u and V respectively The cause a n d effect rule “ if u is p then V is Q" is understood that the pair (I t v) is a fuzzy set on the universal domain UxV The fuzzy implication p -> Q is considered a fuzzy set and we need to identify its membership
function nip >o from membership functions mp and m 0 o f fuzzy sets p and Q There are
various researches around this issue We relate herein several ways to determine membership function \ 11:
If adopting the idea o f implication operator in classical logic theory, we have: V(u v)
G u X V: /77 /> _ » (■ ;(u, v) = 0 ( 1 - nip, m 0 ) , in which © is S-norm operator in fuzzy logic
theory I f © is replaced with m ax function, we obtain the Dienes formula inr >(){u, v)
= m a x ( l - /77/», mò) I f © is replaced with p ro b a b ility su m , we receive the Mizumoto formula /;// >0(u, v) = 1- nip + mp.mợ And if © is substituted by lim ited
m ultip lication , we get the Lukaciewicz formula as u, v) = mi n(l , 1- /;7/.+ mọ).
Trang 29In general, the © can be substituted by any valid function satisfying certain conditions
o f S-norm operator
Afiother way to interpret the meaning o f this kind o f relation is that the truth value o f compound clause " if u is P then v is Q ” increases iff the truth values o f both antecedent and consequent are large This means that nip_M)(u, v) = <8>(/«/>, in0 ) If the
0 operator is substituted with m in function, we receive the Mamdani formula m r ^ a(u,
v) inin(/;//- m a ) Similarly, if 0 is replaced by norm al m ultiplication we obtain the formula m/>_^c)( u v) = m r m () [2],
Fuzzy association rule, in a sense, is a form o f the fuzzy implication Thus, it must, in
pan comply with the above ideas Although there are many combinations o f m r and
m o to form the m p ^ g (u, v), the Mamdani formulas should be the most favorable one
This is the main reason that influences our choice o f min function and algebraic
m ultipication for T-norm operator.
3.2.3 Algorithm for fuzzy association rules mining
The issue o f discovering fuzzy association rules is usually decomposed into two following phases:
Phase one: finding all possible frequent fuzzy itemset <X, A> from the input
database, i.e fs (< X , A>) > fn in s u p
Phase two: generating all possible confident fuzzy association rules from the
discovered frequent fuzzy itemsets above This subproblem is relatively straightforward and less time-consuming comparing to the previous step If <X, A> is
a frequent fuzzy itemset, the rules we receive from <X, A> has the form o f
\" is A '— ls—± X \ X ' is A \ A', in which, X ’ and A ’ are non-empty subsets o f X and A
respectively The inverse slash (i.e \ sign) in the implication denotes the subtraction
operator between two sets, fc is the fuzzy confidence factor o f the rule and must meet the condition fc > fm in c o n f
The in p u ts o f th e a lg o rith m are a database D with attribute set I and record set T and
fn in su p as well as fm inconf.
The o u tp u ts o f th e a lg o rith m are all possible confident fuzzy association rules
N otation table:
Trang 301« Set o f fuzzy attributes in Dp, each o f them is attached with a fuzzy
set Each fuzzy set j\ in turn, has a th reshold vty as used in formula
( 3.7)
T, Set o f records in Dp value o f each record at a given fuzzy attribute is
in [ 0 , 1 ]
ck Set o f fuzzy k-itemset candid ates
Set o f frequent fuzzy k-itemsets
F Set o f all possible frequent itemsets from d ataba se Dp
Fm ins up F u zzy m inim um su pport
1 '111 i neonf F uzzy m in im um confidence
Ta b l e 9 - Not at i on s used in f uzzy associat ion rules mi ni ng al gori t hm
The algorithm :
2 ( D r , Ip, T F) = FuzzyM ate rialization(D , I, T);
3 F | = Counting(D |., Ip, T\:,fm insup)\
T a b l e 10 - T h e al gori t hm for mi ni ng fuzzy associ at i on rules
The algorithm in table 10 uses the following sub-programs:
(Di.-, Ik, Tk) = FuzzyMaterialization(D, I T): this function is to convert the original
database I) into the fuzzified database D r Afterwards, I and T are also transformed to1] and T, respectively For example, with the database in table 8, after running this function, we will obtain:
Ii = {[Age, Age Young] (1 ), [Age, Age Middle-aged] (2), [Age, Age O ld| (3),
[Cholesterol Cholesterol Low] (4), fCholesterol, Cholesterol High| (5)
[BloodSugar, BloodSugar 0] (6), fBloodSugar, BloodSugar /] (7),
| / leart Disease Heart Disease No] (8), [HeartDisease, HeartDisea.se Yes] (9)}
After converting, I| contains 9 new fuzzy attributes comparing to 4 in I Each fuzzy attribute is a pair including the name o f original attribute and the name o f
Trang 31corresponding fuzzy set and surrounded by square brackets For instance, after
fuzzilying the A ge attribute, we receive three new fuzzy attributes [Age, Age Youngj
Age Middle-agedJ, and [Age, Age Old],
In addition, the function FuzzyMaterialization also converts T into T ( as shown in
the following table:c_
T a b l e 11 - T F: Val ues o f recor ds at at tribut es af ter f uzzifying
Note that the characters A C, S, and H in the table 11 are all the first character o f
Age C holesterol S u g a r, and H eart respectively Each fuzzy s e t / i s accompanied by a
threshold \vf, so only values that greater or equal to that threshold are taken into
consideration All other values are 0 All gray cells in the table 11 indicates that theirs values are larger or equal to threshold (all thresholds in table 11 are 0.5) All values located in white-ground cells are equal to 0
Fi = Counting(D|., I|., TV, fm in su p Y this function is to generate Fj, that is set o f all
frequent fuzzy 1-itemsets All elements in Fj must have supports greater or equal to
fm insup For instance, if applying the norm al m ultiplication for T-norm (<8>) operator
in formula (3.6) and fm insup with 46%, we achieve the F t that looks like the
following table:
Age Young] } (1) 1 0 % N o
{[Age, Age Middle-aged]} (2) 45 % N o
{[Age, Age Old]} (3) 7 6 % Yes
{[Serum cholesterol Cholesterol Low]} (4) 43 % No
{[Serum cholesterol, Cholesterol High]} (5) 1 6 % No
{[BloodSugar, BloodSugar ()]} (6) 85 % Yes
{[BloodSugar, BloodSugar / ] } (7) 15 % No
{[HeartDisease, HeartDisease No]} (8 ) 54 % Yes
! [HeartDisease, HeartDisease_ Yes]} (9) 4 6 % Yes
Ta b l e 12 - C , : set o f can d i dat e l - i t emset s
Trang 32INS ERT INTO Ck
S E L E C T p.i i, p.i2 p.ik-i, q.ik-i
FR OM Lk., p Lk_, q
W H E R E p.i| = q.i|, p.ik_2 = q.ik-2, P - i k - i < q.ik-i AND p.ik.,.o * q.ik.|.o;
In which, p.i, and q.i¡ are index number o f j 1’ fuzzy attributes in itemsets p and q respectively, p.ij.o and q.ij.o are the index number o f original attribute Two fuzzy attributes sharing a com mo n original attribute must not exist in the same fuzzy
itemset l or example, after running the above SQL command, we obtain C 2 {'3, 6)
¡3 8), {3 9}, {6 8}, {6 9}} The 2-itemset {8, 9} is invalid because its two fuzzy
attributes are derived from a common attribute H eartD isease.
C k = P r u n e ( C k): this function helps us to prune any unnecessary candidate k-itemset
in C k thanks to the downward closure property “all subsets o f a fre q u e n t item set are
also fr e q u e n t a n d any su perset o f a non-frequent item set w ill be not freq u e n t" To
evaluate the usefulness o f any k-itemset in C k the P r u n e function must make sure
that all (k-l)-subsets o f C u are present in F k_t For instance, after pruning the C 2 =
{{3, 6}, {3 8}, {3, 9}, {6, 8}, {6, 9}}
F u = Checking(Cu, D ¡.,fm in su p ): this function first scans over the whole transactions
in the dalatabase to update support factors for candidate itemsets in Cu- Afterwards,
Cheeking eliminates any infrequent candidate itemset, i.e whose support is smaller
than fm in s u p All frequent itemsets are retained and put into Fu After running F2 =
Cliccking(C2 D|., 46%), we receive F2 = {{3,6}, {6,8}} The following table displays the detailed information