Parallel mining for fuzzy association rules

Know ledge Discovery in Databases KDD Keywords: Dala mining, association rules, binary association rules, quantitative alien rules, fuzzy association rules, parallel algorithm... We coul

Trang 2

I isi o f figures

I ¡st i)f t a b l e s

Notations & Ab br ev ia ti o n s

A c k n o w l e d g e m e n t s

Abstract

Chapter 1 Introduction to Data min in g

1.1 Data m in i n g

1.1.1 Data mining: M o tiv at io n

1.1.2 Data mining: D ef in iti on

1.1.3 Main steps in Knowledge discovery in databases ( K D D )

1.1 Major approaches and techniques in Data m in i n g

1.2.1 Major approaches and techniques in Data m i n i n g

1.2.2 Kinds o f data could be m i n e d

1.2 Applications o f Data m i n i n g

1.2.1 Applications o f Data m i n i n g

1.2.2 Classification o f Data mining s y st e m s

1.3 f o cu se d issues in Data mining

Chapter 2 Association r ul e s

2.1 Association rules: Moti vat io n

2.2 Association rules mining - Problem statement

2.3 Main research trends in Association rules m in i n g

Chapter 3 f u z z y association rules mi n in g

3 1 Quantitative association r u l e s

3.1.1 Association rules with quantitative and categorical attributes

3

4

10

12

12 13

13

14

16 16

17

18

21

Trang 3

3.1.2 Methods o f data discretization 22

3.2 Fuzzy association r u l e s 24

3.2.1 Data discretization based on fuzzy s e t 24

3.2.2 Fuzzy association r u l e s 27

3.2.3 Algorithm for fuzzy association rules min in g 31

3.2.4 Relation between fuzzy association rules and quantitative o n e s 35

3.2.5 Experiments and conclu sions 36

Chapter 4 Parallel mining o f fuzzy association r u l e s 41

4.1 Several previously proposed parallel algorithms 42

4.2 A new parallel algorithm for fuzzy association rules m in in g 50

4 2 1 Our a p p r o a c h 50

4.2.2 The new a l g o r i t h m 54

4.2.3 Proof o f correctness and computational comp lexit y 54

4.3 Experiments and conclusions 57

C o n c l u s io n 59

Achievements throughout the dissertation 59

Future w o r k 60

R e f e r e n c e 61

A p p e n d i x 65

Trang 4

L i s t o f f i g u r e s

Figure 1 - The volume o f data strongly increases in the past two d e c a d e s

l igure 2 - Steps in KD D p r o c e s s

Figure 3 - Illustration o f an association r u le

Figure 4 - "Sharp boundary problem" in data discretization

Figure 5 - Membership functions o f fuzzy sets associated with “Age" attribute

Figure 6 - Membership functions o f "CholesterolJLow” and "Cholesterol H i g h "

Figure 7 - The processing time increases dramatically as decreasing thq fm in su p

Figure 8 - Num ber o f itemsets and rules strongly increase as reducing the fm in su p Figure 9 - The number o f rules enlarges remarkably as decreasing the f m i n s u p

Figure 10 - Processing time increases largely as slightly increasing number o f attrs f igure 1 1 - Processing time increases linearly as increasing the number o f r e c o r d s Figure 12 - Optional choices for T-norm o p e r a t o r

Figure 13 - The mining results reflect the changing o f threshold v a l u e s

f igure 14 - Count distribution algorithm on a 3-processor parallel s y s t e m

Figure 15 - Data distribution algorithm on a 3-processor parallel s y s t e m

Figure 16 - The rule generating time largely reduces as increasing the m in c o n f

Figure 17 - The number o f rules largely reduces when increasing the m in c o n f

figure 18 - The illustration for division algorithm

f’igire 19 - Processing time largely reduces as increasing the number o f process

Figure 20 - Mining time largely depends on number o f process (logical, physical)

Figure 21 - The main interface window o f F u z z y A R M to o l

Figure 22 - The sub-window for adding new furry s e t s

Figure 23 - The window for viewing mining results

9 11 16 23 25 25 36 37 37 38 38 39 40 43 44 48 48 55 57 58 65 66

66

Trang 5

List of tables

Table I - An example o f transactional d at ab as es 17

Table 2 - Frequent itemsets in sample database in table 1 with support = 5 0 % 17

Table 3 - Association rules generated from frequent itemset A C W 1 8 Table 4 - Diagnostic database o f heart disease on 17 p a t i e n t s 21

fable 5 - Data discretization for attributes having finite v a l u e s 22

fable 6 - Data discretization for "Serum cholesterol" a tt ri b ut e 23

fable 7 - Data discretization for "Age" attribute 23

Table 8 - The diagnostic database o f heart disease on 13 p a t i e n t s 27

fable 9 - Notations used in fuzzy association rules mining a l g o r it h m 32

fable 10 - The algorithm for mining fuzzy association rules 32

fable 1 1 - T | : Values o f records at attributes after fuzzifying 33

Table 12 - C i: set o f candidate 1-it em se ts 33

fa ble 13 - F2: set o f frequent 2 - i t e m s e t s 34

fable 14 - Fuzzy association rules generated from database in table 8 35

fable 15 - The sequential algorithm for generating association r u l e s 48

fable 10 - Fuzzy attributes received after being fuzzified the database in table 8 51 fable 1 7 - Fuzzy attributes dividing algorithm among p r o c e s s o r s 53

Trang 6

Know ledge Discovery in Databases KDD

Keywords:

Dala mining, association rules, binary association rules, quantitative alien rules, fuzzy association rules, parallel algorithm

Trang 7

C h a p te r 1 Introduction to Data m ining

1.1 Data m in in g

1.1.1 Data mining: Motivation

The past two decades lias seen a dramatic increase in the amount o f information or data being stored in electronic devices (i.e hard disk, C D - R O M etc.) This ac cum ul ation o f data has taken place at an explosive rate It has been estimated that the amount

o f information in the world doubles every two years and the size and number o f databases are increasing even faster Figure l illustrates the data explosion [3|

Fi gu r e I - T h e v o l u me o f data s t r o n g l y i nc r e as e s in the past t wo d e c a d e s

We are drowning in data, but starving for useful knowledge The vast amount o f accumulated data is actually a valuable resource because information is the vital factor for business operations, and decision-makers could make the most o f the data to gain precious insight into the business before making decisions Data mining, the extraction o f hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most significant information in their data collection (databases, data warehouses, data repositories) The automated, prospective analyses offered by data mining go beyond the normal analyses o f past events provided by retrospective tools typical o f decision support systems Data mining tools can answer business questions that traditionally were too time-consuming to resolve This is where data mining & knowledge discovery in databases demonstrates its obvious benefits for today’s competitive business environment Nowadays Data Mining & K D D has been becoming a key role in computer science and knowledge engineering areas

Trang 8

The initial application o f data mining is only in commerce (retail) and finance (stock market) However, data mining is now widespreadly and successfully put into other fields such as bio-informatics, medical treatment, telecommunication, education, ctc.

1.1.2 Data mining: Definition

Before discussing some definitions o f data mining, I have a small explanation about terminology so that readers can avoid unnecessary confusions As mention-ed before,

we can roughly understand that data mining is a process o f extracting nontrivial, implicit, previously unknown, and potentially useful knowledge from huge sets o f data Thus, we should name this process as knowledge discovery in database (KDD) instead o f data mining However, most o f the researchers agree that the two above terminologies (Data mining and KDD) are similar and they can be used interchangeably They explain for this “humorous misnomer” that the core motivation

o f KDD is the useful knowledge, but the main object they have to deal with during mining process is data Thus, in a sense, data mining and KD D imply the same meaning However, in several materials, data mining is sometimes referred to as one step in the whole K D D process [3] [43]

There are numerous definitions o f data mining and they are all descriptive I would like to restate herein some o f them that are widely accepted

Definition one: W J Frawley, G Piatetsky-Shapiro, and C J Matheus 19 9 1 [43]:

"Knowledge discovery in databases, also known Data mining, is the nontrivial process o f identifying valid, novel, potentially useful, and ultimately understand-able patterns in data ”

Definition two: M Holshemier va A Siebes (1994):

"Data Mining is the search fo r relationships and global patterns that exist in targe databases hut are 'hidden' among the vast amount o f data, such as a relationship between patient data and their medical diagnosis These relationships represent valuable knowledge about the database and the objects in the database and, i f the database is a faithful mirror, o f the real world registered by the database "

1.1.3 Main steps in Knowledge discovery in databases (KDD)

The whole KDD process is usually decomposed into the following steps [3J [ 14] [23]:

Trang 9

D a ta selection-, selecting or segmenting the necessary data that needs to be mined

from large data sets (databases, data warehouses, data repositories) according to some criteria

D ata p re p ro c e ssin g : this is the data clean and reconfiguration stage where some

techniques arc applied to deal with incomplete, noisy, and inconsistent data This step also tries to reduce data by using aggregate and group function, data compression methods, histograms, sampling, etc Furthermore, discretization techniques (binning, histograms, cluster analysis, entropy-based discretization, segmentation) can be used

to reduce the number o f values for a given continuous attribute by dividing the range

o f the attribute into separated intervals After this step, data is clean, complete, uniform, reduced, and discretized

D ata transformation-, in this step, data are transformed or consolidated into forms

appropriate for mining Data transformation can involve data smoothing and normalization After this step, data are ready for the mining step

D ata mining: this is considered to be the most important step in KD D process It

applies some data mining techniques (chiefly borrowing from machine learning and other fields) to discover and extract useful patterns or relationships from data

K now ledge rep resentation a n d evaluation: the patterns identified by the system in

previous step are interpreted into knowledge that can then be used to support human decision-making (e.g prediction and classification tasks, summarizing the contents o f

a database or explaining observed phenomena) Knowledge representation also converts patterns into user-readable expressions such as trees, graphs, charts & tables, rules, etc

Trang 10

1.1 M ajor a p p r o a c h e s and te c h n iq u e s in Data m in in g

1.2.1 Major approaches and techniques in Data mining

Data mining consists o f many approaches They can be classified according to functionality, kind o f knowledge, type o f data to be mined, or whatever appropriate criteria 114], 1 describe major approaches below:

Classification & prediction', this method tries to arrange a given object into an

appropriate class am on g the others The number o f classes and their name are definitely known For example, we can classify or anticipate geographic regions according to weather and climate data This approach normally uses typical techniques and concepts in machine learning such as decision tree, artificial neural network, k-min support vector machine, etc Classification is also called supervised learning

Association rules: this is a relatively simple form o f rule e.g “ 80 percent o f men that

purchase beer also purchase dry beef" Association rule is now successfully applied in supermarket (retail), medicine, bio-informatics, finance & stock market, etc

Sequential/tem poral p a ttern s mining', this method is somewhat similar to association

rules except that data and mining results (a kind o f rule) always contain a temporal attribute to exhibit the order or sequence in which events or objects effect upon each other This approach plays a key role in finance and stock market thanks to its capab lity o f prediction

Clustering & segm entation', this method tries to arrange a given object into a suited

category (also known as cluster) The number o f clusters may be dynamic and their labels (names) are unknown Clustering and segmentation are also called unsup;rvised learning

Concept description & sum m arization', the main objective o f this method is to

describe or summarize an object so that the obtained information is compact and condeised Document or text summarization may be a typical example

1.2.2 Kinds of data could be mined

Data mining can work on various kinds o f data The most typical data types are as follows:

Trang 11

R ela tio n a l d a ta b a se s: databases organized according to the relational model Most o f

the existing database management systems support this kind o f model such as Oracle, IBM DB2 MS SQL Server MS Access, etc

M ultidim en sio n a l databases', this kind o f database is also called data warehouse, data

mart, etc The data selected from different sources contain the historical feature thanks

to an implicit or explicit temporal attribute This kind o f database is used primarily in data mining and decision-making support systems

T ransactional d a ta b a ses: this kind o f database is commonly used in supermarket,

banking, etc Each transaction includes a certain number o f items (e.g items may be goods in an order) and a transactional database, in turn, contains a certain number o f transactions

O bject relational d a ta b a se s: this database model is a hybrid o f the object oriented

model and the relational model

S p a tia l, tem poral, a n d tim e-series d a ta ’, this kind o f data always contains either spatial

(e.g map) or temporal (e.g stock market) attributes

M ultim edia d a ta b a ses: this kind o f data includes audio, image, video, text, www, and

many other data format Today, This kind o f data is widely used on Internet thanks to its useful applications

1.2 A p p lic a tio n s o f Data m ining

1.2.1 Applications of Data mining

Although data mining is a relatively new research trend, it is a big attraction o f researchers because o f its practical applications in many areas The following should

be typical applications: ( I ) Data analysis and decision-marking supports This application is popular in commerce (retail industry), finance & stock market, etc (2) Medical treatment: finding the potential relevance among symptoms, diagnoses, and treatment methods (nutrient, prescription, surgeon, etc) (3) Text and Web mining: document summarization, text retrieval and text searching, text and hypertext classification (4) Bio-informatics: search and compare typical or special genetic information such as genomes and DNA, the implicit relations between a set o f genomes and a genetic disease, etc (5) Finance & stock market: examining data to

Trang 12

extract predicted information for price o f a certain kind o f coupon (6) Others (telecommunication, medical insurance, astronomy, anti-terrorism, sports, etc).

1.2.2 Classification of Data mining systems

Data mining is a knowledge engineering related field that involves many others research areas such as database, machine learning, artificial intelligence, high performance computing, data & knowledge visualization, etc We could classify data mining systems according to different criteria as follows:

C la ssifyin g based an kin d o f data to be mined', data mining systems work with

relational databases, data warehouses, transactional databases, object-oriented databases, spatial and temporal databases, multimedia databases, text and webdatabases, etc

C la ssifyin g based on type o f m ined kn ow ledge: data mining tools that return

summarization or description, association rules, classification or prediction, clustering, etc

C la ssifyin g ba sed on w hat kin d o f techniques to be used', data mining tools work as

online analytical processing (OLAP ) systems, use machine learning techniques(decision tree, artificial neural network, k-min, genetic algorithm, support vector machine, rough set, fuzzy set etc.), data visualization, etc

C lassifyin g based on w hat fie ld s the data m ining system s are app lied to: data mining

systems are used in different fields such as commerce (retail industry),telecommunication, bio-informatics, medical treatment, finance & stock market, medical insurance, etc

1.3 F o c u s e d issues in D ata m ining

Data mining is a relatively new research topic Thus, there are several pending orunconvincingly solved issues I relate herein some o f them that are attracting much attention o f data mining researchers

( I) O L A M (Online Analytical Mining) is a smooth combination o f databases, data warehouses, and data mining Nowadays, database management systems like Oracle,

MS SQI Server, IBM DB2 have integrated OLAP and data warehouse functionalities

to facilitate users in data retrieval and data analyzing These add-in supports also

Trang 13

charge users an additional sum o f money Researchers in these fields hope to go beyond the current limitation by developing multi-purposes Of.AM systems that support data transactions for daily business operations as well as data analyzing for making decision | 14], (2) Data mining systems can mine various forms o f knowledge from different types o f data [14] [7], (3) How to enhance the performance, accuracy, scalability, and integration o f data mining systems? How to decrease the computational complexity? How to improve the ability o f dealing with incomplete, inconsistent, and noisy data? Three questions above should still be concentrated in the future [14 | (4) Taking advantage o f background knowledge or knowledge from users (experts or specialists) to upgrade the total performance o f data mining systems [7] [1 ] (5) Parallel and distributed data mining is an interesting research trend because it makes use o f powerful computing systems to reduce response time This is essential because more and more real-time applications are needed in today’s competitive world [5] [8] f 12) [18] [26] [31] [32] [34] [42] (6) Data Mining Query Language (D MQL): Researchers in this area try to design a standard query language for data mining This language will be used in OL AM systems as if SQL are widely used in relational databases [14] (7) Knowledge representation and visualization are also taken into consideration to express knowledge in human-readable and easy-to-use forms Knowledge can be represented in more intuitive expressions due to multidimensional or multilevel data structures.

This thesis primarily involves in mining fuzzy association rules and parallel algorithms for mining fuzzy association rules

Trang 14

C h a p te r 2 A ss o ciatio n rules

2.1 A s s o c ia tio n rules: M o tiv a tio n

Association rule is the form o f “70 percent o f customers that p u rch a se beer also

purchase dry b e e f 20 percent o f customers purchase both” or “75 percent o f patients

who sm oke cigarettes and live near p o llu te d areas also g et lung ca n cer, 25 percent o f

patients smoke and live near polluted areas as well as suffer from lung cancer"

"Purchase beer" or “smoke cigarettes and live near polluted areas” are antecedents,

“purchase dry b e e f and “get lung cancer” are called consequents o f association rules 20% and 30% are called support factors (percentage o f transactions or records that contain both antecedent and consequent o f a rule), 70% and 75% are called confidence factors (percentage o f transactions or records that hold the antecedent also hold the consequent o f a rule) The following figure pictorially depicts the former example o f association rules

7 0 % o f

transactions tli at purchase beer also purchase dry beef

Fi gure 3 - Ill ustration o f an a s s o c i a t i o n rule

I he knowledge and information derived from association rules have an obvious difference in meaning from that o f normal queries (usual in SQL syntax) This knowledge contains previously unknown relationships and predictions hidden in massive volumes o f data It not only results from usual group, aggregate or sort operations but also results from a complicated and time-consuming computing process

Being a simple kind o f rule, association rules, however, carry useful knowledge and contribute substantially to decision-making process Unearthing significant rules from databases is the main motivation o f researchers

number o f

^ transactions that buy beer

Trang 15

2.2 A s s o c ia tio n rules m in in g - P roblem s ta te m e n t

Let I = {i | i2, i„) be a set o f n items or attributes (in transactional or relational databases respectively) and T = {tt, t2, ., tm} be a set o f m transactions or records (in

transactional or relational databases respectively) Each transaction is identified with its unique TID number A (transactional) database D is a binary relation 5 on the

Descart multiplication I xT (or also written 5 c IxT) If an item i occurs in a transaction t, we write (i, t) e 8 or i8t Generally speaking, a transactional database is

a set o f transactions, where each transaction t contains a set o f items or t e 2 1 (where

m insup 136].

The following table enumerates all possible frequent itemsets in sample database in

table I with m insup value is 50%.

T a b l e 2 - Fr e qu e n t it emsets in s a mpl e d a t a b a s e in table 1 with s u p p o r t = 5 0 %

Association rule is an implication in the form o f X ——> Y , where X and Y are

frequent itemsets that disjoint, i.e X n Y = 0 and c, confidence factor o f the rule, is

the conditional probability that a transaction contains Y given that it contains X, i.e c

Trang 16

= s(X uV) / ,v(X) A rule is confident if its confidence factor is larger or equal to a user-specified m inim um confidence (m inconf) value, i.e c > m in c o n f [ 3 6 1.

flic association rules mining task can be stated as follows:

Let D be a (transactional) database, m insup and m in c o n f are minimum support

and mi ni mu m confidence respectively The mining task tries to discover all

frequ ent and confident association rules X —» Y i.e ,?(XuY) > m insup and c( X - > } ' ) = i ( X u Y ) / v(X) > m in c o n f

Almost all previously proposed algorithms decompose this mining task into two separated phases [4] [5] [20] [24] [34] [35]:

Phase o n e : finding all possible frequent itemsets from database D This stage is

usually complicated and time-consuming because it needs much time o f CPU (CPU- bound) and I/O operations (1/O-bound)

Phase tw o : generating confident association rules from discovered frequent itemsets

in the previous step If X is a frequent itemset, confident association rules created

from X have the form o f X ' ——* X \ X ' , where X ’ is any non-empty subset o f X and

X \ X ’ is subtraction o f X ’ from X This step is relatively straightforward and much less ti me-consuming than the step one

The following table lists all possible association rules generated from the frequent

itemset A C W (from database in table 1) with m in c o n f = 70%.

T a b i c 3 - As s oc i at i on rules ge n e r at e d from f re quent itemset A C W

2.3 Main research trends in Association rules mining

Since proposed by R Agrawal in 1993 [36], the field o f association rules mining has developed into various new directions thanks to a variety o f improvements from researchers Some o f proposals are try to enhance the precision and performance

Trang 17

some try to tune the interestingness o f rules, etc 1 list herein some o f its dominanttrends.

Mining binary or boolean association rules: this is the initial research direction o f association rules Most o f the early mining algorithms are related to this kind o f rule [20] 13 5 1 [36], In binary association rules, an item is only determined whether it is present or not The quantity associated with each item is fully ignored, e.g a transaction buying twenty bottles o f beer is the same a transaction that buys only one bottle The most well-known algorithms mining binary association rules are Apriori together with its variants (Apriori-Tid and AprioriHybrid) [35] An example o f this

type o f rule is “ bving b rea d = 'y e s ’ A N D buying su g a r = y e s ’ => bu yin g m ilk ~

y e s ' A ND buying butter =■ 'yes ’, with support 20 % and confidence 8 0 % ”

Quantitative and categorical association rules: attributes in databases may be binary (boolean), number (quantitative), or nominal (categorical), etc To discover association rules that involve these data types, quantitative and categorical attributes need to be discretized to convert into binary ones There exist some o f discretization

methods that are proposed in [34] [39] An example o f this kind o f rule is “sex —

Unde ’ AN D age e '50 65 ' AND w eight e '6 0 8 0 ' A N D sug ar in b lo o d > ]2 0 m g /m l

=~> blood p ressu re = 'high ’ with support 30 % and confidence 6 5 % ”

Fuzzy association rules: this type o f rule was suggested to overcome several drawbacks in quantitative association rules such as “ sharp boundary p r ob lem ” or semantic expression Fuzzy association rule is more natural and intuitive to users

thanks to its “ fuzzy” characteristics An example is “d ry cough = 'y e s ' A N D h ig h

fe v e r AN D m uscle aches = 'y e s ’ A N D breathing difficulties = ‘y e s ’ => g et S A R S

(Severe Acute Respiratory Syndrome) = ‘y e s ’, with support 4% and confidence 8 0 % ”

H igh fe v e r in the above rule is a fuzzy attribute We measure the body temperature

based on a fuzzy concept

Multi-level association rules: all kinds o f association rules above are too concrete, so they cannot reflect relationships on general views Multi-level or generalized association rule is devised to surmount this problem [15] [37], In approach, we wouldIf

prefer rule like “ buy PC = 'yes' => buy o p era tin g system = 'y e s ' A N D buy office

tools = ' y e s rather than “ buy IB M PC = 'y e s ' => buy M icrosoft W indow s ~ ‘y e s '

AN D buy M icrosoft O ffice = 'y e s ’” Obviously, the former rule is the generalized

form o f the latter and the latter is the specific form o f the former

Trang 18

Association rules with weighted items (or attributes): we use weight associated with each item to indicate “ the level” that item contributes to the rule In other words, weights are used to measure the importance o f items For example, while surveying

SA RS plague within a certain group o f people, the information o f body tem perature and respiratory system is much more essential than that o f age To reflect the difference between the above attributes, we attach greater weight values for body

tem perature and resp ira to ry system attributes This is an attractive research branch

and solutions to it were presented in several papers flO] [441 By using weights, we should discover scarce association rules o f high interestingness This means that we can retain rules with small supports but have a special meaning

Besides examining o f variants o f association rules, researchers pay attention to how to accelerate the phase o f discovering frequent itemsets Most o f recommended algorithms are to try to reduce the number o f frequent itemsets need to be mined by

developing new theories o f m axim al frequent item sets [11] (MAF IA algorithm)

clo se d item sets [13] ( C L O S E T algorithm), [24] (C H A R M algorithm), [30] These

new approaches considerably decrease mining time owning to their “delicate pruning strategies” Experiments show that these algorithms outperform known ones like Apriori, AprioriTid, etc

Parallel and distributed algorithms for association rules mining: in addition to sequential or serial algorithms, parallel algorithms are invented to enhance total performance o f mining process by making use o f robust parallel systems The advent

o f parallel and distributed data mining is highly accepted because size o f databases increases sharply and real-time applications are commonly used in recent years Numerous parallel algorithms for mining association rules were devised during past ten years in [5] [12] [18] [26] [31] [32] [34] They are both platform dependent and platform independent

Mining association rules in the point o f view o f rough set theory [41],

Furthermore, there exist other research trends such as online association rule mining [33] that data mining tools are integrated or directly connected to data warehouses or data repositories based on well-known technologies as OLAP, M OL AP , ROLAP,ADO etc

Trang 19

C h a p te r 3 Fuzzy association rules mining

3.1 Q u a n tita tiv e a s s o c ia tio n rules

3.1.1 Association rules with quantitative and categorical attributes

Mining quantitative and categorical association rules is an important task because o f its practical applications on real world databases This kind o f association rules first introduced in [38]

T a b l e 4 - Di agn os t i c d a t a b a s e o f heart di sease on 17 patients

In the above database, three attributes Age, Serum ch o lestero l, M axim um heart rate are quantitative, two attributes Chest p a in type and resting electrocardio-graphics are categorical, and all the rest are binary (Sex, H eart d isea se, F asting blo o d sugar) In

fact, binary data type is also considered to be a special form o f category From the data in table 4, we can extract such rules as:

<Age: 54 74> A N D <Sex: Female> A N D <Cholesterol: 2 0 0 3 0 0 => <Heart

disease: Yes>, with support 23.53% and confidence 80%

<Sex: Male> A N D <Resting electrocardiographics: 0> A N D <Fasting blood suga r < 120> => <Heart disease: No>, with support 17.65% and confidence

100%

Trang 20

The approach proposed in [34] discovers this kind o f rules by partitioning value ranges o f quantitative and categorical attributes into separated intervals to convert them into binary ones Traditional well-known algorithms such as Apriori [35],

C H A R M [24], C O L S E T [20] can then work on these new binary attributes as original problem o f mining boolean association rules

3.1.2 Methods of data discretization

Binary association rules mining algorithms [20] [24] [35] [36] only work with relational database containing only binary attributes or transactional databases as shown in table I They cannot be applied directly to practical databases as shown in table 4 In order to conquer this obstacle, quantitative and categorical columns must first be converted into boolean ones [34] [39], However, there remain some limitations in data discretization that influence the quality o f discovered rules The output rules do not satisfy researchers’s expectation The following section describes major discretization methods to contrast their disadvantages

The first c a s e : let A be a discrete quantitative or categorical attribute with finite value

domain {vh v 2 vk} and k is small enough (k < 100) After being discretized, the original attribute is developed into k new binary attributes named A _ V t, A _ V 2,

A V k Value o f a record at column A Vj is equal to True (Yes or 1) if the original

value o f this record at attribute A is equal to Vj, all the rest cases will set the value o f

A Vj to False (No or 0) The attributes C hest pa in type and resting

elcctrocardiographics in table 4 belong to this case After transforming, the initial

attribute C hest pain type will be converted into four binary columns

Chest j p a i n j y p e l , C hest _pain _ typ e_ 2, C hest_pain- J y p e _ 3 , C hest p a in type 4 as

shown in the following table

Chest pain type

( I * 2, 3, 4)

T a b l e 5 - Dat a di scret i zati on for at tribut es h a v i n g finite values

The seco n d ca se : if A is a continuous and quantitative attribute or a categorical one

having value domain {v|, v 2 vp} {p is relatively large) A will be mapped to q new

binary columns in the form o f <A: startj end|>, <A: start2 end2>, ., <A:

Trang 21

slaiit] endt|> Value o f a given record at column <A: startj endj> is True (Yes or 1) if (lie original value v at this record o f A is between sta rt, and end,, <A: start,.,end|> will receive False (No or 0) value for all other cases o f v The attributes Age Serum

cholesterol, and M axim um heart rate in table 4 belong to this form Serum cholesterol

and Age could be discretized as shown in the two following tables:

Ta b l e 6 - Dat a di scret i zati on for "S er um chol esterol " attribute

T a b l e 7 - Data discret ization for "Age " at tri but e

Unfortunately, the mentioned discretization methods encounter some pitfalls such as

"sharp boundary problem" [4] |9| The figure below displays the support distribution

o f an attribute A having a value range from 1 to 10 Supposing that we divide A into

two separated intervals 11 5] and [6 10J respectively I f the m insup value is 41%, the range |6 10] will not gain sufficient support Therefore [6 10] cannot satisfy m insup (40% < m insup = 41%) even though there is a large support near its left boundary

For example [4 7] has support 55%, [5 8] has support 45% So, this partition results

in a ‘‘sharp boundary" between 5 and 6, and therefore mining algorithms cannot generate confident rules involving interval [6 10]

Trang 22

A n o t h er attribute partitioning method [38] is to divide the attribute domain into ove rlapped regions We can see that the boundaries o f intervals are overlapped with each other As a result, the elements located near the boundary will contribute to more than one interval such that some intervals may become interesting in this case It is, however, not reasonable because total support o f all intervals exceeds 100% and w e unintentionally overemphasize the importance o f values located near boundaries This

is not natural and inconsistent

fu rt h e r m o re , partitioning attribute domain into separated ranges results in a problem

in rule interpretation The table 7 shows that two values 29 and 30 belong to different interv a 1 s though they are very similar in indicating old level Also, supposing that the range | 1 29] denotes young people, [30 59] for middle-aged people, and [60 120] for old ones, so the age o f 59 implies a middle-aged person whereas the age o f 60 implies an old person This is not intuitive and natural in understanding the meaning

o f quantitative association rules

19] This kind o f rule not only successfully improves “sharp boundary problem" but also allows us to express association rules in a more intuitive and a friendly format

For example, the quantitative rule “<Age: 54 74> A N D <Sex: Female> AND

C h o l e s t e r o l : 200 300> => < Heart disease: Yes> is now replaced by “<Age_Old>

A N D < Sex: Female> A N D <Cholesterol_High > => < Heart disease: Yes>”

Age O ld and C holesterol H igh in the above rule are fuzzy attributes.

3.2 F uzzy a s s o c ia tio n rules

3.2 1 D a t a d i s c r e t i z a t i o n b a s e d o n f u z z y s e t

In the fuzzy set theory 12 11 [47), an element can belongs to a set with a membership value in [0, IJ This value is assigned by the membership function associated with

each fuzzy set For attribute x and its domain Dx (also known as universal set), the

mapiping o f the membership function associated with fuzzy s e t / v is as follow:

m, ( x ) : Dx -» [0,l] ( 3 1)

Trang 23

I he fuzzy sei provides a smooth change over the boundaries and allows us to express association rules in a more expressive form, b e t' s use the fuzzy set in data

d iserctizing to make the most o f its benefits

f o r the attribute A ge and its universal domain [0, 120], we attach with it three fuzzy sets A ge Young Age M id d le-a g ed , and A g e Old The graphic representations o f

these fuzzy sets are shown in the following figure

AAge_Voung Age_Middle-aged Age 01d

0 - J -•<" - ^ \ - y A ge

Figure 5 - M e m b e r s h i p f unct i ons o f f uzzy sets a s s oc i at ed with “ A g e ” at tri but e

By using fuzzy set we completely get rid o f “sharp boundary problem" thanks to its o*wn characteristics For example, the graph in figure 5 indicates that the ages o f 59

and 60 have membership values o f fuzzy set A g e O ld approximately 0.85 and 0.90 respectively Similarly, the ages o f 30 and 29 towards the fuzzy set Age Young are

0 70 and 0.75 Obviously, this transformation method is much more intuitive and natural than known discretization ones

A nother example, the original attribute Serum cholesterol is decomposed into two new fuzzy attributes C holestero_L ow and C holestero High The following figure

portrays membership functions o f these fuzzy concepts

A

Choie sterol_Lo w Cholesterol_H lgh

Serum cholesterol (mÿml)

Figure 6 - M e m b e r s h i p funct i ons o f " C h o l e s t e r o l L o w " and "Chol es t e rol High"

If A is a categorical attribute having value domain {V|, v2, ., vki and k is relatively

small, we fuzzily this attribute by attaching a new fuzzy attribute A Vj to each value v,- I he value o f membership function m A Vi(x) equals to I if.y = v, and equals to 0 for vice versa Ultimately thinking, A V, is also a normal set because its membership

Trang 24

function value is either 0 or 1 If k is too large, we can fuzzily this attribute by

dividing its domain into intervals and attaching a new fuzzy attribute to each partition,

i lowever developers or users should consult experts for necessary knowledge related

to current data to achieve an appropriate division

Data discretization using fu z z y sets could bring the follow ing benefits:

firstly, smooth transition o f membership functions should help us eliminate the

"sharp boundary problem”

Data discretization by using fuzzy sets assists us significantly reduce the number o f new attributes because number o f fuzzy sets associated with each original attribute is relative!} small comparing to that o f an attribute in quantitative association rules For

instance, if we use normal discretization methods over attribute Serum ch o lestero l, we

will obtain five sub-ranges (also five new attributes) from its original domain [100,

60()'| whereas we will create only two new attributes C holesterol Low and

C holesterol H igh by applying fuzzy sets This advantage is very essential because it

allows us to compact the set o f candidate itemsets, and therefore shortening the total mining time

Fuzzy association rule is more intuitive, and natural than known ones

All values o f records at new attributes after fuzzifying are in [0, 1) This is to imply the possibility that a given element belongs to a fuzzy set As a result, this flexible coding offers an exact method to measure the contribution or impact o f each record to the overall support o f an itemset

The next advantage that we will see more clearly in the next section is fuzzified

databases still hold "downward closure property” (all subsets o f a freq u en t item set are

also frequen t, a n d any superset o f a non-frequent item set w ill be not frequent) if we

have a wise choice for T-norm operator Thus, conventional algorithms such as Apriori also work well upon fuzzified databases with just slight modifications

Another benefit is this data discretization method can be easily applied to both relational and transactional databases

Trang 25

Tabl e 8 - T h e di a g no s t i c da t a b a s e o f heart di sease on 13 patients

I cl I [ i| i2 in{ be a set o f n attributes, denoting /,, is the u h attribute in I And T r- {I, t2 tmj is a set o f w records, and tv is the vlh record in T The value o f record

I, at attribute /'„ can be referred to as tvfiu] For instance, in the table 8, the value o f

t s | i 21 (also the value o f i5[Serum cholesterol]) is 274 (mg/ml) Using fuzzification

method in the previous section, we associate each attribute with a set o f fuzzy sets

as follows:

For example, with the database in table 8, we have:

F/\gC = i A g e Young, A ge M iddle-aged, A ge Old] (with k = 3) Fscrum cholesterol = {C h o lestero l Low, C holesterol H igh) (with k = 2)

A fu z z y association rule |4j (9] is an implication in the fo r m of:

Where:

• X Y c I are itemsets X = {X|, X2, xp} (Xj * Xj if i * j ) and Y = {y |, y2,

> q ! (y, * y, if i * j)

• A = j fX|, fx2, ., f\P) B = {fyi, fy2, ., fyC|} are sets o f fuzzy sets corresponding

to attributes in X and Y fxi e FXI và fyj g Fyj-

Trang 26

W c can rewrite the fuzzy association rules as two following forms:

X - - i x , Xp] is A {i; , i;p) Y={V| yq) is B={fy,.fyq) (3.4)

(x, is fxl) A N D A N D (xp is fxp) zo (y, is fyl) A N D AN D (yq is fyq) (3.5)

A f u z z y item se t is now defined as a pair <X A>, in which X ( c I) is an itemset and A

is a set o f fuzzy sets associated with attributes in X

T h e su p p o rt o f a fuzzy itemset <X, A> is denoted fs{< X , A>) and determined by the

following formula:

¿ K (A- tA'l ] ) ® «V, (fv lX2 ] ) ® ® «v„ ('v [ Xp ] )}

/.v(< X , A >) = — - - (3.6 )

Where:

X = {x | xp} and /v is the vth record in T

0 is the T-norm operator in fuzzy logic theory Its role is similar to that o f logic operator A N D in traditional logic

|T| (card o f T) is the total number o f records in T (also equal to m).

A fr e q u e n t fu z z y item set: a fuzzy itemset <X A> is frequent if its support is greater

or equal to a fuzzy minimum support (fin ins up) specified by users, i.e.

Trang 27

T oán tù T -norn t ( 0 ) : there are various ways to choose T-norm operator [ 1 ị [2] (21

14 7 1 for formula (3.6) such as:

• Min function: a 0 b = min(a, b)

• Normal multiplication: a 0 b = a.b

• Limited multiplication: a <8> b = max(0, a + b - 1)

• Drastic multiplication: a 0 b = a (if b=T), = b (if a = l ) , = 0 (if a, b < I )

• Yager joint operator: a 0 b = 1 - m in fl , ( ( l - a ) NV + ( l - b ) w) w] (with vi

0 ) If lv = 1 it becomes limited multiplication I f w runs up to +co, it will

develops into m in function 11' w decreases to 0, it becomes Drastic

multiplication

Based on experiments, we conclude that m in function and norm al m ultiplication are

the two most preferable choices for T-norm operator because they are convenient to calculate support factors as well as can highlight the logical relations among fuzzy attributes in frequent fuzzy itemsets The two following formulas (3.12) and (3.13) are

Trang 28

derived from the formula (3.6) by applying mill function and norm al m ultiplication

respectively

Another reason for choosing mill function and algebraic m ultiplication for T-norm operator is related to the question “how we understand the m eaning o f the im plication

operator (—> or =>) ill fu zzy logic theory?" In the classical logic, the implication

operator, used to link two clauses p and Q to form a compound clause p —> Ọ,

expresses the idea “ if p then Q" This is a relatively sophisticated logical link because

it is used to represent a cause and effect relation While formalizing, we, however, consider the truth value o f this relation as a regular combination o f those o f p and Ọ

This assumption mav lead us to a misconception or a misunderstanding o f this kind o f relation ị 11

In i'uzzy logic theory, implication operator expresses a compound clause in the form

of “ if u is p then V is Q ' \ in which, p and Q are two fuzzy sets on two universal domain u and V respectively The cause a n d effect rule “ if u is p then V is Q" is understood that the pair (I t v) is a fuzzy set on the universal domain UxV The fuzzy implication p -> Q is considered a fuzzy set and we need to identify its membership

function nip >o from membership functions mp and m 0 o f fuzzy sets p and Q There are

various researches around this issue We relate herein several ways to determine membership function \ 11:

If adopting the idea o f implication operator in classical logic theory, we have: V(u v)

= m a x ( l - /77/», mò) I f © is replaced with p ro b a b ility su m , we receive the Mizumoto formula /;// >0(u, v) = 1- nip + mp.mợ And if © is substituted by lim ited

m ultip lication , we get the Lukaciewicz formula as u, v) = mi n(l , 1- /;7/.+ mọ).

Trang 29

o f S-norm operator

Afiother way to interpret the meaning o f this kind o f relation is that the truth value o f compound clause " if u is P then v is Q ” increases iff the truth values o f both antecedent and consequent are large This means that nip_M)(u, v) = <8>(/«/>, in0 ) If the

0 operator is substituted with m in function, we receive the Mamdani formula m r ^ a(u,

v) inin(/;//- m a ) Similarly, if 0 is replaced by norm al m ultiplication we obtain the formula m/>_^c)( u v) = m r m () [2],

Fuzzy association rule, in a sense, is a form o f the fuzzy implication Thus, it must, in

pan comply with the above ideas Although there are many combinations o f m r and

m o to form the m p ^ g (u, v), the Mamdani formulas should be the most favorable one

This is the main reason that influences our choice o f min function and algebraic

m ultipication for T-norm operator.

3.2.3 Algorithm for fuzzy association rules mining

The issue o f discovering fuzzy association rules is usually decomposed into two following phases:

Phase one: finding all possible frequent fuzzy itemset <X, A> from the input

database, i.e fs (< X , A>) > fn in s u p

Phase two: generating all possible confident fuzzy association rules from the

discovered frequent fuzzy itemsets above This subproblem is relatively straightforward and less time-consuming comparing to the previous step If <X, A> is

a frequent fuzzy itemset, the rules we receive from <X, A> has the form o f

\" is A '— ls—± X \ X ' is A \ A', in which, X ’ and A ’ are non-empty subsets o f X and A

respectively The inverse slash (i.e \ sign) in the implication denotes the subtraction

operator between two sets, fc is the fuzzy confidence factor o f the rule and must meet the condition fc > fm in c o n f

The in p u ts o f th e a lg o rith m are a database D with attribute set I and record set T and

fn in su p as well as fm inconf.

The o u tp u ts o f th e a lg o rith m are all possible confident fuzzy association rules

N otation table:

Trang 30

1« Set o f fuzzy attributes in Dp, each o f them is attached with a fuzzy

set Each fuzzy set j\ in turn, has a th reshold vty as used in formula

( 3.7)

T, Set o f records in Dp value o f each record at a given fuzzy attribute is

in [ 0 , 1 ]

ck Set o f fuzzy k-itemset candid ates

Set o f frequent fuzzy k-itemsets

F Set o f all possible frequent itemsets from d ataba se Dp

Fm ins up F u zzy m inim um su pport

1 '111 i neonf F uzzy m in im um confidence

Ta b l e 9 - Not at i on s used in f uzzy associat ion rules mi ni ng al gori t hm

The algorithm :

2 ( D r , Ip, T F) = FuzzyM ate rialization(D , I, T);

3 F | = Counting(D |., Ip, T\:,fm insup)\

T a b l e 10 - T h e al gori t hm for mi ni ng fuzzy associ at i on rules

The algorithm in table 10 uses the following sub-programs:

(Di.-, Ik, Tk) = FuzzyMaterialization(D, I T): this function is to convert the original

database I) into the fuzzified database D r Afterwards, I and T are also transformed to1] and T, respectively For example, with the database in table 8, after running this function, we will obtain:

Ii = {[Age, Age Young] (1 ), [Age, Age Middle-aged] (2), [Age, Age O ld| (3),

[Cholesterol Cholesterol Low] (4), fCholesterol, Cholesterol High| (5)

[BloodSugar, BloodSugar 0] (6), fBloodSugar, BloodSugar /] (7),

| / leart Disease Heart Disease No] (8), [HeartDisease, HeartDisea.se Yes] (9)}

After converting, I| contains 9 new fuzzy attributes comparing to 4 in I Each fuzzy attribute is a pair including the name o f original attribute and the name o f

Trang 31

corresponding fuzzy set and surrounded by square brackets For instance, after

fuzzilying the A ge attribute, we receive three new fuzzy attributes [Age, Age Youngj

Age Middle-agedJ, and [Age, Age Old],

In addition, the function FuzzyMaterialization also converts T into T ( as shown in

the following table:c_

T a b l e 11 - T F: Val ues o f recor ds at at tribut es af ter f uzzifying

Note that the characters A C, S, and H in the table 11 are all the first character o f

Age C holesterol S u g a r, and H eart respectively Each fuzzy s e t / i s accompanied by a

threshold \vf, so only values that greater or equal to that threshold are taken into

consideration All other values are 0 All gray cells in the table 11 indicates that theirs values are larger or equal to threshold (all thresholds in table 11 are 0.5) All values located in white-ground cells are equal to 0

Fi = Counting(D|., I|., TV, fm in su p Y this function is to generate Fj, that is set o f all

frequent fuzzy 1-itemsets All elements in Fj must have supports greater or equal to

fm insup For instance, if applying the norm al m ultiplication for T-norm (<8>) operator

in formula (3.6) and fm insup with 46%, we achieve the F t that looks like the

following table:

Age Young] } (1) 1 0 % N o

{[Age, Age Middle-aged]} (2) 45 % N o

{[Age, Age Old]} (3) 7 6 % Yes

{[Serum cholesterol Cholesterol Low]} (4) 43 % No

{[Serum cholesterol, Cholesterol High]} (5) 1 6 % No

{[BloodSugar, BloodSugar ()]} (6) 85 % Yes

{[BloodSugar, BloodSugar / ] } (7) 15 % No

{[HeartDisease, HeartDisease No]} (8 ) 54 % Yes

! [HeartDisease, HeartDisease_ Yes]} (9) 4 6 % Yes

Ta b l e 12 - C , : set o f can d i dat e l - i t emset s

Trang 32

INS ERT INTO Ck

S E L E C T p.i i, p.i2 p.ik-i, q.ik-i

FR OM Lk., p Lk_, q

W H E R E p.i| = q.i|, p.ik_2 = q.ik-2, P - i k - i < q.ik-i AND p.ik.,.o * q.ik.|.o;

In which, p.i, and q.i¡ are index number o f j 1’ fuzzy attributes in itemsets p and q respectively, p.ij.o and q.ij.o are the index number o f original attribute Two fuzzy attributes sharing a com mo n original attribute must not exist in the same fuzzy

itemset l or example, after running the above SQL command, we obtain C 2 {'3, 6)

¡3 8), {3 9}, {6 8}, {6 9}} The 2-itemset {8, 9} is invalid because its two fuzzy

attributes are derived from a common attribute H eartD isease.

C k = P r u n e ( C k): this function helps us to prune any unnecessary candidate k-itemset

in C k thanks to the downward closure property “all subsets o f a fre q u e n t item set are

also fr e q u e n t a n d any su perset o f a non-frequent item set w ill be not freq u e n t" To

evaluate the usefulness o f any k-itemset in C k the P r u n e function must make sure

that all (k-l)-subsets o f C u are present in F k_t For instance, after pruning the C 2 =

{{3, 6}, {3 8}, {3, 9}, {6, 8}, {6, 9}}

F u = Checking(Cu, D ¡.,fm in su p ): this function first scans over the whole transactions

in the dalatabase to update support factors for candidate itemsets in Cu- Afterwards,

Cheeking eliminates any infrequent candidate itemset, i.e whose support is smaller

than fm in s u p All frequent itemsets are retained and put into Fu After running F2 =

Cliccking(C2 D|., 46%), we receive F2 = {{3,6}, {6,8}} The following table displays the detailed information

Định dạng
Số trang	64
Dung lượng	24,08 MB