Luận văn parallel mining for fuzzy association rules

Sequeaiattemporal patterns mining: this method is somewhat similar to association rules except that data and mining results a kind of rule always contain a temporal attribite to exhibit

Trang 1

NAM NATIONAL LNIVERSITY, HANOI FACULTY OF TECHNOLOGY

PHAN XUAN IHEU

AFRALLEL MINING FOR FUZZY ASSOCIATION RULES

Major: Intormation 'Techuology

Trang 2

1.0.2 Data mining: Definition

1,1 Malor approaches and techniques in Data mining — 12

1.3.1 Major approaches and techniques in Data mining Ì2

1.2.2 Kinds of data could be mined Kreerrrrerarrrr T2

1.2.1 Applications of Data mining —

1.3 Focused issues in Deda mining ee ce eens -14

2,1 Association ruleS: MIGLVALIGH, cccccxeeherrereerrrerrrree 216

2.3 Main research wends in Association rules mining 2 cece cece TR

Chapter 3 Puczy association cues mining

Trang 3

3.1.2 Methods of data discretization Hà

3.2.1 Data discretization based on fuzzy set pees 24 3.0.2 Baecy association rules

3.2.3 Aigoritlim for fuzzy association rules mining

4.1 Several previously proposed parallel algorithms

4.2.4 new parallel algoritlan for fuzzy association rules mining, SỐ

Ref nee

Trang 4

Eiaue ~The voluine of data strongly increases in the past byo decades 9

biguce 3 - Mfustvation of an asseciation rule - “ - a se TẾ

Figure 4 + “Sharp bounlary problen” in data đïsoreization -Ö23

Figure 5 - Membership functions of fuzzy sets assuciated with “Age” attribute 25

6- Membership finctions of "Cholesterol_Low" and "Cholesterof_Hieh" 25

The processing time increases dramatically as decreasing the findasup 36

Figure & + Number of itemsets and rules strongly increase as reducing the fivinsup 37

Figure 9 The number of rules enlarges remarkably as decreasing the fininsup 37 Figure 10 — Pracessing time increases largely as slightly increasing number of attrs 38

Figure [1 — Processing time increases linearly as increasing the number of records 38

Figure 12 - Optional choices for T-norm operator

Figure 13 - The mining results reflect the changing of threshold values

Figure 14 - Count distribution algorithin on a 3-processor parallel system 6 43

Figure 18 - Data distribution algorithin an a 3-processor parallel system _ 4d

> 16 - The rule generating time largely reduces as increasing the minconf 48

17 ~ The number of rules largely reduces when inereasing the minconf 48

1 - The iliustration for division algorithm ccc sees woe 5S

Figure 19 - Processing lime largely reduces as increasing the number of process 57 Figure 20 - Mining time largely depends on number of process (logical, physical) 58

Eipure 21 - ‘Phe main interface window of Fuz7yARM tool sce

Figure 22 - The sub-window for nđđỉng new flntry SERS, 0 cu mee 66

- The window for viewing mining resulls

Trang 5

lable 5 - Data diserctization for attributes having finile valaes 22

Fable 6 - Data discretization for "§erum cholesterol" attribute we 23 Table 7 - Data discretization for "Age" attribute 23

Mahle 8 The diagnostic database of heart disease on 13 palients 27

Fable LO The algorithin for mining fuzzy association tiles eee 32

able 11 - T)> Values of records at attributes afler fuzzifying ccc VI

Vable 13 - Fa: get of frequent 2-itemsets oceans tree ries ceeeeeee 34

Table }4 - Fizzy association rules generated from database in table 8 35

Fable 15 - The sequential algorithm for gencrating association rules

Table 16 - Fuzzy attributes received aller being fuzzified the database in table 8 51

Vable 17 = Fuzzy attributes dividing algorithm truong prúcessots 53

Trang 7

Chapter 1 introduction to Data mining

1.1 Data mining

1.1.1 Data mining: Motivation

The past two decades has seen a dramatic increase in the amount of information or

data being slored in electronic devices (-¢ hard disk, CD-ROM etc.), This accumula-

tion of data has taken place at an explosive rate It has been estimated that the amount

of information in the world doubles every bwo years and the size and number of

databases are increasing even faster Figure | illustrates the data explosion |3]

1970 19883 1230 2000

Figure | - The volume of data strongly inereases in the past two decades

We are drowning in data, but starving for useful knowledge The vast amount of

accumulated data is actually a valuable resource because infonnatian is the vital

factor for business operations, and decision-makers could make the most of the data to

gain precious insight into the business before making decisions Data mining the

extraction af hidden predictive information froin large databases, is a powerful new

Ieehuvlogy with great potential lo help companies focus on the most significant

inlonnation in their data collection (databases, data warchouses, dala repositories)

The automated, prospective analyses offered by data mining go beyond the normal

analyses of past events provided by retrospective tools typical of decision support

systems Data mining tools can answer business questions that iraditionally were too

lime-consuming to resolve ‘This is where data mining & knowledge discovery in

databases demonstrates its obvious benefits for today’s compelitive business

environment Nowadays Data Mining & KDD has been becoming a key role in

colnpuler science and knowledge engineering areas.

Trang 8

Vhe initial application of data mining is only in commerce (retail) and finance (stock

market) However, data mining is now widespr

dly and successfully put inte other

fields such as bia-informatics, medical treatment, tclecommunication, education, ete

1.1.2 Data mining: Definition

liclore discussing some definitions of data mining | have a small explanation about

terminology so thal readers can avoid unnecessary confusions As mention-ed before,

we can roughly understand that data mining is a process of extracting nontrivial,

implicit, previously unkanwn, and potentially useful knowledge from huge sets of data, Thus we should name this process as knowledge discovery in database (KDD) instead of data mining However, most of the researchers agree that the two above terminologies (Data mining and KDD) are similar and they can be used

interchangeably They explain for this “humerous misnomer” that the core motivation

of KDD is the useful knowledge, but the main object they have to deal with during

mining process is data Thus, in a sensc, data mining and KDD imply the same

micaning However, in several materials, data mining is sometimes referred to as one

step in the whole KDD process [31 [431

There are numerous definitions of data mining and they are all descriptive | would

like tu restate herein some of them that are widely accepted

Definition ane: W J Prawley G Piatetsky-Shapiro, and C 1 Matheus 1991 [43]:

“Kaowledge discovery in databases also knawa Data mining is the nontrivial

process of identifying valid novel potentially useful and ultimately understand-able

patteras in data.”

Definition two: M, Holshemicr va A Siebes (1994):

“Date Mining is die search for relationships and global patterts that caist in large

databases hut ave ‘hidden’ amony the vast nmount of data, such as a relationship

henveen patient data and their medical diagnosis These relationships represent

valuable knowledge ahout the database and the abjects in the database and if the

dutabase is o faithful mirror, af the real world repistered hy the database.”

1.1.3 Main steps in Knowledge discovery in databases (KDD)

The whole KDD proces

is usually decomposed into the follawing steps [3] [14] [23]:

Trang 9

YW

Hate selection: selecting or segmenting the necessary data that needs ta be mined

from Large data sety (databases, data warehouses dala repositories) accarding to same

erizeria

Pe preprocessing: this is the data clean and reconfiguration stage where some

icchniques are applied to deal with incomplete, noisy, and inconsistent data This step

also tries to reduce data by using aggregate and group function, data compression incthods, histograms sampling, etc Furthermore, discretization techniques (binning

histograms cluster analysis entropy-based discretization, segmentation) can be used

to reduce the number of values for a given continuous attribute by dividing the range

of de altribute into separated intervals After this step, data is clean, complete,

uniform reduced, and discretized

Data twansformation: in this step, data are transformed cr consolidated into forms

appropriate for mining Data transformation can inyolve data smoothing and

normalization Alter this step, data are ready for the mining step

Duta mining: this is considered to be the most important step in KDD process It applies same data mining techniques (chiefly borrowing from machine learning and

other fields) to discover and extract useful patterns or relationships from data

Knowledge representation and evaluation: the patterns identified by the system in

previous slep are interpreted into knowledge that can then be uscd to suppor! human

decision-making (e.g prediction and classification tasks, summarizing the contents of

a database or explaining observed phenomena), Knowledge representation also

converts patterns into user-readable expressions such as trees, graphs charts & tables,

Trang 10

1.1 Major approaches and techniques in Data mining

1.2.1 Major approaches and techniques in Data mining

Daia mining consisis of many approaches They can be classilicd according lo

functionality kind of knowledge type of data to be mined, or whatever appropriate

criteria [14] t describe major approaches below:

Class fication & prediction: this method trics te arrange a given object into an

approoriale class among the others The number of classes and their name are

definéely known For example, we can classify or anticipate geographic regions

according to weather and climate dala This approach normally uses typical techniques and concepts in machine learning such as decision tree, artificial neural

nebwerk, k-min, support vector machine, etc Classification is also called supervised

learning

A

purchase beer alse purchase dry beef” Association rule is now successfully applied in

cation rules: this is a relatively simple form of rule e.g “80 percent of men that

supermarket (retail) medicine, bio-intormaties, finance & stock market, etc

Sequeaiattemporal patterns mining: this method is somewhat similar to association

rules except that data and mining results (a kind of rule) always contain a temporal

attribite to exhibit the order or sequence in which events or objects effect upon each

other, This approach plays a key role in finance and stock market thanks to its capab lity of prediction

Clustering & segmentation: this method (ries to arrange a given object into a suited

calegiry (also known as cluster) The number of clusters may be dynamic and their

labels (names) are unknown Clustering and segmentation arc also called

unsupzrvised learning

Concet description & summarization: the main objective of this method is to deserixe or summarize an abject so that the obtained information is compact and

candetsed Dectument or text summarization may he a typical example

1.2.2 Kinds of data could be mined

Data mining can work on various kinds of data The most typical data types are as follows:

Trang 11

13

Relational datebases: databases organized according to the relational model Most of

the existing database management systems support this kind of madel such as Orele,

IBM DB2 MS SQI Server MS Access, ete

AMutidimensional databases: this kind of database is also called dala warchouse, data

mart ele The data selected from different sources contain the historical feature thanks

to an implicit or explicit temporal attribute This kind of database is used primarily in

data inininy and decision-making support systems

Transactional databases: this kind of database is commonly used in supermarket,

banking etc Each transaction includes a certain number of items (e.g items may be goods in au order) and a Wansactional database, in turn, contains a certain number of {vangactions

Object - relational databases: this dalabase model is a hybrid of the abject oriented

model and the relational model

Spatial temporal and time-series data: this kind of data always contains either spatial (c.g map) or temporal (e.g stock market) attributes

Multimedia databases: this kind of data includes audio, image, video, text, www, and

many other dala fernal Today, This kind of data is widely used on Internet [hanks to

its useful applications

1.2 Applications of Data mining

1.2.1 Applications of Data mining

Although data mining is a relatively new research trend, it is a big attraction of

resvarchers because of its practical applications in many areas The following should

be typical applications: (1) Data analysi

and decision-marking supports This

application is popular in commerce (retail industry), finance & slock market, etc (2)

Medical treatment: finding the potential relevance among symptoms, diagnoses, and treatment methods (nutrient preseriptign, surgeon, ele) (3) Text and Web mining: document summarization text retrieval and text searching, text and hypertext

{4) Bio-informaties: search and compare typical or special gzenctic

information such as genomes and IDNA, the implicit relations between a set of

genomes and a genetic disease, etc (5) Finance & stock market: examining data to

Trang 12

vxtraet predicted information lor price of a certain kind of coupon, (6) Cthers

(1elecommunication, medical insurance, astronomy, anti-terrnrisin, sports, ete)

1.2.2 Classification of Data mining systems

Data mining is a knowledge engincering related field that involves many others research areas such as databasc machine learning, artificial intelligence, high

perlormance computing, data & knowledge visualization etc We could classify data

mining systems according to different criteria ay fallaws

Classifving hased on kind of data to he mined: data mining systems work with

relational databases, data warehouses, transactional databases, object-oriented databases, spatial and temporal databases, multimedia databases, text and web databases etc

Classifying fhased on type af ntined knowledge: data mining tools that return summarization or description association rules, classification or prediction,

clustering, etc

Classifuing based on what kind of techniques to be used: data mining tonls work as

online analytical processing (OLAP) systems, use machine learning techniques

(decision tree, artificial neural network, k-min, genetic algorithm, support vector machine, rough set fuzzy set etc.), data visualization, etc

Classifying based on what fields the data mining systems are applied to: dala mining

systems are used in different fields such as commerce (retail industry),

telecommunication bio-informatics medical treatment, finance & stock market, medical insurance ete

1.3 Focused issues in Data mining

Data mining is a relatively new research topic Thus there are several pending or

unconvincingly solved issues [ relate herein some of them (hat are attracting much attention of data mining researchers

(1) OLAM [Online Analytical Mining) is a smooth combination of databases, data

warehouses, and data mining Nowadays, database management systems like Oracle,

MS SQL Server [BM DB2 have integrated OLAP and data warehouse functionalities

to facilitate users in data retrieval and data analyzing These add-in supports also

Trang 13

18

charge users an additional sum of money, Researchers in these fields hope to go

beyond the current limitation by developing multi-purposes OLAM systems that

support data transactions for daily business operations as well as dala analyzing for

making decision [14] (2) Data mining systems can mine various forms of knowledge

from different types of data [14] [7] (3) How lu enhance the performance, accuracy,

scalability, and integration of data mining systems’? How to decrease the computational complexity? How to improve the ability of dealing with incomplete,

inconsistent, and noisy data? Three questions above should still be concentrated in the

future [14] (4) Taking advantage of background knowledge or knowledge from users (experts or specialists) to upgrade the total performance of data mining systems [7]

[1] (4) Parallel and distributed data mining is an interesting rescarch trend hecause it

makes use of powerful computing systems to reduce response time This is essential

because more and more real-time applications are needed in today’s competitive

world [5] [8] £12] [18] [26] [31] [32] [34] [42] (6) Data Mining Query Language

(DMQI.): Researchers in this area try ta design a standard query language for data

mining This language will be used in OQLAM systems as if SQL are widely used in relational databases [14] (7) Knowledge representation and visualizatiun are also

taken inte consideration to express knowledge in human-readable and easy-to-use

forms, Knowledge can be represented in more intuitive expressions due to

multidimensional or multilevel data structures

This thesis primarily involves in mining fuzzy association rules and_ parallel

algorithms for mining fuzzy association rules

Trang 14

2.1 Association rules: Motivation

Association rule is the form of “70 percent of customers that purchase heer also

purchase dry beef, 20 percent of customers purchase both” or “75 percent of patients who |

noke cigarettes and live near polluted areas also get tung cancer, 25 percent of

pat nts smoke and live near polluted areas as well as suffer from tung cancer”

“Purchase beer™ or “smoke cigarettes and live near polluted areas” are antecedents,

“purchase dry beef and “get lung cancer™ are called consequents of association rules

20% and 30% are called support factors (percentage of transactions or records that

coiain both antecedent and consequent of a mule), 70% and 75% are called

confidence factors (percentage of transactions or records that hold the antecedent also

hold the cansequent of a rule) The following figure pictorially depicts the former

example of association rules

umber of es af transachons | transactions that huy beer that

purchase

“a ZWnf hhees also tranzscbơng buy | Purchase dry number of transachons beet

that buy dry beef

Figure 3- Illustration of an association rule

The knowledge and information derived from association rules have an obvious

difference in meaning from that of nonnal queries (usual in SQL syntax) This

knowledge contains previously unknown relationships and predictions hidden in

massive volumes of dala Tl not only results from usual group, aggregate or sort

operations but also results trom a complicated and time-consuming computing

provess

Being a simple kind of rule, association rules, however carry useful knowledge and

contribute substantially to decision-making process Unearthing significant niles from

databases is the main motivation of researchers.

Trang 15

17 2.2 Association rules mining — Problem statement

Leth = fip lạ iạ} be a set of nm items or attributes (in transactional or relational

databases respectively) and TF — {t), ta, bn} be a set of a transactions or records (in

transactional or relational datahases respectively} Each transaction is identified with ils unique TID number A (transactional) database D is a binary relation & on the Des

art multiplication 1xT (or also written & C IxT)} If an item i occurs in a

transaction f we write (i, t) € 8 or i&t Generally speaking, a transactional database is

a set of transactions, where each transaction t contains a set of items or t © 2! (where

2!is power set of T} [24] [36]

For example consider the sample transactional database shown in table | with B= {A

C.D TW? and T= (1,2, 3.4.5.6}

Transaction TP Nemeet Ì

AC TW _,

cp ©

AC TW ACD W AcD TW CDT

Tabie 1 - An example of transactional databases

X ¢ Lis called an itemset The support factor of an itemset X, denoted as s(X), is the

percentage of transactions that contains X An itemset X is frequent if its support is

greater than or equal to a user-specified minimum support (minsup) value, i.e s(X) >

avinsup 136]

the following table enumerates all possible frequent itemsets in sample database in

lable | with minsup value is 50%

Table 2 - Frequent itemsets in sample database in table 1 with support — 50%

Association rule is an implication in the form of ¥—*+¥, where X and Y are frequent itemsets that disjoint, ic Xo V = @, and c confidence factor of the rule, is the conditional probability that a transaction contains Y, given thal it contains X, ie ¢

A-Lof A+

Trang 16

= #X 2V) ¿ (X) A rule is confident if ils confidence factor is larger or equal to a

user-specified miniman confidence (minconf) value i.e ¢> minconf [36]

‘The association rules mining task can be stated as follows:

Let D be a (transactional) database, minsup and minconf are minimum support

and minimum confidence respectively The mining task tries to discover all

frequent and confident association cules > ¥ ie (XUV) = minsup and

bound) and I/O operations (I/O-baund),

Phase pve: generating confident association rules from discovered frequent itemsets

in the previous step If X is a frequent itemset, confident association rules created from X have the form of 1" -+-> 4 ‘X', where X’ is any non-empty subset of X and X‘\X? is subtraction of X* from X This step is relatively straightforward and much less time-consuming than the step one

The following table lists all possible association rules generated from the frequent

itemsel ACW (from database in table 1} with mtacanj'= 70%

Table 3 - Association rutes generated from frequent itemset ACW

2.3 Main research trends in Association rules mining

Since proposed by K Agrawal in 1993 [36], the field of association rules mining has

developed into various new directions thanks to a varicty of improvements from researchers Some of proposals are try to enhance the precision and performance,

Trang 17

19

seine (ry (a tune the interestingness of rules, ete T list herein some of its dominant

trends

Mining binary or boolean association rules: this is the initial research direction of

associntion rules Most of the carly mining algorithms are related to this kind of rule

[20| [38] [36] In binary association rules, an item is only determined whether it is

present or not The quantity associaled with cach item is fully ignored, eg a

transaclion buying twenty bottles of beer is the same a transaction that buys only one

boule, The most well-known algorithms mining binary association rules are Apriori together with its variants (Apriori-Tid and AprioriHybrid) [35] An example of this

type of tule is “hying bread = ‘yes’ AND buying sugar = ‘yes’ => buying milk =

‘ves’ AND buying butter = ‘yes’, with support 20% and confidence 80%"

Quantitative and categorical association rules: attributes in databases may be binary (bootean), number (quantitative), or nominal (categorical), etc To discover

association rules thal involve these data types, quantitative and categorical attributes necd ta be discretized to convert into binary ones There exist some of discretization

methods that are proposed in {34] [39] An example of this kind of rule is “sex =

‘male’ AND age © '50 65' AND weight © '60 80' AND sugar in blood > 120mg/m!

> blood pressure = ‘high’ with support 30% and confidence 65%”

Fuzzy association rules: this type of rule was suggested Io overcome several

drawbacks in quantitative association rules such as “sharp boundary problem” or

semantic expression Fuzzy association rule is more natural and intuitive to users

thanks to its “fuzzy” characteristics An example is “dry cough = ‘ves’ AND high

fever AND muscle aches = ‘yes' AND breathing difficulties = ‘yes' => get SARS

(Severe Acute Respiratory Syndrome) = ‘yes’, with suppart 4% and confidence 80%”

High fever in the above rule is a fuzzy attribute We measure the body temperature

based on a fuzzy concept

Multi-level association rules: all kinds of association rules above are too concrete, 50

they carmot reflect relationships on general views Multi-level or generalized

" association rule is devised to surmount this problem [15] [37] In approach, we would prefer rule like “ew PC — ‘yes' => buy operating system = ‘yes’ AND buy office

fools ~ ‘yes rather than “huy (BM PC = ‘pes’ —> buy Microsoft Windows = ‘yes’

AND buy Microsoft Office = ‘yes”’ Obviously, the former rule is the generalized

form of the latter and the fatter is the specific form of the former

Trang 18

Association rules with weighted items (or attributes): we use weight assacialed with cach ilem to indicate “the level” that item contributes to the rule In other words

weighls are used to measure the importance of items For example, while surveying,

SARS plague within a certain group of people the information of body temperature and respfratory sysiem is much more essential than that of age To reflect the

difference between the above attributes, we attach greater weight values for body femperature and respiratory system attributes This is an attractive research branch

and solutions to it were presented in several papers [10] [44] Ry using weights, we

should discover scarce association rules of high interestingness This means that we

can retain rules with small supports bul have a special meaning

Besides cxamining of variants of association rules, researchers pay attention to how to

accelerate the phase of discovering frequent itemsets Most of recommended

algorithins are lo wy to reduce the number of frequent itemsets need to be mined by developing new theories of maximal frequent itemsets [11] (MAFIA algorithm), closed itensets (13| (CLOSET algorithm), [24] (CHARM algorithm), [30] These

new approaches considerably decrease mining time owning to their “delicate pruning strategies” Experiments show that these algorithms outperform known ones like

Apriori, AprioriTid, ete

Parallel and distributed algorithms for association rules mining: in addition to

sequential or serial algorithms, parallel algorithms are invented to enhance total performance of mining process by making use of robust parallel systems The advent

of parallel and distributed data mining is highly accepted because size of databases

increases sharply and real-time applications arc commonly used in recent years

Numerous parallel algorithms for mining association rules were devised during past

ten years in [5] [12] [18] [26] [31] [32] [34] They arc both platform dependent and platform independent

Mining association rules in the point of view of raugh set theory [41]

Furthermore, there exist other research trends such as online association rule mining

[33] that data mining tools are integrated or directly connected to data warehouses or

data repositories based on well-known technologies as OLAP, MOLAP, ROLAP,

ADO etc.

Trang 19

21

Chapter 3 Fuzzy association rules mining

3.1 Quantitative association rules

3.1.1 Association rules with quantitative and categorical attributes

Mining quantitative and categorical assaciation rules is an important task because af

its practical applications on real world databases This kind of association rules first

Table 4 - Diagnostic database of heart disease on 17 patients

In the above database, three attributes Age, Serwn cholesterol, Maximum heart rate are quantitative, two attributes Chest pain (ype and resting electrocardio-graphics are calcgorical, and all the test are binary (Sex Heart disease, Fasting blood sugar) In

fact binary data type is also considered to be a special form of category From the

data in table 4, we can extract such rules as:

<Age: 54 74> AND <Sex: Femaie> AND <Cholesterol: 200 300> => <Heart disease: Yes>, with support 23.53% and confidence 80%

<Sex: Male> AND <Resting electrocardiographics: 0> AND <Fasting blood

sugar < {20> => <Heart disease: No>, with support 17.65% and confidence

100%

Trang 20

The approach proposed in [34] discovers this kind of rules by partitioning valuc

ranges of quantitative and categorical attributes into separated intervals to convert them into binary ones Yraditional well-known algorithms such as Apriori [35]

CHARM [24], CGLSET [20] can then work on these new binary altributes as original

problem of mining boolean association rules

3.1.2 Methods of data discretization

Binary association rules mining algorithms [20] [24] [35] [36] only work with

clional databases as

relational database containing only binary attributes or trans:

shown in table 1 They cannot be applied directly to practical databases as shown in table 4 In order lo conquer this obstacle, quantitative and categorical columns must

first be converted into boolean ones [34] [39] However, there remain some

limitations in data discretization that influence the quality of discovered rules The

output rules do not satisfy researchers’s expectation The following section describes

major diseretizatian methods to contrast their disadvantages

The first case: let A be a discrete quantitative or categorical attribute with finite value

đomain ƒvị, vy, , vụ} and & is small enough (& < 100) After being discretized, the

original attribute is developed into & new binary attributes named A_V, A_Vo,

A V, Value of a record at column A_V, is equal to True (Yes or 1) if the original

value of this record at attribule A is equal to v;, all the rest cases will set the value of

AN, to False (No or 0) The attributes Chest pain type and resting

eleetrocardiographics in table 4 belong to this case After transforming, the initial attribute Chest pain ppe will be converted into four binary columns

Chest_pain_type_1 Chest_pain_type_2 Chest_pain-_type_3, Chest_pain_type_4 as shown in the fallowing table

‘Chest pain type Chest pain_ | Chest_pain_ | Chest_pain_ | Chest_pain_

(1.2.3.4) > ~~ |_type one 1 lypc onc 2 | type one 3 | type one 4

‘Table 5 - Daca discretization for attributes having finite values

The second case: if A is a continuous and quantitative attribute or a categorical one having value domain {¥), v2 Vp} ( is relatively large) A will be mapped to g new

binary columns in the form of <A: starty.end)>, <A: starty.endy>, , <A:

Trang 21

clot sierol and Marini: bevrt rate in table 4 belong to this form Seri cholesterol

anid 4e could be diserelized as shown in the two following tables:

Table 7 - Data discretization for “Age” attrihute

tization methods encounter sonie pitfalls such as

Unforumately the mentioned dis

[4] ]9} The figure belaw displ

“sharp houndary probtent

efen attribute A haying a value range from | to 10 Supposing that we divide A into

two separated intervals [1.5] and [6 0} respectively If the ainsup value is 41%, the

rage (6,.L0f will not gain suflicient support, Therefore [6 10| cannot salisty minsup

s lef boundary

0% uiasagy) 41%) even though there is a large support near i

For example [4.7] has support 54%, [5 8] has support 45% So, this partition resulls

in “shep boundary” between $ and 6, and therefore mining algorithms cannot

generate confident rules involving interval [6.10

Trang 22

unintentionally ov eremphasize the importance af values lncatcd near boundaries This,

ts hor natural and inconsistem,

hurthermoare partitioning attribute demain inte separated ranges results in a problem

n ru[t interpretation ‘The table 7 shows thal two values 29 and 30 belong to different

arteritis Though they are very similar in indicating old level Also, supposing thal the

[1.29] denotes young people, [30 59] for middle-aged people and [60.120]

Tor old ones sa the age of 59 implies a middle-aged person whereas the age of 60 implies an old person ‘his is not intuitive and natural in understanding the meaning

oF quantitslive association rules,

sseciaiion rule was recommended to overcome the above shoricomings [4]

Muzzy

[9] This kind of rule not only successfully improves “sharp boundary problem” but

also aflows us to express association rules in a more intuitive and a friendly formal

the quantittive rule “<Age: $4 74> AND <Sex: Female> AND

<Chulesterut: 200 300> > <Heart disease: Yes>" is now replaced by “<Age_Old>

AND © Sext Female AND

tye Old and Cholesterof_éfigh in the above rule are fuzzy

3.2 Fuzzy association rules

3.2.1 Data discretization based on fuzzy set

In the luzzy set theory [21] [47] an element can helongs to a set with a membership

value in [0.1] This value is assigned by the membership finctian associated with

each fuzzy set For attribute v and its domain DP, (also known as universal set), the

of the membership function associated with fizy set f, is as follow:

Imũpipit

mm, (sy: D,-> [ol] GB.)

Trang 23

2%

Phe fuzzy set provides a smooth change over the boundaries and allows us te express

Hseociation sides in a more expressive form Lets mse the set in data

dtacretving lo make the most of its benefits

For the ativibute age ane its universal domain [0.120], we attach with il three fuzzy

sects

ae Young tee Middle-uged aml Age_Old, The graphic representations af

Wiese fiysy sets are sliewn in the foliowing figure

Figure 3 - Membership funetions of fuzzy sets assacinted with “Age" attribute

By using [yzy sel we completely get rid ef “sharp boundary problem” thanks to its

wwit characteristics For example the graph in figure 5 indicates that the ages of 59 and 60 have membership values of fuzzy sel Age Old approximately 0.85 and 0.90

respectively Similarly the ages of 30 and 29 towards the fuzzy set Age_ Young are

O70 and 0.75, Obviously this transformation method is much more intuilive and

natural than known discretization ones

Another example the original attribute Serv cholesterol is decomposed into two new fuzey attributes Colesero Low and Cholestero High The following figure portrays membership functions of these fuzzy concepts

Chole stersl_Low Cholesterol_High

IDA is a categorical altrilwte having value domain fv), %, v,} and & is relatively

suil we fuzzify this attribute by attaching a mew fuzzy attribute A_V; to cach value

Mie valie of membership function ma_vjGx) equals to TL iÊx y; and equals ta 0 for

vise versa, Ultimately thinking AY; is also a normal set because its membership

Trang 24

Minghon value is cither O or 1 Uf é is too large we can fuzzify this atiribute by

Dara discretization using fuzzy sets could bring the folowing benefits:

Eirsly smooth wansilion af membership functions should help us climinate the

sharp poundary problem”

Data discretization hy using fuzzy sets assists us significantly reduce the number of

new ulttibules because number of fuzzy sets associated with each original attribute is

relatively smal] comparing to that of an attribute in quantitative association rules For

instance if we use normal discretization methods over attribute Sern cholesteral, we

will obtain five sub-ranges (also five new attributes) [rom its original domain { 100,

600], whereas we will create only two new attributes Chofesterod Low and

Chotesterol_High by applying luezy sets This advantage is very essential because it

allows us lo compact the set of candidate itemsets and therelore shortening the total

mining tine,

Fury association rule is more intuitive, and natural than known ones,

All values of records at new auributes after fuzzilying are in [0, 1] This is lo imply

the possibility that a given efement belongs to a fuzzy set As a resull this {exible

coding oflers an exact method to measure the contribution or impact of cach record to the overall support ofan Hemset

The nest advantage that we will sce more clearly in the next section is fuzzified

databases still hold “downward closure property” (all subsets of a frequent itemset are

also frequent and any superset of a nan-frequent itemset will be not frequent) it we

T

Apriori also work well upon fuzzified databases with just slight modiGvations

have a wise choice Jor T-norm operator Thus, conventional algorithms such as

Another benetit is this data discretization method can be casily applied to both

relational and transactional databases.

Trang 25

Table 8 The dingnostig database of heurt disease om 13 patients

Tipe dy ia} be a set ola attributes, denoting 4, is the #Ỳ a0ribute in E And T

mn} ib a set of wi records, and 4, is the v" record in T The value of record

fat altribute 4, can be referred 10 as Gfi,] For instance in the table 8 the value of

isl] Gilsa the value of ts[Serwn choleserni]) is 274 (mg/ml), Uging Rưzi

inethod in the previous section we associate each attribute i, with a set of fuzzy sets

as follows:

Vor example with the database in table 8, we have:

Fqge = [Age_Young, Age Middle-aged, Age_Old} (with k = 3)

Fern Chatestwnt = | Cholestero’_Low, Cholesterol High} (with k = 2)

Trang 26

XX Sp SARE Neh WHE oh BGR Ge Ga)

(sr ish) AND AND (pis Gp) 2 Q is G1) AND AND (yy is fy) G28)

ftemset is now delined as a pair <X A>, in which X (c 1) is an itemset and A

isu scl oF fuzzy sets associated with attributes in X

The support of a lusay itemset <X A> is denoted (<X A>) and determined by the

logic operator AND in traditionat logic

A frequent [rap itenset: a fuzzy itemsel <X is frequent if its support is greater

of equal (on fuzzy minimum support (faeinsap) specilied by users, Le.

Trang 27

29

The support of a fuzzy association rule is detined as:

i(cX is A => Bis ¥>) =f5(<XUY AUB>) (G.10)

A fuccy association rufe is frequent if its support is larger or equal to farinsup, Le BCX Is A —» Bis V>) 2 fininsup

Confidence factor af a fuzzy association rule is denoted fo(X is A => ¥ is B) and

delined as:

iciX is A -> ¥ is B)- f(&X is A —> Bis ¥>)/ fe(<X, A>} GAD

A furzy association rule is considered frequent il its confidence greater or equal to a

fizzy minimum confidence (ncan/) threshold specified by users his means thal

the confidence must satisfy the condition:

je(X is A => W is B) 2 fininconf

Toda tt T-narm (): there are variaus ways to choose T-nerm operator [ 1} [2} [21]

|47| for formula (3,6) such as:

* Min function: a @ b= minfa, b)

* Normal multiplication: a®b=ab

= Limited multiplication: a@ b= max), a+b -t)

«Drastic multiplication: a@ bea(ifbel) =b (ifa=l) = 0 (fa b< 1)

0) fw = 1 it becomes limited multiplication If w mms up to tro, it will

* Yager joint operator: a@®b=

develops into ain function If w decreases to 0, it becomes Drastic multiplication

Based on experiments, we conclude that sin function and normal multiplication are

the two most preferable choices for T-norm operator because they are convenient to

Trang 28

derived Hom the Gana (2.6) by applying ain funetion and wormat multiplication

Another reason lor choosing rtia function and algebraic multiplication for T-norm

operator is related to the question “haw we understand the meaning uf the implication

operator (=> on a3) in Fizzy logic theory?" In the classical logic the implication

operator, used to link twa clauses P and Q to form a compound clause P > Q

expresses the ides

if P then Q” This is a relatively sophisticated logical link because

il is used to represent a cause and effect relation, While formalizing we, however, consider the ruth valuc of this relation as a regular combination of those of P and Q [Tus assumption may lead us to a misconception or a misunderstanding of this kind of

domain U and V respectively The cawse and effeer rule “if w is P then v is Q” is

understood that the pair (u v) is a fuzzy set on the universal domain Ux The fuzzy

implication P — Q is considered a fizzy set and we necd to identi

ils membership

function tpg from membership functions mp and my of fuzzy sets P and Q There are

various researches around this issue We relate hercin several ways to determine

inembership function mp se fl:

If adopting the idea of implication operator in classical logic theory we have: Vật, v}

£ 17 V:/0p ¿o(U, v) = @(I- nip nig) in which, ® is S-norm operator in fuzzy logic

theory 1f @ is replaced with max function, we obtain the Dienes formula mr solu ¥)

= max(l- mp mg) If & is replaced with probability swum we receive the Mizumoto

formula ne gue v¥) — T+ atp + mpg And, if @ js substituted by fimited

onitiptication we get the Lukaciewiez formula

§ Mpa ¥) = min(L T+ ap + my).

Trang 29

31

hi veneral the @ van be substinited by any valid function satisfying certain conditions

of S-norm operator

Anather way to interpret Ihe meaning of this kind of relation is thal the (ruff value of

compound clause “if wis P then v is Q” increases iff the fur values of belh amecedent and consequent are large This means that mp g(u v) = @laip, mg) Ifthe

& operator is substituled with yi function, we receive the Mamdani formula szp_ pfu

vị min my) Similarly, if @ is replaced by normal mudtiptication, we obtain the

formula mp sjÁu V) = uc tig E2]

Fuzzy association rule, ina sense, is a form of the fuzzy iunplication Thus, it must in

part, comply with the above ideas Although there are many combinations of m, and

img bu fori the tụ yu, v) the Maindani formulas should be the most favorable one

‘This is lhe main reason that influences our choice of min function and algeraic

audtipication for T-norm operator

3.2.3 Algorithm for fuzzy association rules mining

Phose ova: generating all possible confident fuzzy association rules from the

discovered frequent fizzy itemsets above This subproblem is relatively

straightforward and less time-consuming comparing lo the previous step If <X, A> is

a frequent fuzzy itemset the rules we receive from <X, A> has the form of Vis #@—2 NN ix 44’, in which, X* and A’ are non-empty subsets of X and A

respectively lhe inverse slash (i.e | sign) in the implication denotes the subtraction operator between two seis /¢ is the fuzzy confidence factor of the rule and must meet

the cundition fc 2 finincanf,

The inpriats of the algorithm are a database D with attribute set I and record set T, and

fininsup as well as fixincont

ovialion rules

The outputs of the algorithm are all possible conlident [uzzy

Notation table:

Trang 30

| Set of fazy aztrihites in Dg, each of them # atiached with a furry

Ì sẽt HAh fizzy set fc tun, has a tareshnld wy as used in formula

el of records in Dp value of ench zecord at a given fuzzy at

‘et of afl passible frequent itemsets from datba:

uzzy minimum support uzzy minimum confidence

Table (0 — The algorithm for mining fuzzy association rules

The algorithin in table 10 uses the following sub-programs:

(De Ie Ty) = FuzzyMaterialization(D, I, T): this function is to convert the original

database D into the tuzzified database Dy Afterwards, I and T are also transformed to

‘or cxample, with the database in table 8, after running this

1; and Ty respectively

function, we will obtain:

= {[Age Age Young] (1), (Age Age_Middle-aged] (2), [Age Age_Old] G)

‘Chatesterat, Cholesterat_Low] (4), [Cholesteral, Cholesterol_ High] (5)

[BloodSngar, BloadSuger 0) (8), [BloodSugar, BlaodSugar_1| (7), [teartDixease, Heart Disease No) (8), [HeartDixeuse, HeartDisease Yes} (9

After converting, Tp contains 9 new fuzy allributes comparing to 4 m 1 Each fuzzy

auribute is a pair including the name of original attribute and the name of

Trang 31

33

corresponding fazey set and surrounded by square brackets For iustanee, after

favifying the Age attribute we

eive three new fuzzy attributes [Age đực Young]

velee Middle-aged] and |Age Age Old]

[In addition, the function FuzzyMaterialization also converts T into Ty as shown in

dlues of records at attributes after fuzzifying

Note that the characters A C 8 and H in the table 11 are all the first character of

Age Cholesterol, Sugar, and Heart respectively Fach fuzzy set fis accompanied by a

the

hold w;, so only values that greater or equal to that threshold are taken into

consideration All other values are 0 All gray cells in the table 11 indicates that theirs values are larger or equal to threshold {all thresholds in table 11 are 0.5} All values

lucated in white-ground cells arc equal to 0

F, + Counting(De, [r Ty, forinsup): (his function is to generate Fy, thal is sct of all frequen fUzzy 1-itemsets Al] elements in F;) must have supports greater or equal to

Jininsup For instanee, if applying the normad multiplicatian Tar T-norm (@) operator

in formula (3.6) and fiainsup with 46% we achieve the F, that looks like the

following table:

PY BlondBigor, BioadSugar Of) (6) a5 % | Yes mx

| M(ReunDineme, HeartDisease_Na]} (8) 54% Yes

Table 12 - C,: set of candidate |-itemsets

Trang 32

FL 4434 67 {81.1911

Cí — doi

(CQ based on the set of frequent fazzy (ke L-itemsels (Fy,.) discovered in the previous

wa} this function is to produce the set of all fuzzy candidate k-itemscts

stcp The following SQL, statement indicates how elements in Fy are combined to

form candidate k-itemsets

INSERT INTO C,

SELECT pip peige oes Pet Geir

FROM Lua pela q

WHERE pi, = q.i), - Deiseg = Gea Pelee € đi AND piper # q-ij,.).02

In which, p.i, and q.j are index number of j" fuzzy attributes in itemsets p and q

respectively p.i.o and q.i,o are the index number of original attribute, Two fuzzy

alttibutes sharing a common original attribute must not exist in the same fuzzy itemset, For example, after running the above SQL command we obtain Cy * {13.6} (3 8} {3 9}, {ó, 8}, (6 94} The 2-itemset {8.9} is invalid because its Iwo fuzzy

aliributes are derived from a common attribute HeartDisease

ý — Prune(C¡): this finetion helps us to prune any unnecessary candidate k-itemset

in Cy thanks to the downward closure property “all subsets of a frequent itemser are

also frequent, and any superset of a non-frequent tlemset will be not frequent” To

evaluate the usefulness of any k-itemset in Cy, the Prune function must make sure that all (k-1)-subsets of C), are present in Fis For instance, after pruning the C, =

(4B GO}, (BBE, 13.91, {6 8}, {6.93}

EL - Checking(C¿ Đụ, /im?#sup): thìịs fưnetion first scans over the whole transactions

in the dalatabase Ww update support faclors for candidate itemsels in Cy Afterwards,

Checking eliminates any infrequent candidate itemset, i.e whose support is smailer than fininsup All frequent itemsets are retained and pul into Fy After running Fy

Checking(Cz Dy, 46%), we receive Fy - {{3.6}, (6,8}) The following table displays the detailed information

Tiêu đề	Parallel mining for fuzzy association rules
Người hướng dẫn	Dr Ha Quang Thuy
Trường học	National University of Hanoi
Chuyên ngành	Information Technology
Thể loại	Thesis
Năm xuất bản	2003
Thành phố	Hanoi

Định dạng
Số trang	64
Dung lượng	821,88 KB