Sequeaiattemporal patterns mining: this method is somewhat similar to association rules except that data and mining results a kind of rule always contain a temporal attribite to exhibit
Trang 1
NAM NATIONAL LNIVERSITY, HANOI FACULTY OF TECHNOLOGY
PHAN XUAN IHEU
AFRALLEL MINING FOR FUZZY ASSOCIATION RULES
Major: Intormation 'Techuology
Trang 21.0.2 Data mining: Definition
1,1 Malor approaches and techniques in Data mining — 12
1.3.1 Major approaches and techniques in Data mining Ì2
1.2.2 Kinds of data could be mined Kreerrrrerarrrr T2
1.2.1 Applications of Data mining —
1.3 Focused issues in Deda mining ee ce eens -14
2,1 Association ruleS: MIGLVALIGH, cccccxeeherrereerrrerrrree 216
2.3 Main research wends in Association rules mining 2 cece cece TR
Chapter 3 Puczy association cues mining
Trang 3
3.1.2 Methods of data discretization Hà
3.2.1 Data discretization based on fuzzy set pees 24 3.0.2 Baecy association rules
3.2.3 Aigoritlim for fuzzy association rules mining
4.1 Several previously proposed parallel algorithms
4.2.4 new parallel algoritlan for fuzzy association rules mining, SỐ
Ref nee
Trang 4
Eiaue ~The voluine of data strongly increases in the past byo decades 9
biguce 3 - Mfustvation of an asseciation rule - “ - a se TẾ
Figure 4 + “Sharp bounlary problen” in data đïsoreization -Ö23
Figure 5 - Membership functions of fuzzy sets assuciated with “Age” attribute 25
6- Membership finctions of "Cholesterol_Low" and "Cholesterof_Hieh" 25
The processing time increases dramatically as decreasing the findasup 36
Figure & + Number of itemsets and rules strongly increase as reducing the fivinsup 37
Figure 9 The number of rules enlarges remarkably as decreasing the fininsup 37 Figure 10 — Pracessing time increases largely as slightly increasing number of attrs 38
Figure [1 — Processing time increases linearly as increasing the number of records 38
Figure 12 - Optional choices for T-norm operator
Figure 13 - The mining results reflect the changing of threshold values
Figure 14 - Count distribution algorithin on a 3-processor parallel system 6 43
Figure 18 - Data distribution algorithin an a 3-processor parallel system _ 4d
> 16 - The rule generating time largely reduces as increasing the minconf 48
17 ~ The number of rules largely reduces when inereasing the minconf 48
1 - The iliustration for division algorithm ccc sees woe 5S
Figure 19 - Processing lime largely reduces as increasing the number of process 57 Figure 20 - Mining time largely depends on number of process (logical, physical) 58
Eipure 21 - ‘Phe main interface window of Fuz7yARM tool sce
Figure 22 - The sub-window for nđđỉng new flntry SERS, 0 cu mee 66
- The window for viewing mining resulls
Trang 5lable 5 - Data diserctization for attributes having finile valaes 22
Fable 6 - Data discretization for "§erum cholesterol" attribute we 23 Table 7 - Data discretization for "Age" attribute 23
Mahle 8 The diagnostic database of heart disease on 13 palients 27
Fable LO The algorithin for mining fuzzy association tiles eee 32
able 11 - T)> Values of records at attributes afler fuzzifying ccc VI
Vable 13 - Fa: get of frequent 2-itemsets oceans tree ries ceeeeeee 34
Table }4 - Fizzy association rules generated from database in table 8 35
Fable 15 - The sequential algorithm for gencrating association rules
Table 16 - Fuzzy attributes received aller being fuzzified the database in table 8 51
Vable 17 = Fuzzy attributes dividing algorithm truong prúcessots 53
Trang 7Chapter 1 introduction to Data mining
1.1 Data mining
1.1.1 Data mining: Motivation
The past two decades has seen a dramatic increase in the amount of information or
data being slored in electronic devices (-¢ hard disk, CD-ROM etc.), This accumula-
tion of data has taken place at an explosive rate It has been estimated that the amount
of information in the world doubles every bwo years and the size and number of
databases are increasing even faster Figure | illustrates the data explosion |3]
1970 19883 1230 2000
Figure | - The volume of data strongly inereases in the past two decades
We are drowning in data, but starving for useful knowledge The vast amount of
accumulated data is actually a valuable resource because infonnatian is the vital
factor for business operations, and decision-makers could make the most of the data to
gain precious insight into the business before making decisions Data mining the
extraction af hidden predictive information froin large databases, is a powerful new
Ieehuvlogy with great potential lo help companies focus on the most significant
inlonnation in their data collection (databases, data warchouses, dala repositories)
The automated, prospective analyses offered by data mining go beyond the normal
analyses of past events provided by retrospective tools typical of decision support
systems Data mining tools can answer business questions that iraditionally were too
lime-consuming to resolve ‘This is where data mining & knowledge discovery in
databases demonstrates its obvious benefits for today’s compelitive business
environment Nowadays Data Mining & KDD has been becoming a key role in
colnpuler science and knowledge engineering areas.
Trang 8Vhe initial application of data mining is only in commerce (retail) and finance (stock
market) However, data mining is now widespr
dly and successfully put inte other
fields such as bia-informatics, medical treatment, tclecommunication, education, ete
1.1.2 Data mining: Definition
liclore discussing some definitions of data mining | have a small explanation about
terminology so thal readers can avoid unnecessary confusions As mention-ed before,
we can roughly understand that data mining is a process of extracting nontrivial,
implicit, previously unkanwn, and potentially useful knowledge from huge sets of data, Thus we should name this process as knowledge discovery in database (KDD) instead of data mining However, most of the researchers agree that the two above terminologies (Data mining and KDD) are similar and they can be used
interchangeably They explain for this “humerous misnomer” that the core motivation
of KDD is the useful knowledge, but the main object they have to deal with during
mining process is data Thus, in a sensc, data mining and KDD imply the same
micaning However, in several materials, data mining is sometimes referred to as one
step in the whole KDD process [31 [431
There are numerous definitions of data mining and they are all descriptive | would
like tu restate herein some of them that are widely accepted
Definition ane: W J Prawley G Piatetsky-Shapiro, and C 1 Matheus 1991 [43]:
“Kaowledge discovery in databases also knawa Data mining is the nontrivial
process of identifying valid novel potentially useful and ultimately understand-able
patteras in data.”
Definition two: M, Holshemicr va A Siebes (1994):
“Date Mining is die search for relationships and global patterts that caist in large
databases hut ave ‘hidden’ amony the vast nmount of data, such as a relationship
henveen patient data and their medical diagnosis These relationships represent
valuable knowledge ahout the database and the abjects in the database and if the
dutabase is o faithful mirror, af the real world repistered hy the database.”
1.1.3 Main steps in Knowledge discovery in databases (KDD)
The whole KDD proces
is usually decomposed into the follawing steps [3] [14] [23]:
Trang 9YW
Hate selection: selecting or segmenting the necessary data that needs ta be mined
from Large data sety (databases, data warehouses dala repositories) accarding to same
erizeria
Pe preprocessing: this is the data clean and reconfiguration stage where some
icchniques are applied to deal with incomplete, noisy, and inconsistent data This step
also tries to reduce data by using aggregate and group function, data compression incthods, histograms sampling, etc Furthermore, discretization techniques (binning
histograms cluster analysis entropy-based discretization, segmentation) can be used
to reduce the number of values for a given continuous attribute by dividing the range
of de altribute into separated intervals After this step, data is clean, complete,
uniform reduced, and discretized
Data twansformation: in this step, data are transformed cr consolidated into forms
appropriate for mining Data transformation can inyolve data smoothing and
normalization Alter this step, data are ready for the mining step
Duta mining: this is considered to be the most important step in KDD process It applies same data mining techniques (chiefly borrowing from machine learning and
other fields) to discover and extract useful patterns or relationships from data
Knowledge representation and evaluation: the patterns identified by the system in
previous slep are interpreted into knowledge that can then be uscd to suppor! human
decision-making (e.g prediction and classification tasks, summarizing the contents of
a database or explaining observed phenomena), Knowledge representation also
converts patterns into user-readable expressions such as trees, graphs charts & tables,
Trang 101.1 Major approaches and techniques in Data mining
1.2.1 Major approaches and techniques in Data mining
Daia mining consisis of many approaches They can be classilicd according lo
functionality kind of knowledge type of data to be mined, or whatever appropriate
criteria [14] t describe major approaches below:
Class fication & prediction: this method trics te arrange a given object into an
approoriale class among the others The number of classes and their name are
definéely known For example, we can classify or anticipate geographic regions
according to weather and climate dala This approach normally uses typical techniques and concepts in machine learning such as decision tree, artificial neural
nebwerk, k-min, support vector machine, etc Classification is also called supervised
learning
A
purchase beer alse purchase dry beef” Association rule is now successfully applied in
cation rules: this is a relatively simple form of rule e.g “80 percent of men that
supermarket (retail) medicine, bio-intormaties, finance & stock market, etc
Sequeaiattemporal patterns mining: this method is somewhat similar to association
rules except that data and mining results (a kind of rule) always contain a temporal
attribite to exhibit the order or sequence in which events or objects effect upon each
other, This approach plays a key role in finance and stock market thanks to its capab lity of prediction
Clustering & segmentation: this method (ries to arrange a given object into a suited
calegiry (also known as cluster) The number of clusters may be dynamic and their
labels (names) are unknown Clustering and segmentation arc also called
unsupzrvised learning
Concet description & summarization: the main objective of this method is to deserixe or summarize an abject so that the obtained information is compact and
candetsed Dectument or text summarization may he a typical example
1.2.2 Kinds of data could be mined
Data mining can work on various kinds of data The most typical data types are as follows:
Trang 1113
Relational datebases: databases organized according to the relational model Most of
the existing database management systems support this kind of madel such as Orele,
IBM DB2 MS SQI Server MS Access, ete
AMutidimensional databases: this kind of database is also called dala warchouse, data
mart ele The data selected from different sources contain the historical feature thanks
to an implicit or explicit temporal attribute This kind of database is used primarily in
data inininy and decision-making support systems
Transactional databases: this kind of database is commonly used in supermarket,
banking etc Each transaction includes a certain number of items (e.g items may be goods in au order) and a Wansactional database, in turn, contains a certain number of {vangactions
Object - relational databases: this dalabase model is a hybrid of the abject oriented
model and the relational model
Spatial temporal and time-series data: this kind of data always contains either spatial (c.g map) or temporal (e.g stock market) attributes
Multimedia databases: this kind of data includes audio, image, video, text, www, and
many other dala fernal Today, This kind of data is widely used on Internet [hanks to
its useful applications
1.2 Applications of Data mining
1.2.1 Applications of Data mining
Although data mining is a relatively new research trend, it is a big attraction of
resvarchers because of its practical applications in many areas The following should
be typical applications: (1) Data analysi
and decision-marking supports This
application is popular in commerce (retail industry), finance & slock market, etc (2)
Medical treatment: finding the potential relevance among symptoms, diagnoses, and treatment methods (nutrient preseriptign, surgeon, ele) (3) Text and Web mining: document summarization text retrieval and text searching, text and hypertext
{4) Bio-informaties: search and compare typical or special gzenctic
information such as genomes and IDNA, the implicit relations between a set of
genomes and a genetic disease, etc (5) Finance & stock market: examining data to
Trang 12vxtraet predicted information lor price of a certain kind of coupon, (6) Cthers
(1elecommunication, medical insurance, astronomy, anti-terrnrisin, sports, ete)
1.2.2 Classification of Data mining systems
Data mining is a knowledge engincering related field that involves many others research areas such as databasc machine learning, artificial intelligence, high
perlormance computing, data & knowledge visualization etc We could classify data
mining systems according to different criteria ay fallaws
Classifving hased on kind of data to he mined: data mining systems work with
relational databases, data warehouses, transactional databases, object-oriented databases, spatial and temporal databases, multimedia databases, text and web databases etc
Classifying fhased on type af ntined knowledge: data mining tools that return summarization or description association rules, classification or prediction,
clustering, etc
Classifuing based on what kind of techniques to be used: data mining tonls work as
online analytical processing (OLAP) systems, use machine learning techniques
(decision tree, artificial neural network, k-min, genetic algorithm, support vector machine, rough set fuzzy set etc.), data visualization, etc
Classifying based on what fields the data mining systems are applied to: dala mining
systems are used in different fields such as commerce (retail industry),
telecommunication bio-informatics medical treatment, finance & stock market, medical insurance ete
1.3 Focused issues in Data mining
Data mining is a relatively new research topic Thus there are several pending or
unconvincingly solved issues [ relate herein some of them (hat are attracting much attention of data mining researchers
(1) OLAM [Online Analytical Mining) is a smooth combination of databases, data
warehouses, and data mining Nowadays, database management systems like Oracle,
MS SQL Server [BM DB2 have integrated OLAP and data warehouse functionalities
to facilitate users in data retrieval and data analyzing These add-in supports also
Trang 1318
charge users an additional sum of money, Researchers in these fields hope to go
beyond the current limitation by developing multi-purposes OLAM systems that
support data transactions for daily business operations as well as dala analyzing for
making decision [14] (2) Data mining systems can mine various forms of knowledge
from different types of data [14] [7] (3) How lu enhance the performance, accuracy,
scalability, and integration of data mining systems’? How to decrease the computational complexity? How to improve the ability of dealing with incomplete,
inconsistent, and noisy data? Three questions above should still be concentrated in the
future [14] (4) Taking advantage of background knowledge or knowledge from users (experts or specialists) to upgrade the total performance of data mining systems [7]
[1] (4) Parallel and distributed data mining is an interesting rescarch trend hecause it
makes use of powerful computing systems to reduce response time This is essential
because more and more real-time applications are needed in today’s competitive
world [5] [8] £12] [18] [26] [31] [32] [34] [42] (6) Data Mining Query Language
(DMQI.): Researchers in this area try ta design a standard query language for data
mining This language will be used in OQLAM systems as if SQL are widely used in relational databases [14] (7) Knowledge representation and visualizatiun are also
taken inte consideration to express knowledge in human-readable and easy-to-use
forms, Knowledge can be represented in more intuitive expressions due to
multidimensional or multilevel data structures
This thesis primarily involves in mining fuzzy association rules and_ parallel
algorithms for mining fuzzy association rules
Trang 142.1 Association rules: Motivation
Association rule is the form of “70 percent of customers that purchase heer also
purchase dry beef, 20 percent of customers purchase both” or “75 percent of patients who |
noke cigarettes and live near polluted areas also get tung cancer, 25 percent of
pat nts smoke and live near polluted areas as well as suffer from tung cancer”
“Purchase beer™ or “smoke cigarettes and live near polluted areas” are antecedents,
“purchase dry beef and “get lung cancer™ are called consequents of association rules
20% and 30% are called support factors (percentage of transactions or records that
coiain both antecedent and consequent of a mule), 70% and 75% are called
confidence factors (percentage of transactions or records that hold the antecedent also
hold the cansequent of a rule) The following figure pictorially depicts the former
example of association rules
umber of es af transachons | transactions that huy beer that
purchase
“a ZWnf hhees also tranzscbơng buy | Purchase dry number of transachons beet
that buy dry beef
Figure 3- Illustration of an association rule
The knowledge and information derived from association rules have an obvious
difference in meaning from that of nonnal queries (usual in SQL syntax) This
knowledge contains previously unknown relationships and predictions hidden in
massive volumes of dala Tl not only results from usual group, aggregate or sort
operations but also results trom a complicated and time-consuming computing
provess
Being a simple kind of rule, association rules, however carry useful knowledge and
contribute substantially to decision-making process Unearthing significant niles from
databases is the main motivation of researchers.
Trang 1517 2.2 Association rules mining — Problem statement
Leth = fip lạ iạ} be a set of nm items or attributes (in transactional or relational
databases respectively) and TF — {t), ta, bn} be a set of a transactions or records (in
transactional or relational datahases respectively} Each transaction is identified with ils unique TID number A (transactional) database D is a binary relation & on the Des
art multiplication 1xT (or also written & C IxT)} If an item i occurs in a
transaction f we write (i, t) € 8 or i&t Generally speaking, a transactional database is
a set of transactions, where each transaction t contains a set of items or t © 2! (where
2!is power set of T} [24] [36]
For example consider the sample transactional database shown in table | with B= {A
C.D TW? and T= (1,2, 3.4.5.6}
Transaction TP Nemeet Ì
AC TW _,
cp ©
AC TW ACD W AcD TW CDT
Tabie 1 - An example of transactional databases
X ¢ Lis called an itemset The support factor of an itemset X, denoted as s(X), is the
percentage of transactions that contains X An itemset X is frequent if its support is
greater than or equal to a user-specified minimum support (minsup) value, i.e s(X) >
avinsup 136]
the following table enumerates all possible frequent itemsets in sample database in
lable | with minsup value is 50%
Table 2 - Frequent itemsets in sample database in table 1 with support — 50%
Association rule is an implication in the form of ¥—*+¥, where X and Y are frequent itemsets that disjoint, ic Xo V = @, and c confidence factor of the rule, is the conditional probability that a transaction contains Y, given thal it contains X, ie ¢
A-Lof A+
Trang 16= #X 2V) ¿ (X) A rule is confident if ils confidence factor is larger or equal to a
user-specified miniman confidence (minconf) value i.e ¢> minconf [36]
‘The association rules mining task can be stated as follows:
Let D be a (transactional) database, minsup and minconf are minimum support
and minimum confidence respectively The mining task tries to discover all
frequent and confident association cules > ¥ ie (XUV) = minsup and
bound) and I/O operations (I/O-baund),
Phase pve: generating confident association rules from discovered frequent itemsets
in the previous step If X is a frequent itemset, confident association rules created from X have the form of 1" -+-> 4 ‘X', where X’ is any non-empty subset of X and X‘\X? is subtraction of X* from X This step is relatively straightforward and much less time-consuming than the step one
The following table lists all possible association rules generated from the frequent
itemsel ACW (from database in table 1} with mtacanj'= 70%
Table 3 - Association rutes generated from frequent itemset ACW
2.3 Main research trends in Association rules mining
Since proposed by K Agrawal in 1993 [36], the field of association rules mining has
developed into various new directions thanks to a varicty of improvements from researchers Some of proposals are try to enhance the precision and performance,
Trang 1719
seine (ry (a tune the interestingness of rules, ete T list herein some of its dominant
trends
Mining binary or boolean association rules: this is the initial research direction of
associntion rules Most of the carly mining algorithms are related to this kind of rule
[20| [38] [36] In binary association rules, an item is only determined whether it is
present or not The quantity associaled with cach item is fully ignored, eg a
transaclion buying twenty bottles of beer is the same a transaction that buys only one
boule, The most well-known algorithms mining binary association rules are Apriori together with its variants (Apriori-Tid and AprioriHybrid) [35] An example of this
type of tule is “hying bread = ‘yes’ AND buying sugar = ‘yes’ => buying milk =
‘ves’ AND buying butter = ‘yes’, with support 20% and confidence 80%"
Quantitative and categorical association rules: attributes in databases may be binary (bootean), number (quantitative), or nominal (categorical), etc To discover
association rules thal involve these data types, quantitative and categorical attributes necd ta be discretized to convert into binary ones There exist some of discretization
methods that are proposed in {34] [39] An example of this kind of rule is “sex =
‘male’ AND age © '50 65' AND weight © '60 80' AND sugar in blood > 120mg/m!
> blood pressure = ‘high’ with support 30% and confidence 65%”
Fuzzy association rules: this type of rule was suggested Io overcome several
drawbacks in quantitative association rules such as “sharp boundary problem” or
semantic expression Fuzzy association rule is more natural and intuitive to users
thanks to its “fuzzy” characteristics An example is “dry cough = ‘ves’ AND high
fever AND muscle aches = ‘yes' AND breathing difficulties = ‘yes' => get SARS
(Severe Acute Respiratory Syndrome) = ‘yes’, with suppart 4% and confidence 80%”
High fever in the above rule is a fuzzy attribute We measure the body temperature
based on a fuzzy concept
Multi-level association rules: all kinds of association rules above are too concrete, 50
they carmot reflect relationships on general views Multi-level or generalized
" association rule is devised to surmount this problem [15] [37] In approach, we would prefer rule like “ew PC — ‘yes' => buy operating system = ‘yes’ AND buy office
fools ~ ‘yes rather than “huy (BM PC = ‘pes’ —> buy Microsoft Windows = ‘yes’
AND buy Microsoft Office = ‘yes”’ Obviously, the former rule is the generalized
form of the latter and the fatter is the specific form of the former
Trang 18Association rules with weighted items (or attributes): we use weight assacialed with cach ilem to indicate “the level” that item contributes to the rule In other words
weighls are used to measure the importance of items For example, while surveying,
SARS plague within a certain group of people the information of body temperature and respfratory sysiem is much more essential than that of age To reflect the
difference between the above attributes, we attach greater weight values for body femperature and respiratory system attributes This is an attractive research branch
and solutions to it were presented in several papers [10] [44] Ry using weights, we
should discover scarce association rules of high interestingness This means that we
can retain rules with small supports bul have a special meaning
Besides cxamining of variants of association rules, researchers pay attention to how to
accelerate the phase of discovering frequent itemsets Most of recommended
algorithins are lo wy to reduce the number of frequent itemsets need to be mined by developing new theories of maximal frequent itemsets [11] (MAFIA algorithm), closed itensets (13| (CLOSET algorithm), [24] (CHARM algorithm), [30] These
new approaches considerably decrease mining time owning to their “delicate pruning strategies” Experiments show that these algorithms outperform known ones like
Apriori, AprioriTid, ete
Parallel and distributed algorithms for association rules mining: in addition to
sequential or serial algorithms, parallel algorithms are invented to enhance total performance of mining process by making use of robust parallel systems The advent
of parallel and distributed data mining is highly accepted because size of databases
increases sharply and real-time applications arc commonly used in recent years
Numerous parallel algorithms for mining association rules were devised during past
ten years in [5] [12] [18] [26] [31] [32] [34] They arc both platform dependent and platform independent
Mining association rules in the point of view of raugh set theory [41]
Furthermore, there exist other research trends such as online association rule mining
[33] that data mining tools are integrated or directly connected to data warehouses or
data repositories based on well-known technologies as OLAP, MOLAP, ROLAP,
ADO etc.
Trang 1921
Chapter 3 Fuzzy association rules mining
3.1 Quantitative association rules
3.1.1 Association rules with quantitative and categorical attributes
Mining quantitative and categorical assaciation rules is an important task because af
its practical applications on real world databases This kind of association rules first
Table 4 - Diagnostic database of heart disease on 17 patients
In the above database, three attributes Age, Serwn cholesterol, Maximum heart rate are quantitative, two attributes Chest pain (ype and resting electrocardio-graphics are calcgorical, and all the test are binary (Sex Heart disease, Fasting blood sugar) In
fact binary data type is also considered to be a special form of category From the
data in table 4, we can extract such rules as:
<Age: 54 74> AND <Sex: Femaie> AND <Cholesterol: 200 300> => <Heart disease: Yes>, with support 23.53% and confidence 80%
<Sex: Male> AND <Resting electrocardiographics: 0> AND <Fasting blood
sugar < {20> => <Heart disease: No>, with support 17.65% and confidence
100%
Trang 20The approach proposed in [34] discovers this kind of rules by partitioning valuc
ranges of quantitative and categorical attributes into separated intervals to convert them into binary ones Yraditional well-known algorithms such as Apriori [35]
CHARM [24], CGLSET [20] can then work on these new binary altributes as original
problem of mining boolean association rules
3.1.2 Methods of data discretization
Binary association rules mining algorithms [20] [24] [35] [36] only work with
clional databases as
relational database containing only binary attributes or trans:
shown in table 1 They cannot be applied directly to practical databases as shown in table 4 In order lo conquer this obstacle, quantitative and categorical columns must
first be converted into boolean ones [34] [39] However, there remain some
limitations in data discretization that influence the quality of discovered rules The
output rules do not satisfy researchers’s expectation The following section describes
major diseretizatian methods to contrast their disadvantages
The first case: let A be a discrete quantitative or categorical attribute with finite value
đomain ƒvị, vy, , vụ} and & is small enough (& < 100) After being discretized, the
original attribute is developed into & new binary attributes named A_V, A_Vo,
A V, Value of a record at column A_V, is equal to True (Yes or 1) if the original
value of this record at attribule A is equal to v;, all the rest cases will set the value of
AN, to False (No or 0) The attributes Chest pain type and resting
eleetrocardiographics in table 4 belong to this case After transforming, the initial attribute Chest pain ppe will be converted into four binary columns
Chest_pain_type_1 Chest_pain_type_2 Chest_pain-_type_3, Chest_pain_type_4 as shown in the fallowing table
‘Chest pain type Chest pain_ | Chest_pain_ | Chest_pain_ | Chest_pain_
(1.2.3.4) > ~~ |_type one 1 lypc onc 2 | type one 3 | type one 4
‘Table 5 - Daca discretization for attributes having finite values
The second case: if A is a continuous and quantitative attribute or a categorical one having value domain {¥), v2 Vp} ( is relatively large) A will be mapped to g new
binary columns in the form of <A: starty.end)>, <A: starty.endy>, , <A:
Trang 21clot sierol and Marini: bevrt rate in table 4 belong to this form Seri cholesterol
anid 4e could be diserelized as shown in the two following tables:
Table 7 - Data discretization for “Age” attrihute
tization methods encounter sonie pitfalls such as
Unforumately the mentioned dis
[4] ]9} The figure belaw displ
“sharp houndary probtent
efen attribute A haying a value range from | to 10 Supposing that we divide A into
two separated intervals [1.5] and [6 0} respectively If the ainsup value is 41%, the
rage (6,.L0f will not gain suflicient support, Therefore [6 10| cannot salisty minsup
s lef boundary
0% uiasagy) 41%) even though there is a large support near i
For example [4.7] has support 54%, [5 8] has support 45% So, this partition resulls
in “shep boundary” between $ and 6, and therefore mining algorithms cannot
generate confident rules involving interval [6.10
Trang 22
unintentionally ov eremphasize the importance af values lncatcd near boundaries This,
ts hor natural and inconsistem,
hurthermoare partitioning attribute demain inte separated ranges results in a problem
n ru[t interpretation ‘The table 7 shows thal two values 29 and 30 belong to different
arteritis Though they are very similar in indicating old level Also, supposing thal the
[1.29] denotes young people, [30 59] for middle-aged people and [60.120]
Tor old ones sa the age of 59 implies a middle-aged person whereas the age of 60 implies an old person ‘his is not intuitive and natural in understanding the meaning
oF quantitslive association rules,
sseciaiion rule was recommended to overcome the above shoricomings [4]
Muzzy
[9] This kind of rule not only successfully improves “sharp boundary problem” but
also aflows us to express association rules in a more intuitive and a friendly formal
the quantittive rule “<Age: $4 74> AND <Sex: Female> AND
<Chulesterut: 200 300> > <Heart disease: Yes>" is now replaced by “<Age_Old>
AND © Sext Female AND
tye Old and Cholesterof_éfigh in the above rule are fuzzy
3.2 Fuzzy association rules
3.2.1 Data discretization based on fuzzy set
In the luzzy set theory [21] [47] an element can helongs to a set with a membership
value in [0.1] This value is assigned by the membership finctian associated with
each fuzzy set For attribute v and its domain DP, (also known as universal set), the
of the membership function associated with fizy set f, is as follow:
Imũpipit
mm, (sy: D,-> [ol] GB.)
Trang 232%
Phe fuzzy set provides a smooth change over the boundaries and allows us te express
Hseociation sides in a more expressive form Lets mse the set in data
dtacretving lo make the most of its benefits
For the ativibute age ane its universal domain [0.120], we attach with il three fuzzy
sects
ae Young tee Middle-uged aml Age_Old, The graphic representations af
Wiese fiysy sets are sliewn in the foliowing figure
Figure 3 - Membership funetions of fuzzy sets assacinted with “Age" attribute
By using [yzy sel we completely get rid ef “sharp boundary problem” thanks to its
wwit characteristics For example the graph in figure 5 indicates that the ages of 59 and 60 have membership values of fuzzy sel Age Old approximately 0.85 and 0.90
respectively Similarly the ages of 30 and 29 towards the fuzzy set Age_ Young are
O70 and 0.75, Obviously this transformation method is much more intuilive and
natural than known discretization ones
Another example the original attribute Serv cholesterol is decomposed into two new fuzey attributes Colesero Low and Cholestero High The following figure portrays membership functions of these fuzzy concepts
Chole stersl_Low Cholesterol_High
IDA is a categorical altrilwte having value domain fv), %, v,} and & is relatively
suil we fuzzify this attribute by attaching a mew fuzzy attribute A_V; to cach value
Mie valie of membership function ma_vjGx) equals to TL iÊx y; and equals ta 0 for
vise versa, Ultimately thinking AY; is also a normal set because its membership
Trang 24Minghon value is cither O or 1 Uf é is too large we can fuzzify this atiribute by
Dara discretization using fuzzy sets could bring the folowing benefits:
Eirsly smooth wansilion af membership functions should help us climinate the
sharp poundary problem”
Data discretization hy using fuzzy sets assists us significantly reduce the number of
new ulttibules because number of fuzzy sets associated with each original attribute is
relatively smal] comparing to that of an attribute in quantitative association rules For
instance if we use normal discretization methods over attribute Sern cholesteral, we
will obtain five sub-ranges (also five new attributes) [rom its original domain { 100,
600], whereas we will create only two new attributes Chofesterod Low and
Chotesterol_High by applying luezy sets This advantage is very essential because it
allows us lo compact the set of candidate itemsets and therelore shortening the total
mining tine,
Fury association rule is more intuitive, and natural than known ones,
All values of records at new auributes after fuzzilying are in [0, 1] This is lo imply
the possibility that a given efement belongs to a fuzzy set As a resull this {exible
coding oflers an exact method to measure the contribution or impact of cach record to the overall support ofan Hemset
The nest advantage that we will sce more clearly in the next section is fuzzified
databases still hold “downward closure property” (all subsets of a frequent itemset are
also frequent and any superset of a nan-frequent itemset will be not frequent) it we
T
Apriori also work well upon fuzzified databases with just slight modiGvations
have a wise choice Jor T-norm operator Thus, conventional algorithms such as
Another benetit is this data discretization method can be casily applied to both
relational and transactional databases.
Trang 25
Table 8 The dingnostig database of heurt disease om 13 patients
Tipe dy ia} be a set ola attributes, denoting 4, is the #Ỳ a0ribute in E And T
mn} ib a set of wi records, and 4, is the v" record in T The value of record
fat altribute 4, can be referred 10 as Gfi,] For instance in the table 8 the value of
isl] Gilsa the value of ts[Serwn choleserni]) is 274 (mg/ml), Uging Rưzi
inethod in the previous section we associate each attribute i, with a set of fuzzy sets
as follows:
Vor example with the database in table 8, we have:
Fqge = [Age_Young, Age Middle-aged, Age_Old} (with k = 3)
Fern Chatestwnt = | Cholestero’_Low, Cholesterol High} (with k = 2)
Trang 26XX Sp SARE Neh WHE oh BGR Ge Ga)
(sr ish) AND AND (pis Gp) 2 Q is G1) AND AND (yy is fy) G28)
ftemset is now delined as a pair <X A>, in which X (c 1) is an itemset and A
isu scl oF fuzzy sets associated with attributes in X
The support of a lusay itemset <X A> is denoted (<X A>) and determined by the
© OX = (xy Nef and 1 is the v” record in T
© @ is the T-norm eperater in fuzzy logic theory Its role is similar to that of
logic operator AND in traditionat logic
A frequent [rap itenset: a fuzzy itemsel <X is frequent if its support is greater
of equal (on fuzzy minimum support (faeinsap) specilied by users, Le.
Trang 2729
The support of a fuzzy association rule is detined as:
i(cX is A => Bis ¥>) =f5(<XUY AUB>) (G.10)
A fuccy association rufe is frequent if its support is larger or equal to farinsup, Le BCX Is A —» Bis V>) 2 fininsup
Confidence factor af a fuzzy association rule is denoted fo(X is A => ¥ is B) and
delined as:
iciX is A -> ¥ is B)- f(&X is A —> Bis ¥>)/ fe(<X, A>} GAD
A furzy association rule is considered frequent il its confidence greater or equal to a
fizzy minimum confidence (ncan/) threshold specified by users his means thal
the confidence must satisfy the condition:
je(X is A => W is B) 2 fininconf
Toda tt T-narm (): there are variaus ways to choose T-nerm operator [ 1} [2} [21]
|47| for formula (3,6) such as:
* Min function: a @ b= minfa, b)
* Normal multiplication: a®b=ab
= Limited multiplication: a@ b= max), a+b -t)
«Drastic multiplication: a@ bea(ifbel) =b (ifa=l) = 0 (fa b< 1)
min[1 (1-4)” © (1-b))!"] (with w >
0) fw = 1 it becomes limited multiplication If w mms up to tro, it will
* Yager joint operator: a@®b=
develops into ain function If w decreases to 0, it becomes Drastic multiplication
Based on experiments, we conclude that sin function and normal multiplication are
the two most preferable choices for T-norm operator because they are convenient to
Trang 28derived Hom the Gana (2.6) by applying ain funetion and wormat multiplication
Another reason lor choosing rtia function and algebraic multiplication for T-norm
operator is related to the question “haw we understand the meaning uf the implication
operator (=> on a3) in Fizzy logic theory?" In the classical logic the implication
operator, used to link twa clauses P and Q to form a compound clause P > Q
expresses the ides
if P then Q” This is a relatively sophisticated logical link because
il is used to represent a cause and effect relation, While formalizing we, however, consider the ruth valuc of this relation as a regular combination of those of P and Q [Tus assumption may lead us to a misconception or a misunderstanding of this kind of
domain U and V respectively The cawse and effeer rule “if w is P then v is Q” is
understood that the pair (u v) is a fuzzy set on the universal domain Ux The fuzzy
implication P — Q is considered a fizzy set and we necd to identi
ils membership
function tpg from membership functions mp and my of fuzzy sets P and Q There are
various researches around this issue We relate hercin several ways to determine
inembership function mp se fl:
If adopting the idea of implication operator in classical logic theory we have: Vật, v}
£ 17 V:/0p ¿o(U, v) = @(I- nip nig) in which, ® is S-norm operator in fuzzy logic
theory 1f @ is replaced with max function, we obtain the Dienes formula mr solu ¥)
= max(l- mp mg) If & is replaced with probability swum we receive the Mizumoto
formula ne gue v¥) — T+ atp + mpg And, if @ js substituted by fimited
onitiptication we get the Lukaciewiez formula
§ Mpa ¥) = min(L T+ ap + my).
Trang 2931
hi veneral the @ van be substinited by any valid function satisfying certain conditions
of S-norm operator
Anather way to interpret Ihe meaning of this kind of relation is thal the (ruff value of
compound clause “if wis P then v is Q” increases iff the fur values of belh amecedent and consequent are large This means that mp g(u v) = @laip, mg) Ifthe
& operator is substituled with yi function, we receive the Mamdani formula szp_ pfu
vị min my) Similarly, if @ is replaced by normal mudtiptication, we obtain the
formula mp sjÁu V) = uc tig E2]
Fuzzy association rule, ina sense, is a form of the fuzzy iunplication Thus, it must in
part, comply with the above ideas Although there are many combinations of m, and
img bu fori the tụ yu, v) the Maindani formulas should be the most favorable one
‘This is lhe main reason that influences our choice of min function and algeraic
audtipication for T-norm operator
3.2.3 Algorithm for fuzzy association rules mining
Phose ova: generating all possible confident fuzzy association rules from the
discovered frequent fizzy itemsets above This subproblem is relatively
straightforward and less time-consuming comparing lo the previous step If <X, A> is
a frequent fuzzy itemset the rules we receive from <X, A> has the form of Vis #@—2 NN ix 44’, in which, X* and A’ are non-empty subsets of X and A
respectively lhe inverse slash (i.e | sign) in the implication denotes the subtraction operator between two seis /¢ is the fuzzy confidence factor of the rule and must meet
the cundition fc 2 finincanf,
The inpriats of the algorithm are a database D with attribute set I and record set T, and
fininsup as well as fixincont
ovialion rules
The outputs of the algorithm are all possible conlident [uzzy
Notation table:
Trang 30| Set of fazy aztrihites in Dg, each of them # atiached with a furry
Ì sẽt HAh fizzy set fc tun, has a tareshnld wy as used in formula
el of records in Dp value of ench zecord at a given fuzzy at
‘et of afl passible frequent itemsets from datba:
uzzy minimum support uzzy minimum confidence
Table (0 — The algorithm for mining fuzzy association rules
The algorithin in table 10 uses the following sub-programs:
(De Ie Ty) = FuzzyMaterialization(D, I, T): this function is to convert the original
database D into the tuzzified database Dy Afterwards, I and T are also transformed to
‘or cxample, with the database in table 8, after running this
1; and Ty respectively
function, we will obtain:
= {[Age Age Young] (1), (Age Age_Middle-aged] (2), [Age Age_Old] G)
‘Chatesterat, Cholesterat_Low] (4), [Cholesteral, Cholesterol_ High] (5)
[BloodSngar, BloadSuger 0) (8), [BloodSugar, BlaodSugar_1| (7), [teartDixease, Heart Disease No) (8), [HeartDixeuse, HeartDisease Yes} (9
After converting, Tp contains 9 new fuzy allributes comparing to 4 m 1 Each fuzzy
auribute is a pair including the name of original attribute and the name of
Trang 3133
corresponding fazey set and surrounded by square brackets For iustanee, after
favifying the Age attribute we
eive three new fuzzy attributes [Age đực Young]
velee Middle-aged] and |Age Age Old]
[In addition, the function FuzzyMaterialization also converts T into Ty as shown in
dlues of records at attributes after fuzzifying
Note that the characters A C 8 and H in the table 11 are all the first character of
Age Cholesterol, Sugar, and Heart respectively Fach fuzzy set fis accompanied by a
the
hold w;, so only values that greater or equal to that threshold are taken into
consideration All other values are 0 All gray cells in the table 11 indicates that theirs values are larger or equal to threshold {all thresholds in table 11 are 0.5} All values
lucated in white-ground cells arc equal to 0
F, + Counting(De, [r Ty, forinsup): (his function is to generate Fy, thal is sct of all frequen fUzzy 1-itemsets Al] elements in F;) must have supports greater or equal to
Jininsup For instanee, if applying the normad multiplicatian Tar T-norm (@) operator
in formula (3.6) and fiainsup with 46% we achieve the F, that looks like the
following table:
PY BlondBigor, BioadSugar Of) (6) a5 % | Yes mx
| M(ReunDineme, HeartDisease_Na]} (8) 54% Yes
Table 12 - C,: set of candidate |-itemsets
Trang 32FL 4434 67 {81.1911
Cí — doi
(CQ based on the set of frequent fazzy (ke L-itemsels (Fy,.) discovered in the previous
wa} this function is to produce the set of all fuzzy candidate k-itemscts
stcp The following SQL, statement indicates how elements in Fy are combined to
form candidate k-itemsets
INSERT INTO C,
SELECT pip peige oes Pet Geir
FROM Lua pela q
WHERE pi, = q.i), - Deiseg = Gea Pelee € đi AND piper # q-ij,.).02
In which, p.i, and q.j are index number of j" fuzzy attributes in itemsets p and q
respectively p.i.o and q.i,o are the index number of original attribute, Two fuzzy
alttibutes sharing a common original attribute must not exist in the same fuzzy itemset, For example, after running the above SQL command we obtain Cy * {13.6} (3 8} {3 9}, {ó, 8}, (6 94} The 2-itemset {8.9} is invalid because its Iwo fuzzy
aliributes are derived from a common attribute HeartDisease
ý — Prune(C¡): this finetion helps us to prune any unnecessary candidate k-itemset
in Cy thanks to the downward closure property “all subsets of a frequent itemser are
also frequent, and any superset of a non-frequent tlemset will be not frequent” To
evaluate the usefulness of any k-itemset in Cy, the Prune function must make sure that all (k-1)-subsets of C), are present in Fis For instance, after pruning the C, =
(4B GO}, (BBE, 13.91, {6 8}, {6.93}
EL - Checking(C¿ Đụ, /im?#sup): thìịs fưnetion first scans over the whole transactions
in the dalatabase Ww update support faclors for candidate itemsels in Cy Afterwards,
Checking eliminates any infrequent candidate itemset, i.e whose support is smailer than fininsup All frequent itemsets are retained and pul into Fy After running Fy
Checking(Cz Dy, 46%), we receive Fy - {{3.6}, (6,8}) The following table displays the detailed information