Data Mining and Knowledge Discovery Handbook, 2 Edition part 13 pot

Introduction Discretization is a data-processing procedure that transforms quantitative data into qualitative data.. 6.1.2 Levels of measurement scales In addition to being classiﬁed int

Trang 1

100 Barak Chizi and Oded Maimon

Rokach, L., Decomposition methodology for classiﬁcation tasks: a meta decomposer frame-work, Pattern Analysis and Applications, 9(2006):257–271

Rokach L., Genetic algorithm-based feature set partitioning for classiﬁcation prob-lems,Pattern Recognition, 41(5):1676–1700, 2008

Rokach L., Mining manufacturing data using genetic algorithm-based feature set decompo-sition, Int J Intelligent Systems Technologies and Applications, 4(1):57-78, 2008 Rokach, L and Maimon, O., Theory and applications of attribute decomposition, IEEE In-ternational Conference on Data Mining, IEEE Computer Society Press, pp 473–480, 2001

Rokach L and Maimon O., Feature Set Decomposition for Decision Trees, Journal of Intel-ligent Data Analysis, Volume 9, Number 2, 2005b, pp 131–158

Rokach, L and Maimon, O., Clustering methods, Data Mining and Knowledge Discovery Handbook, pp 321–352, 2005, Springer

Rokach, L and Maimon, O., Data mining for improving the quality of manufacturing: a feature set decomposition approach, Journal of Intelligent Manufacturing, 17(3):285–

299, 2006, Springer

Rokach, L., Maimon, O., Data Mining with Decision Trees: Theory and Applications, World Scientiﬁc Publishing, 2008

Rokach L., Maimon O and Lavi I., Space Decomposition In Data Mining: A Clustering Ap-proach, Proceedings of the 14th International Symposium On Methodologies For Intel-ligent Systems, Maebashi, Japan, Lecture Notes in Computer Science, Springer-Verlag,

2003, pp 24–31

Rokach, L and Maimon, O and Averbuch, M., Information Retrieval System for Medical Narrative Reports, Lecture Notes in Artiﬁcial intelligence 3055, page 217-228 Springer-Verlag, 2004

Rokach, L and Maimon, O and Arbel, R., Selective voting-getting more for less in sensor fusion, International Journal of Pattern Recognition and Artiﬁcial Intelligence 20 (3) (2006), pp 329–350

Scherf, M and Brauer, W Feature selection by means of a feature weighting approach Tech-nical Report FKI- 221- 97, Technische Universit at Munchen 1997

Setiono, R and Liu, H Chi2: Feature selection and discretization of numeric attributes In Proceedings of the Seventh IEEE International Conference on Tools with Artiﬁcial In-telligence, 1995

Singh, M and Provan, G M Efﬁcient learning of selective Bayesian classiﬁers In Machine Learning: Proceedings of the Thirteenth International network Conference on Machine Learning Morgan Kaufmann, 1996

Skalak, B Prototype and feature selection by sampling and random mutation hill climbing algorithms In Machine Learning: Proceedings of the Eleventh International Conference Morgan Kaufmann, 1994

Vafaie, H and De Jong, K Genetic algorithms as a tool for restructuring feature space rep-resentations In Proceedings of the International Conference on Tools with A I IEEE Computer Society Press, 1995

Ward, B., What’s Wrong with Economics New York: Basic Books, 1972

Trang 2

Discretization Methods

Ying Yang1, Geoffrey I Webb2, and Xindong Wu3

1 School of Computer Science and Software Engineering, Monash University, Melbourne, Australia yyang@mail.csse.monash.edu.au

2 Faculty of Information Technology

Monash University, Australia

geoff.webb@infotech.monash.edu

3 Department of Computer Science

University of Vermont, USA

xwu@cs.uvm.edu

Summary Data-mining applications often involve quantitative data However, learning from quantitative data is often less effective and less efficient than learning from qualitative data Discretization addresses this issue by transforming quantitative data into qualitative data This chapter presents a comprehensive introduction to discretization It clarifies the definition of discretization It provides a taxonomy of discretization methods together with a survey of major discretization methods It also discusses issues that affect the design and application of discretization methods

Key words: Discretization, quantitative data, qualitative data

Introduction

Discretization is a data-processing procedure that transforms quantitative data into qualitative data

Data Mining applications often involve quantitative data However, there exist many learning algorithms that are primarily oriented to handle qualitative data (Ker-ber, 1992, Dougherty et al., 1995, Kohavi and Sahami, 1996) Even for algorithms that can directly deal with quantitative data, learning is often less efﬁcient and less effective (Catlett, 1991, Kerber, 1992, Richeldi and Rossotto, 1995, Frank and Wit-ten, 1999) Hence discretization has long been an active topic in Data Mining and knowledge discovery Many discretization algorithms have been proposed Evalua-tion of these algorithms has frequently shown that discretizaEvalua-tion helps improve the performance of learning and helps understand the learning results

This chapter presents an overview of discretization Section 6.1 explains the ter-minology involved in discretization It clarifies the definition of discretization, which has been defined in many differing way in previous literature Section 6.2 presents a

O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

DOI 10.1007/978-0-387-09823-4_6, © Springer Science+Business Media, LLC 2010

Trang 3

102 Ying Yang, Geoffrey I Webb, and Xindong Wu

comprehensive taxonomy of discretization approaches Section 6.3 introduces typi-cal discretization algorithms corresponding to the taxonomy Section 6.4 addresses the issue that different discretization strategies are appropriate for different learn-ing problems Hence designlearn-ing or applylearn-ing discretization should not be blind to its learning context Section 6.5 provides a summary of this chapter

6.1 Terminology

Discretization transforms one type of data to another type In the large amount of existing literature that addresses discretization, there is considerable variation in the

terminology used to describe these two data types, including ‘quantitative’ vs ‘qual-itative’, ‘continuous’ vs ‘discrete’, ‘ordinal’ vs ‘nominal’, and ‘numeric’ vs

‘cat-egorical’ It is necessary to make clear the difference among the various terms and accordingly choose the most suitable terminology for discretization

We adopt the terminology of statistics (Bluman, 1992, Samuels and Witmer, 1999), which provides two parallel ways to classify data into different types Data

can be classiﬁed into either qualitative or quantitative Data can also be classiﬁed into different levels of measurement scales Sections 6.1.1 and 6.1.2 summarize this

terminology

6.1.1 Qualitative vs quantitative

Qualitative data, also often referred to as categorical data, are data that can be placed into distinct categories Qualitative data sometimes can be arrayed in a mean-ingful order But no arithmetic operations can be applied to them Examples of

qual-itative data are: blood type of a person: A, B, AB, O; and assignment evaluation: fail, pass, good, excellent.

Quantitative data are numeric in nature They can be ranked in order They also admit to meaningful arithmetic operations Quantitative data can be further classiﬁed into two groups, discrete or continuous

Discrete data assume values that can be counted The data cannot assume all

val-ues on the number line within their value range An example is: number of children

in a family.

Continuous data can assume all values on the number line within their value

range The values are obtained by measuring An example is: temperature.

6.1.2 Levels of measurement scales

In addition to being classified into either qualitative or quantitative, data can also be classified by how they are categorized, counted or measured This type of classifi-cation uses measurement scales, and four common levels of scales are: nominal, ordinal, interval and ratio

Trang 4

The nominal level of measurement scales classiﬁes data into mutually exclusive (non-overlapping), exhaustive categories in which no meaningful order or ranking

can be imposed on the data An example is: blood type of a person: A, B, AB, O

The ordinal level of measurement scales classiﬁes data into categories that can

be ranked However, the differences between the ranks cannot be calculated by

arith-metic An example is: assignment evaluation: fail, pass, good, excellent It is

mean-ingful to say that the assignment evaluation of pass ranks higher than that of fail It

is not meaningful in the same way to say that the blood type of A ranks higher than that of B

The interval level of measurement scales ranks data, and the differences between

units of measure can be calculated by arithmetic However, zero in the interval level

of measurement does not mean ‘nil’ or ‘nothing’ as zero in arithmetic means An example is: Fahrenheit temperature It has a meaningful difference of one degree

between each unit But 0 degree Fahrenheit does not mean there is no heat It is meaningful to say that 74 degree is two degrees higher than 72 degree It is not meaningful in the same way to say that the evaluation of excellent is two degrees higher than the evaluation of good

The ratio level of measurement scales possesses all the characteristics of interval

measurement, and there exists a zero that, the same as arithmetic zero, means ‘nil’ or

‘nothing’ In consequence, true ratios exist between different units of measure An

example is: number of children in a family It is meaningful to say that family X has

twice as many children as does family Y It is not meaningful in the same way to say that 100 degree Fahrenheit is twice as hot as 50 degree Fahrenheit

The nominal level is the lowest level of measurement scales It is the least power-ful in that it provides the least information about the data The ordinal level is higher, followed by the interval level The ratio level is the highest Any data conversion from a higher level of measurement scales to a lower level of measurement scales will lose information Table 6.1 gives a summary of the characteristics of different levels of measurement scales

Table 6.1 Measurement Scales

Level Ranking ? Arithmetic operation ? Arithmetic zero ? Nominal no no no Ordinal yes no no Interval yes yes no Ratio yes yes yes

6.1.3 Summary

In summary, the following classiﬁcation of data types applies:

1 qualitative data:

a) nominal;

b) ordinal;

Trang 5

2 quantitative data:

a) interval, either discrete or continuous;

b) ratio, either discrete or continuous

We believe that ‘discretization’ as it is usually applied in data mining is best

de-ﬁned as the transformation from quantitative data to qualitative data In consequence,

we will refer to data as either quantitative or qualitative throughout this chapter

6.2 Taxonomy

There exist diverse taxonomies in the existing literature to classify discretization methods Different taxonomies emphasize different aspects of the distinctions among discretization methods

Typically, discretization methods can be either primary or composite Primary

methods accomplish discretization without reference to any other discretization method Composite methods are built on top of some primary method(s)

Primary methods can be classiﬁed as per the following taxonomies

1 Supervised vs Unsupervised (Dougherty et al., 1995) Methods that use the

class information of the training instances to select discretization cut points are supervised Methods that do not use the class information are unsupervised

Supervised discretization can be further characterized as error-based, entropy-based or statistics-entropy-based according to whether intervals are selected using

met-rics based on error on the training data, entropy of the intervals, or some statisti-cal measure

2 Parametric vs Non-parametric Parametric discretization requires input from

the user, such as the maximum number of discretized intervals Non-parametric discretization only uses information from data and does not need input from the user

3 Hierarchical vs Non-hierarchical Hierarchical discretization selects cut points

in an incremental process, forming an implicit hierarchy over the value range

The procedure can be split or merge (Kerber, 1992) Split discretization

ini-tially has the whole value range as an interval, then continues splitting it into sub-intervals until some threshold is met Merge discretization initially puts each value into an interval, then continues merging adjacent intervals until some threshold is met Some discretization methods utilize both split and merge pro-cesses For example, intervals are initially formed by splitting, and then a merge process is performed to post-process the formed intervals Non-hierarchical dis-cretization does not form any hierarchy during disdis-cretization For example, many methods scan the ordered values only once, sequentially forming the intervals

4 Univariate vs Multivariate (Bay, 2000) Methods that discretize each attribute

in isolation are univariate Methods that take into consideration relationships among attributes during discretization are multivariate

5 Disjoint vs Non-disjoint (Yang and Webb, 2002) Disjoint methods discretize

the value range of the attribute under discretization into disjoint intervals No

Trang 6

intervals overlap Non-disjoint methods discretize the value range into intervals that can overlap

6 Global vs Local (Dougherty et al., 1995) Global methods discretize with

re-spect to the whole training data space They perform discretization once only, using a single set of intervals throughout a single classiﬁcation task Local meth-ods allow different sets of intervals to be formed for a single attribute, each set being applied in a different classiﬁcation context For example, different dis-cretizations of a single attribute might be applied at different nodes of a decision tree (Quinlan, 1993)

7 Eager vs Lazy (Hsu et al., 2000, Hsu et al., 2003) Eager methods perform discretization prior to classiﬁcation time Lazy methods perform discretization

during the classiﬁcation time

8 Time-sensitive vs Time-insensitive Under time-sensitive discretization, the

qualitative value associated with a quantitative value can change along the time That is, the same quantitative value can be discretized into different values de-pending on the previous values observed in the time series Time-insensitive discretization only uses the stationary pro-perties of the quantitative data

9 Ordinal vs Nominal Ordinal discretization transforms quantitative data into

ordinal qualitative data It aims at taking advantage of the ordering information implicit in quantitative attributes, so as not to make values 1 and 2 as dissimi-lar as values 1 and 10 Nominal discretization transforms quantitative data into nominal qualitative data The ordering information is hence discarded

10 Fuzzy vs Non-fuzzy (Wu, 1995, Wu, 1999, Ishibuchi et al., 2001) Fuzzy

dis-cretization ﬁrst discretizes quantitative attribute values into intervals It then places some kind of membership function at each cut point as fuzzy borders The membership function measures the degree of each value belonging to each interval With these fuzzy borders, a value can be discretized into a few different intervals at the same time, with varying degrees Non-fuzzy discretization forms sharp borders without employing any membership function

Composite methods ﬁrst choose some primary discretization method to form the initial cut points They then focus on how to adjust these initial cut points to achieve certain goals The taxonomy of a composite method sometimes is ﬂexible, depending

on the taxonomy of its primary method

6.3 Typical methods

Corresponding to our taxonomy in the previous section, we here enumerate some typical discretization methods There are many other methods that are not reviewed due to the space limit For a more comprehensive study on existing discretization algorithms, Yang (2003) and Wu (1995) offer good sources

Trang 7

6.3.1 Background and terminology

A term often used for describing a discretization approach is ‘cut point’ Discretiza-tion forms intervals according to the value range of the quantitative data It then as-sociates a qualitative value to each interval A cut point is a value among the quanti-tative data where an interval boundary is located by a discretization method Another commonly-mentioned term is ‘boundary cut point’, which are values between two instances with different classes in the sequence of instances sorted by a quantitative attribute It has been proved that evaluating only the boundary cut points is sufﬁcient for ﬁnding the minimum class information entropy (Fayyad and Irani, 1993)

We use the following terminology Data comprises a set or sequence of instances Each instance is described by a vector of attribute values For classiﬁcation learning,

each instance is also labelled with a class Each attribute is either qualitative or quan-titative Classes are qualitative Instances from which one learns cut points or other

knowledge are training instances If a test instance is presented, a learning

algo-rithm is asked to make a prediction about the test instance according to the evidence provided by the training instances

6.3.2 Equal-width, equal-frequency and ﬁxed-frequency discretization

We arrange to present these three methods together because they are seemingly sim-ilar but actually different They all are typical of unsupervised discretization They are also typical of parametric discretization

When discretizing a quantitative attribute, equal width discretization (EWD)

(Catlett, 1991, Kerber, 1992, Dougherty et al., 1995) predeﬁnes k, the number of intervals It then divides the number line between v min and v max into k intervals of equal width, where v min is the minimum observed value, v max is the maximum

ob-served value Thus the intervals have width w = (v max − v min )/k and the cut points are at v min + w,v min + 2w,··· ,v min + (k − 1)w.

When discretizing a quantitative attribute, equal-frequency discretization (EFD)

(Catlett, 1991, Kerber, 1992, Dougherty et al., 1995) predeﬁnes k, the number of in-tervals It then divides the sorted values into k intervals so that each interval contains approximately the same number of training instances Suppose there are n training instances, each interval then contains n /k training instances with adjacent (possibly

identical) values Note that training instances with identical values must be placed

in the same interval In consequence it is not always possible to generate k

equal-frequency intervals

When discretizing a quantitative attribute, ﬁxed-frequency discretization

(FFD) (Yang and Webb, 2004) predeﬁnes a sufﬁcient interval frequency k Then

it discretizes the sorted values into intervals so that each interval has approximately4

the same number k of training instances with adjacent (possibly identical) values.

It is worthwhile contrasting EFD and FFD, both of which form intervals of equal frequency EFD ﬁxes the interval number that is usually arbitrarily chosen FFD ﬁxes

4Just as for EFD, because of the existence of identical values, some intervals can have

in-stance frequency exceeding k.

Trang 8

the interval frequency that is not arbitrary but to ensure each interval contains sufﬁ-cient instances to supply information such as for estimating probability

6.3.3 Multi-interval-entropy-minimization discretization ((MIEMD)

Multi-interval-entropy-minimization discretization (Fayyad and Irani, 1993) is typ-ical of supervised discretization It is also typtyp-ical of non-parametric discretization

To discretize an attribute, MIEMD evaluates as a candidate cut point the midpoint between each successive pair of the sorted values For evaluating each candidate cut point, the data are discretized into two intervals and the resulting class information entropy is calculated A binary discretization is determined by selecting the cut point for which the entropy is minimal amongst all candidates The binary discretization

is applied recursively, always selecting the best cut point A minimum description length criterion (MDL) is applied to decide when to stop discretization

6.3.4 ChiMerge, StatDisc and InfoMerge discretization

EWD and EFD are non-hierarchical discretization MIEMD involves a split proce-dure and hence is hierarchical discretization A typical merge approach to hierarchi-cal discretization is ChiMerge (Kerber, 1992) It uses theχ2 (Chi square) statistic

to determine if the relative class frequencies of adjacent intervals are distinctly dif-ferent or if they are similar enough to justify merging them into a single interval The ChiMerge algorithm consists of an initialization process and a bottom-up merg-ing process The initialization process contains two steps: (1) ascendmerg-ingly sort the training instances according to their values for the attributes being discretized, (2) construct the initial discretization, in which each instance is put into its own interval The interval merging process contains two steps, repeated continuously: (1) compute theχ2for each pair of adjacent intervals, (2) merge the pair of adjacent intervals with the lowestχ2value Merging continues until all pairs of intervals haveχ2values ex-ceeding a predeﬁnedχ2-threshold That is, all intervals are considered signiﬁcantly

different by theχ2independence test The recommendedχ2-threshold is at the 0.90,

0.95 or 0.99 signiﬁcant level

StatDisc discretization (Richeldi and Rossotto, 1995) extends ChiMerge to al-low any number of intervals to be merged instead of only 2 as ChiMerge does Both ChiMerge and StatDisc are based on a statistical measure of dependency The statis-tical measures treat an attribute and a class symmetrically A third merge discretiza-tion, InfoMerge (Freitas and Lavington, 1996) argues that an attribute and a class

should be asymmetric since one wants to predict the value of the class attribute given

the discretized attribute but not the reverse Hence InfoMerge uses information loss, which is calculated as the amount of information necessary to identify the class of an instance after merging and the amount of information before merging, to direct the merge procedure

Trang 9

6.3.5 Cluster-based discretization

The above mentioned methods are all univariate A typical multivariate discretization technique is cluster-based discretization (Chmielewski and Grzymala-Busse, 1996)

This method consists of two steps The ﬁrst step is cluster formation to determine initial intervals for the quantitative attributes The second step is post-processing to

minimize the number of discretized intervals Instances here are deemed as points in

n-dimensional space which is deﬁned by n attribute values During cluster formation,

the median cluster analysis method is used Clusters are initialized by allowing each instance to be a cluster New clusters are formed by merging two existing clusters that exhibit the greatest similarity between each other The cluster formation continues as long as the level of consistency of the partition is not less than the level of consistency

of the original data Once this process is completed, instances that belong to the same cluster are indiscernible by the subset of quantitative attributes, thus a partition on the set of training instances is induced Clusters can be analyzed in terms of all attributes to ﬁnd out cut points for each attribute simultaneously After discretized intervals are formed, post-processing picks a pair of adjacent intervals among all quantitative attributes for merging whose resulting class entropy is the smallest If the consistency of the dataset after the merge is above a given threshold, the merge

is performed Otherwise this pair of intervals are marked as non-mergable and the next candidate is processed The process stops when each possible pair of adjacent intervals are marked as non-mergable

6.3.6 ID3 discretization

ID3 provides a typical example of local discretization ID3 (Quinlan, 1986) is an inductive learning program that constructs classiﬁcation rules in the form of a de-cision tree It uses local discretization to deal with quantitative attributes For each quantitative attribute, ID3 divides its sorted values into two intervals in all possible ways For each division, the resulting information gain of the data is calculated The attribute that obtains the maximum information gain is chosen to be the current tree node And the data are divided into subsets corresponding to its two value intervals

In each subset, the same process is recursively conducted to grow the decision tree The same attribute can be discretized differently if it appears in different branches of the decision tree

6.3.7 Non-disjoint discretization

The above mentioned methods are all disjoint discretization Non-disjoint discretiza-tion (NDD) (Yang and Webb, 2002), on the other hand, forms overlapping inter-vals for a quantitative attribute, always locating a value toward the middle of its discretized interval This strategy is desirable since it can efﬁciently form for each single quantitative value a most appropriate interval

When discretizing a quantitative attribute, suppose there are N instances NDD identiﬁes among the sorted values t atomic intervals, (a

1,b

1],(a

2,b

2], ,(a

t ,b

t ],

Trang 10

each containing s instances, so that5

s = s 3

One interval is formed for each set of three consecutive atomic intervals, such that the kth (1 ≤ k ≤ t − 2) interval (a k ,b k ] satisﬁes a k = a

k and b k = b

k+2 Each

value v is assigned to interval (a

i−1 ,b

i+1] where i is the index of the atomic interval (a

i ,b

i ] such that a

i < v ≤ b

i , except when i = 1 in which case v is assigned to interval (a

1,b

3] and when i = t in which case v is assigned to interval (a

t −2 ,b

t ] Figure 6.1 illustrates the procedure As a result, except in the case of falling into the ﬁrst or the

last atomic interval, a numeric value is always toward the middle of its corresponding

interval, and intervals can overlap with each other

Atomic Interval

Interval

Fig 6.1 Atomic Intervals Compose Actual Intervals

6.3.8 Lazy discretization

The above mentioned methods are all eager In comparison, lazy discretization (LD) (Hsu et al., 2000,Hsu et al., 2003) defers discretization until classiﬁcation time

It waits until a test instance is presented to determine the cut points for each quan-titative attribute of this test instance When classifying an instance, LD creates only one interval for each quantitative attribute containing its value from the instance, and leaves other value regions untouched In particular, it selects a pair of cut points for each quantitative attribute such that the value is in the middle of its corresponding in-terval Where the cut points locate is decided by LD’s primary discretization method, such as EWD

5Theoretically any odd number k besides 3 is acceptable in (6.1) as long as the same number

k of atomic intervals are grouped together later for the probability estimation For

simplic-ity, we take k= 3 for demonstration

Định dạng
Số trang	10
Dung lượng	385,9 KB