Data Mining and Knowledge Discovery Handbook, 2 Edition part 66 pps

Dunn family of indices A cluster validity index for crisp clustering proposed in Dunn, 1974, aims at the identiﬁca-tion of ‘compact and well separated clusters’.. The diameter of a clust

Trang 1

that increase (or decrease) as the number of clusters increase we search for the values of nc at which a signiﬁcant local change in value of the index occurs This change appears as a ‘knee’

in the plot and it is an indication of the number of clusters underlying the data set Moreover, the absence of a knee may be an indication that the data set possesses no clustering structure Below, some representative validity indices for crisp and fuzzy clustering are presented

Crisp Clustering

Crisp clustering considers non overlapping partitions meaning that a data point either belongs

to a class or not In this section we discuss validity indices suitable for crisp clustering The modiﬁed HubertΓ statistic

The deﬁnition of the modiﬁed HubertΓ (Theodoridis and Koutroubas, 1999) statistic is given

by the equation

Γ = (1/M) N∑−1

i=1

N

∑

j =i+1

where N is the number of objects in a data set, M = N · (N − 1)/2, P is the proximity matrix

of the data set and Q is an N × N matrix whose (i, j) element is equal to the distance between

the representative points(v c i ,vc j ) of the clusters where the objects x i and x jbelong

Similarly, we can deﬁne the normalized Hubertσ statistic, given by equation

ˆ

Γ =

(1/M)∑ N −1

i=1 ∑N

j =i+1 (P(i, j) −μP )(Q(i, j) −μQ)

whereμP,μQ,σP,σQ are the respective means and variances of P, Q matrices.

If the d(v c i ,vc j ) is close to d(x i,x j ) for i, j = 1,2, ,N, P and Q will be in close

agreement and the values ofΓ and ˆΓ(normalized Γ ) will be high Conversely, a high value of

Γ ( ˆΓ) indicates the existence of compact clusters Thus, in the plot of normalized Γ versus

nc, we seek a signiﬁcant knee that corresponds to a signiﬁcant increase of normalized G The number of clusters at which the knee occurs is an indication of the number of clusters that

occurs in the data We note, that for n c = 1 and n c = N the index is not deﬁned.

Dunn family of indices

A cluster validity index for crisp clustering proposed in (Dunn, 1974), aims at the identifica-tion of ‘compact and well separated clusters’ The index is defined in the following equaidentifica-tion for a specific number of clusters

Dn c = min i =1, ,n c

min j =i+1, ,n c

%

d (c i,c j)

max k =1, ,n c (diam(c k))

&?

(31.18)

where d(c i,c j ) is the dissimilarity function between two clusters c i and c j deﬁned as

d (c i,c j ) = min x ∈C i ,y∈C j d (x,y) , and diam(c) is the diameter of a cluster, which may be con-sidered as a measure of clusters’ dispersion The diameter of a cluster C can be deﬁned as

follows:

If the data set contains compact and well-separated clusters, the distance between the clusters is expected to be large and the diameter of the clusters is expected to be small Thus,

Trang 2

based on the Dunn’s index deﬁnition, we may conclude that large values of the index indicate the presence of compact and well-separated clusters

Index D n cdoes not exhibit any trend with respect to number of clusters Then the

maxi-mum in the plot of D n c versus the number of clusters can be an indication of the number of clusters that ﬁts the data

The problems of the Dunn index are: i) its considerable time complexity, and ii) its

sen-sitivity to the presence of noise in data sets, since these are likely to increase the values of

diam (c) (i.e dominator of equation 31.18).

Three indices, are proposed in (Pal and Biswas, 1997) that are more robust to the presence

of noise They are known as Dunn-like indices since they are based on the Dunn index More-over, the three indices use for their deﬁnition the concepts of Minimum Spanning Tree (MST), the relative neighborhood graph (RNG) and the Gabriel graph respectively (Theodoridis and Koutroubas, 1999) Consider the index based on MST Let a cluster ci and the complete graph

Gi whose vertices correspond to the vectors of c i The weight, we, of an edge, e, of this graph equals the distance between its two end points, x ,y Let E MST

i be the set of edges of the MST

of the graph G i , and e MST i the edge in E i MST with the maximum weight Then the diameter of

C i is deﬁned as the weight of e MST i Dunn-like index based on the concept of the MST is given

by equation

D n c = min i =1, ,n c

,

min j =i+1, ,n c

d (c i,c j)

maxk =1, ,n c (diam MST

@

(31.20)

The number of clusters at which D MST m takes its maximum value indicates the number

of clusters in the underlying data Based on similar arguments we may deﬁne the Dunn-like

indices for GG and RGN graphs

The Davies-Bouldin (DB) index

A similarity measure R i j between the clusters C i and C j is deﬁned based on a measure of

dispersion of a cluster C i , denoted by s i, and a dissimilarity measure between two clusters,

di j The R i jindex is deﬁned to satisfy the following conditions (Davies and Bouldin, 1979):

1 R i j= 0

2 R i j = R ji

3 if s i = 0 and s j = 0 then R i j= 0

4 if s j > sk and d i j = d ik then R i j > Rik

5 if s j = s k and d i j < dik then R i j > Rik

These conditions state that R i j is non-negative and symmetric A simple choice for R i j

that satisﬁes the above conditions is (Davies and Bouldin, 1979):

Then the DB index is deﬁned as

DBn c= 1

n c ·∑n c

i=1Ri,andRi = max j =1, ,n c , j=i {Ri j },i = 1, ,nc (31.22)

It is clear for the above deﬁnition that DB n cis the average similarity between each cluster

ci , i = 1, ,n cand its most similar one It is desirable for the clusters to have the minimum

possible similarity to each other; therefore we seek partitionings that minimize DB n c The

DBn c index exhibits no trends with respect to the number of clusters and thus we seek the

minimum value of DB in its plot versus the number of clusters

Trang 3

Some alternative deﬁnitions of the dissimilarity between two clusters as well as the

dis-persion of a cluster, c i, is deﬁned in (Davies and Bouldin, 1979)

Three variants of the DB n cindex are proposed in (Pal and Biswas, 1997) They are based

on the MST, RNG and GG concepts, similar to the cases of the Dunn-like indices

Other validity indices for crisp clustering have been proposed in (Dave, 1996) and (Mil-ligan and Cooper, 1985) The implementation of most of these indices is computationally very expensive, especially when the number of clusters and objects in the data set grows very large (Xie and Beni, 1991) In (Milligan and Cooper, 1985), an evaluation study of 30 validity indices proposed in literature is presented It is based on tiny data sets (about 50 points each) with well-separated clusters The results of this study (Milligan and Cooper, 1985) place Caliski and Harabasz (1974), Je(2)/Je(1) (1984), C-index (1976), Gamma and Beale among the six best indices However, it is noted that although the results concerning these methods are encouraging they are likely to be data dependent Thus, the behavior of indices may change if different data structures are used Also, some indices are based on a sample of clustering results A representative example is Je(2)/Je(1) whose computations based only on the information provided by the items involved in the last cluster merge

RMSSDT, SPR, RS, CD

This family of validity indices is applicable in the cases that hierarchical algorithms are used

to cluster the data sets Below we present the deﬁnitions of four validity indices, which have

to be used simultaneously to determine the number of clusters existing in the data set These four indices are applied to each step of a hierarchical clustering algorithm (Sharma, 1996)

• RMSSTD (root mean square standard deviation) of a new clustering scheme deﬁned at

a level of a clustering hierarchy is the square root of the variance of all the variables (attributes used in the clustering process) This index measures the homogeneity of the formed clusters at each step of the hierarchical algorithm Since the objective of cluster

analysis is to form homogeneous groups the RMSST D of a cluster should be as small as

possible Where the values of RMSSTD are higher than the ones of the previous step, we have an indication that the new clustering scheme is worse

In the following deﬁnitions we shall use the term SS, which means Sum of Squares and refers to the equation:

SS=∑n

Along with this we shall use some additional symbolism like:

1 SSw referring to the sum of squares within group,

2 SSb referring to the sum of squares between groups,

3 SSt referring to the total sum of squares, of the whole data set.

• SPR (Semi-Partial R-squared) for the new cluster is the difference between SSw of the

new cluster and the sum of the SSw’s values of clusters joined to obtain the new cluster

(loss of homogeneity), divided by the SSt for the whole data set This index measures the

loss of homogeneity after merging the two clusters of a single algorithm step If the index value is zero then the new cluster is obtained by merging two perfectly homogeneous clusters If its value is high then the new cluster is obtained by merging two heterogeneous clusters

• RS (R-Squared) of the new cluster is the ratio of SSb over SSt SSb is a measure of

differ-ence between groups Since SSt = SSb+SSw, the greater the SSb the smaller the SSw and

vice versa As a result, the greater the differences between groups, the more homogenous

Trang 4

each group is and vice versa Thus, RS may be considered as a measure of dissimilarity

between clusters Furthermore, it measures the degree of homogeneity between groups

The values of RS range between 0 and 1 Where the value of RS is zero, there is an indi-cation that no difference exists among groups On the other hand, when RS equals 1 there

is an indication of signiﬁcant difference among groups

• The CD index measures the distance between the two clusters that are merged in a given

step of the hierarchical clustering This distance depends on the selected representatives for the hierarchical clustering we perform For instance, in the case of Centroid hierarchi-cal clustering the representatives of the formed clusters are the centers of each cluster, so

CD is the distance between the centers of the clusters Where we use single linkage, CD

measures the minimum Euclidean distance between all possible pairs of points In case of

complete linkage, CD is the maximum Euclidean distance between all pairs of data points,

and so on

Using these four indices we determine the number of clusters that exist in a data set, plotting a graph of all these indices values for a number of different stages of the clustering algorithm In this graph we search for the steepest knee, or in other words, the greatest jump

of these indices’ values from the higher to the smaller number of clusters

The SD validity index

Another clustering validity approach is proposed in (Halkidi et al., 2000) The SD validity index deﬁnition is based on the concepts of average scattering for clusters and total

separation between clusters Below, we give the fundamental deﬁnition for this index Average scattering for clusters It evaluates scattering of the points in the clusters

comparing the variance of the considered clustering scheme with the variance of the whole data set The average scattering for clusters is deﬁned as follows:

Scat (n c) = 1

nc ·∑

n c

i=1 σ(v i

(31.24) The termσ(S) is the variance of a data set; and its p-th dimension is deﬁned as follows:

σp=1

n ·∑n

k=1

x k p − ¯x p

(31.25)

where ¯x p is the p-th dimension of ¯ X=1

n · ∑ n

k=1x k,∀xk ∈ S.

The termσ(v i ) is the variance of cluster c i and its p-th dimension is given by the equation

σ(v i)p= ∑n i

k=1

(x p

k − v p

Y T Y1/2 , where

Y = (y1, ,yk) is a vector (e.g.σ(v i))

Total separation between clusters The deﬁnition of total scattering (separation) between

clusters is given in the following equation:

Dis (n c) =Dmax

Dmin ·∑n c

k=1

n c

∑

z=1 k − vz

−1

(31.27)

Trang 5

where D max = max(v i − v j ) i, j ∈1,2,3,,ncis the maximum distance between cluster

cen-ters The D min = min(vi − v j)),∀i, j ∈ {1,2, ,nc} is the minimum distance between

clus-ter cenclus-ters

Now, we can deﬁne a validity index based on 31.24 and 31.27 as follows

SD (n c ) = a · Scat(n c ) + Dis(n c) (31.28)

where a is a weighting factor equal to Dis(c max ) and where c maxis the maximum number of input clusters

The ﬁrst term (i.e Scat(n c)) is deﬁned in Eq 24, indicating the average compactness of clusters (i.e intra-cluster distance) A small value for this term indicates compact clusters and as the scattering within clusters increases (i.e they become less compact) the value of

Scat (n c ) also increases The second term Dis(n c) indicates the total separation between the

nc clusters (i.e an indication of inter-cluster distance) Contrary to the ﬁrst term the second

one, Dis(n c), is inﬂuenced by the geometry of the clusters and increases with the number of

clusters The two terms of SD are of the different range, thus a weighting factor is needed in order to incorporate both terms in a balanced way The number of clusters, n c, that minimizes the above index is an optimal value Also, the inﬂuence of the maximum number of clusters

cmax, related to the weighting factor, in the selection of the optimal clustering scheme, is

discussed in (Halkidi et al., 2000) It is proved that SD proposes an optimal number of clusters almost irrespectively of the c maxvalue

The S Dbw validity index

A recent validity index is proposed in (Halkidi and Vazirgiannis, 2001a) It exploits the inherent features of clusters to assess the validity of results and select the optimal partitioning for the data under concern Similarly with the SD index, its deﬁnition is based on the compact-ness and separation of clusters The average scattering for clusters is deﬁned as above in 31.24

Inter-cluster Density (ID) - It evaluates the average density in the region among

clus-ters in relation with the density of the clusclus-ters The goal is the density among clusclus-ters to be signiﬁcantly low in comparison with the density in the considered clusters Then, considering

a partitioning of the data set into more than two clusters (i.e n c > 1) the inter-cluster density

is deﬁned as follows:

Dens bw (c) =

1

nc · (nc − 1)

n c

∑

i=1

n c

∑

j =1, j=i

density (u i j)

max{density(vi ),density(v j )}

where v i , v j are the centers of clusters c i , c j respectively, and u i jthe middle point of the line

segment deﬁned by the clusters’ centers v i , v j

The term density(u) is deﬁned in the following equation:

density (u) =

n i j

∑

l=1

where x l is a point of data set S, n i jis the number of points (tuples) that belong to the clusters

c i and c j , i.e x l ∈ ci ∪ c j ⊆ S It represents the number of points in the neighborhood of u In

our work, the neighborhood of a data point, u, is deﬁned to be a hyper-sphere with center u and radius the average standard deviation of the clusters, stdev The standard deviation of the

clusters is given by the following equation:

Trang 6

stdev= 1

nc

n c

∑

i=1 σ(v i

where c is the number of clusters and s(v i ) is the variance of cluster C i

More speciﬁcally, the function f (x,u) is deﬁned as:

0 if d(x,u) > stdev ,

It is obvious that a point belongs to the neighborhood of u if its distance from u is smaller than the average standard deviation of clusters Here we assume that the data has been scaled

to consider all dimensions (bringing them into comparable ranges), as is equally important during the process of ﬁnding the neighbors of a multidimensional point (Berry and Linoff, 1996)

Then the validity index S Dbw is deﬁned as:

The above deﬁnitions refer to the case that a cluster presents clustering tendency, i.e it

can be partitioned into at least two clusters The index is not deﬁned for n c= 1

The deﬁnition of S Dbw indicates that both criteria of ‘good’ clustering (i.e compactness

and separation) are properly combined, enabling reliable evaluation of clustering results Also, the density variations among clusters are taken into account to achieve more reliable results

The number of clusters, n c, that minimizes the above index is an optimal value indicating the number of clusters present in the data set

Moreover, an approach based on the S Dbw index is proposed in (Halkidi and

Vazirgian-nis, 2001b) It evaluates the clustering schemes of a data set as deﬁned by different clustering algorithms and selects the algorithm resulting in optimal partitioning of the data

In general terms, S Dbw enables the selection both of the algorithm and its parameter

values for which the optimal partitioning of a data set is deﬁned (assuming that the data set presents clustering tendency) However, the index cannot properly handle arbitrarily shaped clusters The same applies to all the aforementioned indices

There are a number of applications where it is important to identify non-convex clusters such as medical or spatial data applications An approach to handle arbitrarily shaped clusters

in the cluster validity process is presented in (Halkidi and Vazirgiannis, 2002)

31.4.5 Fuzzy Clustering

In this section, we present validity indices suitable for fuzzy clustering The objective is to seek clustering schemes where most of the vectors of the data set exhibit a high degree of

mem-bership in one cluster Fuzzy clustering is deﬁned by a matrix U= u i j

, where u i j denotes

the degree of membership of the vector x i in cluster j Also, a set of cluster representatives is deﬁned Similar to a crisp clustering case a validity index, q! is deﬁned and we search for the minimum or maximum in the plot of q versus n c

Also, where q exhibits a trend with respect to the number of clusters, we seek a signiﬁcant

knee of decrease (or increase) in the plot of q.

Below two categories of fuzzy validity indices are discussed The ﬁrst category uses only

the membership values, u i j , of a fuzzy partition of data The second involves both the U

matrix and the data set itself

Trang 7

Validity Indices involving only the membership values

Bezdek proposed in (Bezdeck et al., 1984) the partition coefﬁcient, which is deﬁned as

N

∑

i=1

n c

∑

j=1

The PC index values range in[1/n c,1], where ncis the number of clusters The closer the index is to unity the ”crisper” the clustering is In case that all membership values to a fuzzy

partition are equal, that is, u i j = 1/n c, the PC obtains its lower value Thus, the closer the value

of PC is to 1/nc, the fuzzier the clustering is Furthermore, a value close to 1/ncindicates that there is no clustering tendency in the considered data set or the clustering algorithm failed to reveal it

The partition entropy coefﬁcient is another index of this category It is deﬁned as follows

N

∑

i=1

n c

∑

j=1

where a is the base of the logarithm The index is computed for values of n cgreater than 1 and its value ranges in[0,log anc] The closer the value of PE to 0, the ‘crisper’ the clustering is

As in the previous case, index values close to the upper bound (i.e log anc), indicate absence

of any clustering structure in the data set or inability of the algorithm to extract it

The drawbacks of these indices are:

• their monotonous dependency on the number of clusters Thus, we seek signiﬁcant knees

of increase (for PC) or decrease (for PE) in the plots of the indices versus the number of clusters,

• their sensitivity to the fuzziﬁer, m More speciﬁcally, as m → 1 the indices give the same

values for all values of n c On the other hand when m → ∞ , both PC and PE exhibit

signiﬁcant knee at n c= 2,

• the lack of direct connection to the geometry of the data (Dave, 1996), since they do not

use the data itself

Indices involving the membership values and the data set

The Xie-Beni index (Xie and Beni, 1991), XB, also called the compactness and separation

validity function, is a representative index of this category Consider a fuzzy partition of the

data set X = {x j ; j = 1, ,n} with v i , (i = 1, ,n c ) the centers of each cluster and u i j the

membership of the jth data point belonging to the ith cluster The fuzzy deviation of x jform

cluster i, d i j , is deﬁned as the distance between x j and the center of cluster weighted by

the fuzzy membership of data point j with regards to cluster i It is given by the following

equation:

Also, for a cluster i, the sum of the squares of fuzzy deviation of the data point in X,

denotedσi , is called variation of cluster i.

The termπi= (σi/ni ), is called compactness of cluster i Since n iis the number of point

in cluster belonging to cluster i,πi is the average variation in cluster i Then the compactness

of a partitioning of n cclusters is deﬁned as the average compactness of the deﬁned clusters, given by the equation:

π =∑

n c

i=1πi

Also, the separation of the fuzzy partitions is deﬁned as the minimum distance between cluster centers, that is

Trang 8

d min = minv i − v j (31.37)

Then XB index is deﬁned as

where N is the number of points in the data set.

It is clear that small values of XB are expected for compact and well-separated clusters

We note, however, that XB is monotonically decreasing when the number of clusters n cgets very large and close to n One way to eliminate this decreasing tendency of the index is to

determine a starting point, c max, of the monotonic behavior and to search for the minimum

value of XB in the range [2, c max] Moreover, the values of the index XB depend on the

fuzziﬁer values, so as if m → ∞ then XB → ∞.

Another index of this category is the Fukuyama-Sugeno index, which is deﬁned as

FS m=∑N

i=1

n c

∑

j=1

u m i j'x

i − vj2

A −v j − v2

A

(

(31.39)

where v is the mean vector of X and A is an lxl positive deﬁnite, symmetric matrix When A=

I, the above distance becomes the squared Euclidean distance It is clear that for compact and

well-separated clusters we expect small values for FS m The ﬁrst term in brackets measures the compactness of the clusters while the second one measures the distances of the clusters representatives

Also some other fuzzy validity indices are proposed in (Gath and Geva, 1989), which are based on the concepts of hyper volume and density

31.4.6 Other Approaches for Cluster Validity

Another approach for ﬁnding the optimal number of clusters of a data set was proposed in (Smyth, 1996) It introduces a practical clustering algorithm based on Monte Carlo cross-validation More speciﬁcally, the algorithm consists of M cross-validation runs over M chosen

train/test partitions of a data set, D For each partition u, the EM algorithm is used to deﬁne

nc clusters to the training data, while n c is varied from 1 to c max Then, the log-likelihood

L u (D) is calculated for each model with n cclusters It is deﬁned using the probability density function of the data as

L k (D) =∑N

i=1

where f kis the probability density function for the data andΦkdenotes parameters that have

been estimated from data This is repeated M times and the M cross-validated estimates are averaged for each of n c Based on these estimates we may deﬁne the posterior probabilities for

each value of the number of clusters n c , p(n c/D) If one of p(nc/D) is near 1, there is strong

evidence that the particular number of clusters is the best for our data set The evaluation approach proposed in (Smyth, 1996) is based on density functions considered for the data set Thus, it is based on concepts related to probabilistic models in order to estimate the number of clusters, better ﬁtting a data set, and it does not use concepts directly related to the data, (i.e inter-cluster and intra-cluster distances)

Trang 9

Athanasopoulos, D (1991) Probabilistic Theory Stamoulis, Piraeus.

Berry, M and Linoff, G (1996) Data Mining Techniques for Marketing, Sales and Customer

Support John Wiley and Sons, Inc.

Bezdeck, J., Ehrlich, R., and Full, W (1984) Fcm:fuzzy c-means algorithm Computers and

Geoscience.

Brin, S., Motwani, R., Ullman, J., and Tsur, S (1997) Dynamic itemset counting and

impli-cation rules for market basket data CSIGMOD Record (ACM Special Interest Group on

Management of Data), 26(2).

Dave, R (1996) Validating fuzzy partitions obtained through c-shells clustering Pattern

Recognition Letters, 10:613–623.

Davies, D and Bouldin, D (1979) A cluster separation measure PIEEE Transactions on

Pattern Analysis and Machine Intelligence, 1(2).

Dietterich, T (1998) Approximate statistical tests for comparing supervised classiﬁcation

learning algorithms Neural Computation, 10(7):6.

Dong, G and Li, J (1998) Research and development in knowledge discovery and

Data Mining In Proc 2nd Paciﬁc-Asia Conf Knowledge Discovery and Data

Min-ing(PAKDD).

Dunn, J (1974) Well separated clusters and optimal fuzzy partitions Cybernetics, 4:95–104 Fayyad, M., Piatesky-Shapiro, G., Smuth, P., and Uthurusamy, R (1996) Advances in

Knowledge Discovery and Data Mining AAAI Press.

Gago, P and Bentos, C (1998) A metric for selection of the most promising rules In

Proceedings of the 2nd European Conference on The Pronciples of Data Mining and Knowledge Discovery (PKDD’98).

Gath, I and Geva, A (1989) Unsupervised optimal fuzzy clustering IEEE Transactions on

Pattern Analysis and Machine Intelligence, 11(7).

Gray, B and Orlowka, M (1998) Ccaiia: Clustering categorial attributed into interseting

accociation rules In Proceedings of the 2nd Paciﬁc-Asia Conference on Knowledge

Discovery and Data Mining (PAKDD ’98).

Guha, S., Rastogi, R., and Shim, K (1999) Rock: A robust clustering algorithm for

categor-ical attributes In Proceedings of the IEEE Conference on Data Engineering.

Halkidi, M and Vazirgiannis, M (2001a) Clustering validity assessment: Finding the

opti-mal partitioning of a data set In Proceedings of ICDM California, USA.

Halkidi, M and Vazirgiannis, M (2001b) A data set oriented approach for clustering

algo-rithm selection In Proceedings of PKDD Freiburg, Germany.

Halkidi, M and Vazirgiannis, M (2002) Clustering validity assessment: Finding the optimal

partitioning of a data set In Poster paper in the Proceedings of SETN Conference April,

Thessaloniki, Greece

Halkidi, M., Vazirgiannis, M., and Batistakis, I (2000) Quality scheme assessement in the

clustering process In Proceedings of PKDD Lyon, France.

Han, J and Kamber, M (2001) Data Mining: Concepts and Techniques Morgan Kaufmann

Publishers

Hilderman, R and Hamilton, H (1999) Knowledge discovery and interestingness measures:

A survey In Technical Report CS 99-04 Department of Computer Science, University

of Regina

Jain, A., Murty, M., and Flyn, P (1999) Data clustering: A review ACM Computing Surveys,

31(3)

Trang 10

Liu, H., Hsu, W., and Chen, S (1997) Using general impressions to analyze discovered

classiﬁcation rules In Proceedings of the Third International Conference on Knowledge

Discovery and Data Mining (KDD’97) Newport Beach, California.

MAGOpus V1.1 software g.i webb and assoc In RuleQuest Research Pty Ltd, 30 Athena

Avenue, St Ives NSW 2075, Australia.

Milligan, G and Cooper, M (1985) An examination of procedures for determining the

number of clusters in a data set Psychometrika, 50(3):159–179.

Pal, N and Biswas, J (1997) Cluster validation using graph theoretic concepts Pattern

Recognition, 30(6).

Pasquier, N., Bastide, Y., Taouil, R., and Lakhal, L (1999) Discovering frequent closed

itemsets for association rules In Proceedings of the 7th International Conference on

Database Theory.

Pei, J., Han, J., and Mao, R (2000) Dcloset: An efﬁcient algorithm for mining frequent

closed itemsets In Proceedings of ACM-SIGMOD International Workshop on Data

Mining and Knowledge Discovery (DMKD’00).

Piatetsky-Shapiro, G (1991) Discovery analysis and presentation of strong rules

Knowl-edge Discovery in Databases, AAAI/MIT Press

Roberto, J., Bayardo, J., Agrawal, R., and Gunopulos, D (1999) Constraint-based rule

mining in large, dense databases In Proceedings of the 15th ICDE.

Rokach, L., Averbuch, M., and Maimon, O., Information retrieval system for medical narra-tive reports Lecture notes in artiﬁcial intelligence, 3055 pp 217-228, Springer-Verlag (2004)

Sharma, S (1996) Applied Multivariate Techniques John Wiley and Sons.

Smyth, P (1996) Clustering using monte carlo cross-validation In Proceedings of KDD

Conference.

Smyth, P and Goodman, R (1991) Rule induction using information theory Knowledge

Discovery in Databases, AAAI/MIT Press

Snedecor, G and Cochran, W (1989) Statistical Methods owa State University Press,

Ames, IA, 8th Edition

Theodoridis, S and Koutroubas, K (1999) Pattern recognition Knowledge Discovery in

Databases, Academic Press

Xie, X and Beni, G (1991) A validity measure for fuzzy clustering IEEE Transactions on

Pattern Analysis and machine Intelligence, 13(4).

Zaki, M and Hsiao, C (2002) Charm: An efﬁcient algorithm for closed itemset mining In

Proceedings of the 2nd SIAM International Conference on Data Mining.

Zhong, N., Yao, Y., and Ohsuga, S (1999) Peculiarity-oriented multi-database mining

In Proceedings of the 3rd European Conference on the Principles of Data Mining and

Knowledge Discovery.

Định dạng
Số trang	10
Dung lượng	111,22 KB