Data Mining Concepts and Techniques phần 7 ppsx

In thiscase, efficient new methods should be developed for clustering with obstacle objects partition-in large data sets.. In this section, we examine how efficient constraint-based clus

Trang 1

Experiments on PROCLUS show that the method is efficient and scalable atfinding high-dimensional clusters Unlike CLIQUE, which outputs many overlappedclusters, PROCLUS finds nonoverlapped partitions of points The discovered clustersmay help better understand the high-dimensional data and facilitate other subse-quence analyses.

This section looks at how methods of frequent pattern mining can be applied to

cluster-ing, resulting in frequent pattern–based cluster analysis Frequent pattern mincluster-ing, as

the name implies, searches for patterns (such as sets of items or objects) that occur quently in large data sets Frequent pattern mining can lead to the discovery of interestingassociations and correlations among data objects Methods for frequent pattern miningwere introduced in Chapter 5 The idea behind frequent pattern–based cluster analysis isthat the frequent patterns discovered may also indicate clusters Frequent pattern–basedcluster analysis is well suited to high-dimensional data It can be viewed as an extension

fre-of the dimension-growth subspace clustering approach However, the boundaries fre-of ferent dimensions are not obvious, since here they are represented by sets of frequentitemsets That is, rather than growing the clusters dimension by dimension, we growsets of frequent itemsets, which eventually lead to cluster descriptions Typical examples

dif-of frequent pattern–based cluster analysis include the clustering dif-of text documents thatcontain thousands of distinct keywords, and the analysis of microarray data that con-tain tens of thousands of measured values or “features.” In this section, we examine two

forms of frequent pattern–based cluster analysis: frequent term–based text clustering and clustering by pattern similarity in microarray data analysis.

In frequent term–based text clustering, text documents are clustered based on the frequent terms they contain Using the vocabulary of text document analysis, a term is

any sequence of characters separated from other terms by a delimiter A term can bemade up of a single word or several words In general, we first remove nontext informa-tion (such as HTML tags and punctuation) and stop words Terms are then extracted

A stemming algorithm is then applied to reduce each term to its basic stem In this way,

each document can be represented as a set of terms Each set is typically large tively, a large set of documents will contain a very large set of distinct terms If we treateach term as a dimension, the dimension space will be of very high dimensionality! Thisposes great challenges for document cluster analysis The dimension space can be referred

Collec-to as term vecCollec-tor space, where each document is represented by a term vecCollec-tor.

This difficulty can be overcome by frequent term–based analysis That is, by using an

efficient frequent itemset mining algorithm introduced in Section 5.2, we can mine aset of frequent terms from the set of text documents Then, instead of clustering onhigh-dimensional term vector space, we need only consider the low-dimensional fre-quent term sets as “cluster candidates.” Notice that a frequent term set is not a clusterbut rather the description of a cluster The corresponding cluster consists of the set of

documents containing all of the terms of the frequent term set A well-selected subset of

the set of all frequent term sets can be considered as a clustering

Trang 2

7.9 Clustering High-Dimensional Data 441

“How, then, can we select a good subset of the set of all frequent term sets?” This step

is critical because such a selection will determine the quality of the resulting clustering

Let F i be a set of frequent term sets and cov(F i)be the set of documents covered by F i

That is, cov(F i)refers to the documents that contain all of the terms in F i The general

principle for finding a well-selected subset, F1, , F k, of the set of all frequent term sets

is to ensure that (1)Σk i=1cov(F i ) = D(i.e., the selected subset should cover all of the

documents to be clustered); and (2) the overlap between any two partitions, F i and F j (for i 6= j), should be minimized An overlap measure based on entropy9is used to assesscluster overlap by measuring the distribution of the documents supporting some clusterover the remaining cluster candidates

An advantage of frequent term–based text clustering is that it automatically ates a description for the generated clusters in terms of their frequent term sets Tradi-tional clustering methods produce only clusters—a description for the generated clustersrequires an additional processing step

gener-Another interesting approach for clustering high-dimensional data is based on pattern

similarity among the objects on a subset of dimensions Here we introduce the ter method, which performs clustering by pattern similarity in microarray data analysis In DNA microarray analysis, the expression levels of two genes may rise and fall

pClus-synchronously in response to a set of environmental stimuli or conditions Under the

pCluster model, two objects are similar if they exhibit a coherent pattern on a subset of dimensions Although the magnitude of their expression levels may not be close, the pat-

terns they exhibit can be very much alike This is illustrated in Example 7.15 Discovery ofsuch clusters of genes is essential in revealing significant connections in gene regulatorynetworks

Example 7.15 Clustering by pattern similarity in DNA microarray analysis Figure 7.22 shows a

frag-ment of microarray data containing only three genes (taken as “objects” here) and ten

attributes (columns a to j) No patterns among the three objects are visibly explicit ever, if two subsets of attributes, {b, c, h, j, e} and { f , d, a, g, i}, are selected and plotted

How-as in Figure 7.23(a) and (b) respectively, it is eHow-asy to see that they form some

interest-ing patterns: Figure 7.23(a) forms a shift pattern, where the three curves are similar to

each other with respect to a shift operation along the y-axis; while Figure 7.23(b) forms a

scaling pattern, where the three curves are similar to each other with respect to a scaling

operation along the y-axis.

Let us first examine how to discover shift patterns In DNA microarray data, each rowcorresponds to a gene and each column or attribute represents a condition under whichthe gene is developed The usual Euclidean distance measure cannot capture pattern

similarity, since the y values of different curves can be quite far apart Alternatively, we could first transform the data to derive new attributes, such as A i j = v i − v j (where v iand

9 Entropy is a measure from information theory It was introduced in Chapter 2 regarding data cretization and is also described in Chapter 6 regarding decision tree construction.

Trang 3

dis-a b c d e f g h i j

9080706050403020100

Object 1Object 2Object 3

Figure 7.22 Raw data from a fragment of microarray data containing only 3 objects and 10 attributes

a

90 80 70 60 50 40 30 20 10 0

Object 1 Object 2 Object 3

Figure 7.23 Objects in Figure 7.22 form (a) a shift pattern in subspace {b, c, h, j, e}, and (b) a scaling

pattern in subspace { f , d, a, g, i}.

v j are object values for attributes A i and A j, respectively), and then cluster on the derived

attributes However, this would introduce d(d − 1)/2 dimensions for a d-dimensional

data set, which is undesirable for a nontrivial d value A biclustering method was

pro-posed in an attempt to overcome these difficulties It introduces a new measure, the mean

Trang 4

7.9 Clustering High-Dimensional Data 443

squared residue score, which measures the coherence of the genes and conditions in a

submatrix of a DNA array Let I ⊂ X and J ⊂ Y be subsets of genes, X, and conditions,

Y , respectively The pair, (I, J), specifies a submatrix, A IJ, with the mean squared residuescore defined as

where d iJ and d I j are the row and column means, respectively, and d IJ is the mean of

the subcluster matrix, A IJ A submatrix, A IJ, is called aδ-bicluster if H(I, J) ≤δforsomeδ> 0 A randomized algorithm is designed to find such clusters in a DNA array.There are two major limitations of this method First, a submatrix of aδ-bicluster is notnecessarily aδ-bicluster, which makes it difficult to design an efficient pattern growth–based algorithm Second, because of the averaging effect, aδ-bicluster may contain someundesirable outliers yet still satisfy a rather smallδthreshold

To overcome the problems of the biclustering method, a pCluster model was

intro-duced as follows Given objects x, y ∈ O and attributes a, b ∈ T , pScore is defined by a

where d xa is the value of object (or gene) x for attribute (or condition) a, and so on.

A pair, (O, T ), forms a δ-pCluster if, for any 2 × 2 matrix, X, in (O, T ), we have

pScore(X )≤δfor someδ> 0 Intuitively, this means that the change of values on thetwo attributes between the two objects is confined byδfor every pair of objects in O and every pair of attributes in T

It is easy to see thatδ-pCluster has the downward closure property; that is, if (O, T )

forms aδ-pCluster, then any of its submatrices is also aδ-pCluster Moreover, because

a pCluster requires that every two objects and every two attributes conform with theinequality, the clusters modeled by the pCluster method are more homogeneous thanthose modeled by the bicluster method

In frequent itemset mining, itemsets are considered frequent if they satisfy a minimumsupport threshold, which reflects their frequency of occurrence Based on the definition

of pCluster, the problem of mining pClusters becomes one of mining frequent patterns

in which each pair of objects and their corresponding features must satisfy the specified

δthreshold A frequent pattern–growth method can easily be extended to mine suchpatterns efficiently

Trang 5

Now, let’s look into how to discover scaling patterns Notice that the original pScore

definition, though defined for shift patterns in Equation (7.41), can easily be extendedfor scaling by introducing a new inequality,

The pCluster model, though developed in the study of microarray data clusteranalysis, can be applied to many other applications that require finding similar or coher-ent patterns involving a subset of numerical dimensions in large, high-dimensionaldata sets

In the above discussion, we assume that cluster analysis is an automated, algorithmiccomputational process, based on the evaluation of similarity or distance functions among

a set of objects to be clustered, with little user guidance or interaction However, users often

have a clear view of the application requirements, which they would ideally like to use toguide the clustering process and influence the clustering results Thus, in many applica-tions, it is desirable to have the clustering process take user preferences and constraintsinto consideration Examples of such information include the expected number of clus-ters, the minimal or maximal cluster size, weights for different objects or dimensions,and other desirable characteristics of the resulting clusters Moreover, when a clusteringtask involves a rather high-dimensional space, it is very difficult to generate meaningfulclusters by relying solely on the clustering parameters User input regarding importantdimensions or the desired results will serve as crucial hints or meaningful constraintsfor effective clustering In general, we contend that knowledge discovery would be mosteffective if one could develop an environment for human-centered, exploratory min-ing of data, that is, where the human user is allowed to play a key role in the process

Foremost, a user should be allowed to specify a focus—directing the mining algorithm

toward the kind of “knowledge” that the user is interested in finding Clearly, user-guidedmining will lead to more desirable results and capture the application semantics

Constraint-based clustering finds clusters that satisfy user-specified preferences or

constraints Depending on the nature of the constraints, constraint-based clusteringmay adopt rather different approaches Here are a few categories of constraints

1 Constraints on individual objects: We can specify constraints on the objects to be

clustered In a real estate application, for example, one may like to spatially cluster only

Trang 6

7.10 Constraint-Based Cluster Analysis 445

those luxury mansions worth over a million dollars This constraint confines the set

of objects to be clustered It can easily be handled by preprocessing (e.g., performingselection using an SQL query), after which the problem reduces to an instance ofunconstrained clustering

2 Constraints on the selection of clustering parameters: A user may like to set a desired

range for each clustering parameter Clustering parameters are usually quite specific

to the given clustering algorithm Examples of parameters include k, the desired ber of clusters in a k-means algorithm; orε(the radius) and MinPts (the minimum

num-number of points) in the DBSCAN algorithm Although such user-specified eters may strongly influence the clustering results, they are usually confined to thealgorithm itself Thus, their fine tuning and processing are usually not considered aform of constraint-based clustering

param-3 Constraints on distance or similarity functions: We can specify different distance or

similarity functions for specific attributes of the objects to be clustered, or differentdistance measures for specific pairs of objects When clustering sportsmen, for exam-ple, we may use different weighting schemes for height, body weight, age, and skilllevel Although this will likely change the mining results, it may not alter the cluster-ing process per se However, in some cases, such changes may make the evaluation

of the distance function nontrivial, especially when it is tightly intertwined with theclustering process This can be seen in the following example

Example 7.16 Clustering with obstacle objects A city may have rivers, bridges, highways, lakes, and

mountains We do not want to swim across a river to reach an automated banking

machine Such obstacle objects and their effects can be captured by redefining the

distance functions among objects Clustering with obstacle objects using a ing approach requires that the distance between each object and its correspondingcluster center be reevaluated at each iteration whenever the cluster center is changed.However, such reevaluation is quite expensive with the existence of obstacles In thiscase, efficient new methods should be developed for clustering with obstacle objects

partition-in large data sets

4 User-specified constraints on the properties of individual clusters: A user may like to

specify desired characteristics of the resulting clusters, which may strongly influencethe clustering process Such constraint-based clustering arises naturally in practice,

as in Example 7.17

Example 7.17 User-constrained cluster analysis Suppose a package delivery company would like to

determine the locations for k service stations in a city The company has a database

of customers that registers the customers’ names, locations, length of time sincethe customers began using the company’s services, and average monthly charge

We may formulate this location selection problem as an instance of unconstrainedclustering using a distance function computed based on customer location How-

ever, a smarter approach is to partition the customers into two classes: high-value

Trang 7

customers (who need frequent, regular service) and ordinary customers (who require

occasional service) In order to save costs and provide good service, the manageradds the following constraints: (1) each station should serve at least 100 high-valuecustomers; and (2) each station should serve at least 5,000 ordinary customers.Constraint-based clustering will take such constraints into consideration during theclustering process

5 Semi-supervised clustering based on “partial” supervision: The quality of

unsuper-vised clustering can be significantly improved using some weak form of supervision.This may be in the form of pairwise constraints (i.e., pairs of objects labeled as belong-ing to the same or different cluster) Such a constrained clustering process is called

semi-supervised clustering.

In this section, we examine how efficient constraint-based clustering methods can bedeveloped for large data sets Since cases 1 and 2 above are trivial, we focus on cases 3 to

5 as typical forms of constraint-based cluster analysis

Example 7.16 introduced the problem of clustering with obstacle objects regarding the

placement of automated banking machines The machines should be easily accessible tothe bank’s customers This means that during clustering, we must take obstacle objectsinto consideration, such as rivers, highways, and mountains Obstacles introduce con-straints on the distance function The straight-line distance between two points is mean-ingless if there is an obstacle in the way As pointed out in Example 7.16, we do not want

to have to swim across a river to get to a banking machine!

“How can we approach the problem of clustering with obstacles?” A partitioning

clus-tering method is preferable because it minimizes the distance between objects and

their cluster centers If we choose the k-means method, a cluster center may not be

accessible given the presence of obstacles For example, the cluster mean could turn

out to be in the middle of a lake On the other hand, the k-medoids method chooses

an object within the cluster as a center and thus guarantees that such a problem not occur Recall that every time a new medoid is selected, the distance between eachobject and its newly selected cluster center has to be recomputed Because there could

can-be obstacles can-between two objects, the distance can-between two objects may have to can-bederived by geometric computations (e.g., involving triangulation) The computationalcost can get very high if a large number of objects and obstacles are involved.The clustering with obstacles problem can be represented using a graphical nota-

tion First, a point, p, is visible from another point, q, in the region, R, if the straight line joining p and q does not intersect any obstacles A visibility graph is the graph,

V G = (V , E), such that each vertex of the obstacles has a corresponding node in

V and two nodes, v1 and v2, in V are joined by an edge in E if and only if the corresponding vertices they represent are visible to each other Let V G0= (V0, E0)

be a visibility graph created from V G by adding two additional points, p and q, in

Trang 8

V0 E0 contains an edge joining two points in V0 if the two points are mutually

vis-ible The shortest path between two points, p and q, will be a subpath of V G0 as

shown in Figure 7.24(a) We see that it begins with an edge from p to either v1, v2,

or v3, goes through some path in VG, and then ends with an edge from either v4 or

v5to q.

To reduce the cost of distance computation between any two pairs of objects orpoints, several preprocessing and optimization techniques can be used One methodgroups points that are close together into microclusters This can be done by first

triangulating the region R into triangles, and then grouping nearby points in the

same triangle into microclusters, using a method similar to BIRCH or DBSCAN, asshown in Figure 7.24(b) By processing microclusters rather than individual points,the overall computation is reduced After that, precomputation can be performed

to build two kinds of join indices based on the computation of the shortest paths:

(1) VV indices, for any pair of obstacle vertices, and (2) MV indices, for any pair

of microcluster and obstacle vertex Use of the indices helps further optimize theoverall performance

With such precomputation and optimization, the distance between any two points(at the granularity level of microcluster) can be computed efficiently Thus, the clus-

tering process can be performed in a manner similar to a typical efficient k-medoids

algorithm, such as CLARANS, and achieve good clustering quality for large data sets.Given a large set of points, Figure 7.25(a) shows the result of clustering a large set ofpoints without considering obstacles, whereas Figure 7.25(b) shows the result with con-sideration of obstacles The latter represents rather different but more desirable clusters.For example, if we carefully compare the upper left-hand corner of the two graphs, wesee that Figure 7.25(a) has a cluster center on an obstacle (making the center inaccessi-ble), whereas all cluster centers in Figure 7.25(b) are accessible A similar situation hasoccurred with respect to the bottom right-hand corner of the graphs

VG VG’

Figure 7.24 Clustering with obstacle objects (o1and o2): (a) a visibility graph, and (b) triangulation of

regions with microclusters From [THH01]

Trang 9

(a) (b)

Figure 7.25 Clustering results obtained without and with consideration of obstacles (where rivers and

inaccessible highways or city blocks are represented by polygons): (a) clustering without sidering obstacles, and (b) clustering with obstacles

Let’s examine the problem of relocating package delivery centers, as illustrated in

Example 7.17 Specifically, a package delivery company with n customers would like

to determine locations for k service stations so as to minimize the traveling distance

between customers and service stations The company’s customers are regarded as

either high-value customers (requiring frequent, regular services) or ordinary customers

(requiring occasional services) The manager has stipulated two constraints: each tion should serve (1) at least 100 high-value customers and (2) at least 5,000 ordinarycustomers

sta-This can be considered as a constrained optimization problem We could considerusing a mathematical programming approach to handle it However, such a solution is

difficult to scale to large data sets To cluster n customers into k clusters, a mathematical programming approach will involve at least k × n variables As n can be as large as a

few million, we could end up having to solve a few million simultaneous equations—

a very expensive feat A more efficient approach is proposed that explores the idea ofmicroclustering, as illustrated below

The general idea of clustering a large data set into k clusters satisfying user-specified

constraints goes as follows First, we can find an initial “solution” by partitioning the

data set into k groups, satisfying the user-specified constraints, such as the two

con-straints in our example We then iteratively refine the solution by moving objects fromone cluster to another, trying to satisfy the constraints For example, we can move a set

of m customers from cluster C i to C j if C i has at least m surplus customers (under the specified constraints), or if the result of moving customers into C i from some other

clusters (including from C ) would result in such a surplus The movement is desirable

Trang 10

if the total sum of the distances of the objects to their corresponding cluster centers isreduced Such movement can be directed by selecting promising points to be moved,

such as objects that are currently assigned to some cluster, C i, but that are actually closer

to a representative (e.g., centroid) of some other cluster, C j We need to watch out forand handle deadlock situations (where a constraint is impossible to satisfy), in whichcase, a deadlock resolution strategy can be employed

To increase the clustering efficiency, data can first be preprocessed using the clustering idea to form microclusters (groups of points that are close together), therebyavoiding the processing of all of the points individually Object movement, deadlockdetection, and constraint satisfaction can be tested at the microcluster level, which re-duces the number of points to be computed Occasionally, such microclusters may need

micro-to be broken up in order micro-to resolve deadlocks under the constraints This ogy ensures that the effective clustering can be performed in large data sets under theuser-specified constraints with good efficiency and scalability

In comparison with supervised learning, clustering lacks guidance from users or fiers (such as class label information), and thus may not generate highly desirable clus-ters The quality of unsupervised clustering can be significantly improved using someweak form of supervision, for example, in the form of pairwise constraints (i.e., pairs ofobjects labeled as belonging to the same or different clusters) Such a clustering process

classi-based on user feedback or guidance constraints is called semi-supervised clustering.

Methods for semi-supervised clustering can be categorized into two classes:

constraint-based semi-supervised clustering and distance-based semi-supervised clustering.

Constraint-based semi-supervised clustering relies on user-provided labels or constraints

to guide the algorithm toward a more appropriate data partitioning This includes ifying the objective function based on constraints, or initializing and constraining the

mod-clustering process based on the labeled objects Distance-based semi-supervised tering employs an adaptive distance measure that is trained to satisfy the labels or con-

clus-straints in the supervised data Several different adaptive distance measures have beenused, such as string-edit distance trained using Expectation-Maximization (EM), andEuclidean distance modified by a shortest distance algorithm

An interesting clustering method, called CLTree (CLustering based on decision

TREEs), integrates unsupervised clustering with the idea of supervised classification It

is an example of constraint-based semi-supervised clustering It transforms a clusteringtask into a classification task by viewing the set of points to be clustered as belonging to

one class, labeled as “Y ,” and adds a set of relatively uniformly distributed, “nonexistence points” with a different class label, “N.” The problem of partitioning the data space into

data (dense) regions and empty (sparse) regions can then be transformed into a cation problem For example, Figure 7.26(a) contains a set of data points to be clustered

classifi-These points can be viewed as a set of “Y ” points Figure 7.26(b) shows the addition of

a set of uniformly distributed “N” points, represented by the “◦” points The original

Trang 11

(a) (b) (c)

Figure 7.26 Clustering through decision tree construction: (a) the set of data points to be clustered,

viewed as a set of “Y ” points, (b) the addition of a set of uniformly distributed “N” points, represented by “◦”, and (c) the clustering result with “Y ” points only.

clustering problem is thus transformed into a classification problem, which works out

a scheme that distinguishes “Y ” and “N” points A decision tree induction method can

be applied10to partition the two-dimensional space, as shown in Figure 7.26(c) Two

clusters are identified, which are from the “Y ” points only.

Adding a large number of “N” points to the original data may introduce

unneces-sary overhead in computation Furthermore, it is unlikely that any points added wouldtruly be uniformly distributed in a very high-dimensional space as this would require anexponential number of points To deal with this problem, we do not physically add any

of the “N” points, but only assume their existence This works because the decision tree method does not actually require the points Instead, it only needs the number of “N”

points at each decision tree node This number can be computed when needed, out having to add points to the original data Thus, CLTree can achieve the results in

with-Figure 7.26(c) without actually adding any “N” points to the original data Again, two

clusters are identified

The question then is how many (virtual) “N” points should be added in order to achieve good clustering results The answer follows this simple rule: At the root node, the number of inherited “N” points is 0 At any current node, E, if the number of “N” points inherited from the parent node of E is less than the number of “Y ” points in E, then the number of “N” points for E is increased to the number of “Y ” points in E (That is, we set the number of “N” points to be as big as the number of “Y ” points.) Otherwise, the number

of inherited “N” points is used in E The basic idea is to use an equal number of “N” points to the number of “Y ” points.

Decision tree classification methods use a measure, typically based on informationgain, to select the attribute test for a decision node (Section 6.3.2) The data are thensplit or partitioned according the test or “cut.” Unfortunately, with clustering, this canlead to the fragmentation of some clusters into scattered regions To address this problem,methods were developed that use information gain, but allow the ability to look ahead

10 Decision tree induction was described in Chapter 6 on classification.

Trang 12

7.11 Outlier Analysis 451

That is, CLTree first finds initial cuts and then looks ahead to find better partitions thatcut less into cluster regions It finds those cuts that form regions with a very low relativedensity The idea is that we want to split at the cut point that may result in a big empty

(“N”) region, which is more likely to separate clusters With such tuning, CLTree can

per-form high-quality clustering in high-dimensional space It can also find subspace clusters

as the decision tree method normally selects only a subset of the attributes An ing by-product of this method is the empty (sparse) regions, which may also be useful

interest-in certainterest-in applications In marketinterest-ing, for example, clusters may represent different ments of existing customers of a company, while empty regions reflect the profiles ofnoncustomers Knowing the profiles of noncustomers allows the company to tailor theirservices or marketing to target these potential customers

“What is an outlier?” Very often, there exist data objects that do not comply with the

general behavior or model of the data Such data objects, which are grossly different

from or inconsistent with the remaining set of data, are called outliers.

Outliers can be caused by measurement or execution error For example, the display

of a person’s age as −999 could be caused by a program default setting of an unrecordedage Alternatively, outliers may be the result of inherent data variability The salary of thechief executive officer of a company, for instance, could naturally stand out as an outlieramong the salaries of the other employees in the firm

Many data mining algorithms try to minimize the influence of outliers or eliminatethem all together This, however, could result in the loss of important hidden information

because one person’s noise could be another person’s signal In other words, the outliers

may be of particular interest, such as in the case of fraud detection, where outliers mayindicate fraudulent activity Thus, outlier detection and analysis is an interesting data

mining task, referred to as outlier mining.

Outlier mining has wide applications As mentioned previously, it can be used in frauddetection, for example, by detecting unusual usage of credit cards or telecommunica-tion services In addition, it is useful in customized marketing for identifying the spend-ing behavior of customers with extremely low or extremely high incomes, or in medicalanalysis for finding unusual responses to various medical treatments

Outlier mining can be described as follows: Given a set of n data points or objects and k, the expected number of outliers, find the top k objects that are considerably

dissimilar, exceptional, or inconsistent with respect to the remaining data The outliermining problem can be viewed as two subproblems: (1) define what data can beconsidered as inconsistent in a given data set, and (2) find an efficient method tomine the outliers so defined

The problem of defining outliers is nontrivial If a regression model is used for datamodeling, analysis of the residuals can give a good estimation for data “extremeness.”The task becomes tricky, however, when finding outliers in time-series data, as they may

be hidden in trend, seasonal, or other cyclic changes When multidimensional data are

Trang 13

analyzed, not any particular one but rather a combination of dimension values may be

extreme For nonnumeric (i.e., categorical) data, the definition of outliers requires specialconsideration

“What about using data visualization methods for outlier detection?” This may seem like

an obvious choice, since human eyes are very fast and effective at noticing data tencies However, this does not apply to data containing cyclic plots, where values thatappear to be outliers could be perfectly valid values in reality Data visualization meth-ods are weak in detecting outliers in data with many categorical attributes or in data ofhigh dimensionality, since human eyes are good at visualizing numeric data of only two

inconsis-to three dimensions

In this section, we instead examine computer-based methods for outlier detection

These can be categorized into four approaches: the statistical approach, the distance-based approach, the density-based local outlier approach, and the deviation-based approach, each

of which are studied here Notice that while clustering algorithms discard outliers asnoise, they can be modified to include outlier detection as a by-product of their execu-tion In general, users must check that each outlier discovered by these approaches isindeed a “real” outlier

The statistical distribution-based approach to outlier detection assumes a distribution

or probability model for the given data set (e.g., a normal or Poisson distribution) and

then identifies outliers with respect to the model using a discordancy test Application of

the test requires knowledge of the data set parameters (such as the assumed data bution), knowledge of distribution parameters (such as the mean and variance), and theexpected number of outliers

distri-“How does the discordancy testing work?” A statistical discordancy test examines two

hypotheses: a working hypothesis and an alternative hypothesis A working hypothesis,

H, is a statement that the entire data set of n objects comes from an initial distribution model, F, that is,

H : o i ∈ F, where i = 1, 2, , n. (7.43)The hypothesis is retained if there is no statistically significant evidence supporting its

rejection A discordancy test verifies whether an object, o i, is significantly large (or small)

in relation to the distribution F Different test statistics have been proposed for use as

a discordancy test, depending on the available knowledge of the data Assuming that

some statistic, T , has been chosen for discordancy testing, and the value of the statistic for

object o i is v i , then the distribution of T is constructed Significance probability, SP(v i) =

Prob(T > v i), is evaluated If SP(v i)is sufficiently small, then o i is discordant and the

working hypothesis is rejected An alternative hypothesis, H, which states that o icomes

from another distribution model, G, is adopted The result is very much dependent on

which model F is chosen because o imay be an outlier under one model and a perfectlyvalid value under another

Trang 14

The alternative distribution is very important in determining the power of the test,

that is, the probability that the working hypothesis is rejected when o iis really an outlier.There are different kinds of alternative distributions

Inherent alternative distribution: In this case, the working hypothesis that all of the

objects come from distribution F is rejected in favor of the alternative hypothesis that all of the objects arise from another distribution, G:

Mixture alternative distribution: The mixture alternative states that discordant values

are not outliers in the F population, but contaminants from some other population,

G In this case, the alternative hypothesis is

H : o i∈ (1 −λ)F +λG, where i = 1, 2, , n. (7.45)

Slippage alternative distribution: This alternative states that all of the objects (apart

from some prescribed small number) arise independently from the initial model, F,

with its given parameters, whereas the remaining objects are independent

observa-tions from a modified version of F in which the parameters have been shifted.

There are two basic types of procedures for detecting outliers:

Block procedures: In this case, either all of the suspect objects are treated as outliers

or all of them are accepted as consistent

Consecutive (or sequential) procedures: An example of such a procedure is the

inside-out procedure Its main idea is that the object that is least “likely” to be an inside-outlier is

tested first If it is found to be an outlier, then all of the more extreme values are alsoconsidered outliers; otherwise, the next most extreme object is tested, and so on Thisprocedure tends to be more effective than block procedures

“How effective is the statistical approach at outlier detection?” A major drawback is that

most tests are for single attributes, yet many data mining problems require finding liers in multidimensional space Moreover, the statistical approach requires knowledgeabout parameters of the data set, such as the data distribution However, in many cases,the data distribution may not be known Statistical methods do not guarantee that alloutliers will be found for the cases where no specific test was developed, or where theobserved distribution cannot be adequately modeled with any standard distribution

Trang 15

out-7.11.2 Distance-Based Outlier Detection

The notion of distance-based outliers was introduced to counter the main limitations

imposed by statistical methods An object, o, in a data set, D, is a distance-based (DB) outlier with parameters pct and dmin,11that is, a DB(pct, dmin)-outlier, if at least a frac- tion, pct, of the objects in D lie at a distance greater than dmin from o In other words,

rather than relying on statistical tests, we can think of distance-based outliers as thoseobjects that do not have “enough” neighbors, where neighbors are defined based ondistance from the given object In comparison with statistical-based methods, distance-based outlier detection generalizes the ideas behind discordancy testing for various stan-dard distributions Distance-based outlier detection avoids the excessive computationthat can be associated with fitting the observed distribution into some standard distri-bution and in selecting discordancy tests

For many discordancy tests, it can be shown that if an object, o, is an outlier according

to the given test, then o is also a DB(pct, dmin)-outlier for some suitably defined pct and

dmin For example, if objects that lie three or more standard deviations from the mean

are considered to be outliers, assuming a normal distribution, then this definition can

be generalized by a DB(0.9988, 0.13σ)outlier.12Several efficient algorithms for mining distance-based outliers have been developed.These are outlined as follows

Index-based algorithm: Given a data set, the index-based algorithm uses

multidimen-sional indexing structures, such as R-trees or k-d trees, to search for neighbors of each

object o within radius dmin around that object Let M be the maximum number of

objects within the dmin-neighborhood of an outlier Therefore, once M +1 neighbors

of object o are found, it is clear that o is not an outlier This algorithm has a worst-case

complexity of O(n2k), where n is the number of objects in the data set and k is the dimensionality The index-based algorithm scales well as k increases However, this

complexity evaluation takes only the search time into account, even though the task

of building an index in itself can be computationally intensive

Nested-loop algorithm: The nested-loop algorithm has the same computational

com-plexity as the index-based algorithm but avoids index structure construction and tries

to minimize the number of I/Os It divides the memory buffer space into two halvesand the data set into several logical blocks By carefully choosing the order in whichblocks are loaded into each half, I/O efficiency can be achieved

11The parameter dmin is the neighborhood radius around object o It corresponds to the parameterε

in Section 7.6.1.

12The parameters pct and dmin are computed using the normal curve’s probability density function to

satisfy the probability condition (P|x − 3| ≤ dmin) < 1 − pct, i.e., P(3 − dmin ≤ x ≤ 3 + dmin) < −pct, where x is an object (Note that the solution may not be unique.) A dmin-neighborhood of radius 0.13

indicates a spread of ±0.13 units around the 3 σ mark (i.e., [2.87, 3.13]) For a complete proof of the derivation, see [KN97].

Trang 16

Cell-based algorithm: To avoid O(n2)computational complexity, a cell-based algorithm

was developed for memory-resident data sets Its complexity is O(c k + n), where c

is a constant depending on the number of cells and k is the dimensionality In this

method, the data space is partitioned into cells with a side length equal todmin

2 √

k Each

cell has two layers surrounding it The first layer is one cell thick, while the second

is d2√k− 1e cells thick, rounded up to the closest integer The algorithm counts

outliers on a cell-by-cell rather than an object-by-object basis For a given cell, it

accumulates three counts—the number of objects in the cell, in the cell and the firstlayer together, and in the cell and both layers together Let’s refer to these counts as

cell count, cell + 1 layer count, and cell + 2 layers count, respectively.

“How are outliers determined in this method?” Let M be the maximum number of outliers that can exist in the dmin-neighborhood of an outlier.

An object, o, in the current cell is considered an outlier only if cell + 1 layer count

is less than or equal to M If this condition does not hold, then all of the objects

in the cell can be removed from further investigation as they cannot be outliers

If cell + 2 layers count is less than or equal to M, then all of the objects in the cell are considered outliers Otherwise, if this number is more than M, then it

is possible that some of the objects in the cell may be outliers To detect these

outliers, object-by-object processing is used where, for each object, o, in the cell, objects in the second layer of o are examined For objects in the cell, only those

objects having no more than M points in their dmin-neighborhoods are outliers The dmin-neighborhood of an object consists of the object’s cell, all of its first

layer, and some of its second layer

A variation to the algorithm is linear with respect to n and guarantees that no more

than three passes over the data set are required It can be used for large disk-residentdata sets, yet does not scale well for high dimensions

Distance-based outlier detection requires the user to set both the pct and dmin

parameters Finding suitable settings for these parameters can involve much trial anderror

Statistical and distance-based outlier detection both depend on the overall or “global”

distribution of the given set of data points, D However, data are usually not uniformly

distributed These methods encounter difficulties when analyzing data with rather ferent density distributions, as illustrated in the following example

dif-Example 7.18 Necessity for density-based local outlier detection Figure 7.27 shows a simple 2-D data

set containing 502 objects, with two obvious clusters Cluster C1contains 400 objects

Cluster C2contains 100 objects Two additional objects, o1and o2are clearly outliers.However, by distance-based outlier detection (which generalizes many notions from

Trang 17

C1

o2

o1

Figure 7.27 The necessity of density-based local outlier analysis From [BKNS00]

statistical-based outlier detection), only o1is a reasonable DB(pct, dmin)-outlier, because

if dmin is set to be less than the minimum distance between o2and C2, then all 501 objects

are further away from o2than dmin Thus, o2would be considered a DB(pct, outlier, but so would all of the objects in C1! On the other hand, if dmin is set to be greater

dmin)-than the minimum distance between o2and C2, then even when o2is not regarded as an

outlier, some points in C1may still be considered outliers

This brings us to the notion of local outliers An object is a local outlier if it is outlying

relative to its local neighborhood, particulary with respect to the density of the

neighbor-hood In this view, o2of Example 7.18 is a local outlier relative to the density of C2

Object o1is an outlier as well, and no objects in C1are mislabeled as outliers This forms

the basis of density-based local outlier detection Another key idea of this approach to

outlier detection is that, unlike previous methods, it does not consider being an

lier as a binary property Instead, it assesses the degree to which an object is an

out-lier This degree of “outlierness” is computed as the local outlier factor (LOF) of an

object It is local in the sense that the degree depends on how isolated the object is withrespect to the surrounding neighborhood This approach can detect both global and localoutliers

To define the local outlier factor of an object, we need to introduce the concepts of

k-distance, k-distance neighborhood, reachability distance,13and local reachability sity These are defined as follows:

den-The k-distance of an object p is the maximal distance that p gets from its k-nearest neighbors This distance is denoted as k-distance(p) It is defined as the distance, d(p, o), between p and an object o ∈ D, such that (1) for at least k objects, o0∈ D, it

13 The reachability distance here is similar to the reachability distance defined for OPTICS in Section 7.6.2, although it is given in a somewhat different context.

Trang 18

holds that d(p, o0)≤ d(p, o) That is, there are at least k objects in D that are as close as

or closer to p than o, and (2) for at most k − 1 objects, o00∈ D, it holds that d(p, o00) <

d(p, o) That is, there are at most k − 1 objects that are closer to p than o You may be

wondering at this point how k is determined The LOF method links to density-based clustering in that it sets k to the parameter MinPts, which specifies the minimum num-

ber of points for use in identifying clusters based on density (Sections 7.6.1 and 7.6.2)

Here, MinPts (as k) is used to define the local neighborhood of an object, p The k-distance neighborhood of an object p is denoted N k distance(p) (p), or N k (p)

for short By setting k to MinPts, we get N MinPts (p) It contains the MinPts-nearest neighbors of p That is, it contains every object whose distance is not greater than the MinPts-distance of p.

The reachability distance of an object p with respect to object o (where o is within the MinPts-nearest neighbors of p), is defined as reach dist MinPts (p, o) = max{MinPts- distance(o), d(p, o)} Intuitively, if an object p is far away from o, then the reachability

distance between the two is simply their actual distance However, if they are

“suffi-ciently” close (i.e., where p is within the MinPts-distance neighborhood of o), then the actual distance is replaced by the MinPts-distance of o This helps to significantly reduce the statistical fluctuations of d(p, o) for all of the p close to o The higher the

value of MinPts is, the more similar is the reachability distance for objects within

the same neighborhood

Intuitively, the local reachability density of p is the inverse of the average reachability density based on the MinPts-nearest neighbors of p It is defined as

lrd MinPts (p) = |N MinPts (p)|

Σo∈N MinPts (p) reach dist MinPts (p, o). (7.46)

The local outlier factor (LOF) of p captures the degree to which we call p an outlier.

based on both synthetic and real-world large data sets have demonstrated the power of

LOFat identifying local outliers

Trang 19

7.11.4 Deviation-Based Outlier Detection

Deviation-based outlier detection does not use statistical tests or distance-basedmeasures to identify exceptional objects Instead, it identifies outliers by examining themain characteristics of objects in a group Objects that “deviate” from this description are

considered outliers Hence, in this approach the term deviations is typically used to refer

to outliers In this section, we study two techniques for deviation-based outlier tion The first sequentially compares objects in a set, while the second employs an OLAPdata cube approach

detec-Sequential Exception Technique

The sequential exception technique simulates the way in which humans can distinguishunusual objects from among a series of supposedly like objects It uses implicit redun-

dancy of the data Given a data set, D, of n objects, it builds a sequence of subsets, {D1, D2, , D m }, of these objects with 2 ≤ m ≤ n such that

Dissimilarities are assessed between subsets in the sequence The technique introducesthe following key terms

Exception set: This is the set of deviations or outliers It is defined as the smallest

subset of objects whose removal results in the greatest reduction of dissimilarity inthe residual set.14

Dissimilarity function: This function does not require a metric distance between the

objects It is any function that, if given a set of objects, returns a low value if the objectsare similar to one another The greater the dissimilarity among the objects, the higherthe value returned by the function The dissimilarity of a subset is incrementally com-

puted based on the subset prior to it in the sequence Given a subset of n numbers, {x1, , x n}, a possible dissimilarity function is the variance of the numbers in theset, that is,

where x is the mean of the n numbers in the set For character strings, the dissimilarity

function may be in the form of a pattern string (e.g., containing wildcard characters)that is used to cover all of the patterns seen so far The dissimilarity increases when

the pattern covering all of the strings in D j−1does not cover any string in D jthat is

not in D j−1

14For interested readers, this is equivalent to the greatest reduction in Kolmogorov complexity for the

amount of data discarded.

Trang 20

The general task of finding an exception set can be NP-hard (i.e., intractable).

A sequential approach is computationally feasible and can be implemented using a linearalgorithm

“How does this technique work?” Instead of assessing the dissimilarity of the current

subset with respect to its complementary set, the algorithm selects a sequence of subsetsfrom the set for analysis For every subset, it determines the dissimilarity difference of

the subset with respect to the preceding subset in the sequence.

“Can’t the order of the subsets in the sequence affect the results?” To help alleviate any

possible influence of the input order on the results, the above process can be repeatedseveral times, each with a different random ordering of the subsets The subset with thelargest smoothing factor value, among all of the iterations, becomes the exception set

OLAP Data Cube Technique

An OLAP approach to deviation detection uses data cubes to identify regions of lies in large multidimensional data This technique was described in detail in Chapter 4.For added efficiency, the deviation detection process is overlapped with cube compu-

anoma-tation The approach is a form of discovery-driven exploration, in which precomputed

measures indicating data exceptions are used to guide the user in data analysis, at all els of aggregation A cell value in the cube is considered an exception if it is significantlydifferent from the expected value, based on a statistical model The method uses visualcues such as background color to reflect the degree of exception of each cell The usercan choose to drill down on cells that are flagged as exceptions The measure value of a

lev-cell may reflect exceptions occurring at more detailed or lower levels of the cube, where

these exceptions are not visible from the current level

The model considers variations and patterns in the measure value across all of the dimensions to which a cell belongs For example, suppose that you have a data cube for

sales data and are viewing the sales summarized per month With the help of the visualcues, you notice an increase in sales in December in comparison to all other months.This may seem like an exception in the time dimension However, by drilling down onthe month of December to reveal the sales per item in that month, you note that there

is a similar increase in sales for other items during December Therefore, an increase

in total sales in December is not an exception if the item dimension is considered Themodel considers exceptions hidden at all aggregated group-by’s of a data cube Manualdetection of such exceptions is difficult because the search space is typically very large,particularly when there are many dimensions involving concept hierarchies with severallevels

Trang 21

7.12 Summary

A cluster is a collection of data objects that are similar to one another within the same

cluster and are dissimilar to the objects in other clusters The process of grouping a

set of physical or abstract objects into classes of similar objects is called clustering.

Cluster analysis has wide applications, including market or customer segmentation,

pattern recognition, biological studies, spatial data analysis, Web document cation, and many others Cluster analysis can be used as a stand-alone data miningtool to gain insight into the data distribution or can serve as a preprocessing step forother data mining algorithms operating on the detected clusters

classifi-The quality of clustering can be assessed based on a measure of dissimilarity of objects,

which can be computed for various types of data, including interval-scaled, binary,

categorical, ordinal, and ratio-scaled variables, or combinations of these variable types For nonmetric vector data, the cosine measure and the Tanimoto coefficient are often

used in the assessment of similarity

Clustering is a dynamic field of research in data mining Many clustering algorithms

have been developed These can be categorized into partitioning methods, hierarchical

methods, density-based methods, grid-based methods, model-based methods, methods for high-dimensional data (including frequent pattern–based methods), and constraint- based methods Some algorithms may belong to more than one category.

A partitioning method first creates an initial set of k partitions, where parameter

k is the number of partitions to construct It then uses an iterative relocation nique that attempts to improve the partitioning by moving objects from one group

tech-to another Typical partitioning methods include k-means, k-medoids, CLARANS,

and their improvements

A hierarchical method creates a hierarchical decomposition of the given set of data

objects The method can be classified as being either agglomerative (bottom-up) or divisive (top-down), based on how the hierarchical decomposition is formed To com- pensate for the rigidity of merge or split, the quality of hierarchical agglomeration can

be improved by analyzing object linkages at each hierarchical partitioning (such as

in ROCK and Chameleon), or by first performing microclustering (that is,

group-ing objects into “microclusters”) and then operatgroup-ing on the microclusters with otherclustering techniques, such as iterative relocation (as in BIRCH)

A density-based method clusters objects based on the notion of density It either

grows clusters according to the density of neighborhood objects (such as in DBSCAN)

or according to some density function (such as in DENCLUE) OPTICS is a based method that generates an augmented ordering of the clustering structure ofthe data

density-A grid-based method first quantizes the object space into a finite number of cells that

form a grid structure, and then performs clustering on the grid structure STING is

Trang 22

Exercises 461

a typical example of a grid-based method based on statistical information stored ingrid cells WaveCluster and CLIQUE are two clustering algorithms that are both grid-based and density-based

A model-based method hypothesizes a model for each of the clusters and finds the

best fit of the data to that model Examples of model-based clustering include the

EM algorithm (which uses a mixture density model), conceptual clustering (such

as COBWEB), and neural network approaches (such as self-organizing featuremaps)

Clustering high-dimensional data is of crucial importance, because in many

advanced applications, data objects such as text documents and microarray dataare high-dimensional in nature There are three typical methods to handle high-

dimensional data sets: dimension-growth subspace clustering, represented by CLIQUE, dimension-reduction projected clustering, represented by PROCLUS, and frequent pattern–based clustering, represented by pCluster.

A constraint-based clustering method groups objects based on

application-dependent or user-specified constraints For example, clustering with the existence ofobstacle objects and clustering under user-specified constraints are typical methods ofconstraint-based clustering Typical examples include clustering with the existence

of obstacle objects, clustering under user-specified constraints, and semi-supervisedclustering based on “weak” supervision (such as pairs of objects labeled as belonging

to the same or different cluster)

One person’s noise could be another person’s signal Outlier detection and analysis are

very useful for fraud detection, customized marketing, medical analysis, and many

other tasks Computer-based outlier analysis methods typically follow either a cal distribution-based approach, a distance-based approach, a density-based local outlier detection approach, or a deviation-based approach.

statisti-Exercises

7.1 Briefly outline how to compute the dissimilarity between objects described by the

following types of variables:

(a) Numerical (interval-scaled) variables

(b) Asymmetric binary variables

(c) Categorical variables

(d) Ratio-scaled variables

(e) Nonmetric vector objects

7.2 Given the following measurements for the variable age:

18, 22, 25, 42, 28, 43, 33, 35, 56, 28,

Trang 23

standardize the variable by the following:

(a) Compute the mean absolute deviation of age.

(b) Compute the z-score for the first four measurements

7.3 Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):

(a) Compute the Euclidean distance between the two objects.

(b) Compute the Manhattan distance between the two objects.

(c) Compute the Minkowski distance between the two objects, using q = 3.

7.4 Section 7.2.3 gave a method wherein a categorical variable having M states can be encoded

by M asymmetric binary variables Propose a more efficient encoding scheme and state

why it is more efficient

7.5 Briefly describe the following approaches to clustering: partitioning methods, hierarchical

methods, density-based methods, grid-based methods, model-based methods, methods for high-dimensional data, and constraint-based methods Give examples in each case.

7.6 Suppose that the data mining task is to cluster the following eight points (with (x, y)

representing location) into three clusters:

A1(2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 9)

The distance function is Euclidean distance Suppose initially we assign A1, B1, and C1

as the center of each cluster, respectively Use the k-means algorithm to show only

(a) The three cluster centers after the first round execution(b) The final three clusters

7.7 Both k-means and k-medoids algorithms can perform effective clustering Illustrate the

strength and weakness of k-means in comparison with the k-medoids algorithm Also,

illustrate the strength and weakness of these schemes in comparison with a hierarchicalclustering scheme (such as AGNES)

7.8 Use a diagram to illustrate how, for a constant MinPts value, density-based clusters with

respect to a higher density (i.e., a lower value forε, the neighborhood radius) are pletely contained in density-connected sets obtained with respect to a lower density

com-7.9 Why is it that BIRCH encounters difficulties in finding clusters of arbitrary shape but

OPTICS does not? Can you propose some modifications to BIRCH to help it find clusters

of arbitrary shape?

7.10 Present conditions under which density-based clustering is more suitable than

partitioning-based clustering and hierarchical clustering Given some application ples to support your argument

exam-7.11 Give an example of how specific clustering methods may be integrated, for example,

where one clustering algorithm is used as a preprocessing step for another In

Trang 24

Exercises 463

addition, provide reasoning on why the integration of two methods may sometimes lead

to improved clustering quality and efficiency

7.12 Clustering has been popularly recognized as an important data mining task with broad

applications Give one application example for each of the following cases:

(a) An application that takes clustering as a major data mining function

(b) An application that takes clustering as a preprocessing tool for data preparation for

other data mining tasks

7.13 Data cubes and multidimensional databases contain categorical, ordinal, and numerical

data in hierarchical or aggregate forms Based on what you have learned about the ing methods, design a clustering method that finds clusters in large data cubes effectivelyand efficiently

cluster-7.14 Subspace clustering is a methodology for finding interesting clusters in high-dimensional

space This methodology can be applied to cluster any kind of data Outline an efficientalgorithm that may extend density connectivity-based clustering for finding clusters ofarbitrary shapes in projected dimensions in a high-dimensional data set

7.15 [Contributed by Alex Kotov] Describe each of the following clustering algorithms in terms

of the following criteria: (i) shapes of clusters that can be determined; (ii) input meters that must be specified; and (iii) limitations

7.16 [Contributed by Tao Cheng] Many clustering algorithms handle either only numerical

data, such as BIRCH, or only categorical data, such as ROCK, but not both Analyze whythis is the case Note, however, that the EM clustering algorithm can easily be extended

to handle data with both numerical and categorical attributes Briefly explain why it can

do so and how

7.17 Human eyes are fast and effective at judging the quality of clustering methods for

two-dimensional data Can you design a data visualization method that may help humans

visualize data clusters and judge the clustering quality for three-dimensional data? Whatabout for even higher-dimensional data?

7.18 Suppose that you are to allocate a number of automatic teller machines (ATMs) in a

given region so as to satisfy a number of constraints Households or places of workmay be clustered so that typically one ATM is assigned per cluster The clustering, how-ever, may be constrained by two factors: (1) obstacle objects (i.e., there are bridges,

Trang 25

rivers, and highways that can affect ATM accessibility), and (2) additional user-specifiedconstraints, such as each ATM should serve at least 10,000 households How can a cluster-

ing algorithm such as k-means be modified for quality clustering under both constraints?

7.19 For constraint-based clustering, aside from having the minimum number of customers

in each cluster (for ATM allocation) as a constraint, there could be many other kinds ofconstraints For example, a constraint could be in the form of the maximum number

of customers per cluster, average income of customers per cluster, maximum distancebetween every two clusters, and so on Categorize the kinds of constraints that can beimposed on the clusters produced and discuss how to perform clustering efficiently undersuch kinds of constraints

7.20 Design a privacy-preserving clustering method so that a data owner would be able to

ask a third party to mine the data for quality clustering without worrying about thepotential inappropriate disclosure of certain private or sensitive information stored

in the data

7.21 Why is outlier mining important? Briefly describe the different approaches behind

statistical-based outlier detection, distanced-based outlier detection, density-based local lier detection, and deviation-based outlier detection.

out-7.22 Local outlier factor (LOF) is an interesting notion for the discovery of local outliers

in an environment where data objects are distributed rather unevenly However, itsperformance should be further improved in order to efficiently discover local outliers.Can you propose an efficient method for effective discovery of local outliers in largedata sets?

Bibliographic Notes

Clustering has been studied extensively for more then 40 years and across many plines due to its broad applications Most books on pattern classification and machinelearning contain chapters on cluster analysis or unsupervised learning Several textbooksare dedicated to the methods of cluster analysis, including Hartigan [Har75], Jain andDubes [JD88], Kaufman and Rousseeuw [KR90], and Arabie, Hubert, and De Sorte[AHS96] There are also many survey articles on different aspects of clustering meth-ods Recent ones include Jain, Murty, and Flynn [JMF99] and Parsons, Haque, and Liu[PHL04]

disci-Methods for combining variables of different types into a single dissimilarity matrixwere introduced by Kaufman and Rousseeuw [KR90]

For partitioning methods, the k-means algorithm was first introduced by Lloyd [Llo57] and then MacQueen [Mac67] The k-medoids algorithms of PAM and CLARA were proposed by Kaufman and Rousseeuw [KR90] The k-modes (for clustering categorical data) and k-prototypes (for clustering hybrid data) algorithms were proposed

by Huang [Hua98] The k-modes clustering algorithm was also proposed independently

by Chaturvedi, Green, and Carroll [CGC94, CGC01]

Trang 26

Bibliographic Notes 465

The CLARANS algorithm was proposed by Ng and Han [NH94] Ester, Kriegel, and

Xu [EKX95] proposed techniques for further improvement of the performance ofCLARANS using efficient spatial access methods, such as R*-tree and focusing

techniques A k-means–based scalable clustering algorithm was proposed by Bradley,

Fayyad, and Reina [BFR98]

An early survey of agglomerative hierarchical clustering algorithms was conducted byDay and Edelsbrunner [DE84] Agglomerative hierarchical clustering, such as AGNES,and divisive hierarchical clustering, such as DIANA, were introduced by Kaufman andRousseeuw [KR90] An interesting direction for improving the clustering quality of hier-archical clustering methods is to integrate hierarchical clustering with distance-basediterative relocation or other nonhierarchical clustering methods For example, BIRCH,

by Zhang, Ramakrishnan, and Livny [ZRL96], first performs hierarchical clustering with

a CF-tree before applying other techniques Hierarchical clustering can also be performed

by sophisticated linkage analysis, transformation, or nearest-neighbor analysis, such asCURE by Guha, Rastogi, and Shim [GRS98], ROCK (for clustering categorical attributes)

by Guha, Rastogi, and Shim [GRS99b], and Chameleon by Karypis, Han, and Kumar[KHK99]

For density-based clustering methods, DBSCAN was proposed by Ester, Kriegel,Sander, and Xu [EKSX96] Ankerst, Breunig, Kriegel, and Sander [ABKS99] developedOPTICS, a cluster-ordering method that facilitates density-based clustering without wor-rying about parameter specification The DENCLUE algorithm, based on a set of densitydistribution functions, was proposed by Hinneburg and Keim [HK98]

A grid-based multiresolution approach called STING, which collects statistical mation in grid cells, was proposed by Wang, Yang, and Muntz [WYM97] WaveCluster,developed by Sheikholeslami, Chatterjee, and Zhang [SCZ98], is a multiresolution clus-tering approach that transforms the original feature space by wavelet transform.For model-based clustering, the EM (Expectation-Maximization) algorithm wasdeveloped by Dempster, Laird, and Rubin [DLR77] AutoClass is a Bayesian statistics-based method for model-based clustering by Cheeseman and Stutz [CS96a] that uses avariant of the EM algorithm There are many other extensions and applications of EM,such as Lauritzen [Lau95] For a set of seminal papers on conceptual clustering, see Shav-lik and Dietterich [SD90] Conceptual clustering was first introduced by Michalski andStepp [MS83] Other examples of the conceptual clustering approach include COBWEB

infor-by Fisher [Fis87], and CLASSIT infor-by Gennari, Langley, and Fisher [GLF89] Studies of theneural network approach [He99] include SOM (self-organizing feature maps) by Koho-nen [Koh82], [Koh89], by Carpenter and Grossberg [Ce91], and by Kohonen, Kaski,Lagus, et al [KKL+00], and competitive learning by Rumelhart and Zipser [RZ85].Scalable methods for clustering categorical data were studied by Gibson, Kleinberg,and Raghavan [GKR98], Guha, Rastogi, and Shim [GRS99b], and Ganti, Gehrke, andRamakrishnan [GGR99] There are also many other clustering paradigms For exam-ple, fuzzy clustering methods are discussed in Kaufman and Rousseeuw [KR90], Bezdek[Bez81], and Bezdek and Pal [BP92]

For high-dimensional clustering, an Apriori-based dimension-growth subspace tering algorithm called CLIQUE was proposed by Agrawal, Gehrke, Gunopulos, and

Trang 27

clus-Raghavan [AGGR98] It integrates density-based and grid-based clustering methods.

A sampling-based, dimension-reduction subspace clustering algorithm called PROCLUS,and its extension, ORCLUS, were proposed by Aggarwal et al [APW+99] and by Aggarwaland Yu [AY00], respectively An entropy-based subspace clustering algorithm for min-ing numerical data, called ENCLUS, was proposed by Cheng, Fu, and Zhang [CFZ99].For a frequent pattern–based approach to handling high-dimensional data, Beil, Ester,and Xu [BEX02] proposed a method for frequent term–based text clustering H Wang,

W Wang, Yang, and Yu proposed pCluster, a pattern similarity–based clustering method[WWYY02]

Recent studies have proceeded to clustering stream data, as in Babcock, Babu, Datar,

et al [BBD+02] A k-median-based data stream clustering algorithm was proposed by

Guha, Mishra, Motwani, and O’Callaghan [GMMO00], and by O’Callaghan, Mishra,Meyerson, et al [OMM+02] A method for clustering evolving data streams was pro-posed by Aggarwal, Han, Wang, and Yu [AHWY03] A framework for projected cluster-ing of high-dimensional data streams was proposed by Aggarwal, Han, Wang, and Yu[AHWY04a]

A framework for constraint-based clustering based on user-specified constraints wasbuilt by Tung, Han, Lakshmanan, and Ng [THLN01] An efficient method for constraint-based spatial clustering in the existence of physical obstacle constraints was proposed

by Tung, Hou, and Han [THH01] The quality of unsupervised clustering can be nificantly improved using supervision in the form of pairwise constraints (i.e., pairs

sig-of instances labeled as belonging to the same or different clustering) Such a process isconsidered semi-supervised clustering A probabilistic framework for semi-supervisedclustering was proposed by Basu, Bilenko, and Mooney [BBM04] The CLTree method,which transforms the clustering problem into a classification problem and then usesdecision tree induction for cluster analysis, was proposed by Liu, Xia, and Yu [LXY01].Outlier detection and analysis can be categorized into four approaches: the statisticalapproach, the distance-based approach, the density-based local outlier detection, and thedeviation-based approach The statistical approach and discordancy tests are described

in Barnett and Lewis [BL94] Distance-based outlier detection is described in Knorrand Ng [KN97, KN98] The detection of density-based local outliers was proposed byBreunig, Kriegel, Ng, and Sander [BKNS00] Outlier detection for high-dimensional data

is studied by Aggarwal and Yu [AY01] The sequential problem approach to based outlier detection was introduced in Arning, Agrawal, and Raghavan [AAR96].Sarawagi, Agrawal, and Megiddo [SAM98] introduced a discovery-driven method foridentifying exceptions in large multidimensional data using OLAP data cubes Jagadish,Koudas, and Muthukrishnan [JKM99] introduced an efficient method for miningdeviants in time-series databases

Trang 28

Mining Stream, Time-Series,

and Sequence Data

Our previouschapters introduced the basic concepts and techniques of data mining The techniques

studied, however, were for simple and structured data sets, such as data in relationaldatabases, transactional databases, and data warehouses The growth of data in various

complex forms (e.g., semi-structured and unstructured, spatial and temporal, hypertext

and multimedia) has been explosive owing to the rapid progress of data collection andadvanced database system technologies, and the World Wide Web Therefore, an increas-ingly important task in data mining is to mine complex types of data Furthermore, manydata mining applications need to mine patterns that are more sophisticated than thosediscussed earlier, including sequential patterns, subgraph patterns, and features in inter-connected networks We treat such tasks as advanced topics in data mining

In the following chapters, we examine how to further develop the essential data ing techniques (such as characterization, association, classification, and clustering) andhow to develop new ones to cope with complex types of data We start off, in this chapter,

min-by discussing the mining of stream, time-series, and sequence data Chapter 9 focuses

on the mining of graphs, social networks, and multirelational data Chapter 10 examinesmining object, spatial, multimedia, text, and Web data Research into such mining is fastevolving Our discussion provides a broad introduction We expect that many new booksdedicated to the mining of complex kinds of data will become available in the future

As this chapter focuses on the mining of stream data, time-series data, and sequencedata, let’s look at each of these areas

Imagine a satellite-mounted remote sensor that is constantly generating data Thedata are massive (e.g., terabytes in volume), temporally ordered, fast changing, and poten-

tially infinite This is an example of stream data Other examples include

telecommu-nications data, transaction data from the retail industry, and data from electric powergrids Traditional OLAP and data mining methods typically require multiple scans ofthe data and are therefore infeasible for stream data applications In Section 8.1, we studyadvanced mining methods for the analysis of such constantly flowing data

A time-series database consists of sequences of values or events obtained over repeated

measurements of time Suppose that you are given time-series data relating to stockmarket prices How can the data be analyzed to identify trends? Given such data for

467

Trang 29

two different stocks, can we find any similarities between the two? These questions areexplored in Section 8.2 Other applications involving time-series data include economicand sales forecasting, utility studies, and the observation of natural phenomena (such asatmosphere, temperature, and wind).

A sequence database consists of sequences of ordered elements or events, recorded with or without a concrete notion of time Sequential pattern mining is the discovery

of frequently occurring ordered events or subsequences as patterns An example of a

sequential pattern is “Customers who buy a Canon digital camera are likely to buy an HP color printer within a month.” Periodic patterns, which recur in regular periods or dura-

tions, are another kind of pattern related to sequences Section 8.3 studies methods ofsequential pattern mining

Recent research in bioinformatics has resulted in the development of numerous ods for the analysis of biological sequences, such as DNA and protein sequences

meth-Section 8.4 introduces several popular methods, including biological sequence alignment algorithms and the hidden Markov model.

Tremendous and potentially infinite volumes of data streams are often generated by

real-time surveillance systems, communication networks, Internet traffic, on-line actions in the financial market or retail industry, electric power grids, industry pro-duction processes, scientific and engineering experiments, remote sensors, and other

trans-dynamic environments Unlike traditional data sets, stream data flow in and out of a

computer system continuously and with varying update rates They are temporally ordered, fast changing, massive, and potentially infinite It may be impossible to store an entire

data stream or to scan through it multiple times due to its tremendous volume over, stream data tend to be of a rather low level of abstraction, whereas most analystsare interested in relatively high-level dynamic changes, such as trends and deviations Todiscover knowledge or patterns from data streams, it is necessary to develop single-scan,on-line, multilevel, multidimensional stream processing and analysis methods

More-Such single-scan, on-line data analysis methodology should not be confined to onlystream data It is also critically important for processing nonstream data that are mas-sive With data volumes mounting by terabytes or even petabytes, stream data nicelycapture our data processing needs of today: even when the complete set of data is col-lected and can be stored in massive data storage devices, single scan (as in data streamsystems) instead of random access (as in database systems) may still be the most realisticprocessing mode, because it is often too expensive to scan such a data set multiple times

In this section, we introduce several on-line stream data analysis and mining methods.Section 8.1.1 introduces the basic methodologies for stream data processing and query-ing Multidimensional analysis of stream data, encompassing stream data cubes andmultiple granularities of time, is described in Section 8.1.2 Frequent-pattern miningand classification are presented in Sections 8.1.3 and 8.1.4, respectively The clustering

of dynamically evolving data streams is addressed in Section 8.1.5

Trang 30

8.1 Mining Data Streams 469

Stream Data Systems

As seen from the previous discussion, it is impractical to scan through an entire datastream more than once Sometimes we cannot even “look” at every element of a streambecause the stream flows in so fast and changes so quickly The gigantic size of suchdata sets also implies that we generally cannot store the entire stream data set in mainmemory or even on disk The problem is not just that there is a lot of data, it is that the

universes that we are keeping track of are relatively large, where a universe is the domain

of possible values for an attribute For example, if we were tracking the ages of millions ofpeople, our universe would be relatively small, perhaps between zero and one hundredand twenty We could easily maintain exact summaries of such data In contrast, theuniverse corresponding to the set of all pairs of IP addresses on the Internet is very large,which makes exact storage intractable A reasonable way of thinking about data streams

is to actually think of a physical stream of water Heraclitus once said that you can neverstep in the same stream twice,1and so it is with stream data

For effective processing of stream data, new data structures, techniques, and rithms are needed Because we do not have an infinite amount of space to store streamdata, we often trade off between accuracy and storage That is, we generally are willing

algo-to settle for approximate rather than exact answers Synopses allow for this by

provid-ing summaries of the data, which typically can be used to return approximate answers

to queries Synopses use synopsis data structures, which are any data structures that are substantially smaller than their base data set (in this case, the stream data) From the

algorithmic point of view, we want our algorithms to be efficient in both space and time

Instead of storing all or most elements seen so far, using O(N) space, we often want to use polylogarithmic space, O(log k N), where N is the number of elements in the stream

data We may relax the requirement that our answers are exact, and ask for approximateanswers within a small error range with high probability That is, many data stream–based algorithms compute an approximate answer within a factorεof the actual answer,with high probability Generally, as the approximation factor (1+ε)goes down, the spacerequirements go up In this section, we examine some common synopsis data structuresand techniques

Random Sampling

Rather than deal with an entire data stream, we can think of sampling the stream at odic intervals “To obtain an unbiased sampling of the data, we need to know the length

peri-of the stream in advance But what can we do if we do not know this length in advance?”

In this case, we need to modify our approach

1 Plato citing Heraclitus: “Heraclitus somewhere says that all things are in process and nothing stays still, and likening existing things to the stream of a river he says you would not step twice into the same river.”

Trang 31

A technique called reservoir sampling can be used to select an unbiased random

sample of s elements without replacement The idea behind reservoir sampling is atively simple We maintain a sample of size at least s, called the “reservoir,” from which

rel-a rrel-andom srel-ample of size s crel-an be generrel-ated However, generrel-ating this srel-ample from the

reservoir can be costly, especially when the reservoir is large To avoid this step, we

main-tain a set of s candidates in the reservoir, which form a true random sample of the

ele-ments seen so far in the stream As the data stream flows, every new element has a certain

probability of replacing an old element in the reservoir Let’s say we have seen N elements

thus far in the stream The probability that a new element replaces an old one, chosen

at random, is then s/N This maintains the invariant that the set of s candidates in our

reservoir forms a random sample of the elements seen so far

Sliding Windows

Instead of sampling the data stream randomly, we can use the sliding window model to

analyze stream data The basic idea is that rather than running computations on all of

the data seen so far, or on some sample, we can make decisions based only on recent data More formally, at every time t, a new data element arrives This element “expires” at time

t + w, where w is the window “size” or length The sliding window model is useful for

stocks or sensor networks, where only recent events may be important It also reducesmemory requirements because only a small window of data is stored

Histograms

The histogram is a synopsis data structure that can be used to approximate the frequency

distribution of element values in a data stream A histogram partitions the data into a

set of contiguous buckets Depending on the partitioning rule used, the width (bucket value range) and depth (number of elements per bucket) can vary The equal-width par-

titioning rule is a simple way to construct histograms, where the range of each bucket isthe same Although easy to implement, this may not sample the probability distributionfunction well A better approach is to use V-Optimal histograms (see Section 2.5.4) Sim-ilar to clustering, V-Optimal histograms define bucket sizes that minimize the frequencyvariance within each bucket, which better captures the distribution of the data Thesehistograms can then be used to approximate query answers rather than using samplingtechniques

Multiresolution Methods

A common way to deal with a large amount of data is through the use of data reduction

methods (see Section 2.5) A popular data reduction method is the use of conquer strategies such as multiresolution data structures These allow a program totrade off between accuracy and storage, but also offer the ability to understand a datastream at multiple levels of detail

Trang 32

divide-and-8.1 Mining Data Streams 471

A concrete example is a balanced binary tree, where we try to maintain this balance as

new data come in Each level of the tree provides a different resolution The farther away

we are from the tree root, the more detailed is the level of resolution

A more sophisticated way to form multiple resolutions is to use a clustering method

to organize stream data into a hierarchical structure of trees For example, we can use atypical hierarchical clustering data structure like CF-tree in BIRCH (see Section 7.5.2)

to form a hierarchy of microclusters With dynamic stream data flowing in and out,

sum-mary statistics of data streams can be incrementally updated over time in the hierarchy

of microclusters Information in such microclusters can be aggregated into larger clusters depending on the application requirements to derive general data statistics at

macro-multiresolution

Wavelets (Section 2.5.3), a technique from signal processing, can be used to build a

multiresolution hierarchy structure over an input signal, in this case, the stream data.Given an input signal, we would like to break it down or rewrite it in terms of simple,orthogonal basis functions The simplest basis is the Haar wavelet Using this basis cor-responds to recursively performing averaging and differencing at multiple levels of reso-lution Haar wavelets are easy to understand and implement They are especially good atdealing with spatial and multimedia data Wavelets have been used as approximations tohistograms for query optimization Moreover, wavelet-based histograms can be dynam-ically maintained over time Thus, wavelets are a popular multiresolution method fordata stream compression

Sketches

Synopses techniques mainly differ by how exactly they trade off accuracy for storage.Sampling techniques and sliding window models focus on a small part of the data,whereas other synopses try to summarize the entire data, often at multiple levels of detail.Some techniques require multiple passes over the data, such as histograms and wavelets,

whereas other methods, such as sketches, can operate in a single pass.

Suppose that, ideally, we would like to maintain the full histogram over the universe

of objects or elements in a data stream, where the universe is U = {1, 2, , v} and the stream is A = {a1, a2, , a N } That is, for each value i in the universe, we want to maintain the frequency or number of occurrences of i in the sequence A If the universe is large,

this structure can be quite large as well Thus, we need a smaller representation instead

Let’s consider the frequency moments of A These are the numbers, F k, defined as

the length of the sequence (that is, N, here) F2is known as the self-join size, the repeat

rate, or as Gini’s index of homogeneity The frequency moments of a data set provideuseful information about the data for database applications, such as query answering In

addition, they indicate the degree of skew or asymmetry in the data (Section 2.2.1), which

Trang 33

is useful in parallel database applications for determining an appropriate partitioningalgorithm for the data.

When the amount of memory available is smaller than v, we need to employ a

synop-sis The estimation of the frequency moments can be done by synopses that are known as

sketches These build a small-space summary for a distribution vector (e.g., histogram)

using randomized linear projections of the underlying data vectors Sketches provideprobabilistic guarantees on the quality of the approximate answer (e.g., the answer to

the given query is 12 ± 1 with a probability of 0.90) Given N elements and a universe

U of v values, such sketches can approximate F0, F1, and F2in O(log v + log N) space The basic idea is to hash every element uniformly at random to either z i∈ {−1, + 1},

and then maintain a random variable, X =∑i m i z i It can be shown that X2is a good

estimate for F2 To explain why this works, we can think of hashing elements to −1 or+1as assigning each element value to an arbitrary side of a tug of war When we sum up

to get X, we can think of measuring the displacement of the rope from the center point.

By squaring X, we square this displacement, capturing the data skew, F2

To get an even better estimate, we can maintain multiple random variables, X i Then

by choosing the median value of the square of these variables, we can increase our

con-fidence that the estimated value is close to F2

From a database perspective, sketch partitioning was developed to improve the

performance of sketching on data stream query optimization Sketch partitioning uses

coarse statistical information on the base data to intelligently partition the domain of the

underlying attributes in a way that provably tightens the error guarantees

Randomized Algorithms

Randomized algorithms, in the form of random sampling and sketching, are often used

to deal with massive, high-dimensional data streams The use of randomization oftenleads to simpler and more efficient algorithms in comparison to known deterministicalgorithms

If a randomized algorithm always returns the right answer but the running times vary,

it is known as a Las Vegas algorithm In contrast, a Monte Carlo algorithm has bounds

on the running time but may not return the correct result We mainly consider MonteCarlo algorithms One way to think of a randomized algorithm is simply as a probabilitydistribution over a set of deterministic algorithms

Given that a randomized algorithm returns a random variable as a result, we wouldlike to have bounds on the tail probability of that random variable This tells us that theprobability that a random variable deviates from its expected value is small One basic

tool is Chebyshev’s Inequality Let X be a random variable with mean µ and standard

deviationσ(varianceσ2) Chebyshev’s inequality says that

Trang 34

In many cases, multiple random variables can be used to boost the confidence in our

results As long as these random variables are fully independent, Chernoff bounds can be

used Let X1, X2, , X nbe independent Poisson trials In a Poisson trial, the probability

of success varies from trial to trial If X is the sum of X1to X n, then a weaker version ofthe Chernoff bound tells us that

Pr[X < (1 +δ)µ] < e −µδ2/4 (8.3)

whereδ∈ (0, 1] This shows that the probability decreases exponentially as we movefrom the mean, which makes poor estimates much more unlikely

Data Stream Management Systems and Stream Queries

In traditional database systems, data are stored in finite and persistent databases However,

stream data are infinite and impossible to store fully in a database In a Data Stream agement System (DSMS), there may be multiple data streams They arrive on-line and

Man-are continuous, temporally ordered, and potentially infinite Once an element from a datastream has been processed, it is discarded or archived, and it cannot be easily retrievedunless it is explicitly stored in memory

A stream data query processing architecture includes three parts: end user, query cessor, and scratch space (which may consist of main memory and disks) An end user

pro-issues a query to the DSMS, and the query processor takes the query, processes it usingthe information stored in the scratch space, and returns the results to the user

Queries can be either one-time queries or continuous queries A one-time query is

eval-uated once over a point-in-time snapshot of the data set, with the answer returned to the

user A continuous query is evaluated continuously as data streams continue to arrive.

The answer to a continuous query is produced over time, always reflecting the stream

data seen so far A continuous query can act as a watchdog, as in “sound the alarm if the

power consumption for Block 25 exceeds a certain threshold.” Moreover, a query can be

pre-defined (i.e., supplied to the data stream management system before any relevant data have arrived) or ad hoc (i.e., issued on-line after the data streams have already begun).

A predefined query is generally a continuous query, whereas an ad hoc query can beeither one-time or continuous

Stream Query Processing

The special properties of stream data introduce new challenges in query processing

In particular, data streams may grow unboundedly, and it is possible that queries mayrequire unbounded memory to produce an exact answer How can we distinguishbetween queries that can be answered exactly using a given bounded amount of memoryand queries that must be approximated? Actually, without knowing the size of the inputdata streams, it is impossible to place a limit on the memory requirements for most com-mon queries, such as those involving joins, unless the domains of the attributes involved

in the query are restricted This is because without domain restrictions, an unbounded

Trang 35

number of attribute values must be remembered because they might turn out to joinwith tuples that arrive in the future.

Providing an exact answer to a query may require unbounded main memory; therefore

a more realistic solution is to provide an approximate answer to the query Approximate query answering relaxes the memory requirements and also helps in handling system

load, because streams can come in too fast to process exactly In addition, ad hoc queriesneed approximate history to return an answer We have already discussed common syn-opses that are useful for approximate query answering, such as random sampling, slidingwindows, histograms, and sketches

As this chapter focuses on stream data mining, we will not go into any further details

of stream query processing methods For additional discussion, interested readers mayconsult the literature recommended in the bibliographic notes of this chapter

Stream data are generated continuously in a dynamic environment, with huge volume,infinite flow, and fast-changing behavior It is impossible to store such data streams com-pletely in a data warehouse Most stream data represent low-level information, consisting

of various kinds of detailed temporal and other features To find interesting or unusual

patterns, it is essential to perform multidimensional analysis on aggregate measures (such

as sum and average) This would facilitate the discovery of critical changes in the data athigher levels of abstraction, from which users can drill down to examine more detailedlevels, when needed Thus multidimensional OLAP analysis is still needed in stream dataanalysis, but how can we implement it?

Consider the following motivating example

Example 8.1 Multidimensional analysis for power supply stream data A power supply station

gen-erates infinite streams of power usage data Suppose individual user, street address, and second are the attributes at the lowest level of granularity Given a large number of users,

it is only realistic to analyze the fluctuation of power usage at certain high levels, such

as by city or street district and by quarter (of an hour), making timely power supplyadjustments and handling unusual situations

Conceptually, for multidimensional analysis, we can view such stream data as a virtual

data cube, consisting of one or a few measures and a set of dimensions, including one

time dimension, and a few other dimensions, such as location, user-category, and so on.

However, in practice, it is impossible to materialize such a data cube, because the rialization requires a huge amount of data to be computed and stored Some efficientmethods must be developed for systematic analysis of such data

mate-Data warehouse and OLAP technology is based on the integration and consolidation

of data in multidimensional space to facilitate powerful and fast on-line data analysis

A fundamental difference in the analysis of stream data from that of relational and house data is that the stream data are generated in huge volume, flowing in and outdynamically and changing rapidly Due to limited memory, disk space, and processing

Trang 36

ware-8.1 Mining Data Streams 475

power, it is impossible to register completely the detailed level of data and compute a fullymaterialized cube A realistic design is to explore several data compression techniques,

including (1) tilted time frame on the time dimension, (2) storing data only at some ical layers, and (3) exploring efficient computation of a very partially materialized data cube The (partial) stream data cubes so constructed are much smaller than those con-

crit-structed from the raw stream data but will still be effective for multidimensional streamdata analysis We examine such a design in more detail

Time Dimension with Compressed Time

Scale: Tilted Time Frame

In stream data analysis, people are usually interested in recent changes at a fine scale but

in long-term changes at a coarse scale Naturally, we can register time at different levels ofgranularity The most recent time is registered at the finest granularity; the more distanttime is registered at a coarser granularity; and the level of coarseness depends on theapplication requirements and on how old the time point is (from the current time) Such

a time dimension model is called a tilted time frame This model is sufficient for many

analysis tasks and also ensures that the total amount of data to retain in memory or to

be stored on disk is small

There are many possible ways to design a titled time frame Here we introduce three

models, as illustrated in Figure 8.1: (1) natural tilted time frame model, (2) logarithmic tilted time frame model, and (3) progressive logarithmic tilted time frame model.

A natural tilted time frame model is shown in Figure 8.1(a), where the time frame

(or window) is structured in multiple granularities based on the “natural” or usual timescale: the most recent 4 quarters (15 minutes), followed by the last 24 hours, then

31 days, and then 12 months (the actual scale used is determined by the application).Based on this model, we can compute frequent itemsets in the last hour with the pre-cision of a quarter of an hour, or in the last day with the precision of an hour, and

Time

t t 2t 4t 8t 16t 32t 64t

(b) A logarithmic tilted time frame model.

(a) A natural tilted time frame model.

Snapshots (by clock time) Frame no

(c) A progressive logarithmic tilted time frame table.

Figure 8.1 Three models for tilted time frames

Trang 37

so on until the whole year with the precision of a month.2 This model registers only

4 + 24 + 31 + 12 = 71units of time for a year instead of 365 × 24 × 4 = 35,040 units,with an acceptable trade-off of the grain of granularity at a distant time

The second model is the logarithmic tilted time frame model, as shown in

Figure 8.1(b), where the time frame is structured in multiple granularities according

to a logarithmic scale Suppose that the most recent slot holds the transactions of thecurrent quarter The remaining slots are for the last quarter, the next two quarters (ago),

4 quarters, 8 quarters, 16 quarters, and so on, growing at an exponential rate According

to this model, with one year of data and the finest precision at a quarter, we would needlog2(365× 24 × 4) + 1 = 16.1 units of time instead of 365 × 24 × 4 = 35,040 units That

is, we would just need 17 time frames to store the compressed information

The third method is the progressive logarithmic tilted time frame model, where

snap-shots are stored at differing levels of granularity depending on the recency Let T be the

clock time elapsed since the beginning of the stream Snapshots are classified into

differ-ent frame numbers, which can vary from 0 to max frame, where log2(T )−max capacity ≤ max frame ≤ log2(T ), and max capacity is the maximal number of snapshots held in

each frame

Each snapshot is represented by its timestamp The rules for insertion of a snapshot

t (at time t) into the snapshot frame table are defined as follows: (1) if (t mod 2 i) = 0

but (t mod 2 i+1)6= 0, t is inserted into frame number i if i ≤ max frame; otherwise (i.e.,

i > max frame), t is inserted into max frame; and (2) each slot has a max capacity At the insertion of t into frame number i, if the slot already reaches its max capacity, the oldest

snapshot in this frame is removed and the new snapshot inserted

Example 8.2 Progressive logarithmic tilted time frame Consider the snapshot frame table of

Figure 8.1(c), where max frame is 5 and max capacity is 3 Let’s look at how timestamp

64 was inserted into the table We know (64 mod 26) = 0but (64 mod 27)6= 0, that is,

i =6 However, since this value of i exceeds max frame, 64 was inserted into frame 5 instead

of frame 6 Suppose we now need to insert a timestamp of 70 At time 70, since (70mod 21) = 0but (70 mod 22)6= 0, we would insert 70 into frame number 1 This would

knock out the oldest snapshot of 58, given the slot capacity of 3 From the table, we see thatthe closer a timestamp is to the current time, the denser are the snapshots stored

In the logarithmic and progressive logarithmic models discussed above, we haveassumed that the base is 2 Similar rules can be applied to any base α, whereα is aninteger andα> 1 All three tilted time frame models provide a natural way for incre-mental insertion of data and for gradually fading out older values

The tilted time frame models shown are sufficient for typical time-related queries,and at the same time, ensure that the total amount of data to retain in memory and/or

to be computed is small

2 We align the time axis with the natural calendar time Thus, for each granularity level of the tilted time frame, there might be a partial interval, which is less than a full unit at that level.

Trang 38

Depending on the given application, we can provide different fading factors in thetitled time frames, such as by placing more weight on the more recent time frames Wecan also have flexible alternative ways to design the tilted time frames For example, sup-pose that we are interested in comparing the stock average from each day of the currentweek with the corresponding averages from the same weekdays last week, last month, orlast year In this case, we can single out Monday to Friday instead of compressing theminto the whole week as one unit

Critical Layers

Even with the tilted time frame model, it can still be too costly to dynamically compute

and store a materialized cube Such a cube may have quite a few dimensions, each taining multiple levels with many distinct values Because stream data analysis has onlylimited memory space but requires fast response time, we need additional strategies thatwork in conjunction with the tilted time frame model One approach is to compute andstore only some mission-critical cuboids of the full data cube

con-In many applications, it is beneficial to dynamically and incrementally compute and

store two critical cuboids (or layers), which are determined based on their conceptual and computational importance in stream data analysis The first layer, called the minimal interest layer, is the minimally interesting layer that an analyst would like to study It is

necessary to have such a layer because it is often neither cost effective nor interesting

in practice to examine the minute details of stream data The second layer, called the

observation layer, is the layer at which an analyst (or an automated system) would like

to continuously study the data This can involve making decisions regarding the signaling

of exceptions, or drilling down along certain paths to lower layers to find cells indicatingdata exceptions

Example 8.3 Critical layers for a power supply stream data cube Let’s refer back to Example 8.1

regarding the multidimensional analysis of stream data for a power supply station

Dimensions at the lowest level of granularity (i.e., the raw data layer) included ual user, street address, and second At the minimal interest layer, these three dimensions are user group, street block, and minute, respectively Those at the observation layer are

individ-∗ (meaning all user), city, and quarter, respectively, as shown in Figure 8.2.

Based on this design, we would not need to compute any cuboids that are lower thanthe minimal interest layer because they would be beyond user interest Thus, to computeour base cuboid, representing the cells of minimal interest, we need to compute and store

the (three-dimensional) aggregate cells for the (user group, street block, minute)

group-by This can be done by aggregations on the dimensions user and address by rolling up from individual user to user group and from street address to street block, respectively, and by rolling up on the time dimension from second to minute.

Similarly, the cuboids at the observation layer should be computed dynamically, ing the tilted time frame model into account as well This is the layer that an analysttakes as an observation deck, watching the current stream data by examining the slope

tak-of changes at this layer to make decisions This layer can be obtained by rolling up the

Trang 39

observation layer

minimal interest layer

primitive data layer

(individual_user, street_address, second) (user_group, street_block, minute) (*, city, quarter)

Figure 8.2 Two critical layers in a “power supply station” stream data cube

cube along the user dimension to ∗ (for all user), along the address dimension to city, and along the time dimension to quarter If something unusual is observed, the analyst can

investigate by drilling down to lower levels to find data exceptions

Partial Materialization of a Stream Cube

“What if a user needs a layer that would be between the two critical layers?” Materializing

a cube at only two critical layers leaves much room for how to compute the cuboids inbetween These cuboids can be precomputed fully, partially, or not at all (i.e., leave every-

thing to be computed on the fly) An interesting method is popular path cubing, which

rolls up the cuboids from the minimal interest layer to the observation layer by followingone popular drilling path, materializes only the layers along the path, and leaves otherlayers to be computed only when needed This method achieves a reasonable trade-offbetween space, computation time, and flexibility, and has quick incremental aggregationtime, quick drilling time, and small space requirements

To facilitate efficient computation and storage of the popular path of the stream cube,

a compact data structure needs to be introduced so that the space taken in the

compu-tation of aggregations is minimized A hyperlinked tree structure called H-tree is revised

and adopted here to ensure that a compact structure is maintained in memory for cient computation of multidimensional and multilevel aggregations

effi-Each branch of the H-tree is organized in the same order as the specified popularpath The aggregate cells are stored in the nonleaf nodes of the H-tree, formingthe computed cuboids along the popular path Aggregation for each correspondingslot in the tilted time frame is performed from the minimal interest layer all theway up to the observation layer by aggregating along the popular path The step-by-step aggregation is performed while inserting the new generalized tuples into thecorresponding time slots

Định dạng
Số trang	78
Dung lượng	1,45 MB