In thiscase, efficient new methods should be developed for clustering with obstacle objects partition-in large data sets.. In this section, we examine how efficient constraint-based clus
Trang 1Experiments on PROCLUS show that the method is efficient and scalable atfinding high-dimensional clusters Unlike CLIQUE, which outputs many overlappedclusters, PROCLUS finds nonoverlapped partitions of points The discovered clustersmay help better understand the high-dimensional data and facilitate other subse-quence analyses.
This section looks at how methods of frequent pattern mining can be applied to
cluster-ing, resulting in frequent pattern–based cluster analysis Frequent pattern mincluster-ing, as
the name implies, searches for patterns (such as sets of items or objects) that occur quently in large data sets Frequent pattern mining can lead to the discovery of interestingassociations and correlations among data objects Methods for frequent pattern miningwere introduced in Chapter 5 The idea behind frequent pattern–based cluster analysis isthat the frequent patterns discovered may also indicate clusters Frequent pattern–basedcluster analysis is well suited to high-dimensional data It can be viewed as an extension
fre-of the dimension-growth subspace clustering approach However, the boundaries fre-of ferent dimensions are not obvious, since here they are represented by sets of frequentitemsets That is, rather than growing the clusters dimension by dimension, we growsets of frequent itemsets, which eventually lead to cluster descriptions Typical examples
dif-of frequent pattern–based cluster analysis include the clustering dif-of text documents thatcontain thousands of distinct keywords, and the analysis of microarray data that con-tain tens of thousands of measured values or “features.” In this section, we examine two
forms of frequent pattern–based cluster analysis: frequent term–based text clustering and clustering by pattern similarity in microarray data analysis.
In frequent term–based text clustering, text documents are clustered based on the frequent terms they contain Using the vocabulary of text document analysis, a term is
any sequence of characters separated from other terms by a delimiter A term can bemade up of a single word or several words In general, we first remove nontext informa-tion (such as HTML tags and punctuation) and stop words Terms are then extracted
A stemming algorithm is then applied to reduce each term to its basic stem In this way,
each document can be represented as a set of terms Each set is typically large tively, a large set of documents will contain a very large set of distinct terms If we treateach term as a dimension, the dimension space will be of very high dimensionality! Thisposes great challenges for document cluster analysis The dimension space can be referred
Collec-to as term vecCollec-tor space, where each document is represented by a term vecCollec-tor.
This difficulty can be overcome by frequent term–based analysis That is, by using an
efficient frequent itemset mining algorithm introduced in Section 5.2, we can mine aset of frequent terms from the set of text documents Then, instead of clustering onhigh-dimensional term vector space, we need only consider the low-dimensional fre-quent term sets as “cluster candidates.” Notice that a frequent term set is not a clusterbut rather the description of a cluster The corresponding cluster consists of the set of
documents containing all of the terms of the frequent term set A well-selected subset of
the set of all frequent term sets can be considered as a clustering
Trang 27.9 Clustering High-Dimensional Data 441
“How, then, can we select a good subset of the set of all frequent term sets?” This step
is critical because such a selection will determine the quality of the resulting clustering
Let F i be a set of frequent term sets and cov(F i)be the set of documents covered by F i
That is, cov(F i)refers to the documents that contain all of the terms in F i The general
principle for finding a well-selected subset, F1, , F k, of the set of all frequent term sets
is to ensure that (1)Σk i=1cov(F i ) = D(i.e., the selected subset should cover all of the
documents to be clustered); and (2) the overlap between any two partitions, F i and F j (for i 6= j), should be minimized An overlap measure based on entropy9is used to assesscluster overlap by measuring the distribution of the documents supporting some clusterover the remaining cluster candidates
An advantage of frequent term–based text clustering is that it automatically ates a description for the generated clusters in terms of their frequent term sets Tradi-tional clustering methods produce only clusters—a description for the generated clustersrequires an additional processing step
gener-Another interesting approach for clustering high-dimensional data is based on pattern
similarity among the objects on a subset of dimensions Here we introduce the ter method, which performs clustering by pattern similarity in microarray data anal- ysis In DNA microarray analysis, the expression levels of two genes may rise and fall
pClus-synchronously in response to a set of environmental stimuli or conditions Under the
pCluster model, two objects are similar if they exhibit a coherent pattern on a subset of dimensions Although the magnitude of their expression levels may not be close, the pat-
terns they exhibit can be very much alike This is illustrated in Example 7.15 Discovery ofsuch clusters of genes is essential in revealing significant connections in gene regulatorynetworks
Example 7.15 Clustering by pattern similarity in DNA microarray analysis Figure 7.22 shows a
frag-ment of microarray data containing only three genes (taken as “objects” here) and ten
attributes (columns a to j) No patterns among the three objects are visibly explicit ever, if two subsets of attributes, {b, c, h, j, e} and { f , d, a, g, i}, are selected and plotted
How-as in Figure 7.23(a) and (b) respectively, it is eHow-asy to see that they form some
interest-ing patterns: Figure 7.23(a) forms a shift pattern, where the three curves are similar to
each other with respect to a shift operation along the y-axis; while Figure 7.23(b) forms a
scaling pattern, where the three curves are similar to each other with respect to a scaling
operation along the y-axis.
Let us first examine how to discover shift patterns In DNA microarray data, each rowcorresponds to a gene and each column or attribute represents a condition under whichthe gene is developed The usual Euclidean distance measure cannot capture pattern
similarity, since the y values of different curves can be quite far apart Alternatively, we could first transform the data to derive new attributes, such as A i j = v i − v j (where v iand
9 Entropy is a measure from information theory It was introduced in Chapter 2 regarding data cretization and is also described in Chapter 6 regarding decision tree construction.
Trang 3dis-a b c d e f g h i j
9080706050403020100
Object 1Object 2Object 3
Figure 7.22 Raw data from a fragment of microarray data containing only 3 objects and 10 attributes
a
90 80 70 60 50 40 30 20 10 0
90 80 70 60 50 40 30 20 10 0
Object 1 Object 2 Object 3
Object 1 Object 2 Object 3
Figure 7.23 Objects in Figure 7.22 form (a) a shift pattern in subspace {b, c, h, j, e}, and (b) a scaling
pattern in subspace { f , d, a, g, i}.
v j are object values for attributes A i and A j, respectively), and then cluster on the derived
attributes However, this would introduce d(d − 1)/2 dimensions for a d-dimensional
data set, which is undesirable for a nontrivial d value A biclustering method was
pro-posed in an attempt to overcome these difficulties It introduces a new measure, the mean
Trang 47.9 Clustering High-Dimensional Data 443
squared residue score, which measures the coherence of the genes and conditions in a
submatrix of a DNA array Let I ⊂ X and J ⊂ Y be subsets of genes, X, and conditions,
Y , respectively The pair, (I, J), specifies a submatrix, A IJ, with the mean squared residuescore defined as
where d iJ and d I j are the row and column means, respectively, and d IJ is the mean of
the subcluster matrix, A IJ A submatrix, A IJ, is called aδ-bicluster if H(I, J) ≤δforsomeδ> 0 A randomized algorithm is designed to find such clusters in a DNA array.There are two major limitations of this method First, a submatrix of aδ-bicluster is notnecessarily aδ-bicluster, which makes it difficult to design an efficient pattern growth–based algorithm Second, because of the averaging effect, aδ-bicluster may contain someundesirable outliers yet still satisfy a rather smallδthreshold
To overcome the problems of the biclustering method, a pCluster model was
intro-duced as follows Given objects x, y ∈ O and attributes a, b ∈ T , pScore is defined by a
where d xa is the value of object (or gene) x for attribute (or condition) a, and so on.
A pair, (O, T ), forms a δ-pCluster if, for any 2 × 2 matrix, X, in (O, T ), we have
pScore(X )≤δfor someδ> 0 Intuitively, this means that the change of values on thetwo attributes between the two objects is confined byδfor every pair of objects in O and every pair of attributes in T
It is easy to see thatδ-pCluster has the downward closure property; that is, if (O, T )
forms aδ-pCluster, then any of its submatrices is also aδ-pCluster Moreover, because
a pCluster requires that every two objects and every two attributes conform with theinequality, the clusters modeled by the pCluster method are more homogeneous thanthose modeled by the bicluster method
In frequent itemset mining, itemsets are considered frequent if they satisfy a minimumsupport threshold, which reflects their frequency of occurrence Based on the definition
of pCluster, the problem of mining pClusters becomes one of mining frequent patterns
in which each pair of objects and their corresponding features must satisfy the specified
δthreshold A frequent pattern–growth method can easily be extended to mine suchpatterns efficiently
Trang 5Now, let’s look into how to discover scaling patterns Notice that the original pScore
definition, though defined for shift patterns in Equation (7.41), can easily be extendedfor scaling by introducing a new inequality,
The pCluster model, though developed in the study of microarray data clusteranalysis, can be applied to many other applications that require finding similar or coher-ent patterns involving a subset of numerical dimensions in large, high-dimensionaldata sets
In the above discussion, we assume that cluster analysis is an automated, algorithmiccomputational process, based on the evaluation of similarity or distance functions among
a set of objects to be clustered, with little user guidance or interaction However, users often
have a clear view of the application requirements, which they would ideally like to use toguide the clustering process and influence the clustering results Thus, in many applica-tions, it is desirable to have the clustering process take user preferences and constraintsinto consideration Examples of such information include the expected number of clus-ters, the minimal or maximal cluster size, weights for different objects or dimensions,and other desirable characteristics of the resulting clusters Moreover, when a clusteringtask involves a rather high-dimensional space, it is very difficult to generate meaningfulclusters by relying solely on the clustering parameters User input regarding importantdimensions or the desired results will serve as crucial hints or meaningful constraintsfor effective clustering In general, we contend that knowledge discovery would be mosteffective if one could develop an environment for human-centered, exploratory min-ing of data, that is, where the human user is allowed to play a key role in the process
Foremost, a user should be allowed to specify a focus—directing the mining algorithm
toward the kind of “knowledge” that the user is interested in finding Clearly, user-guidedmining will lead to more desirable results and capture the application semantics
Constraint-based clustering finds clusters that satisfy user-specified preferences or
constraints Depending on the nature of the constraints, constraint-based clusteringmay adopt rather different approaches Here are a few categories of constraints
1 Constraints on individual objects: We can specify constraints on the objects to be
clustered In a real estate application, for example, one may like to spatially cluster only
Trang 67.10 Constraint-Based Cluster Analysis 445
those luxury mansions worth over a million dollars This constraint confines the set
of objects to be clustered It can easily be handled by preprocessing (e.g., performingselection using an SQL query), after which the problem reduces to an instance ofunconstrained clustering
2 Constraints on the selection of clustering parameters: A user may like to set a desired
range for each clustering parameter Clustering parameters are usually quite specific
to the given clustering algorithm Examples of parameters include k, the desired ber of clusters in a k-means algorithm; orε(the radius) and MinPts (the minimum
num-number of points) in the DBSCAN algorithm Although such user-specified eters may strongly influence the clustering results, they are usually confined to thealgorithm itself Thus, their fine tuning and processing are usually not considered aform of constraint-based clustering
param-3 Constraints on distance or similarity functions: We can specify different distance or
similarity functions for specific attributes of the objects to be clustered, or differentdistance measures for specific pairs of objects When clustering sportsmen, for exam-ple, we may use different weighting schemes for height, body weight, age, and skilllevel Although this will likely change the mining results, it may not alter the cluster-ing process per se However, in some cases, such changes may make the evaluation
of the distance function nontrivial, especially when it is tightly intertwined with theclustering process This can be seen in the following example
Example 7.16 Clustering with obstacle objects A city may have rivers, bridges, highways, lakes, and
mountains We do not want to swim across a river to reach an automated banking
machine Such obstacle objects and their effects can be captured by redefining the
distance functions among objects Clustering with obstacle objects using a ing approach requires that the distance between each object and its correspondingcluster center be reevaluated at each iteration whenever the cluster center is changed.However, such reevaluation is quite expensive with the existence of obstacles In thiscase, efficient new methods should be developed for clustering with obstacle objects
partition-in large data sets
4 User-specified constraints on the properties of individual clusters: A user may like to
specify desired characteristics of the resulting clusters, which may strongly influencethe clustering process Such constraint-based clustering arises naturally in practice,
as in Example 7.17
Example 7.17 User-constrained cluster analysis Suppose a package delivery company would like to
determine the locations for k service stations in a city The company has a database
of customers that registers the customers’ names, locations, length of time sincethe customers began using the company’s services, and average monthly charge
We may formulate this location selection problem as an instance of unconstrainedclustering using a distance function computed based on customer location How-
ever, a smarter approach is to partition the customers into two classes: high-value
Trang 7customers (who need frequent, regular service) and ordinary customers (who require
occasional service) In order to save costs and provide good service, the manageradds the following constraints: (1) each station should serve at least 100 high-valuecustomers; and (2) each station should serve at least 5,000 ordinary customers.Constraint-based clustering will take such constraints into consideration during theclustering process
5 Semi-supervised clustering based on “partial” supervision: The quality of
unsuper-vised clustering can be significantly improved using some weak form of supervision.This may be in the form of pairwise constraints (i.e., pairs of objects labeled as belong-ing to the same or different cluster) Such a constrained clustering process is called
semi-supervised clustering.
In this section, we examine how efficient constraint-based clustering methods can bedeveloped for large data sets Since cases 1 and 2 above are trivial, we focus on cases 3 to
5 as typical forms of constraint-based cluster analysis
Example 7.16 introduced the problem of clustering with obstacle objects regarding the
placement of automated banking machines The machines should be easily accessible tothe bank’s customers This means that during clustering, we must take obstacle objectsinto consideration, such as rivers, highways, and mountains Obstacles introduce con-straints on the distance function The straight-line distance between two points is mean-ingless if there is an obstacle in the way As pointed out in Example 7.16, we do not want
to have to swim across a river to get to a banking machine!
“How can we approach the problem of clustering with obstacles?” A partitioning
clus-tering method is preferable because it minimizes the distance between objects and
their cluster centers If we choose the k-means method, a cluster center may not be
accessible given the presence of obstacles For example, the cluster mean could turn
out to be in the middle of a lake On the other hand, the k-medoids method chooses
an object within the cluster as a center and thus guarantees that such a problem not occur Recall that every time a new medoid is selected, the distance between eachobject and its newly selected cluster center has to be recomputed Because there could
can-be obstacles can-between two objects, the distance can-between two objects may have to can-bederived by geometric computations (e.g., involving triangulation) The computationalcost can get very high if a large number of objects and obstacles are involved.The clustering with obstacles problem can be represented using a graphical nota-
tion First, a point, p, is visible from another point, q, in the region, R, if the straight line joining p and q does not intersect any obstacles A visibility graph is the graph,
V G = (V , E), such that each vertex of the obstacles has a corresponding node in
V and two nodes, v1 and v2, in V are joined by an edge in E if and only if the corresponding vertices they represent are visible to each other Let V G0= (V0, E0)
be a visibility graph created from V G by adding two additional points, p and q, in
Trang 87.10 Constraint-Based Cluster Analysis 447
V0 E0 contains an edge joining two points in V0 if the two points are mutually
vis-ible The shortest path between two points, p and q, will be a subpath of V G0 as
shown in Figure 7.24(a) We see that it begins with an edge from p to either v1, v2,
or v3, goes through some path in VG, and then ends with an edge from either v4 or
v5to q.
To reduce the cost of distance computation between any two pairs of objects orpoints, several preprocessing and optimization techniques can be used One methodgroups points that are close together into microclusters This can be done by first
triangulating the region R into triangles, and then grouping nearby points in the
same triangle into microclusters, using a method similar to BIRCH or DBSCAN, asshown in Figure 7.24(b) By processing microclusters rather than individual points,the overall computation is reduced After that, precomputation can be performed
to build two kinds of join indices based on the computation of the shortest paths:
(1) VV indices, for any pair of obstacle vertices, and (2) MV indices, for any pair
of microcluster and obstacle vertex Use of the indices helps further optimize theoverall performance
With such precomputation and optimization, the distance between any two points(at the granularity level of microcluster) can be computed efficiently Thus, the clus-
tering process can be performed in a manner similar to a typical efficient k-medoids
algorithm, such as CLARANS, and achieve good clustering quality for large data sets.Given a large set of points, Figure 7.25(a) shows the result of clustering a large set ofpoints without considering obstacles, whereas Figure 7.25(b) shows the result with con-sideration of obstacles The latter represents rather different but more desirable clusters.For example, if we carefully compare the upper left-hand corner of the two graphs, wesee that Figure 7.25(a) has a cluster center on an obstacle (making the center inaccessi-ble), whereas all cluster centers in Figure 7.25(b) are accessible A similar situation hasoccurred with respect to the bottom right-hand corner of the graphs
VG VG’
Figure 7.24 Clustering with obstacle objects (o1and o2): (a) a visibility graph, and (b) triangulation of
regions with microclusters From [THH01]
Trang 9(a) (b)
Figure 7.25 Clustering results obtained without and with consideration of obstacles (where rivers and
inaccessible highways or city blocks are represented by polygons): (a) clustering without sidering obstacles, and (b) clustering with obstacles
Let’s examine the problem of relocating package delivery centers, as illustrated in
Example 7.17 Specifically, a package delivery company with n customers would like
to determine locations for k service stations so as to minimize the traveling distance
between customers and service stations The company’s customers are regarded as
either high-value customers (requiring frequent, regular services) or ordinary customers
(requiring occasional services) The manager has stipulated two constraints: each tion should serve (1) at least 100 high-value customers and (2) at least 5,000 ordinarycustomers
sta-This can be considered as a constrained optimization problem We could considerusing a mathematical programming approach to handle it However, such a solution is
difficult to scale to large data sets To cluster n customers into k clusters, a mathematical programming approach will involve at least k × n variables As n can be as large as a
few million, we could end up having to solve a few million simultaneous equations—
a very expensive feat A more efficient approach is proposed that explores the idea ofmicroclustering, as illustrated below
The general idea of clustering a large data set into k clusters satisfying user-specified
constraints goes as follows First, we can find an initial “solution” by partitioning the
data set into k groups, satisfying the user-specified constraints, such as the two
con-straints in our example We then iteratively refine the solution by moving objects fromone cluster to another, trying to satisfy the constraints For example, we can move a set
of m customers from cluster C i to C j if C i has at least m surplus customers (under the specified constraints), or if the result of moving customers into C i from some other
clusters (including from C ) would result in such a surplus The movement is desirable
Trang 107.10 Constraint-Based Cluster Analysis 449
if the total sum of the distances of the objects to their corresponding cluster centers isreduced Such movement can be directed by selecting promising points to be moved,
such as objects that are currently assigned to some cluster, C i, but that are actually closer
to a representative (e.g., centroid) of some other cluster, C j We need to watch out forand handle deadlock situations (where a constraint is impossible to satisfy), in whichcase, a deadlock resolution strategy can be employed
To increase the clustering efficiency, data can first be preprocessed using the clustering idea to form microclusters (groups of points that are close together), therebyavoiding the processing of all of the points individually Object movement, deadlockdetection, and constraint satisfaction can be tested at the microcluster level, which re-duces the number of points to be computed Occasionally, such microclusters may need
micro-to be broken up in order micro-to resolve deadlocks under the constraints This ogy ensures that the effective clustering can be performed in large data sets under theuser-specified constraints with good efficiency and scalability
In comparison with supervised learning, clustering lacks guidance from users or fiers (such as class label information), and thus may not generate highly desirable clus-ters The quality of unsupervised clustering can be significantly improved using someweak form of supervision, for example, in the form of pairwise constraints (i.e., pairs ofobjects labeled as belonging to the same or different clusters) Such a clustering process
classi-based on user feedback or guidance constraints is called semi-supervised clustering.
Methods for semi-supervised clustering can be categorized into two classes:
constraint-based semi-supervised clustering and distance-based semi-supervised clustering.
Constraint-based semi-supervised clustering relies on user-provided labels or constraints
to guide the algorithm toward a more appropriate data partitioning This includes ifying the objective function based on constraints, or initializing and constraining the
mod-clustering process based on the labeled objects Distance-based semi-supervised tering employs an adaptive distance measure that is trained to satisfy the labels or con-
clus-straints in the supervised data Several different adaptive distance measures have beenused, such as string-edit distance trained using Expectation-Maximization (EM), andEuclidean distance modified by a shortest distance algorithm
An interesting clustering method, called CLTree (CLustering based on decision
TREEs), integrates unsupervised clustering with the idea of supervised classification It
is an example of constraint-based semi-supervised clustering It transforms a clusteringtask into a classification task by viewing the set of points to be clustered as belonging to
one class, labeled as “Y ,” and adds a set of relatively uniformly distributed, “nonexistence points” with a different class label, “N.” The problem of partitioning the data space into
data (dense) regions and empty (sparse) regions can then be transformed into a cation problem For example, Figure 7.26(a) contains a set of data points to be clustered
classifi-These points can be viewed as a set of “Y ” points Figure 7.26(b) shows the addition of
a set of uniformly distributed “N” points, represented by the “◦” points The original
Trang 11(a) (b) (c)
Figure 7.26 Clustering through decision tree construction: (a) the set of data points to be clustered,
viewed as a set of “Y ” points, (b) the addition of a set of uniformly distributed “N” points, represented by “◦”, and (c) the clustering result with “Y ” points only.
clustering problem is thus transformed into a classification problem, which works out
a scheme that distinguishes “Y ” and “N” points A decision tree induction method can
be applied10to partition the two-dimensional space, as shown in Figure 7.26(c) Two
clusters are identified, which are from the “Y ” points only.
Adding a large number of “N” points to the original data may introduce
unneces-sary overhead in computation Furthermore, it is unlikely that any points added wouldtruly be uniformly distributed in a very high-dimensional space as this would require anexponential number of points To deal with this problem, we do not physically add any
of the “N” points, but only assume their existence This works because the decision tree method does not actually require the points Instead, it only needs the number of “N”
points at each decision tree node This number can be computed when needed, out having to add points to the original data Thus, CLTree can achieve the results in
with-Figure 7.26(c) without actually adding any “N” points to the original data Again, two
clusters are identified
The question then is how many (virtual) “N” points should be added in order to achieve good clustering results The answer follows this simple rule: At the root node, the number of inherited “N” points is 0 At any current node, E, if the number of “N” points inherited from the parent node of E is less than the number of “Y ” points in E, then the number of “N” points for E is increased to the number of “Y ” points in E (That is, we set the number of “N” points to be as big as the number of “Y ” points.) Otherwise, the number
of inherited “N” points is used in E The basic idea is to use an equal number of “N” points to the number of “Y ” points.
Decision tree classification methods use a measure, typically based on informationgain, to select the attribute test for a decision node (Section 6.3.2) The data are thensplit or partitioned according the test or “cut.” Unfortunately, with clustering, this canlead to the fragmentation of some clusters into scattered regions To address this problem,methods were developed that use information gain, but allow the ability to look ahead
10 Decision tree induction was described in Chapter 6 on classification.
Trang 127.11 Outlier Analysis 451
That is, CLTree first finds initial cuts and then looks ahead to find better partitions thatcut less into cluster regions It finds those cuts that form regions with a very low relativedensity The idea is that we want to split at the cut point that may result in a big empty
(“N”) region, which is more likely to separate clusters With such tuning, CLTree can
per-form high-quality clustering in high-dimensional space It can also find subspace clusters
as the decision tree method normally selects only a subset of the attributes An ing by-product of this method is the empty (sparse) regions, which may also be useful
interest-in certainterest-in applications In marketinterest-ing, for example, clusters may represent different ments of existing customers of a company, while empty regions reflect the profiles ofnoncustomers Knowing the profiles of noncustomers allows the company to tailor theirservices or marketing to target these potential customers
“What is an outlier?” Very often, there exist data objects that do not comply with the
general behavior or model of the data Such data objects, which are grossly different
from or inconsistent with the remaining set of data, are called outliers.
Outliers can be caused by measurement or execution error For example, the display
of a person’s age as −999 could be caused by a program default setting of an unrecordedage Alternatively, outliers may be the result of inherent data variability The salary of thechief executive officer of a company, for instance, could naturally stand out as an outlieramong the salaries of the other employees in the firm
Many data mining algorithms try to minimize the influence of outliers or eliminatethem all together This, however, could result in the loss of important hidden information
because one person’s noise could be another person’s signal In other words, the outliers
may be of particular interest, such as in the case of fraud detection, where outliers mayindicate fraudulent activity Thus, outlier detection and analysis is an interesting data
mining task, referred to as outlier mining.
Outlier mining has wide applications As mentioned previously, it can be used in frauddetection, for example, by detecting unusual usage of credit cards or telecommunica-tion services In addition, it is useful in customized marketing for identifying the spend-ing behavior of customers with extremely low or extremely high incomes, or in medicalanalysis for finding unusual responses to various medical treatments
Outlier mining can be described as follows: Given a set of n data points or objects and k, the expected number of outliers, find the top k objects that are considerably
dissimilar, exceptional, or inconsistent with respect to the remaining data The outliermining problem can be viewed as two subproblems: (1) define what data can beconsidered as inconsistent in a given data set, and (2) find an efficient method tomine the outliers so defined
The problem of defining outliers is nontrivial If a regression model is used for datamodeling, analysis of the residuals can give a good estimation for data “extremeness.”The task becomes tricky, however, when finding outliers in time-series data, as they may
be hidden in trend, seasonal, or other cyclic changes When multidimensional data are
Trang 13analyzed, not any particular one but rather a combination of dimension values may be
extreme For nonnumeric (i.e., categorical) data, the definition of outliers requires specialconsideration
“What about using data visualization methods for outlier detection?” This may seem like
an obvious choice, since human eyes are very fast and effective at noticing data tencies However, this does not apply to data containing cyclic plots, where values thatappear to be outliers could be perfectly valid values in reality Data visualization meth-ods are weak in detecting outliers in data with many categorical attributes or in data ofhigh dimensionality, since human eyes are good at visualizing numeric data of only two
inconsis-to three dimensions
In this section, we instead examine computer-based methods for outlier detection
These can be categorized into four approaches: the statistical approach, the distance-based approach, the density-based local outlier approach, and the deviation-based approach, each
of which are studied here Notice that while clustering algorithms discard outliers asnoise, they can be modified to include outlier detection as a by-product of their execu-tion In general, users must check that each outlier discovered by these approaches isindeed a “real” outlier
The statistical distribution-based approach to outlier detection assumes a distribution
or probability model for the given data set (e.g., a normal or Poisson distribution) and
then identifies outliers with respect to the model using a discordancy test Application of
the test requires knowledge of the data set parameters (such as the assumed data bution), knowledge of distribution parameters (such as the mean and variance), and theexpected number of outliers
distri-“How does the discordancy testing work?” A statistical discordancy test examines two
hypotheses: a working hypothesis and an alternative hypothesis A working hypothesis,
H, is a statement that the entire data set of n objects comes from an initial distribution model, F, that is,
H : o i ∈ F, where i = 1, 2, , n. (7.43)The hypothesis is retained if there is no statistically significant evidence supporting its
rejection A discordancy test verifies whether an object, o i, is significantly large (or small)
in relation to the distribution F Different test statistics have been proposed for use as
a discordancy test, depending on the available knowledge of the data Assuming that
some statistic, T , has been chosen for discordancy testing, and the value of the statistic for
object o i is v i , then the distribution of T is constructed Significance probability, SP(v i) =
Prob(T > v i), is evaluated If SP(v i)is sufficiently small, then o i is discordant and the
working hypothesis is rejected An alternative hypothesis, H, which states that o icomes
from another distribution model, G, is adopted The result is very much dependent on
which model F is chosen because o imay be an outlier under one model and a perfectlyvalid value under another
Trang 147.11 Outlier Analysis 453
The alternative distribution is very important in determining the power of the test,
that is, the probability that the working hypothesis is rejected when o iis really an outlier.There are different kinds of alternative distributions
Inherent alternative distribution: In this case, the working hypothesis that all of the
objects come from distribution F is rejected in favor of the alternative hypothesis that all of the objects arise from another distribution, G:
Mixture alternative distribution: The mixture alternative states that discordant values
are not outliers in the F population, but contaminants from some other population,
G In this case, the alternative hypothesis is
H : o i∈ (1 −λ)F +λG, where i = 1, 2, , n. (7.45)
Slippage alternative distribution: This alternative states that all of the objects (apart
from some prescribed small number) arise independently from the initial model, F,
with its given parameters, whereas the remaining objects are independent
observa-tions from a modified version of F in which the parameters have been shifted.
There are two basic types of procedures for detecting outliers:
Block procedures: In this case, either all of the suspect objects are treated as outliers
or all of them are accepted as consistent
Consecutive (or sequential) procedures: An example of such a procedure is the
inside-out procedure Its main idea is that the object that is least “likely” to be an inside-outlier is
tested first If it is found to be an outlier, then all of the more extreme values are alsoconsidered outliers; otherwise, the next most extreme object is tested, and so on Thisprocedure tends to be more effective than block procedures
“How effective is the statistical approach at outlier detection?” A major drawback is that
most tests are for single attributes, yet many data mining problems require finding liers in multidimensional space Moreover, the statistical approach requires knowledgeabout parameters of the data set, such as the data distribution However, in many cases,the data distribution may not be known Statistical methods do not guarantee that alloutliers will be found for the cases where no specific test was developed, or where theobserved distribution cannot be adequately modeled with any standard distribution
Trang 15out-7.11.2 Distance-Based Outlier Detection
The notion of distance-based outliers was introduced to counter the main limitations
imposed by statistical methods An object, o, in a data set, D, is a distance-based (DB) outlier with parameters pct and dmin,11that is, a DB(pct, dmin)-outlier, if at least a frac- tion, pct, of the objects in D lie at a distance greater than dmin from o In other words,
rather than relying on statistical tests, we can think of distance-based outliers as thoseobjects that do not have “enough” neighbors, where neighbors are defined based ondistance from the given object In comparison with statistical-based methods, distance-based outlier detection generalizes the ideas behind discordancy testing for various stan-dard distributions Distance-based outlier detection avoids the excessive computationthat can be associated with fitting the observed distribution into some standard distri-bution and in selecting discordancy tests
For many discordancy tests, it can be shown that if an object, o, is an outlier according
to the given test, then o is also a DB(pct, dmin)-outlier for some suitably defined pct and
dmin For example, if objects that lie three or more standard deviations from the mean
are considered to be outliers, assuming a normal distribution, then this definition can
be generalized by a DB(0.9988, 0.13σ)outlier.12Several efficient algorithms for mining distance-based outliers have been developed.These are outlined as follows
Index-based algorithm: Given a data set, the index-based algorithm uses
multidimen-sional indexing structures, such as R-trees or k-d trees, to search for neighbors of each
object o within radius dmin around that object Let M be the maximum number of
objects within the dmin-neighborhood of an outlier Therefore, once M +1 neighbors
of object o are found, it is clear that o is not an outlier This algorithm has a worst-case
complexity of O(n2k), where n is the number of objects in the data set and k is the dimensionality The index-based algorithm scales well as k increases However, this
complexity evaluation takes only the search time into account, even though the task
of building an index in itself can be computationally intensive
Nested-loop algorithm: The nested-loop algorithm has the same computational
com-plexity as the index-based algorithm but avoids index structure construction and tries
to minimize the number of I/Os It divides the memory buffer space into two halvesand the data set into several logical blocks By carefully choosing the order in whichblocks are loaded into each half, I/O efficiency can be achieved
11The parameter dmin is the neighborhood radius around object o It corresponds to the parameterε
in Section 7.6.1.
12The parameters pct and dmin are computed using the normal curve’s probability density function to
satisfy the probability condition (P|x − 3| ≤ dmin) < 1 − pct, i.e., P(3 − dmin ≤ x ≤ 3 + dmin) < −pct, where x is an object (Note that the solution may not be unique.) A dmin-neighborhood of radius 0.13
indicates a spread of ±0.13 units around the 3 σ mark (i.e., [2.87, 3.13]) For a complete proof of the derivation, see [KN97].
Trang 167.11 Outlier Analysis 455
Cell-based algorithm: To avoid O(n2)computational complexity, a cell-based algorithm
was developed for memory-resident data sets Its complexity is O(c k + n), where c
is a constant depending on the number of cells and k is the dimensionality In this
method, the data space is partitioned into cells with a side length equal todmin
2 √
k Each
cell has two layers surrounding it The first layer is one cell thick, while the second
is d2√k− 1e cells thick, rounded up to the closest integer The algorithm counts
outliers on a cell-by-cell rather than an object-by-object basis For a given cell, it
accumulates three counts—the number of objects in the cell, in the cell and the firstlayer together, and in the cell and both layers together Let’s refer to these counts as
cell count, cell + 1 layer count, and cell + 2 layers count, respectively.
“How are outliers determined in this method?” Let M be the maximum number of outliers that can exist in the dmin-neighborhood of an outlier.
An object, o, in the current cell is considered an outlier only if cell + 1 layer count
is less than or equal to M If this condition does not hold, then all of the objects
in the cell can be removed from further investigation as they cannot be outliers
If cell + 2 layers count is less than or equal to M, then all of the objects in the cell are considered outliers Otherwise, if this number is more than M, then it
is possible that some of the objects in the cell may be outliers To detect these
outliers, object-by-object processing is used where, for each object, o, in the cell, objects in the second layer of o are examined For objects in the cell, only those
objects having no more than M points in their dmin-neighborhoods are outliers The dmin-neighborhood of an object consists of the object’s cell, all of its first
layer, and some of its second layer
A variation to the algorithm is linear with respect to n and guarantees that no more
than three passes over the data set are required It can be used for large disk-residentdata sets, yet does not scale well for high dimensions
Distance-based outlier detection requires the user to set both the pct and dmin
parameters Finding suitable settings for these parameters can involve much trial anderror
Statistical and distance-based outlier detection both depend on the overall or “global”
distribution of the given set of data points, D However, data are usually not uniformly
distributed These methods encounter difficulties when analyzing data with rather ferent density distributions, as illustrated in the following example
dif-Example 7.18 Necessity for density-based local outlier detection Figure 7.27 shows a simple 2-D data
set containing 502 objects, with two obvious clusters Cluster C1contains 400 objects
Cluster C2contains 100 objects Two additional objects, o1and o2are clearly outliers.However, by distance-based outlier detection (which generalizes many notions from
Trang 17C1
o2
o1
Figure 7.27 The necessity of density-based local outlier analysis From [BKNS00]
statistical-based outlier detection), only o1is a reasonable DB(pct, dmin)-outlier, because
if dmin is set to be less than the minimum distance between o2and C2, then all 501 objects
are further away from o2than dmin Thus, o2would be considered a DB(pct, outlier, but so would all of the objects in C1! On the other hand, if dmin is set to be greater
dmin)-than the minimum distance between o2and C2, then even when o2is not regarded as an
outlier, some points in C1may still be considered outliers
This brings us to the notion of local outliers An object is a local outlier if it is outlying
relative to its local neighborhood, particulary with respect to the density of the
neighbor-hood In this view, o2of Example 7.18 is a local outlier relative to the density of C2
Object o1is an outlier as well, and no objects in C1are mislabeled as outliers This forms
the basis of density-based local outlier detection Another key idea of this approach to
outlier detection is that, unlike previous methods, it does not consider being an
lier as a binary property Instead, it assesses the degree to which an object is an
out-lier This degree of “outlierness” is computed as the local outlier factor (LOF) of an
object It is local in the sense that the degree depends on how isolated the object is withrespect to the surrounding neighborhood This approach can detect both global and localoutliers
To define the local outlier factor of an object, we need to introduce the concepts of
k-distance, k-distance neighborhood, reachability distance,13and local reachability sity These are defined as follows:
den-The k-distance of an object p is the maximal distance that p gets from its k-nearest neighbors This distance is denoted as k-distance(p) It is defined as the distance, d(p, o), between p and an object o ∈ D, such that (1) for at least k objects, o0∈ D, it
13 The reachability distance here is similar to the reachability distance defined for OPTICS in Section 7.6.2, although it is given in a somewhat different context.
Trang 187.11 Outlier Analysis 457
holds that d(p, o0)≤ d(p, o) That is, there are at least k objects in D that are as close as
or closer to p than o, and (2) for at most k − 1 objects, o00∈ D, it holds that d(p, o00) <
d(p, o) That is, there are at most k − 1 objects that are closer to p than o You may be
wondering at this point how k is determined The LOF method links to density-based clustering in that it sets k to the parameter MinPts, which specifies the minimum num-
ber of points for use in identifying clusters based on density (Sections 7.6.1 and 7.6.2)
Here, MinPts (as k) is used to define the local neighborhood of an object, p The k-distance neighborhood of an object p is denoted N k distance(p) (p), or N k (p)
for short By setting k to MinPts, we get N MinPts (p) It contains the MinPts-nearest neighbors of p That is, it contains every object whose distance is not greater than the MinPts-distance of p.
The reachability distance of an object p with respect to object o (where o is within the MinPts-nearest neighbors of p), is defined as reach dist MinPts (p, o) = max{MinPts- distance(o), d(p, o)} Intuitively, if an object p is far away from o, then the reachability
distance between the two is simply their actual distance However, if they are
“suffi-ciently” close (i.e., where p is within the MinPts-distance neighborhood of o), then the actual distance is replaced by the MinPts-distance of o This helps to significantly reduce the statistical fluctuations of d(p, o) for all of the p close to o The higher the
value of MinPts is, the more similar is the reachability distance for objects within
the same neighborhood
Intuitively, the local reachability density of p is the inverse of the average reachability density based on the MinPts-nearest neighbors of p It is defined as
lrd MinPts (p) = |N MinPts (p)|
Σo∈N MinPts (p) reach dist MinPts (p, o). (7.46)
The local outlier factor (LOF) of p captures the degree to which we call p an outlier.
based on both synthetic and real-world large data sets have demonstrated the power of
LOFat identifying local outliers
Trang 197.11.4 Deviation-Based Outlier Detection
Deviation-based outlier detection does not use statistical tests or distance-basedmeasures to identify exceptional objects Instead, it identifies outliers by examining themain characteristics of objects in a group Objects that “deviate” from this description are
considered outliers Hence, in this approach the term deviations is typically used to refer
to outliers In this section, we study two techniques for deviation-based outlier tion The first sequentially compares objects in a set, while the second employs an OLAPdata cube approach
detec-Sequential Exception Technique
The sequential exception technique simulates the way in which humans can distinguishunusual objects from among a series of supposedly like objects It uses implicit redun-
dancy of the data Given a data set, D, of n objects, it builds a sequence of subsets, {D1, D2, , D m }, of these objects with 2 ≤ m ≤ n such that
Dissimilarities are assessed between subsets in the sequence The technique introducesthe following key terms
Exception set: This is the set of deviations or outliers It is defined as the smallest
subset of objects whose removal results in the greatest reduction of dissimilarity inthe residual set.14
Dissimilarity function: This function does not require a metric distance between the
objects It is any function that, if given a set of objects, returns a low value if the objectsare similar to one another The greater the dissimilarity among the objects, the higherthe value returned by the function The dissimilarity of a subset is incrementally com-
puted based on the subset prior to it in the sequence Given a subset of n numbers, {x1, , x n}, a possible dissimilarity function is the variance of the numbers in theset, that is,
where x is the mean of the n numbers in the set For character strings, the dissimilarity
function may be in the form of a pattern string (e.g., containing wildcard characters)that is used to cover all of the patterns seen so far The dissimilarity increases when
the pattern covering all of the strings in D j−1does not cover any string in D jthat is
not in D j−1
14For interested readers, this is equivalent to the greatest reduction in Kolmogorov complexity for the
amount of data discarded.
Trang 20The general task of finding an exception set can be NP-hard (i.e., intractable).
A sequential approach is computationally feasible and can be implemented using a linearalgorithm
“How does this technique work?” Instead of assessing the dissimilarity of the current
subset with respect to its complementary set, the algorithm selects a sequence of subsetsfrom the set for analysis For every subset, it determines the dissimilarity difference of
the subset with respect to the preceding subset in the sequence.
“Can’t the order of the subsets in the sequence affect the results?” To help alleviate any
possible influence of the input order on the results, the above process can be repeatedseveral times, each with a different random ordering of the subsets The subset with thelargest smoothing factor value, among all of the iterations, becomes the exception set
OLAP Data Cube Technique
An OLAP approach to deviation detection uses data cubes to identify regions of lies in large multidimensional data This technique was described in detail in Chapter 4.For added efficiency, the deviation detection process is overlapped with cube compu-
anoma-tation The approach is a form of discovery-driven exploration, in which precomputed
measures indicating data exceptions are used to guide the user in data analysis, at all els of aggregation A cell value in the cube is considered an exception if it is significantlydifferent from the expected value, based on a statistical model The method uses visualcues such as background color to reflect the degree of exception of each cell The usercan choose to drill down on cells that are flagged as exceptions The measure value of a
lev-cell may reflect exceptions occurring at more detailed or lower levels of the cube, where
these exceptions are not visible from the current level
The model considers variations and patterns in the measure value across all of the dimensions to which a cell belongs For example, suppose that you have a data cube for
sales data and are viewing the sales summarized per month With the help of the visualcues, you notice an increase in sales in December in comparison to all other months.This may seem like an exception in the time dimension However, by drilling down onthe month of December to reveal the sales per item in that month, you note that there
is a similar increase in sales for other items during December Therefore, an increase
in total sales in December is not an exception if the item dimension is considered Themodel considers exceptions hidden at all aggregated group-by’s of a data cube Manualdetection of such exceptions is difficult because the search space is typically very large,particularly when there are many dimensions involving concept hierarchies with severallevels
Trang 217.12 Summary
A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters The process of grouping a
set of physical or abstract objects into classes of similar objects is called clustering.
Cluster analysis has wide applications, including market or customer segmentation,
pattern recognition, biological studies, spatial data analysis, Web document cation, and many others Cluster analysis can be used as a stand-alone data miningtool to gain insight into the data distribution or can serve as a preprocessing step forother data mining algorithms operating on the detected clusters
classifi-The quality of clustering can be assessed based on a measure of dissimilarity of objects,
which can be computed for various types of data, including interval-scaled, binary,
categorical, ordinal, and ratio-scaled variables, or combinations of these variable types For nonmetric vector data, the cosine measure and the Tanimoto coefficient are often
used in the assessment of similarity
Clustering is a dynamic field of research in data mining Many clustering algorithms
have been developed These can be categorized into partitioning methods, hierarchical
methods, density-based methods, grid-based methods, model-based methods, methods for high-dimensional data (including frequent pattern–based methods), and constraint- based methods Some algorithms may belong to more than one category.
A partitioning method first creates an initial set of k partitions, where parameter
k is the number of partitions to construct It then uses an iterative relocation nique that attempts to improve the partitioning by moving objects from one group
tech-to another Typical partitioning methods include k-means, k-medoids, CLARANS,
and their improvements
A hierarchical method creates a hierarchical decomposition of the given set of data
objects The method can be classified as being either agglomerative (bottom-up) or divisive (top-down), based on how the hierarchical decomposition is formed To com- pensate for the rigidity of merge or split, the quality of hierarchical agglomeration can
be improved by analyzing object linkages at each hierarchical partitioning (such as
in ROCK and Chameleon), or by first performing microclustering (that is,
group-ing objects into “microclusters”) and then operatgroup-ing on the microclusters with otherclustering techniques, such as iterative relocation (as in BIRCH)
A density-based method clusters objects based on the notion of density It either
grows clusters according to the density of neighborhood objects (such as in DBSCAN)
or according to some density function (such as in DENCLUE) OPTICS is a based method that generates an augmented ordering of the clustering structure ofthe data
density-A grid-based method first quantizes the object space into a finite number of cells that
form a grid structure, and then performs clustering on the grid structure STING is
Trang 22Exercises 461
a typical example of a grid-based method based on statistical information stored ingrid cells WaveCluster and CLIQUE are two clustering algorithms that are both grid-based and density-based
A model-based method hypothesizes a model for each of the clusters and finds the
best fit of the data to that model Examples of model-based clustering include the
EM algorithm (which uses a mixture density model), conceptual clustering (such
as COBWEB), and neural network approaches (such as self-organizing featuremaps)
Clustering high-dimensional data is of crucial importance, because in many
advanced applications, data objects such as text documents and microarray dataare high-dimensional in nature There are three typical methods to handle high-
dimensional data sets: dimension-growth subspace clustering, represented by CLIQUE, dimension-reduction projected clustering, represented by PROCLUS, and frequent pattern–based clustering, represented by pCluster.
A constraint-based clustering method groups objects based on
application-dependent or user-specified constraints For example, clustering with the existence ofobstacle objects and clustering under user-specified constraints are typical methods ofconstraint-based clustering Typical examples include clustering with the existence
of obstacle objects, clustering under user-specified constraints, and semi-supervisedclustering based on “weak” supervision (such as pairs of objects labeled as belonging
to the same or different cluster)
One person’s noise could be another person’s signal Outlier detection and analysis are
very useful for fraud detection, customized marketing, medical analysis, and many
other tasks Computer-based outlier analysis methods typically follow either a cal distribution-based approach, a distance-based approach, a density-based local outlier detection approach, or a deviation-based approach.
statisti-Exercises
7.1 Briefly outline how to compute the dissimilarity between objects described by the
following types of variables:
(a) Numerical (interval-scaled) variables
(b) Asymmetric binary variables
(c) Categorical variables
(d) Ratio-scaled variables
(e) Nonmetric vector objects
7.2 Given the following measurements for the variable age:
18, 22, 25, 42, 28, 43, 33, 35, 56, 28,
Trang 23standardize the variable by the following:
(a) Compute the mean absolute deviation of age.
(b) Compute the z-score for the first four measurements
7.3 Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using q = 3.
7.4 Section 7.2.3 gave a method wherein a categorical variable having M states can be encoded
by M asymmetric binary variables Propose a more efficient encoding scheme and state
why it is more efficient
7.5 Briefly describe the following approaches to clustering: partitioning methods, hierarchical
methods, density-based methods, grid-based methods, model-based methods, methods for high-dimensional data, and constraint-based methods Give examples in each case.
7.6 Suppose that the data mining task is to cluster the following eight points (with (x, y)
representing location) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 9)
The distance function is Euclidean distance Suppose initially we assign A1, B1, and C1
as the center of each cluster, respectively Use the k-means algorithm to show only
(a) The three cluster centers after the first round execution(b) The final three clusters
7.7 Both k-means and k-medoids algorithms can perform effective clustering Illustrate the
strength and weakness of k-means in comparison with the k-medoids algorithm Also,
illustrate the strength and weakness of these schemes in comparison with a hierarchicalclustering scheme (such as AGNES)
7.8 Use a diagram to illustrate how, for a constant MinPts value, density-based clusters with
respect to a higher density (i.e., a lower value forε, the neighborhood radius) are pletely contained in density-connected sets obtained with respect to a lower density
com-7.9 Why is it that BIRCH encounters difficulties in finding clusters of arbitrary shape but
OPTICS does not? Can you propose some modifications to BIRCH to help it find clusters
of arbitrary shape?
7.10 Present conditions under which density-based clustering is more suitable than
partitioning-based clustering and hierarchical clustering Given some application ples to support your argument
exam-7.11 Give an example of how specific clustering methods may be integrated, for example,
where one clustering algorithm is used as a preprocessing step for another In
Trang 24Exercises 463
addition, provide reasoning on why the integration of two methods may sometimes lead
to improved clustering quality and efficiency
7.12 Clustering has been popularly recognized as an important data mining task with broad
applications Give one application example for each of the following cases:
(a) An application that takes clustering as a major data mining function
(b) An application that takes clustering as a preprocessing tool for data preparation for
other data mining tasks
7.13 Data cubes and multidimensional databases contain categorical, ordinal, and numerical
data in hierarchical or aggregate forms Based on what you have learned about the ing methods, design a clustering method that finds clusters in large data cubes effectivelyand efficiently
cluster-7.14 Subspace clustering is a methodology for finding interesting clusters in high-dimensional
space This methodology can be applied to cluster any kind of data Outline an efficientalgorithm that may extend density connectivity-based clustering for finding clusters ofarbitrary shapes in projected dimensions in a high-dimensional data set
7.15 [Contributed by Alex Kotov] Describe each of the following clustering algorithms in terms
of the following criteria: (i) shapes of clusters that can be determined; (ii) input meters that must be specified; and (iii) limitations
7.16 [Contributed by Tao Cheng] Many clustering algorithms handle either only numerical
data, such as BIRCH, or only categorical data, such as ROCK, but not both Analyze whythis is the case Note, however, that the EM clustering algorithm can easily be extended
to handle data with both numerical and categorical attributes Briefly explain why it can
do so and how
7.17 Human eyes are fast and effective at judging the quality of clustering methods for
two-dimensional data Can you design a data visualization method that may help humans
visualize data clusters and judge the clustering quality for three-dimensional data? Whatabout for even higher-dimensional data?
7.18 Suppose that you are to allocate a number of automatic teller machines (ATMs) in a
given region so as to satisfy a number of constraints Households or places of workmay be clustered so that typically one ATM is assigned per cluster The clustering, how-ever, may be constrained by two factors: (1) obstacle objects (i.e., there are bridges,
Trang 25rivers, and highways that can affect ATM accessibility), and (2) additional user-specifiedconstraints, such as each ATM should serve at least 10,000 households How can a cluster-
ing algorithm such as k-means be modified for quality clustering under both constraints?
7.19 For constraint-based clustering, aside from having the minimum number of customers
in each cluster (for ATM allocation) as a constraint, there could be many other kinds ofconstraints For example, a constraint could be in the form of the maximum number
of customers per cluster, average income of customers per cluster, maximum distancebetween every two clusters, and so on Categorize the kinds of constraints that can beimposed on the clusters produced and discuss how to perform clustering efficiently undersuch kinds of constraints
7.20 Design a privacy-preserving clustering method so that a data owner would be able to
ask a third party to mine the data for quality clustering without worrying about thepotential inappropriate disclosure of certain private or sensitive information stored
in the data
7.21 Why is outlier mining important? Briefly describe the different approaches behind
statistical-based outlier detection, distanced-based outlier detection, density-based local lier detection, and deviation-based outlier detection.
out-7.22 Local outlier factor (LOF) is an interesting notion for the discovery of local outliers
in an environment where data objects are distributed rather unevenly However, itsperformance should be further improved in order to efficiently discover local outliers.Can you propose an efficient method for effective discovery of local outliers in largedata sets?
Bibliographic Notes
Clustering has been studied extensively for more then 40 years and across many plines due to its broad applications Most books on pattern classification and machinelearning contain chapters on cluster analysis or unsupervised learning Several textbooksare dedicated to the methods of cluster analysis, including Hartigan [Har75], Jain andDubes [JD88], Kaufman and Rousseeuw [KR90], and Arabie, Hubert, and De Sorte[AHS96] There are also many survey articles on different aspects of clustering meth-ods Recent ones include Jain, Murty, and Flynn [JMF99] and Parsons, Haque, and Liu[PHL04]
disci-Methods for combining variables of different types into a single dissimilarity matrixwere introduced by Kaufman and Rousseeuw [KR90]
For partitioning methods, the k-means algorithm was first introduced by Lloyd [Llo57] and then MacQueen [Mac67] The k-medoids algorithms of PAM and CLARA were proposed by Kaufman and Rousseeuw [KR90] The k-modes (for clustering categorical data) and k-prototypes (for clustering hybrid data) algorithms were proposed
by Huang [Hua98] The k-modes clustering algorithm was also proposed independently
by Chaturvedi, Green, and Carroll [CGC94, CGC01]
Trang 26Bibliographic Notes 465
The CLARANS algorithm was proposed by Ng and Han [NH94] Ester, Kriegel, and
Xu [EKX95] proposed techniques for further improvement of the performance ofCLARANS using efficient spatial access methods, such as R*-tree and focusing
techniques A k-means–based scalable clustering algorithm was proposed by Bradley,
Fayyad, and Reina [BFR98]
An early survey of agglomerative hierarchical clustering algorithms was conducted byDay and Edelsbrunner [DE84] Agglomerative hierarchical clustering, such as AGNES,and divisive hierarchical clustering, such as DIANA, were introduced by Kaufman andRousseeuw [KR90] An interesting direction for improving the clustering quality of hier-archical clustering methods is to integrate hierarchical clustering with distance-basediterative relocation or other nonhierarchical clustering methods For example, BIRCH,
by Zhang, Ramakrishnan, and Livny [ZRL96], first performs hierarchical clustering with
a CF-tree before applying other techniques Hierarchical clustering can also be performed
by sophisticated linkage analysis, transformation, or nearest-neighbor analysis, such asCURE by Guha, Rastogi, and Shim [GRS98], ROCK (for clustering categorical attributes)
by Guha, Rastogi, and Shim [GRS99b], and Chameleon by Karypis, Han, and Kumar[KHK99]
For density-based clustering methods, DBSCAN was proposed by Ester, Kriegel,Sander, and Xu [EKSX96] Ankerst, Breunig, Kriegel, and Sander [ABKS99] developedOPTICS, a cluster-ordering method that facilitates density-based clustering without wor-rying about parameter specification The DENCLUE algorithm, based on a set of densitydistribution functions, was proposed by Hinneburg and Keim [HK98]
A grid-based multiresolution approach called STING, which collects statistical mation in grid cells, was proposed by Wang, Yang, and Muntz [WYM97] WaveCluster,developed by Sheikholeslami, Chatterjee, and Zhang [SCZ98], is a multiresolution clus-tering approach that transforms the original feature space by wavelet transform.For model-based clustering, the EM (Expectation-Maximization) algorithm wasdeveloped by Dempster, Laird, and Rubin [DLR77] AutoClass is a Bayesian statistics-based method for model-based clustering by Cheeseman and Stutz [CS96a] that uses avariant of the EM algorithm There are many other extensions and applications of EM,such as Lauritzen [Lau95] For a set of seminal papers on conceptual clustering, see Shav-lik and Dietterich [SD90] Conceptual clustering was first introduced by Michalski andStepp [MS83] Other examples of the conceptual clustering approach include COBWEB
infor-by Fisher [Fis87], and CLASSIT infor-by Gennari, Langley, and Fisher [GLF89] Studies of theneural network approach [He99] include SOM (self-organizing feature maps) by Koho-nen [Koh82], [Koh89], by Carpenter and Grossberg [Ce91], and by Kohonen, Kaski,Lagus, et al [KKL+00], and competitive learning by Rumelhart and Zipser [RZ85].Scalable methods for clustering categorical data were studied by Gibson, Kleinberg,and Raghavan [GKR98], Guha, Rastogi, and Shim [GRS99b], and Ganti, Gehrke, andRamakrishnan [GGR99] There are also many other clustering paradigms For exam-ple, fuzzy clustering methods are discussed in Kaufman and Rousseeuw [KR90], Bezdek[Bez81], and Bezdek and Pal [BP92]
For high-dimensional clustering, an Apriori-based dimension-growth subspace tering algorithm called CLIQUE was proposed by Agrawal, Gehrke, Gunopulos, and
Trang 27clus-Raghavan [AGGR98] It integrates density-based and grid-based clustering methods.
A sampling-based, dimension-reduction subspace clustering algorithm called PROCLUS,and its extension, ORCLUS, were proposed by Aggarwal et al [APW+99] and by Aggarwaland Yu [AY00], respectively An entropy-based subspace clustering algorithm for min-ing numerical data, called ENCLUS, was proposed by Cheng, Fu, and Zhang [CFZ99].For a frequent pattern–based approach to handling high-dimensional data, Beil, Ester,and Xu [BEX02] proposed a method for frequent term–based text clustering H Wang,
W Wang, Yang, and Yu proposed pCluster, a pattern similarity–based clustering method[WWYY02]
Recent studies have proceeded to clustering stream data, as in Babcock, Babu, Datar,
et al [BBD+02] A k-median-based data stream clustering algorithm was proposed by
Guha, Mishra, Motwani, and O’Callaghan [GMMO00], and by O’Callaghan, Mishra,Meyerson, et al [OMM+02] A method for clustering evolving data streams was pro-posed by Aggarwal, Han, Wang, and Yu [AHWY03] A framework for projected cluster-ing of high-dimensional data streams was proposed by Aggarwal, Han, Wang, and Yu[AHWY04a]
A framework for constraint-based clustering based on user-specified constraints wasbuilt by Tung, Han, Lakshmanan, and Ng [THLN01] An efficient method for constraint-based spatial clustering in the existence of physical obstacle constraints was proposed
by Tung, Hou, and Han [THH01] The quality of unsupervised clustering can be nificantly improved using supervision in the form of pairwise constraints (i.e., pairs
sig-of instances labeled as belonging to the same or different clustering) Such a process isconsidered semi-supervised clustering A probabilistic framework for semi-supervisedclustering was proposed by Basu, Bilenko, and Mooney [BBM04] The CLTree method,which transforms the clustering problem into a classification problem and then usesdecision tree induction for cluster analysis, was proposed by Liu, Xia, and Yu [LXY01].Outlier detection and analysis can be categorized into four approaches: the statisticalapproach, the distance-based approach, the density-based local outlier detection, and thedeviation-based approach The statistical approach and discordancy tests are described
in Barnett and Lewis [BL94] Distance-based outlier detection is described in Knorrand Ng [KN97, KN98] The detection of density-based local outliers was proposed byBreunig, Kriegel, Ng, and Sander [BKNS00] Outlier detection for high-dimensional data
is studied by Aggarwal and Yu [AY01] The sequential problem approach to based outlier detection was introduced in Arning, Agrawal, and Raghavan [AAR96].Sarawagi, Agrawal, and Megiddo [SAM98] introduced a discovery-driven method foridentifying exceptions in large multidimensional data using OLAP data cubes Jagadish,Koudas, and Muthukrishnan [JKM99] introduced an efficient method for miningdeviants in time-series databases
Trang 28Mining Stream, Time-Series,
and Sequence Data
Our previouschapters introduced the basic concepts and techniques of data mining The techniques
studied, however, were for simple and structured data sets, such as data in relationaldatabases, transactional databases, and data warehouses The growth of data in various
complex forms (e.g., semi-structured and unstructured, spatial and temporal, hypertext
and multimedia) has been explosive owing to the rapid progress of data collection andadvanced database system technologies, and the World Wide Web Therefore, an increas-ingly important task in data mining is to mine complex types of data Furthermore, manydata mining applications need to mine patterns that are more sophisticated than thosediscussed earlier, including sequential patterns, subgraph patterns, and features in inter-connected networks We treat such tasks as advanced topics in data mining
In the following chapters, we examine how to further develop the essential data ing techniques (such as characterization, association, classification, and clustering) andhow to develop new ones to cope with complex types of data We start off, in this chapter,
min-by discussing the mining of stream, time-series, and sequence data Chapter 9 focuses
on the mining of graphs, social networks, and multirelational data Chapter 10 examinesmining object, spatial, multimedia, text, and Web data Research into such mining is fastevolving Our discussion provides a broad introduction We expect that many new booksdedicated to the mining of complex kinds of data will become available in the future
As this chapter focuses on the mining of stream data, time-series data, and sequencedata, let’s look at each of these areas
Imagine a satellite-mounted remote sensor that is constantly generating data Thedata are massive (e.g., terabytes in volume), temporally ordered, fast changing, and poten-
tially infinite This is an example of stream data Other examples include
telecommu-nications data, transaction data from the retail industry, and data from electric powergrids Traditional OLAP and data mining methods typically require multiple scans ofthe data and are therefore infeasible for stream data applications In Section 8.1, we studyadvanced mining methods for the analysis of such constantly flowing data
A time-series database consists of sequences of values or events obtained over repeated
measurements of time Suppose that you are given time-series data relating to stockmarket prices How can the data be analyzed to identify trends? Given such data for
467
Trang 29two different stocks, can we find any similarities between the two? These questions areexplored in Section 8.2 Other applications involving time-series data include economicand sales forecasting, utility studies, and the observation of natural phenomena (such asatmosphere, temperature, and wind).
A sequence database consists of sequences of ordered elements or events, recorded with or without a concrete notion of time Sequential pattern mining is the discovery
of frequently occurring ordered events or subsequences as patterns An example of a
sequential pattern is “Customers who buy a Canon digital camera are likely to buy an HP color printer within a month.” Periodic patterns, which recur in regular periods or dura-
tions, are another kind of pattern related to sequences Section 8.3 studies methods ofsequential pattern mining
Recent research in bioinformatics has resulted in the development of numerous ods for the analysis of biological sequences, such as DNA and protein sequences
meth-Section 8.4 introduces several popular methods, including biological sequence alignment algorithms and the hidden Markov model.
Tremendous and potentially infinite volumes of data streams are often generated by
real-time surveillance systems, communication networks, Internet traffic, on-line actions in the financial market or retail industry, electric power grids, industry pro-duction processes, scientific and engineering experiments, remote sensors, and other
trans-dynamic environments Unlike traditional data sets, stream data flow in and out of a
computer system continuously and with varying update rates They are temporally ordered, fast changing, massive, and potentially infinite It may be impossible to store an entire
data stream or to scan through it multiple times due to its tremendous volume over, stream data tend to be of a rather low level of abstraction, whereas most analystsare interested in relatively high-level dynamic changes, such as trends and deviations Todiscover knowledge or patterns from data streams, it is necessary to develop single-scan,on-line, multilevel, multidimensional stream processing and analysis methods
More-Such single-scan, on-line data analysis methodology should not be confined to onlystream data It is also critically important for processing nonstream data that are mas-sive With data volumes mounting by terabytes or even petabytes, stream data nicelycapture our data processing needs of today: even when the complete set of data is col-lected and can be stored in massive data storage devices, single scan (as in data streamsystems) instead of random access (as in database systems) may still be the most realisticprocessing mode, because it is often too expensive to scan such a data set multiple times
In this section, we introduce several on-line stream data analysis and mining methods.Section 8.1.1 introduces the basic methodologies for stream data processing and query-ing Multidimensional analysis of stream data, encompassing stream data cubes andmultiple granularities of time, is described in Section 8.1.2 Frequent-pattern miningand classification are presented in Sections 8.1.3 and 8.1.4, respectively The clustering
of dynamically evolving data streams is addressed in Section 8.1.5
Trang 308.1 Mining Data Streams 469
Stream Data Systems
As seen from the previous discussion, it is impractical to scan through an entire datastream more than once Sometimes we cannot even “look” at every element of a streambecause the stream flows in so fast and changes so quickly The gigantic size of suchdata sets also implies that we generally cannot store the entire stream data set in mainmemory or even on disk The problem is not just that there is a lot of data, it is that the
universes that we are keeping track of are relatively large, where a universe is the domain
of possible values for an attribute For example, if we were tracking the ages of millions ofpeople, our universe would be relatively small, perhaps between zero and one hundredand twenty We could easily maintain exact summaries of such data In contrast, theuniverse corresponding to the set of all pairs of IP addresses on the Internet is very large,which makes exact storage intractable A reasonable way of thinking about data streams
is to actually think of a physical stream of water Heraclitus once said that you can neverstep in the same stream twice,1and so it is with stream data
For effective processing of stream data, new data structures, techniques, and rithms are needed Because we do not have an infinite amount of space to store streamdata, we often trade off between accuracy and storage That is, we generally are willing
algo-to settle for approximate rather than exact answers Synopses allow for this by
provid-ing summaries of the data, which typically can be used to return approximate answers
to queries Synopses use synopsis data structures, which are any data structures that are substantially smaller than their base data set (in this case, the stream data) From the
algorithmic point of view, we want our algorithms to be efficient in both space and time
Instead of storing all or most elements seen so far, using O(N) space, we often want to use polylogarithmic space, O(log k N), where N is the number of elements in the stream
data We may relax the requirement that our answers are exact, and ask for approximateanswers within a small error range with high probability That is, many data stream–based algorithms compute an approximate answer within a factorεof the actual answer,with high probability Generally, as the approximation factor (1+ε)goes down, the spacerequirements go up In this section, we examine some common synopsis data structuresand techniques
Random Sampling
Rather than deal with an entire data stream, we can think of sampling the stream at odic intervals “To obtain an unbiased sampling of the data, we need to know the length
peri-of the stream in advance But what can we do if we do not know this length in advance?”
In this case, we need to modify our approach
1 Plato citing Heraclitus: “Heraclitus somewhere says that all things are in process and nothing stays still, and likening existing things to the stream of a river he says you would not step twice into the same river.”
Trang 31A technique called reservoir sampling can be used to select an unbiased random
sample of s elements without replacement The idea behind reservoir sampling is atively simple We maintain a sample of size at least s, called the “reservoir,” from which
rel-a rrel-andom srel-ample of size s crel-an be generrel-ated However, generrel-ating this srel-ample from the
reservoir can be costly, especially when the reservoir is large To avoid this step, we
main-tain a set of s candidates in the reservoir, which form a true random sample of the
ele-ments seen so far in the stream As the data stream flows, every new element has a certain
probability of replacing an old element in the reservoir Let’s say we have seen N elements
thus far in the stream The probability that a new element replaces an old one, chosen
at random, is then s/N This maintains the invariant that the set of s candidates in our
reservoir forms a random sample of the elements seen so far
Sliding Windows
Instead of sampling the data stream randomly, we can use the sliding window model to
analyze stream data The basic idea is that rather than running computations on all of
the data seen so far, or on some sample, we can make decisions based only on recent data More formally, at every time t, a new data element arrives This element “expires” at time
t + w, where w is the window “size” or length The sliding window model is useful for
stocks or sensor networks, where only recent events may be important It also reducesmemory requirements because only a small window of data is stored
Histograms
The histogram is a synopsis data structure that can be used to approximate the frequency
distribution of element values in a data stream A histogram partitions the data into a
set of contiguous buckets Depending on the partitioning rule used, the width (bucket value range) and depth (number of elements per bucket) can vary The equal-width par-
titioning rule is a simple way to construct histograms, where the range of each bucket isthe same Although easy to implement, this may not sample the probability distributionfunction well A better approach is to use V-Optimal histograms (see Section 2.5.4) Sim-ilar to clustering, V-Optimal histograms define bucket sizes that minimize the frequencyvariance within each bucket, which better captures the distribution of the data Thesehistograms can then be used to approximate query answers rather than using samplingtechniques
Multiresolution Methods
A common way to deal with a large amount of data is through the use of data reduction
methods (see Section 2.5) A popular data reduction method is the use of conquer strategies such as multiresolution data structures These allow a program totrade off between accuracy and storage, but also offer the ability to understand a datastream at multiple levels of detail
Trang 32divide-and-8.1 Mining Data Streams 471
A concrete example is a balanced binary tree, where we try to maintain this balance as
new data come in Each level of the tree provides a different resolution The farther away
we are from the tree root, the more detailed is the level of resolution
A more sophisticated way to form multiple resolutions is to use a clustering method
to organize stream data into a hierarchical structure of trees For example, we can use atypical hierarchical clustering data structure like CF-tree in BIRCH (see Section 7.5.2)
to form a hierarchy of microclusters With dynamic stream data flowing in and out,
sum-mary statistics of data streams can be incrementally updated over time in the hierarchy
of microclusters Information in such microclusters can be aggregated into larger clusters depending on the application requirements to derive general data statistics at
macro-multiresolution
Wavelets (Section 2.5.3), a technique from signal processing, can be used to build a
multiresolution hierarchy structure over an input signal, in this case, the stream data.Given an input signal, we would like to break it down or rewrite it in terms of simple,orthogonal basis functions The simplest basis is the Haar wavelet Using this basis cor-responds to recursively performing averaging and differencing at multiple levels of reso-lution Haar wavelets are easy to understand and implement They are especially good atdealing with spatial and multimedia data Wavelets have been used as approximations tohistograms for query optimization Moreover, wavelet-based histograms can be dynam-ically maintained over time Thus, wavelets are a popular multiresolution method fordata stream compression
Sketches
Synopses techniques mainly differ by how exactly they trade off accuracy for storage.Sampling techniques and sliding window models focus on a small part of the data,whereas other synopses try to summarize the entire data, often at multiple levels of detail.Some techniques require multiple passes over the data, such as histograms and wavelets,
whereas other methods, such as sketches, can operate in a single pass.
Suppose that, ideally, we would like to maintain the full histogram over the universe
of objects or elements in a data stream, where the universe is U = {1, 2, , v} and the stream is A = {a1, a2, , a N } That is, for each value i in the universe, we want to main- tain the frequency or number of occurrences of i in the sequence A If the universe is large,
this structure can be quite large as well Thus, we need a smaller representation instead
Let’s consider the frequency moments of A These are the numbers, F k, defined as
the length of the sequence (that is, N, here) F2is known as the self-join size, the repeat
rate, or as Gini’s index of homogeneity The frequency moments of a data set provideuseful information about the data for database applications, such as query answering In
addition, they indicate the degree of skew or asymmetry in the data (Section 2.2.1), which
Trang 33is useful in parallel database applications for determining an appropriate partitioningalgorithm for the data.
When the amount of memory available is smaller than v, we need to employ a
synop-sis The estimation of the frequency moments can be done by synopses that are known as
sketches These build a small-space summary for a distribution vector (e.g., histogram)
using randomized linear projections of the underlying data vectors Sketches provideprobabilistic guarantees on the quality of the approximate answer (e.g., the answer to
the given query is 12 ± 1 with a probability of 0.90) Given N elements and a universe
U of v values, such sketches can approximate F0, F1, and F2in O(log v + log N) space The basic idea is to hash every element uniformly at random to either z i∈ {−1, + 1},
and then maintain a random variable, X =∑i m i z i It can be shown that X2is a good
estimate for F2 To explain why this works, we can think of hashing elements to −1 or+1as assigning each element value to an arbitrary side of a tug of war When we sum up
to get X, we can think of measuring the displacement of the rope from the center point.
By squaring X, we square this displacement, capturing the data skew, F2
To get an even better estimate, we can maintain multiple random variables, X i Then
by choosing the median value of the square of these variables, we can increase our
con-fidence that the estimated value is close to F2
From a database perspective, sketch partitioning was developed to improve the
performance of sketching on data stream query optimization Sketch partitioning uses
coarse statistical information on the base data to intelligently partition the domain of the
underlying attributes in a way that provably tightens the error guarantees
Randomized Algorithms
Randomized algorithms, in the form of random sampling and sketching, are often used
to deal with massive, high-dimensional data streams The use of randomization oftenleads to simpler and more efficient algorithms in comparison to known deterministicalgorithms
If a randomized algorithm always returns the right answer but the running times vary,
it is known as a Las Vegas algorithm In contrast, a Monte Carlo algorithm has bounds
on the running time but may not return the correct result We mainly consider MonteCarlo algorithms One way to think of a randomized algorithm is simply as a probabilitydistribution over a set of deterministic algorithms
Given that a randomized algorithm returns a random variable as a result, we wouldlike to have bounds on the tail probability of that random variable This tells us that theprobability that a random variable deviates from its expected value is small One basic
tool is Chebyshev’s Inequality Let X be a random variable with mean µ and standard
deviationσ(varianceσ2) Chebyshev’s inequality says that
Trang 348.1 Mining Data Streams 473
In many cases, multiple random variables can be used to boost the confidence in our
results As long as these random variables are fully independent, Chernoff bounds can be
used Let X1, X2, , X nbe independent Poisson trials In a Poisson trial, the probability
of success varies from trial to trial If X is the sum of X1to X n, then a weaker version ofthe Chernoff bound tells us that
Pr[X < (1 +δ)µ] < e −µδ2/4 (8.3)
whereδ∈ (0, 1] This shows that the probability decreases exponentially as we movefrom the mean, which makes poor estimates much more unlikely
Data Stream Management Systems and Stream Queries
In traditional database systems, data are stored in finite and persistent databases However,
stream data are infinite and impossible to store fully in a database In a Data Stream agement System (DSMS), there may be multiple data streams They arrive on-line and
Man-are continuous, temporally ordered, and potentially infinite Once an element from a datastream has been processed, it is discarded or archived, and it cannot be easily retrievedunless it is explicitly stored in memory
A stream data query processing architecture includes three parts: end user, query cessor, and scratch space (which may consist of main memory and disks) An end user
pro-issues a query to the DSMS, and the query processor takes the query, processes it usingthe information stored in the scratch space, and returns the results to the user
Queries can be either one-time queries or continuous queries A one-time query is
eval-uated once over a point-in-time snapshot of the data set, with the answer returned to the
user A continuous query is evaluated continuously as data streams continue to arrive.
The answer to a continuous query is produced over time, always reflecting the stream
data seen so far A continuous query can act as a watchdog, as in “sound the alarm if the
power consumption for Block 25 exceeds a certain threshold.” Moreover, a query can be
pre-defined (i.e., supplied to the data stream management system before any relevant data have arrived) or ad hoc (i.e., issued on-line after the data streams have already begun).
A predefined query is generally a continuous query, whereas an ad hoc query can beeither one-time or continuous
Stream Query Processing
The special properties of stream data introduce new challenges in query processing
In particular, data streams may grow unboundedly, and it is possible that queries mayrequire unbounded memory to produce an exact answer How can we distinguishbetween queries that can be answered exactly using a given bounded amount of memoryand queries that must be approximated? Actually, without knowing the size of the inputdata streams, it is impossible to place a limit on the memory requirements for most com-mon queries, such as those involving joins, unless the domains of the attributes involved
in the query are restricted This is because without domain restrictions, an unbounded
Trang 35number of attribute values must be remembered because they might turn out to joinwith tuples that arrive in the future.
Providing an exact answer to a query may require unbounded main memory; therefore
a more realistic solution is to provide an approximate answer to the query Approximate query answering relaxes the memory requirements and also helps in handling system
load, because streams can come in too fast to process exactly In addition, ad hoc queriesneed approximate history to return an answer We have already discussed common syn-opses that are useful for approximate query answering, such as random sampling, slidingwindows, histograms, and sketches
As this chapter focuses on stream data mining, we will not go into any further details
of stream query processing methods For additional discussion, interested readers mayconsult the literature recommended in the bibliographic notes of this chapter
Stream data are generated continuously in a dynamic environment, with huge volume,infinite flow, and fast-changing behavior It is impossible to store such data streams com-pletely in a data warehouse Most stream data represent low-level information, consisting
of various kinds of detailed temporal and other features To find interesting or unusual
patterns, it is essential to perform multidimensional analysis on aggregate measures (such
as sum and average) This would facilitate the discovery of critical changes in the data athigher levels of abstraction, from which users can drill down to examine more detailedlevels, when needed Thus multidimensional OLAP analysis is still needed in stream dataanalysis, but how can we implement it?
Consider the following motivating example
Example 8.1 Multidimensional analysis for power supply stream data A power supply station
gen-erates infinite streams of power usage data Suppose individual user, street address, and second are the attributes at the lowest level of granularity Given a large number of users,
it is only realistic to analyze the fluctuation of power usage at certain high levels, such
as by city or street district and by quarter (of an hour), making timely power supplyadjustments and handling unusual situations
Conceptually, for multidimensional analysis, we can view such stream data as a virtual
data cube, consisting of one or a few measures and a set of dimensions, including one
time dimension, and a few other dimensions, such as location, user-category, and so on.
However, in practice, it is impossible to materialize such a data cube, because the rialization requires a huge amount of data to be computed and stored Some efficientmethods must be developed for systematic analysis of such data
mate-Data warehouse and OLAP technology is based on the integration and consolidation
of data in multidimensional space to facilitate powerful and fast on-line data analysis
A fundamental difference in the analysis of stream data from that of relational and house data is that the stream data are generated in huge volume, flowing in and outdynamically and changing rapidly Due to limited memory, disk space, and processing
Trang 36ware-8.1 Mining Data Streams 475
power, it is impossible to register completely the detailed level of data and compute a fullymaterialized cube A realistic design is to explore several data compression techniques,
including (1) tilted time frame on the time dimension, (2) storing data only at some ical layers, and (3) exploring efficient computation of a very partially materialized data cube The (partial) stream data cubes so constructed are much smaller than those con-
crit-structed from the raw stream data but will still be effective for multidimensional streamdata analysis We examine such a design in more detail
Time Dimension with Compressed Time
Scale: Tilted Time Frame
In stream data analysis, people are usually interested in recent changes at a fine scale but
in long-term changes at a coarse scale Naturally, we can register time at different levels ofgranularity The most recent time is registered at the finest granularity; the more distanttime is registered at a coarser granularity; and the level of coarseness depends on theapplication requirements and on how old the time point is (from the current time) Such
a time dimension model is called a tilted time frame This model is sufficient for many
analysis tasks and also ensures that the total amount of data to retain in memory or to
be stored on disk is small
There are many possible ways to design a titled time frame Here we introduce three
models, as illustrated in Figure 8.1: (1) natural tilted time frame model, (2) logarithmic tilted time frame model, and (3) progressive logarithmic tilted time frame model.
A natural tilted time frame model is shown in Figure 8.1(a), where the time frame
(or window) is structured in multiple granularities based on the “natural” or usual timescale: the most recent 4 quarters (15 minutes), followed by the last 24 hours, then
31 days, and then 12 months (the actual scale used is determined by the application).Based on this model, we can compute frequent itemsets in the last hour with the pre-cision of a quarter of an hour, or in the last day with the precision of an hour, and
Time
Time
t t 2t 4t 8t 16t 32t 64t
(b) A logarithmic tilted time frame model.
(a) A natural tilted time frame model.
Snapshots (by clock time) Frame no
(c) A progressive logarithmic tilted time frame table.
Figure 8.1 Three models for tilted time frames
Trang 37so on until the whole year with the precision of a month.2 This model registers only
4 + 24 + 31 + 12 = 71units of time for a year instead of 365 × 24 × 4 = 35,040 units,with an acceptable trade-off of the grain of granularity at a distant time
The second model is the logarithmic tilted time frame model, as shown in
Figure 8.1(b), where the time frame is structured in multiple granularities according
to a logarithmic scale Suppose that the most recent slot holds the transactions of thecurrent quarter The remaining slots are for the last quarter, the next two quarters (ago),
4 quarters, 8 quarters, 16 quarters, and so on, growing at an exponential rate According
to this model, with one year of data and the finest precision at a quarter, we would needlog2(365× 24 × 4) + 1 = 16.1 units of time instead of 365 × 24 × 4 = 35,040 units That
is, we would just need 17 time frames to store the compressed information
The third method is the progressive logarithmic tilted time frame model, where
snap-shots are stored at differing levels of granularity depending on the recency Let T be the
clock time elapsed since the beginning of the stream Snapshots are classified into
differ-ent frame numbers, which can vary from 0 to max frame, where log2(T )−max capacity ≤ max frame ≤ log2(T ), and max capacity is the maximal number of snapshots held in
each frame
Each snapshot is represented by its timestamp The rules for insertion of a snapshot
t (at time t) into the snapshot frame table are defined as follows: (1) if (t mod 2 i) = 0
but (t mod 2 i+1)6= 0, t is inserted into frame number i if i ≤ max frame; otherwise (i.e.,
i > max frame), t is inserted into max frame; and (2) each slot has a max capacity At the insertion of t into frame number i, if the slot already reaches its max capacity, the oldest
snapshot in this frame is removed and the new snapshot inserted
Example 8.2 Progressive logarithmic tilted time frame Consider the snapshot frame table of
Figure 8.1(c), where max frame is 5 and max capacity is 3 Let’s look at how timestamp
64 was inserted into the table We know (64 mod 26) = 0but (64 mod 27)6= 0, that is,
i =6 However, since this value of i exceeds max frame, 64 was inserted into frame 5 instead
of frame 6 Suppose we now need to insert a timestamp of 70 At time 70, since (70mod 21) = 0but (70 mod 22)6= 0, we would insert 70 into frame number 1 This would
knock out the oldest snapshot of 58, given the slot capacity of 3 From the table, we see thatthe closer a timestamp is to the current time, the denser are the snapshots stored
In the logarithmic and progressive logarithmic models discussed above, we haveassumed that the base is 2 Similar rules can be applied to any base α, whereα is aninteger andα> 1 All three tilted time frame models provide a natural way for incre-mental insertion of data and for gradually fading out older values
The tilted time frame models shown are sufficient for typical time-related queries,and at the same time, ensure that the total amount of data to retain in memory and/or
to be computed is small
2 We align the time axis with the natural calendar time Thus, for each granularity level of the tilted time frame, there might be a partial interval, which is less than a full unit at that level.
Trang 388.1 Mining Data Streams 477
Depending on the given application, we can provide different fading factors in thetitled time frames, such as by placing more weight on the more recent time frames Wecan also have flexible alternative ways to design the tilted time frames For example, sup-pose that we are interested in comparing the stock average from each day of the currentweek with the corresponding averages from the same weekdays last week, last month, orlast year In this case, we can single out Monday to Friday instead of compressing theminto the whole week as one unit
Critical Layers
Even with the tilted time frame model, it can still be too costly to dynamically compute
and store a materialized cube Such a cube may have quite a few dimensions, each taining multiple levels with many distinct values Because stream data analysis has onlylimited memory space but requires fast response time, we need additional strategies thatwork in conjunction with the tilted time frame model One approach is to compute andstore only some mission-critical cuboids of the full data cube
con-In many applications, it is beneficial to dynamically and incrementally compute and
store two critical cuboids (or layers), which are determined based on their conceptual and computational importance in stream data analysis The first layer, called the minimal interest layer, is the minimally interesting layer that an analyst would like to study It is
necessary to have such a layer because it is often neither cost effective nor interesting
in practice to examine the minute details of stream data The second layer, called the
observation layer, is the layer at which an analyst (or an automated system) would like
to continuously study the data This can involve making decisions regarding the signaling
of exceptions, or drilling down along certain paths to lower layers to find cells indicatingdata exceptions
Example 8.3 Critical layers for a power supply stream data cube Let’s refer back to Example 8.1
regarding the multidimensional analysis of stream data for a power supply station
Dimensions at the lowest level of granularity (i.e., the raw data layer) included ual user, street address, and second At the minimal interest layer, these three dimensions are user group, street block, and minute, respectively Those at the observation layer are
individ-∗ (meaning all user), city, and quarter, respectively, as shown in Figure 8.2.
Based on this design, we would not need to compute any cuboids that are lower thanthe minimal interest layer because they would be beyond user interest Thus, to computeour base cuboid, representing the cells of minimal interest, we need to compute and store
the (three-dimensional) aggregate cells for the (user group, street block, minute)
group-by This can be done by aggregations on the dimensions user and address by rolling up from individual user to user group and from street address to street block, respectively, and by rolling up on the time dimension from second to minute.
Similarly, the cuboids at the observation layer should be computed dynamically, ing the tilted time frame model into account as well This is the layer that an analysttakes as an observation deck, watching the current stream data by examining the slope
tak-of changes at this layer to make decisions This layer can be obtained by rolling up the
Trang 39observation layer
minimal interest layer
primitive data layer
(individual_user, street_address, second) (user_group, street_block, minute) (*, city, quarter)
Figure 8.2 Two critical layers in a “power supply station” stream data cube
cube along the user dimension to ∗ (for all user), along the address dimension to city, and along the time dimension to quarter If something unusual is observed, the analyst can
investigate by drilling down to lower levels to find data exceptions
Partial Materialization of a Stream Cube
“What if a user needs a layer that would be between the two critical layers?” Materializing
a cube at only two critical layers leaves much room for how to compute the cuboids inbetween These cuboids can be precomputed fully, partially, or not at all (i.e., leave every-
thing to be computed on the fly) An interesting method is popular path cubing, which
rolls up the cuboids from the minimal interest layer to the observation layer by followingone popular drilling path, materializes only the layers along the path, and leaves otherlayers to be computed only when needed This method achieves a reasonable trade-offbetween space, computation time, and flexibility, and has quick incremental aggregationtime, quick drilling time, and small space requirements
To facilitate efficient computation and storage of the popular path of the stream cube,
a compact data structure needs to be introduced so that the space taken in the
compu-tation of aggregations is minimized A hyperlinked tree structure called H-tree is revised
and adopted here to ensure that a compact structure is maintained in memory for cient computation of multidimensional and multilevel aggregations
effi-Each branch of the H-tree is organized in the same order as the specified popularpath The aggregate cells are stored in the nonleaf nodes of the H-tree, formingthe computed cuboids along the popular path Aggregation for each correspondingslot in the tilted time frame is performed from the minimal interest layer all theway up to the observation layer by aggregating along the popular path The step-by-step aggregation is performed while inserting the new generalized tuples into thecorresponding time slots