What is clustering • Clustering can be considered the most important unsupervised learning problem; • An other definition of clustering could be “the process of organizing object
Trang 2What is clustering
• Clustering can be considered the most
important unsupervised learning problem;
• An other definition of clustering could be “the
process of organizing objects into groups whose members are similar”
Trang 3What is clustering
• A cluster is therefore a collection of objects which
are “similar” between them and are “dissimilar” to the objects belonging to other clusters
Trang 4What is clustering
• In this case we identify the 4 clusters into which the
data can be divided;
• the similarity criterion is distance:
• two or more objects belong to the same cluster if
they are “close” according to a given distance
(called distance-based clustering.)
Trang 5What is clustering
• Another kind of clustering is conceptual clustering:
two or more objects belong to the same cluster if this
one defines a concept common to all that objects.
• In other words, objects are grouped according to
their fit to descriptive concepts, not according to
simple similarity measures
Trang 7• Marketing: finding groups of customers with
similar behavior given a large database of
customer data containing their properties and past
Trang 8• City-planning: identifying groups of houses
according to their house type, value and
geographical location;
• Earthquake studies: clustering observed
earthquake to identify dangerous zones;
• WWW: document classification; clustering weblog
data to discover groups of similar access patterns
8
Trang 9• dealing with large number of dimensions
and large number of data items.
• the effectiveness of the method depends on
the definition of “distance” (for
distance-based clustering);
9
Trang 10Classification of clustering algorithm
Trang 11Classification of clustering algorithm
• four of the most used clustering algorithms:
Trang 12• K-Means Algorithm Properties
– There are always K clusters.
– There is always at least one item in each cluster.– The clusters are non-hierarchical and they do
not overlap
– Every member of a cluster is closer to its
cluster than any other cluster
12
Trang 13• Assumes instances are real-valued vectors.
• Clusters based on centroids , or mean of
points in a cluster, c:
• Reassignment of instances to clusters is
based on distance to the current cluster
μ
Trang 15K-Means
Let d be the distance measure between instances.
Select k random instances {s1, s2,… s k} as seeds
Until clustering converges or other stopping criterion:
For each instance x i:
Assign x i to the cluster c j such that d(x i , sj) is minimal (Update the seeds to the centroid of each cluster)
For each cluster c j
sj = µ(c j)
Trang 1616
Trang 17K Means Example
(K=2)
Pick seeds Reassign clusters Compute centroids
Trang 180 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10 0
Trang 20Hierarchical Clustering
• Start by assigning each item to a cluster, so
that if you have N items.
• Find the closest (most similar) pair of
clusters and merge them into a single
cluster, so that now you have one cluster
less.
• Compute distances (similarities) between
the new cluster and each of the old clusters.
• Repeat steps 2 and 3 until all items are
clustered into a single cluster of size N (*) 20
Trang 22Hierarchical Clustering
• The nearest pair of cities is MI and TO, at
distance 138 These are merged into a single cluster called "MI/TO" The level of the
new cluster is L(MI/TO) = 138 and the new sequence number is m = 1.
22
Trang 24Hierarchical Clustering
24
• min d(i,j) = d(NA,RM) = 219 => merge NA
and RM into a new cluster called NA/RM
L(NA/RM) = 219
m = 2
Trang 26Hierarchical Clustering
• min d(i,j) = d(BA,NA/RM) = 255 => merge
BA and NA/RM into a new cluster called
BA/NA/RM
L(BA/NA/RM) = 255
m = 3
26
Trang 27Hierarchical Clustering
27
BA/NA/R
M FI MI/TO BA/NA/R
M 0 268 564
FI 268 0 295
MI/TO 564 295 0
Trang 28Hierarchical Clustering
• min d(i,j) = d(BA/NA/RM,FI) = 268 =>
merge BA/NA/RM and FI into a new
cluster called BA/FI/NA/RM
L(BA/FI/NA/RM) = 268
m = 4
28
Trang 30Hierarchical Clustering
• Finally, we merge the last two clusters at
level 295.
30