Machine Learning Clustering

What is clustering • Clustering can be considered the most important unsupervised learning problem; • An other definition of clustering could be “the process of organizing object

Trang 2

What is clustering

• Clustering can be considered the most

important unsupervised learning problem;

• An other definition of clustering could be “the

process of organizing objects into groups whose members are similar”

Trang 3

What is clustering

• A cluster is therefore a collection of objects which

are “similar” between them and are “dissimilar” to the objects belonging to other clusters

Trang 4

What is clustering

• In this case we identify the 4 clusters into which the

data can be divided;

• the similarity criterion is distance:

• two or more objects belong to the same cluster if

they are “close” according to a given distance

(called distance-based clustering.)

Trang 5

What is clustering

• Another kind of clustering is conceptual clustering:

two or more objects belong to the same cluster if this

one defines a concept common to all that objects.

• In other words, objects are grouped according to

their fit to descriptive concepts, not according to

simple similarity measures

Trang 7

• Marketing: finding groups of customers with

similar behavior given a large database of

customer data containing their properties and past

Trang 8

• City-planning: identifying groups of houses

according to their house type, value and

geographical location;

• Earthquake studies: clustering observed

earthquake to identify dangerous zones;

• WWW: document classification; clustering weblog

data to discover groups of similar access patterns

8

Trang 9

• dealing with large number of dimensions

and large number of data items.

• the effectiveness of the method depends on

the definition of “distance” (for

distance-based clustering);

9

Trang 10

Classification of clustering algorithm

Trang 11

Classification of clustering algorithm

• four of the most used clustering algorithms:

Trang 12

• K-Means Algorithm Properties

– There are always K clusters.

– There is always at least one item in each cluster.– The clusters are non-hierarchical and they do

not overlap

– Every member of a cluster is closer to its

cluster than any other cluster

12

Trang 13

• Assumes instances are real-valued vectors.

• Clusters based on centroids , or mean of

points in a cluster, c:

• Reassignment of instances to clusters is

based on distance to the current cluster

μ

Trang 15

K-Means

Let d be the distance measure between instances.

Select k random instances {s1, s2,… s k} as seeds

Until clustering converges or other stopping criterion:

For each instance x i:

Assign x i to the cluster c j such that d(x i , sj) is minimal (Update the seeds to the centroid of each cluster)

For each cluster c j

sj = µ(c j)

Trang 16

16

Trang 17

K Means Example

(K=2)

Pick seeds Reassign clusters Compute centroids

Trang 18

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10 0

Trang 20

Hierarchical Clustering

• Start by assigning each item to a cluster, so

that if you have N items.

• Find the closest (most similar) pair of

clusters and merge them into a single

cluster, so that now you have one cluster

less.

• Compute distances (similarities) between

the new cluster and each of the old clusters.

• Repeat steps 2 and 3 until all items are

clustered into a single cluster of size N (*) 20

Trang 22

• The nearest pair of cities is MI and TO, at

distance 138 These are merged into a single cluster called "MI/TO" The level of the

new cluster is L(MI/TO) = 138 and the new sequence number is m = 1.

22

Trang 24

24

• min d(i,j) = d(NA,RM) = 219 => merge NA

and RM into a new cluster called NA/RM

L(NA/RM) = 219

m = 2

Trang 26

• min d(i,j) = d(BA,NA/RM) = 255 => merge

BA and NA/RM into a new cluster called

BA/NA/RM

L(BA/NA/RM) = 255

m = 3

26

Trang 27

27

BA/NA/R

M FI MI/TO BA/NA/R

M 0 268 564

FI 268 0 295

MI/TO 564 295 0

Trang 28

• min d(i,j) = d(BA/NA/RM,FI) = 268 =>

merge BA/NA/RM and FI into a new

cluster called BA/FI/NA/RM

L(BA/FI/NA/RM) = 268

m = 4

28

Trang 30

• Finally, we merge the last two clusters at

level 295.

30

Định dạng
Số trang	30
Dung lượng	772,5 KB