Multidimentional Scaling Trung Duc Tran Institute of Mathematics University of Silesia in Katowice

Multidimentional Scaling Trung Duc Tran Institute of Mathematics University of Silesia in KatowiceMultidimentional Scaling Trung Duc Tran Institute of Mathematics University of Silesia in KatowiceMultidimentional Scaling Trung Duc Tran Institute of Mathematics University of Silesia in KatowiceMultidimentional Scaling Trung Duc Tran Institute of Mathematics University of Silesia in KatowiceMultidimentional Scaling Trung Duc Tran Institute of Mathematics University of Silesia in Katowice

Trang 1

Multidimensional Scaling

Trung Duc Tran

Institute of Mathematics University of Silesia in Katowice

Trang 2

1.1 Introduction 2 1.2 Numerical Technique for 2D multidimensional scaling 3 1.3 3D multidimensional scaling 5

3.1 Function monoMDS 9 3.2 Function metaMDS 12

4.1 Using Classical algorithm(cmdscale) 15

Trang 3

Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset MDS scale n-dimensional data to m-dimensional data (m<n) In this paper , we will do an example of data Iris in R which has n=4 and m=2,3 Because in 2D and 3D space, we can visualize the dissimilarities between objects

1 Introduction and rst solution by using Kruskal's algorithm

1.1 Introduction

we suppose that there are n objects 1, , , n, and values ˆdij are dissimilarity between every 2 objects For a conguration of points x1, , , xn in t-dimensional space, with every 2 points distances dij , we dened the stress of the conguration by

S =r S1

T1 =

s P(dij − ˆdij)2

P d2 ij

(1.1) First, we assign the initial data to a matrix:

1 a= data matrix ( iris )

2 b = subset (a, select =- c ( Species ))

3 n= nrow (b)

We dene the stress function in R :

5 s= sqrt ( sum (f) / sum (e ^2) )

6 return (s)

7 }

The stress is intended to be a measure of how well the conguration matches the data By denition, the best-tting conguration in t-dimensional space, for a xed value of t, is that conguration which minimizes the stress

Of major interest is the ordinary case in which the distances are Euclidean

If the point xi, has ( orthogonal) coordinates xi1, , xit then the Euclidean (or Pythagorean) distance from xi, to xj is given by

dij =

v u u t

l=t

X

l=1

We dene a distance function in R :

1 dist <- function (b){

2 d= matrix ( nrow = n, ncol = n)

Trang 4

5 d[i,j]=d[j,i]= sqrt ( sum ((b[i ,]-b[j ,]) ^2) )}}

6 return (d)

7 }

1.2 Numerical Technique for 2D multidimensional scaling

In principle the iterative technique we use to minimize the stress is not dicult It requires starting from an arbitrary conguration, computing the (negative) gradient, moving along it a suitable distance, and then repeating the last two steps a sucient number of times If a fairly good conguration is conveniently available for use as the starting conguration, it may save quite a few iterations If not, an arbitrary starting conguration is quite satisfactory Only two conditions should be met: no two points in the conguration should be the same, and the conguration should not lie in a lower-dimensional subspace than has been chosen for the analysis If no conguration is conveniently available, an arbitrary conguration must be generated One satisfactory way to do this is to use the rst n points from the list

(1, 0, 0, , 0, 0), (0, 1, 0, , 0, 0), (0, 0, 0, , 0, 1),

(2, 0, 0, , 0, 0), (0, 2, 0, , 0, 0), etc

So we have the code in R :

1 c = matrix (0, nrow = n, ncol = 2)

5 }

Suppose we have arrived at the conguration x, consisting of the n points x1, , xn

in t dimensions Let the coordinates of xi be xi1, , , xitWe shall call all the numbers

xis , with i = 1, , n and s = 1, 2 the coordinates of the conguration x Suppose the (negative) gradient of stress at x is given by g, whose coordinates are gis Then

we form the next conguration by starting from x and moving along g a distance which we call the step-size α In symbols, the new conguration x0 is given by

x0is = xis+ α

mag(g) ∗ gis for all i and s Here mag(g) means the relative magnitude of g and is given by:

mag(g) =

q P

i,sg2 is

q P

i,sx2 is

Trang 5

The initial value of α with an arbitrary starting conguration should be about 0.2 For a conguration that already has low stress, a smaller value should be used (A poorly chosen value results only in extra iterations.)

We have :

T1 =Xd2ij

S1 =X(dij − ˆdij)2

To calculate the (negative) gradient we use the following formulas For Euclidean distance r = 2 :

gkl = SX

i,j

(σki− σkj)

"

dij − ˆdij

S1 − dij

T1

# (xil− xjl)

where σki and σkj denote the Kronecker symbols

From (1.3):

+ if i = j then σki− σkj = 0

+ if i 6= j ,we can take k = i ,then σki− σkj = σii− σij = 1 ,thus :

gil = SX

i,j

"

dij − ˆdij

S1 − dij

T1

# (xil− xjl)

We dene the gradient function :

7 g= matrix ( nrow = n, ncol = 2)

18 return (g)

19 }

And the function for next conguration :

2 c1= matrix ( nrow = n, ncol = 2)

Trang 6

5 for (i in 1:n){

8 return (c1)

9 }

Here we have α = 0.2257106 by apply the formulas from the previous chapter By doing the loop many time until the stress < 6%

2 while (h >0.06) {

5 }

According to Kruskal ,we can satisfy with this stress

Finally, we plot the result

1 plot ( [ ,2] , c [ ,1] , col = " red " )

2 lines ( c [51:100 ,2] , c [51:100 ,1] , col = " blue " )

3 lines ( c [1:50 ,2] , c [1:50 ,1] , col = " black " )

4 lines ( c [101:150 ,2] , c [101:150 ,1] , col = " purple " )

Figure 1: 2D-MDS plot using Kruskal's algorithm

We can easily see that 3 types of ower are separated in the graph

1.3 3D multidimensional scaling

1 a= data matrix ( iris )

2 b = subset (a, select =- c ( Species ) )

3 n= nrow (b)

4 c = matrix (0, nrow = n, ncol = 3)

Trang 7

5 for (i in 1:n){

9 }

10

16 return (d)

17 }

18

23 s= sqrt ( sum (f) / sum (e ^2) )

25 g= matrix (0, nrow = n, ncol = 3)

41 return (g)

42 }

43

44

50 return (s)

51 }

52

53

55 c1= matrix ( nrow = n, ncol = 2)

Trang 8

58 for (i in 1:n){

61 return (c1)

62 }

63

64 while (h >0.07) {

67 }

To plot the result :

1 library ( " scatterplot3d " )

2 colors <- c ( " black " , " blue " , " purple " )

3 colors <- colors [ as numeric ( iris $ Species )]

Figure 2: 3D-MDS plot using Kruskal's algorithm

2 Compare above results to classical MDS

In this section, we use the cmdscale which is a build-in function in R

We still use the code from the 1st section where we have d is the distance matrix of the data Iris

The code for 2D MDS :

1 > c <- cmdscale (d, k =2)

2 > stress ( c

And for 3D MDS :

Trang 9

2 > stress ( c

We can easily see that the classical algorithm give a better stress than Kruskal's algorithm In particular dataset Iris, for Kruskal's algorithm, 5.4% in 2D-MDS and 6.4% in 3D-MDS compare to 4.2% and 1.2% respectively for classical algorithm Another advance of classical algorithm is it give a result in polynomial time

But in another side , the classical algorithm is particular for metric MDS , if we use another distance operator dierent than Euclidean distance, it will give higher stress

Figure 3: 2D-MDS plot using classical algorithm

Trang 10

Figure 4: 3D-MDS plot using classical algorithm

3 Packages for non-metric MDS(NMDS) in R

3.1 Function monoMDS

First,we need to call out the library

1 > library ( vegan )

Function monoMDS uses Kruskal's (1964b) original monotone regression to minimize the stress There are two alternatives of stress: Kruskal's (1964a,b) original or stress 1 and an alternative version or stress 2 (Sibson 1972) Both of these stresses can

be expressed with a general formula

s2 = P(d − ˆd)2

P(d − d0)2

Where d are distances among points in ordination conguration ˆd are the tted ordination distances d0 are the ordination distances under null model

For stress 1 :d0 = 0 and for stress 2 d0 is mean distances

Stress 2 can be expressed as s2 = 1 − R2 where R2 is squared correlation between

tted values and ordination distances, and so related to the linear t of stress plot For particular data Iris for 2D NMDS :

2

3 Call :

5

Trang 11

8 150 points , dissimilarity unknown

9

components

sratmax )

And 3D NMDS :

2

3 Call :

5

7

8 150 points , dissimilarity unknown

9

components

reached

Figure 5: 2D-NMDS stress plot

Trang 12

Figure 6: 3D-NMDS stress plot

The gures above show stress after iterations

We can easily see that in the rst few iterations,the stress in 3D NMDS are better than 2D NMDS and it converges faster and the nal stress in 3D NMDS is better

Trang 13

3.2 Function metaMDS

Function metaMDS performs Nonmetric Multidimensional Scaling (NMDS), and tries to nd a stable solution using several random starts In addition, it standardizes the scaling in the result, so that the congurations are easier to interpret, and adds species scores to the site ordination The metaMDS function does not provide actual NMDS, but it calls another function for the purpose

In particular data Iris,for 2D NMDS :

31 1: scale factor of the gradient < sfgrmin

32

33 Call :

35

37

40

Trang 14

And 3D NMDS:

42

43 Call :

45

47

50

Trang 15

The stress by using this function are 2.4% for 2D MDS and 0.097% for 3D MDS ,it's slightly better than classical MDS for the Euclidean distance

Figure 7: 2D-NMDS plot

Figure 8: 3D-NMDS plot

Trang 16

4 Negative distance in NMDS

4.1 Using Classical algorithm(cmdscale)

First,we assign negative values for some elements in the distance matrix And the rest of the distance matrix obtain by using Euclidean distance

7 return (d)

8 }

For 2D classical MDS:

2 > stress ( c

And 3D classical MDS:

2 > stress ( c

We can see that classical MDS is not a sucient method to solve this case

4.2 Using Nonmetric Multidimensional Scaling(metaMDS)

For 2D NMDS:

2

3 Call :

5

7

10

Trang 17

For 3D NMDS:

2

3 Call :

5

7

10

In this case,we can easily see that NMDS still gave a good stress and it's a good method to solve a general MDS problem

Trang 18

[1] Faith, D.P., Minchin, P.R and Belbin, L 1987 Compositional dissimilarity as a robust measure of ecological distance Vegetatio 69, 5768 Gower, J.C (1966) Some distance properties of latent root and vector methods used in multivariate analysis Biometrika 53, 325328

[2] Kruskal, J.B 1964a Multidimensional scaling by optimizing goodness-of-t to a nonmetric hypothesis Psychometrika 29, 128

[3] Kruskal, J.B 1964b Nonmetric multidimensional scaling: a numerical method Psychometrika 29, 115129

[4] Minchin, P.R 1987 An evaluation of relative robustness of techniques for ecolog-ical ordinations Vegetatio 69, 89107

[5] Sibson, R 1972 Order invariant methods for data analysis Journal of the Royal Statistical Society B 34, 311349

Định dạng
Số trang	18
Dung lượng	312,02 KB

Tài liệu tham khảo	Loại	Chi tiết
[1] Faith, D.P., Minchin, P.R and Belbin, L. 1987. Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69, 5768. Gower, J.C. (1966).Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325328	Khác
[2] Kruskal, J.B. 1964a. Multidimensional scaling by optimizing goodness-of-t to a nonmetric hypothesis. Psychometrika 29, 128	Khác
[3] Kruskal, J.B. 1964b. Nonmetric multidimensional scaling: a numerical method.Psychometrika 29, 115129	Khác
[4] Minchin, P.R. 1987. An evaluation of relative robustness of techniques for ecolog- ical ordinations. Vegetatio 69, 89107	Khác
[5] Sibson, R. 1972. Order invariant methods for data analysis. Journal of the Royal Statistical Society B 34, 311349	Khác