Multidimentional Scaling Trung Duc Tran Institute of Mathematics University of Silesia in KatowiceMultidimentional Scaling Trung Duc Tran Institute of Mathematics University of Silesia in KatowiceMultidimentional Scaling Trung Duc Tran Institute of Mathematics University of Silesia in KatowiceMultidimentional Scaling Trung Duc Tran Institute of Mathematics University of Silesia in KatowiceMultidimentional Scaling Trung Duc Tran Institute of Mathematics University of Silesia in Katowice
Trang 1Multidimensional Scaling
Trung Duc Tran
Institute of Mathematics University of Silesia in Katowice
Trang 21.1 Introduction 2 1.2 Numerical Technique for 2D multidimensional scaling 3 1.3 3D multidimensional scaling 5
3.1 Function monoMDS 9 3.2 Function metaMDS 12
4.1 Using Classical algorithm(cmdscale) 15
Trang 3Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset MDS scale n-dimensional data to m-dimensional data (m<n) In this paper , we will do an example of data Iris in R which has n=4 and m=2,3 Because in 2D and 3D space, we can visualize the dissimilarities between objects
1 Introduction and rst solution by using Kruskal's algorithm
1.1 Introduction
we suppose that there are n objects 1, , , n, and values ˆdij are dissimilarity between every 2 objects For a conguration of points x1, , , xn in t-dimensional space, with every 2 points distances dij , we dened the stress of the conguration by
S =r S1
T1 =
s P(dij − ˆdij)2
P d2 ij
(1.1) First, we assign the initial data to a matrix:
1 a= data matrix ( iris )
2 b = subset (a, select =- c ( Species ))
3 n= nrow (b)
We dene the stress function in R :
5 s= sqrt ( sum (f) / sum (e ^2) )
6 return (s)
7 }
The stress is intended to be a measure of how well the conguration matches the data By denition, the best-tting conguration in t-dimensional space, for a xed value of t, is that conguration which minimizes the stress
Of major interest is the ordinary case in which the distances are Euclidean
If the point xi, has ( orthogonal) coordinates xi1, , xit then the Euclidean (or Pythagorean) distance from xi, to xj is given by
dij =
v u u t
l=t
X
l=1
We dene a distance function in R :
1 dist <- function (b){
2 d= matrix ( nrow = n, ncol = n)
Trang 45 d[i,j]=d[j,i]= sqrt ( sum ((b[i ,]-b[j ,]) ^2) )}}
6 return (d)
7 }
1.2 Numerical Technique for 2D multidimensional scaling
In principle the iterative technique we use to minimize the stress is not dicult It requires starting from an arbitrary conguration, computing the (negative) gradient, moving along it a suitable distance, and then repeating the last two steps a sucient number of times If a fairly good conguration is conveniently available for use as the starting conguration, it may save quite a few iterations If not, an arbitrary starting conguration is quite satisfactory Only two conditions should be met: no two points in the conguration should be the same, and the conguration should not lie in a lower-dimensional subspace than has been chosen for the analysis If no conguration is conveniently available, an arbitrary conguration must be generated One satisfactory way to do this is to use the rst n points from the list
(1, 0, 0, , 0, 0), (0, 1, 0, , 0, 0), (0, 0, 0, , 0, 1),
(2, 0, 0, , 0, 0), (0, 2, 0, , 0, 0), etc
So we have the code in R :
1 c = matrix (0, nrow = n, ncol = 2)
5 }
Suppose we have arrived at the conguration x, consisting of the n points x1, , xn
in t dimensions Let the coordinates of xi be xi1, , , xitWe shall call all the numbers
xis , with i = 1, , n and s = 1, 2 the coordinates of the conguration x Suppose the (negative) gradient of stress at x is given by g, whose coordinates are gis Then
we form the next conguration by starting from x and moving along g a distance which we call the step-size α In symbols, the new conguration x0 is given by
x0is = xis+ α
mag(g) ∗ gis for all i and s Here mag(g) means the relative magnitude of g and is given by:
mag(g) =
q P
i,sg2 is
q P
i,sx2 is
Trang 5The initial value of α with an arbitrary starting conguration should be about 0.2 For a conguration that already has low stress, a smaller value should be used (A poorly chosen value results only in extra iterations.)
We have :
T1 =Xd2ij
S1 =X(dij − ˆdij)2
To calculate the (negative) gradient we use the following formulas For Euclidean distance r = 2 :
gkl = SX
i,j
(σki− σkj)
"
dij − ˆdij
S1 − dij
T1
# (xil− xjl)
where σki and σkj denote the Kronecker symbols
From (1.3):
+ if i = j then σki− σkj = 0
+ if i 6= j ,we can take k = i ,then σki− σkj = σii− σij = 1 ,thus :
gil = SX
i,j
"
dij − ˆdij
S1 − dij
T1
# (xil− xjl)
We dene the gradient function :
7 g= matrix ( nrow = n, ncol = 2)
18 return (g)
19 }
And the function for next conguration :
2 c1= matrix ( nrow = n, ncol = 2)
Trang 65 for (i in 1:n){
8 return (c1)
9 }
Here we have α = 0.2257106 by apply the formulas from the previous chapter By doing the loop many time until the stress < 6%
2 while (h >0.06) {
5 }
According to Kruskal ,we can satisfy with this stress
Finally, we plot the result
1 plot ( [ ,2] , c [ ,1] , col = " red " )
2 lines ( c [51:100 ,2] , c [51:100 ,1] , col = " blue " )
3 lines ( c [1:50 ,2] , c [1:50 ,1] , col = " black " )
4 lines ( c [101:150 ,2] , c [101:150 ,1] , col = " purple " )
Figure 1: 2D-MDS plot using Kruskal's algorithm
We can easily see that 3 types of ower are separated in the graph
1.3 3D multidimensional scaling
1 a= data matrix ( iris )
2 b = subset (a, select =- c ( Species ) )
3 n= nrow (b)
4 c = matrix (0, nrow = n, ncol = 3)
Trang 75 for (i in 1:n){
9 }
10
11 dist <- function (b){
12 d= matrix ( nrow = n, ncol = n)
16 return (d)
17 }
18
23 s= sqrt ( sum (f) / sum (e ^2) )
25 g= matrix (0, nrow = n, ncol = 3)
41 return (g)
42 }
43
44
50 return (s)
51 }
52
53
55 c1= matrix ( nrow = n, ncol = 2)
Trang 858 for (i in 1:n){
61 return (c1)
62 }
63
64 while (h >0.07) {
67 }
To plot the result :
1 library ( " scatterplot3d " )
2 colors <- c ( " black " , " blue " , " purple " )
3 colors <- colors [ as numeric ( iris $ Species )]
Figure 2: 3D-MDS plot using Kruskal's algorithm
2 Compare above results to classical MDS
In this section, we use the cmdscale which is a build-in function in R
We still use the code from the 1st section where we have d is the distance matrix of the data Iris
The code for 2D MDS :
1 > c <- cmdscale (d, k =2)
2 > stress ( c
And for 3D MDS :
Trang 91 > c <- cmdscale (d, k =3)
2 > stress ( c
We can easily see that the classical algorithm give a better stress than Kruskal's algorithm In particular dataset Iris, for Kruskal's algorithm, 5.4% in 2D-MDS and 6.4% in 3D-MDS compare to 4.2% and 1.2% respectively for classical algorithm Another advance of classical algorithm is it give a result in polynomial time
But in another side , the classical algorithm is particular for metric MDS , if we use another distance operator dierent than Euclidean distance, it will give higher stress
Figure 3: 2D-MDS plot using classical algorithm
Trang 10Figure 4: 3D-MDS plot using classical algorithm
3 Packages for non-metric MDS(NMDS) in R
3.1 Function monoMDS
First,we need to call out the library
1 > library ( vegan )
Function monoMDS uses Kruskal's (1964b) original monotone regression to minimize the stress There are two alternatives of stress: Kruskal's (1964a,b) original or stress 1 and an alternative version or stress 2 (Sibson 1972) Both of these stresses can
be expressed with a general formula
s2 = P(d − ˆd)2
P(d − d0)2
Where d are distances among points in ordination conguration ˆd are the tted ordination distances d0 are the ordination distances under null model
For stress 1 :d0 = 0 and for stress 2 d0 is mean distances
Stress 2 can be expressed as s2 = 1 − R2 where R2 is squared correlation between
tted values and ordination distances, and so related to the linear t of stress plot For particular data Iris for 2D NMDS :
2
3 Call :
5
Trang 118 150 points , dissimilarity unknown
9
components
sratmax )
And 3D NMDS :
2
3 Call :
5
7
8 150 points , dissimilarity unknown
9
components
reached
Figure 5: 2D-NMDS stress plot
Trang 12Figure 6: 3D-NMDS stress plot
The gures above show stress after iterations
We can easily see that in the rst few iterations,the stress in 3D NMDS are better than 2D NMDS and it converges faster and the nal stress in 3D NMDS is better
Trang 133.2 Function metaMDS
Function metaMDS performs Nonmetric Multidimensional Scaling (NMDS), and tries to nd a stable solution using several random starts In addition, it standardizes the scaling in the result, so that the congurations are easier to interpret, and adds species scores to the site ordination The metaMDS function does not provide actual NMDS, but it calls another function for the purpose
In particular data Iris,for 2D NMDS :
31 1: scale factor of the gradient < sfgrmin
32
33 Call :
35
37
40
Trang 14And 3D NMDS:
42
43 Call :
45
47
50
Trang 15The stress by using this function are 2.4% for 2D MDS and 0.097% for 3D MDS ,it's slightly better than classical MDS for the Euclidean distance
Figure 7: 2D-NMDS plot
Figure 8: 3D-NMDS plot
Trang 164 Negative distance in NMDS
4.1 Using Classical algorithm(cmdscale)
First,we assign negative values for some elements in the distance matrix And the rest of the distance matrix obtain by using Euclidean distance
1 dist <- function (b){
2 d= matrix ( nrow = n, ncol = n)
7 return (d)
8 }
For 2D classical MDS:
1 > c <- cmdscale (d, k =2)
2 > stress ( c
And 3D classical MDS:
1 > c <- cmdscale (d, k =3)
2 > stress ( c
We can see that classical MDS is not a sucient method to solve this case
4.2 Using Nonmetric Multidimensional Scaling(metaMDS)
For 2D NMDS:
2
3 Call :
5
7
10
Trang 17For 3D NMDS:
2
3 Call :
5
7
10
In this case,we can easily see that NMDS still gave a good stress and it's a good method to solve a general MDS problem
Trang 18[1] Faith, D.P., Minchin, P.R and Belbin, L 1987 Compositional dissimilarity as a robust measure of ecological distance Vegetatio 69, 5768 Gower, J.C (1966) Some distance properties of latent root and vector methods used in multivariate analysis Biometrika 53, 325328
[2] Kruskal, J.B 1964a Multidimensional scaling by optimizing goodness-of-t to a nonmetric hypothesis Psychometrika 29, 128
[3] Kruskal, J.B 1964b Nonmetric multidimensional scaling: a numerical method Psychometrika 29, 115129
[4] Minchin, P.R 1987 An evaluation of relative robustness of techniques for ecolog-ical ordinations Vegetatio 69, 89107
[5] Sibson, R 1972 Order invariant methods for data analysis Journal of the Royal Statistical Society B 34, 311349