rsimilaritymds

MDS plots of binary data: counterintuitive clustering


Say I have the following binary data frame, df.

structure(list(a = c(0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0), b = c(0, 
0, 0, 0, 0, 0, 1, 0, 1, 0, 1), c = c(0, 0, 0, 0, 1, 0, 0, 1, 
0, 1, 0), d = c(1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0), e = c(0, 0, 
1, 0, 0, 0, 0, 0, 0, 0, 1), f = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 
0, 1), g = c(0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0), h = c(1, 0, 0, 
0, 0, 0, 0, 1, 1, 0, 0), i = c(0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 
0)), class = "data.frame", row.names = c(NA, -11L), .Names = c("a", 
"b", "c", "d", "e", "f", "g", "h", "i"))

> df
   a b c d e f g h i
1  0 0 0 1 0 0 0 1 0
2  0 0 0 0 0 0 1 0 0
3  0 0 0 0 1 1 1 0 0
4  0 0 0 0 0 0 1 0 0
5  1 0 1 0 0 0 0 0 0
6  0 0 0 0 0 0 0 0 1
7  0 1 0 1 0 0 0 0 0
8  1 0 1 0 0 0 0 1 0
9  0 1 0 0 0 0 0 1 1
10 0 0 1 1 0 0 0 0 1
11 0 1 0 0 1 1 0 0 0

I want to examine the similarities between rows and, therefore, use an MDS plot. I perform a classical MDS scaling using the binary (i.e., Jaccard) method in dist.

# Load libraries
library(dplyr)
library(ggplot2)
library(magrittr)

# Perform MDS scaling using binary method
mds_df <- df %>% 
  dist(method = "binary") %>% 
  cmdscale 

Next, I label my columns, bind them to my original data frame, and add row numbers to be used as labels in my plot.

# Name columns
colnames(mds_df) <- c("mds_x", "mds_y")

# Bind to original data frame
df %<>% 
  cbind(mds_df) %>% 
  mutate(tags = row_number())

Finally, I plot my results with ggplot2.

g <- ggplot(df) + geom_point(aes(x = mds_x, y = mds_y), size = 5)
g <- g + geom_text(aes(x = mds_x, y = mds_y, label = tags), position = position_jitter(width = 0.05, height = 0.05))
g <- g + xlab("Coordinate 1") + ylab("Coordinate 2")
print(g)

enter image description here

Now, notice rows 2 and 4 in the matrix are exactly the same. In the figure they fall right on top of each other. Great! Makes sense. Next, look at rows 6 & 7. They have no common 1 values yet fall fairly close together. Hmm. To make things worse, rows 3 & 11 have two 1s in common yet are plotted much further apart. Weird.

I realise that the Jaccard approach compares those common elements with the total number of elements in both sets (i.e., intersect over union), but rows 6 & 7 have three elements not in common and none in common whereas 3 & 11 have two elements in common and two not in common. Intuitively, I feel like 3 & 11 should fall closer together than 6 & 7. Is this due to a poor choice of distance metric or a flaw in my coding/logic? Is there another plotting method that would show these results in a more intuitive way?


Solution

  • Since you have 9 variables, you are plotting 11 observations in 9-dimensional space. When you squash that down to 2-dimensional space details are lost. If you run cmdscale() with eig=TRUE you will get more information about the final solution. The GOF value is the goodness of fit with 1.0 being a perfect score. You have .52 so you are displaying about 52% of the spatial information in the 9-dimensions in 2 dimensions. That is good but not great. If you increase to 3-dimensions, you will get a GOF value of .68. The cmdscale() function computes metric multidimensional scaling (aka principal coordinates analysis).

    Since you have loaded the vegan package, you have the option to try non-metric multidimensional (NMDS) scaling with monoMDS() or metaMDS(). The problem with NMDS is that the solution can find a local minimum so it is best to try several runs and pick the best one. That is what metaMDS() does. By default it tries 20 random starting configurations. If 2 of them are essentially identical, they are convergent. Your data does not find 2 identical solutions, so I am just plotting the best one of the 20. Using trymax=100, I finally got convergent solutions but that solution is not noticeably different from the one using the default 20 tries:

    df.dst <- dist(df, method="binary")
    df.meta <- metaMDS(df.dst)
    plot(df.meta, "sites")
    text(df.meta, "sites", pos=3)
    

    enter image description here

    I think the distances are a bit better represented in this plot. Certainly 11 and 3 are closer than 6 and 7.