Say I have the following binary data frame, df
.
structure(list(a = c(0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0), b = c(0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 1), c = c(0, 0, 0, 0, 1, 0, 0, 1,
0, 1, 0), d = c(1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0), e = c(0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 1), f = c(0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 1), g = c(0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0), h = c(1, 0, 0,
0, 0, 0, 0, 1, 1, 0, 0), i = c(0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
0)), class = "data.frame", row.names = c(NA, -11L), .Names = c("a",
"b", "c", "d", "e", "f", "g", "h", "i"))
> df
a b c d e f g h i
1 0 0 0 1 0 0 0 1 0
2 0 0 0 0 0 0 1 0 0
3 0 0 0 0 1 1 1 0 0
4 0 0 0 0 0 0 1 0 0
5 1 0 1 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 1
7 0 1 0 1 0 0 0 0 0
8 1 0 1 0 0 0 0 1 0
9 0 1 0 0 0 0 0 1 1
10 0 0 1 1 0 0 0 0 1
11 0 1 0 0 1 1 0 0 0
I want to examine the similarities between rows and, therefore, use an MDS plot. I perform a classical MDS scaling using the binary
(i.e., Jaccard) method in dist
.
# Load libraries
library(dplyr)
library(ggplot2)
library(magrittr)
# Perform MDS scaling using binary method
mds_df <- df %>%
dist(method = "binary") %>%
cmdscale
Next, I label my columns, bind them to my original data frame, and add row numbers to be used as labels in my plot.
# Name columns
colnames(mds_df) <- c("mds_x", "mds_y")
# Bind to original data frame
df %<>%
cbind(mds_df) %>%
mutate(tags = row_number())
Finally, I plot my results with ggplot2
.
g <- ggplot(df) + geom_point(aes(x = mds_x, y = mds_y), size = 5)
g <- g + geom_text(aes(x = mds_x, y = mds_y, label = tags), position = position_jitter(width = 0.05, height = 0.05))
g <- g + xlab("Coordinate 1") + ylab("Coordinate 2")
print(g)
Now, notice rows 2 and 4 in the matrix are exactly the same. In the figure they fall right on top of each other. Great! Makes sense. Next, look at rows 6 & 7. They have no common 1
values yet fall fairly close together. Hmm. To make things worse, rows 3 & 11 have two 1
s in common yet are plotted much further apart. Weird.
I realise that the Jaccard approach compares those common elements with the total number of elements in both sets (i.e., intersect over union), but rows 6 & 7 have three elements not in common and none in common whereas 3 & 11 have two elements in common and two not in common. Intuitively, I feel like 3 & 11 should fall closer together than 6 & 7. Is this due to a poor choice of distance metric or a flaw in my coding/logic? Is there another plotting method that would show these results in a more intuitive way?
Since you have 9 variables, you are plotting 11 observations in 9-dimensional space. When you squash that down to 2-dimensional space details are lost. If you run cmdscale()
with eig=TRUE
you will get more information about the final solution. The GOF
value is the goodness of fit with 1.0 being a perfect score. You have .52 so you are displaying about 52% of the spatial information in the 9-dimensions in 2 dimensions. That is good but not great. If you increase to 3-dimensions, you will get a GOF
value of .68. The cmdscale()
function computes metric multidimensional scaling (aka principal coordinates analysis).
Since you have loaded the vegan package, you have the option to try non-metric multidimensional (NMDS) scaling with monoMDS()
or metaMDS()
. The problem with NMDS is that the solution can find a local minimum so it is best to try several runs and pick the best one. That is what metaMDS()
does. By default it tries 20 random starting configurations. If 2 of them are essentially identical, they are convergent. Your data does not find 2 identical solutions, so I am just plotting the best one of the 20. Using trymax=100
, I finally got convergent solutions but that solution is not noticeably different from the one using the default 20 tries:
df.dst <- dist(df, method="binary")
df.meta <- metaMDS(df.dst)
plot(df.meta, "sites")
text(df.meta, "sites", pos=3)
I think the distances are a bit better represented in this plot. Certainly 11 and 3 are closer than 6 and 7.