I'm having a frustrating problem that I can't reproduce (I wish I could). I've generated dendrograms with three ecological datasets, using the same code but unique objects for each. Each leaf in the dendrograms is a survey plot, with species presence/abundance driving the clustering.
I cut the dendrogram into 3 groups, and color code each group. This works for fine for all three datasets when clustering using Euclidean distance, and for two of my datasets when using Bray-Curtis distance. However: the third dataset clusters two leaves when using Bray-Curtis, and forces the color code to recycle, creating k = 4 groups despite specifying k = 3.
My question is: why would two leaves (plots) be forced into their own 'cluster,' and force the dendrogram to have 4 clusters when I've specified k = 3 groups?
I've pasted below an example of the code, and images of the "correct" and "wrong" dendrograms. Curious if anyone has any troubleshooting suggestions, since I can't offer code that reproduces this error. TIA.
I've tried:
Example code (same format with unique objects used for each dendrogram figure). Please access csv from https://drive.google.com/file/d/12eXIXVuHTu4BLGxcGu18bqhT85ZOHkNW/view?usp=sharing. See file clusterdata.csv for the troublesome dataset. Colnames are species; rows are plot ID; values are cover class bins (0 = absent, 1 = < 25%, 2 = 25-50%, etc.)
#library(dendextend)
d <- read.csv("clusterdata.csv")
dend <- d %>%
vegdist(method = "bray") %>%
hclust(method = "ward.D") %>%
# cutree(h = 3) %>%
as.dendrogram()
mycol <- c("#009E73", "#0072B2", "#E69F00")
dend.plot <- as.dendrogram(dend) %>%
set("branches_lwd", 2) %>% # Branches line width
set("branches_k_color", mycol, k = 3) %>% # Color branches by groups
set("labels_cex", 0.5) # Change label size
plot(dend.plot, ylab = "Bray-Curtis Distance", main = "why would clusters be different?")
I found a solution in the post below that involves an intermediate step rounding the height component to get around the height differences being too small or negative.
dend <- d %>%
vegdist(method = "bray") %>%
hclust(method = "ward.D")
dend$height <- round(dend$height, 6)
dend <- as.dendrogram(dend)
It worked fine with average linkage but with Ward some distances are extremely small. You'll notice the values are a lot more reasonable after rounding if you run this:
dend <- d %>%
vegdist(method = "bray") %>%
hclust(method = "ward.D")
dend$height
diff(round(dend$height,6))
The 'height' component of 'tree' is not sorted Error in cutree
Also interesting discussion on Cross Validated on issues combining HC with Bray Distance: HC With Bray Distance