rcluster-analysishierarchical-clusteringdendrogramdendextend

specifying three cluster groups, but dendrogram clusters into four


I'm having a frustrating problem that I can't reproduce (I wish I could). I've generated dendrograms with three ecological datasets, using the same code but unique objects for each. Each leaf in the dendrograms is a survey plot, with species presence/abundance driving the clustering.

I cut the dendrogram into 3 groups, and color code each group. This works for fine for all three datasets when clustering using Euclidean distance, and for two of my datasets when using Bray-Curtis distance. However: the third dataset clusters two leaves when using Bray-Curtis, and forces the color code to recycle, creating k = 4 groups despite specifying k = 3.

My question is: why would two leaves (plots) be forced into their own 'cluster,' and force the dendrogram to have 4 clusters when I've specified k = 3 groups?

I've pasted below an example of the code, and images of the "correct" and "wrong" dendrograms. Curious if anyone has any troubleshooting suggestions, since I can't offer code that reproduces this error. TIA.

I've tried:

Example code (same format with unique objects used for each dendrogram figure). Please access csv from https://drive.google.com/file/d/12eXIXVuHTu4BLGxcGu18bqhT85ZOHkNW/view?usp=sharing. See file clusterdata.csv for the troublesome dataset. Colnames are species; rows are plot ID; values are cover class bins (0 = absent, 1 = < 25%, 2 = 25-50%, etc.)

#library(dendextend)
d <- read.csv("clusterdata.csv")

dend <- d %>% 
vegdist(method = "bray") %>% 
hclust(method = "ward.D") %>% 
# cutree(h = 3) %>% 
as.dendrogram()

mycol <- c("#009E73", "#0072B2", "#E69F00")

dend.plot <-  as.dendrogram(dend) %>%
   set("branches_lwd", 2) %>% # Branches line width
   set("branches_k_color", mycol, k = 3) %>% # Color branches by groups
   set("labels_cex", 0.5) # Change label size
plot(dend.plot, ylab = "Bray-Curtis Distance", main = "why would clusters be different?")

"correct" dendrogram - cut & color coded at three highest levels "wrong" dendrogram - why are the two orange leaves their own cluster?!?


Solution

  • I found a solution in the post below that involves an intermediate step rounding the height component to get around the height differences being too small or negative.

    dend <- d %>% 
    vegdist(method = "bray") %>% 
    hclust(method = "ward.D")
    dend$height <- round(dend$height, 6)
    dend <- as.dendrogram(dend)
    

    It worked fine with average linkage but with Ward some distances are extremely small. You'll notice the values are a lot more reasonable after rounding if you run this:

    dend <- d %>% 
    vegdist(method = "bray") %>% 
    hclust(method = "ward.D")
    
    dend$height
    diff(round(dend$height,6))
    

    The 'height' component of 'tree' is not sorted Error in cutree

    Also interesting discussion on Cross Validated on issues combining HC with Bray Distance: HC With Bray Distance