I am doing agglomerative hierarchical clustering (AHC) using two R packages, FactoMineR and factoextra after doing PCA on the dataset, and I end up having a discrepancy in at least one cluster membership but it could more than one had other datasets been used.
This discrepancy is a bit strange because I am using the same distance metric "euclidean" in both as well as the same "ward" method. It turned out that there are more than one version of "ward" method, there is "ward.D" which happened to be an equivalent of "ward" and a newer one called "ward.D2". I tried both and they were the same (ward = ward.D) and was identical to "ward.D2". However, my question is about the difference between two packages in R using the same "ward" method.
MWE: the temperature dataset from FactoMineR
temperature <- read.table("http://factominer.free.fr/bookV2/temperature.csv", header=TRUE,sep=";",dec=".",row.names=1)
PCA
res.pca <- PCA(temperature[1:23,],scale.unit=TRUE,ncp=Inf,graph=FALSE,quanti.sup=13:16,quali.sup=17)
Code
library(FactoMineR)
library(factoextra)
library(tidyverse)
## FactoMineR::HCPC()
set.seed(123)
res.hcpc <- FactoMineR::HCPC(res.pca, nb.clust = 3, graph = FALSE, method = "ward")
clustDF <- NULL
clustDF$PC1 <- res.pca$ind$coord[, 1]
clustDF$PC2 <- res.pca$ind$coord[, 2]
names <- row.names(res.pca$ind$coord)
clustDF <- as.data.frame(clustDF, row.names = names)
## factoextra::eclust("hclust")
res.eclust2 <- eclust(clustDF[c("PC1", "PC2")], "hclust", hc_metric = "euclidean", hc_method = "ward.D2", k = 3, graph = FALSE, seed = 123)
res.eclust <- eclust(clustDF[c("PC1", "PC2")], "hclust", hc_metric = "euclidean", hc_method = "ward.D", k = 3, graph = FALSE, seed = 123)
clustDF <- clustDF %>%
mutate(eclust = as.factor(eclust),
eclust2 = as.factor(eclust2)) %>%
mutate(eclust = fct_recode(eclust, "2" = "1", "3" = "2", "1" = "3")) %>%
mutate(eclust2= fct_recode(eclust2, "2" = "1", "3" = "2", "1" = "3")) %>%
mutate(match = eclust == HCPC)
Output
PC1 PC2 eclust eclust2 HCPC match
Amsterdam 0.227 -1.3714 2 2 2 TRUE
Athens 7.601 0.9304 3 3 3 TRUE
Berlin -0.288 0.0165 2 2 2 TRUE
Brussels 0.631 -1.1772 2 2 2 TRUE
Budapest 1.668 1.7127 2 2 2 TRUE
Copenhagen -1.462 -0.4921 2 2 2 TRUE
Dublin -0.505 -2.6735 2 2 2 TRUE
Elsinki -4.036 0.4620 1 1 1 TRUE
Kiev -1.712 2.0076 2 2 1 FALSE
Krakow -1.259 0.8750 2 2 2 TRUE
Lisbon 5.599 -1.5543 3 3 3 TRUE
London 0.058 -1.5738 2 2 2 TRUE
Madrid 4.064 0.6977 3 3 3 TRUE
Minsk -3.238 1.3913 1 1 1 TRUE
Moscow -3.463 2.1820 1 1 1 TRUE
Oslo -3.306 0.3101 1 1 1 TRUE
Paris 1.420 -0.8976 2 2 2 TRUE
Prague -0.109 0.6980 2 2 2 TRUE
Reykjavik -4.704 -2.9572 1 1 1 TRUE
Rome 5.382 0.2937 3 3 3 TRUE
Sarajevo 0.163 0.3195 2 2 2 TRUE
Sofia 0.418 0.7951 2 2 2 TRUE
Stockholm -3.149 0.0056 1 1 1 TRUE
Question
As you can see the eclust function whether (ward
, ward.D
or ward.D2
) gave the same clusters, however factoextra::eclust()
was different from FactoMineR::HCPC()
in one case of Kiev. What is the explanation behind this discrepancy?
Notes
factoextra::eclust
is a wrapper for factoextra::hcut
which uses stats::hclust()
function
FactoMineR::HCPC()
on the other hand uses flashClust::hclust()
function for hierarchical clustering
I reported this issue to FactoMineR package here.
In the function HCPC, there is a consolidation of the clkusters which is done by default (consol=TRUE). This argument improve the partition in the sense that the cluesters are more homogeneous, compare to the partition obtained by cutting the hierarchical tree. However, there is no longer consistency between the hierarchical tree and the clusters because, as in the example, some individuals may change cluster. If you use consol=FALSE, the results are the same:
res.hcpc <- FactoMineR::HCPC(res.pca, nb.clust = 3, consol=FALSE, graph = FALSE, method = "ward")