rpcahierarchical-clusteringfactoextrafactominer

Why do I get different clustering between FactoMineR and factoextra packages in R given I use the same metric and method?


I am doing agglomerative hierarchical clustering (AHC) using two R packages, FactoMineR and factoextra after doing PCA on the dataset, and I end up having a discrepancy in at least one cluster membership but it could more than one had other datasets been used.

This discrepancy is a bit strange because I am using the same distance metric "euclidean" in both as well as the same "ward" method. It turned out that there are more than one version of "ward" method, there is "ward.D" which happened to be an equivalent of "ward" and a newer one called "ward.D2". I tried both and they were the same (ward = ward.D) and was identical to "ward.D2". However, my question is about the difference between two packages in R using the same "ward" method.

MWE: the temperature dataset from FactoMineR

temperature <- read.table("http://factominer.free.fr/bookV2/temperature.csv", header=TRUE,sep=";",dec=".",row.names=1)

PCA

res.pca <- PCA(temperature[1:23,],scale.unit=TRUE,ncp=Inf,graph=FALSE,quanti.sup=13:16,quali.sup=17)

Code

library(FactoMineR)
library(factoextra)
library(tidyverse)

## FactoMineR::HCPC()
set.seed(123)
res.hcpc <- FactoMineR::HCPC(res.pca, nb.clust = 3, graph = FALSE, method = "ward")

clustDF <- NULL
clustDF$PC1 <- res.pca$ind$coord[, 1]
clustDF$PC2 <- res.pca$ind$coord[, 2]

names <- row.names(res.pca$ind$coord)
clustDF <- as.data.frame(clustDF, row.names = names)

## factoextra::eclust("hclust")
res.eclust2 <- eclust(clustDF[c("PC1", "PC2")], "hclust", hc_metric = "euclidean", hc_method = "ward.D2", k = 3, graph = FALSE, seed = 123)
res.eclust <- eclust(clustDF[c("PC1", "PC2")], "hclust", hc_metric = "euclidean", hc_method = "ward.D", k = 3, graph = FALSE, seed = 123)

clustDF <- clustDF  %>%
    mutate(eclust = as.factor(eclust),
           eclust2 = as.factor(eclust2)) %>%
    mutate(eclust = fct_recode(eclust, "2" = "1", "3" = "2", "1" = "3")) %>%
    mutate(eclust2= fct_recode(eclust2, "2" = "1", "3" = "2", "1" = "3")) %>%
    mutate(match = eclust == HCPC)

Output

               PC1     PC2 eclust eclust2 HCPC match
Amsterdam    0.227 -1.3714      2       2    2  TRUE
Athens       7.601  0.9304      3       3    3  TRUE
Berlin      -0.288  0.0165      2       2    2  TRUE
Brussels     0.631 -1.1772      2       2    2  TRUE
Budapest     1.668  1.7127      2       2    2  TRUE
Copenhagen  -1.462 -0.4921      2       2    2  TRUE
Dublin      -0.505 -2.6735      2       2    2  TRUE
Elsinki     -4.036  0.4620      1       1    1  TRUE
Kiev        -1.712  2.0076      2       2    1 FALSE
Krakow      -1.259  0.8750      2       2    2  TRUE
Lisbon       5.599 -1.5543      3       3    3  TRUE
London       0.058 -1.5738      2       2    2  TRUE
Madrid       4.064  0.6977      3       3    3  TRUE
Minsk       -3.238  1.3913      1       1    1  TRUE
Moscow      -3.463  2.1820      1       1    1  TRUE
Oslo        -3.306  0.3101      1       1    1  TRUE
Paris        1.420 -0.8976      2       2    2  TRUE
Prague      -0.109  0.6980      2       2    2  TRUE
Reykjavik   -4.704 -2.9572      1       1    1  TRUE
Rome         5.382  0.2937      3       3    3  TRUE
Sarajevo     0.163  0.3195      2       2    2  TRUE
Sofia        0.418  0.7951      2       2    2  TRUE
Stockholm   -3.149  0.0056      1       1    1  TRUE

Question

As you can see the eclust function whether (ward, ward.D or ward.D2) gave the same clusters, however factoextra::eclust() was different from FactoMineR::HCPC() in one case of Kiev. What is the explanation behind this discrepancy?

Notes

factoextra::eclust is a wrapper for factoextra::hcut which uses stats::hclust() function

FactoMineR::HCPC() on the other hand uses flashClust::hclust() function for hierarchical clustering

I reported this issue to FactoMineR package here.


Solution

  • In the function HCPC, there is a consolidation of the clkusters which is done by default (consol=TRUE). This argument improve the partition in the sense that the cluesters are more homogeneous, compare to the partition obtained by cutting the hierarchical tree. However, there is no longer consistency between the hierarchical tree and the clusters because, as in the example, some individuals may change cluster. If you use consol=FALSE, the results are the same:

    res.hcpc <- FactoMineR::HCPC(res.pca, nb.clust = 3, consol=FALSE, graph = FALSE, method = "ward")