rcluster-analysispcafactoextra

pca and cluster analysis, very slow computing


My data has 30,000 rows and 140 columns and I am trying to cluster the data. I am doing a pca and then using about 12 pc to use in the cluster analysis. I took a random sample of 3000 observations and ran it and it took 44 minutes to run both the pca and the hierarchal clustering.

A co-worker did the same in SPSS and it took significantly less time? Any idea why?

Here is a simplified version of my code which works fine but is really slow on anything over 2000 observations. I included the USArrest dataset which is really small so it doesn't really represent my problem but shows what I'm trying to do. I'm hesitant to post a large dataset as that seems rude.

I'm not sure how to speed the clustering up. I know I can do random samples of the data and then use a predict function to assign clusters to the test data. But optimally I'd like to use all of the data in the clustering since the data is static and isn't ever going to change or be updated.

library(factoextra)
library(FactoMineR)       
library(FactoInvestigate) 

## Data

# mydata = My data has 32,000 rows with 139 variables.
# example data with small data set 
data("USArrests")
mydata <- USArrests

## Initial PCA on mydata

res.pca <- PCA(mydata, ncp=4, scale.unit=TRUE, graph = TRUE)

Investigate(res.pca)  # this report is very helpful!  I determined to keep 12 PC and start with 3 clusters.

## Keep PCA dataset with only 2 PC
res.pca1 <- PCA(mydata, ncp=2, scale.unit=TRUE, graph = TRUE)

## Run a HC on the PC:  Start with suggested number of PC and Clusters 
res.hcpc <- HCPC(res.pca1, nb.clust=4, graph = FALSE)

## Dendrogram
fviz_dend(res.hcpc,
          cex = 0.7, 
          palette = "jco",
          rect = TRUE, rect_fill = TRUE, 
          rect_border = "jco", 
          labels_track_height = 0.8 
)

## Cluster Viz
fviz_cluster(res.hcpc,
             geom = "point",  
             elipse.type = "convex", 
             #repel = TRUE, 
             show.clust.cent = TRUE, 
             palette = "jco", 
             ggtheme = theme_minimal(),
             main = "Factor map"
)


#### Cluster 1: Means of Variables
res.hcpc$desc.var$quanti$'1'

#### Cluster 2: Means of Variables
res.hcpc$desc.var$quanti$'2'

#### Cluster 3: Means of Variables
res.hcpc$desc.var$quanti$'3'

#### Cluster 4: Means of Variables
res.hcpc$desc.var$quanti$'4'

#### Number of Observations in each cluster
cluster_hd = res.hcpc$data.clust$clust
summary(cluster_hd)  

Any idea why SPSS is so much faster?

Any idea how to speed this up? I know clustering is labor intensive but I'm not sure what the threshold is for efficiency and my data of 30,000 records and 140 variables.

Are some of the other clustering packages more efficient? Suggestions?


Solution

  • HCPC is a hierarchical clustering on the principal components using the Ward criterion. You can use k-means algorithm instead for the clustering part, which is way faster: Hierarchical clustering has a time complexity of O(n³) whereas k-means has a complexity of O(n) where n is the number of observations.

    Since the criterions optimized by k-means and hierarchical clustering with Ward are the same (minimize the total within-cluster variance), you can use first k-means with a high number of clusters (say 300 for instance) and then run hierarchical clustering on the centers of the clusters if you need to keep the hierachical aspect.