[SOLVED] silhouette calculation in R for a large data

silhouette calculation in R for a large data

I want to calculate silhouette for cluster evaluation. There are some packages in R, for example cluster and clValid. Here is my code using cluster package:

# load the data
# a data from the UCI website with 434874 obs. and  3 variables
data <- read.csv("./data/spatial_network.txt",sep="\t",header =  F)

# apply kmeans
km_res <- kmeans(data,20,iter.max = 1000,
               nstart=20,algorithm="MacQueen")

# calculate silhouette
library(cluster)   
sil <- silhouette(km_res$cluster, dist(data))

# plot silhouette
library(factoextra)
fviz_silhouette(sil)

The code works well for smaller data, say data with 50,000 obs, however I get an error like "Error: cannot allocate vector of size 704.5 Gb" when the data size is a bit large. This might be problem for Dunn index and other internal indices for large datasets.

I have 32GB RAM in my computer. The problem comes from calculating dist(data). I am wondering if it is possible to not calculating dist(data) in advance, and calculate corresponding distances when it is required in the silhouette formula.

I appreciate your help regarding this problem and how I can calculate silhouette for large and very large datasets.

Solution

You can implement Silhouette yourself.

It only needs every distance twice, so storing an entire distance matrix is not necessary. It may run a bit slower because it computes distances twice, but at the same time the better memory efficiency may well make up for that.

It will still take a LONG time though.

You should consider to only use a subsample (do you really need to consider all points?) as well as alternatives such as Simplified Silhouette, in particular with KMeans... You only gain very little with extra data on such methods. So you may just use a subsample.