pythonrbigdatacluster-analysismixed

Clustering on large, mixed type data


I'm dealing with a dataframe of dimension 4 million x 70. Most columns are numeric, and some are categorical, in addition to the occasional missing values. It is essential that the clustering is ran on all data points, and we look to produce around 400,000 clusters (so subsampling the dataset is not an option).

I have looked at using Gower's distance metric for mixed type data, but this produces a dissimilarity matrix of dimension 4 million x 4 million, which is just not feasible to work with since it has 10^13 elements. So, the method needs to avoid dissimilarity matrices entirely.

Ideally, we would use an agglomerative clustering method, since we want a large amount of clusters.

What would be a suitable method for this problem? I am struggling to find a method which meets all of these requirements, and I realise it's a big ask.

Plan B is to use a simple rules-based grouping method based on categorical variables alone, handpicking only a few variables to cluster on since we will suffer from the curse of dimensionality otherwise.


Solution

  • The first step is going to be turning those categorical values into numbers somehow, and the second step is going to be putting the now all numeric attributes into the same scale.

    Clustering is computationally expensive, so you might try a third step of representing this data by the top 10 components of a PCA (or however many components have an eigenvalue > 1) to reduce the columns.

    For the clustering step, you'll have your choice of algorithms. I would think something hierarchical would be helpful for you, since even though you expect a high number of clusters, it makes intuitive sense that those clusters would fall under larger clusters that continue to make sense all the way down to a small number of "parent" clusters. A popular choice might be HDBSCAN, but I tend to prefer trying OPTICS. The implementation in free ELKI seems to be the fastest (it takes some messing around with to figure it out) because it runs in java. The output of ELKI is a little strange, it outputs a file for every cluster so you have to then use python to loop through the files and create your final mapping, unfortunately. But it's all doable (including executing the ELKI command) from python if you're building an automated pipeline.