pythonscikit-learncluster-analysismean-shift

Generating high dimensional datasets with Scikit-Learn


I am working with the Mean Shift clustering algorithm, which is based on the kernel density estimate of a dataset. I would like to generate a large, high dimensional dataset and I thought the Scikit-Learn function make_blobs would be suitable. But when I try to generate a 1 million point, 8 dimensional dataset, I end up with almost every point being treated as a separate cluster.

I am generating the blobs with standard deviation 1, and then setting the bandwidth for the Mean Shift to the same value (I think this makes sense, right?). For two dimensional datasets this produced fine results, but for higher dimensions I think I'm running into the curse of dimensionality in that the distance between points becomes too big for meaningful clustering.

Does anyone have any tips/tricks on how to get a good high-dimensional dataset that is suitable for (something like) Mean Shift clustering? (or am I doing something wrong? (which is of course a good possibility))


Solution

  • The standard deviation of the clusters isn't 1.

    You have 8 dimensions, each of which has a stddev of 1, so you have a total standard deviation of sqrt(8) or something like that.

    Kernel density estimation does not work well in high-dimensional data because of bandwidth problems.