I'm having some trouble with Weka's XMeans clusterer. I've talked to a couple of other humans and we all agree that there are six clusters in the screenshot below, or at least a minimum of two if you squint really hard. Either way, xMeans does not seem to agree.
XMeans seems to systematically underestimate the number of clusters, based on whatever I have the minimum cluster count set to. With the maximum number of clusters held at 100, here are the results I get:
-L 1 // 1 cluster
-L 2 // 2 clusters
-L 3 // 3 clusters
-L 4 // 5 clusters
-L 5 // 6 clusters
-L 6 // 6 clusters
Most egregiously, with -L 1
(and -H 100
) only a single cluster is found. Only by getting the minimum cluster count to five do I actually see the six clusters. Cranking the improve-structure parameter way up (to 100,000) does not seem to have any effect. (I've also played with other options without seeing any difference.) Here are the options that generated the above screenshot, which found one center:
private static final String[] XMEANS_OPTIONS = {
"-H", "100", // max number of clusters (default 4)
"-L", "1", // min number of clusters (default 2)
"-I", "100", // max overall iterations (default 1)
"-M", "1000000", // max improve-structure iterations (default 1000)
"-J", "1000000", // max improve-parameter iterations (default 1000)
};
Obviously I'm missing something here. How do I make XMeans behave as expected?
I think it is what I was afraid of. A psychological problem (I think if you search for Gestalttheorie you might find some explanation on the perception aspect).
The human eye is grasping the form of the clusters and finds six circles. However, k-means and therefore x-means only looks at the distance. Hence, the cluster look rather awkward. Also after multiple restarts with 6-means I almost always achieved clusters like: This one might be a local minima which might be solved by xmeans
or for 3-means
Which is quite interesting as some points clearly violate the expectations.
If you use the K-Means in R you can analyse the withinss of the cluster result. These show that these awkward looking cluster typically perform quite well. Hence, there is no convergence to a different result to be expected.
I think this could be resolvable by using a different distance measure. For example the squared euclidean distance which enforce circle like shapes. Or using some kernel-based clustering technique with a RBF Kernel
===============================================
EDIT1:
However, one aspect of the results of weka are still rather awkward, I used RWeka to run a few experiments. Basically I did run 100 cluster-runs for each initial cluster size between 2
and 7
. My expectation was that for 2
and 6
the clusters are rather stable for the other cluster sizes I would have expected them to grow.
However, the results are differently
So basically 2
and 6
are fairly stable, however, the cluster sizes are always expended by at most 1.
So lets have a look at the BIC
What we can observe is that the BIC is not increasing when increasing the cluster size, however, strongly dependent on the initial cluster size.
Let's drill down a bit further and look at initial cluster size of 3. Running multiple restarts with different initial sizes generates (reproducible) the following two situations:
Nevertheless the results on the BIC seem to suggest that there is a BUG in the BIC calculation.