I have two implementations one is K-Means and the other is EM doing soft clustering. But I do not know how to validate them in terms of accuracy. i.e. which one performs better by retrieving better clusters. My assumption is that because EM is doing soft assignments instead of hard ones as it happens in K-Means. EM performs better, but I do not know how to do this comparison...
How can I benchmark the accuracy of EM soft clustering vs K-Means? and also any suggestions for the synthetic data?
Evaluating fuzzy clustering is itself hard. I believe i have seen some variation of one of the common indexes somewhere.
But first try to answer this question:
Clustering algorithms are supposed to be an explanatory tool, so can you really judge the performance on synthetic, labeled data at all? Or souldn't you measure quality by going out to the "field" and trying to learn something new from data?
This is not a math problem.
EM, because of its fuzzy assignments, should be less likely to get stuck in a local minima than k-means. At least in theory. At the same time, it never converges. Lloyds k-means must converge (with squared Euclidean, not with other distances) because of a finiteness argument; the same argument does not hold for fuzzy algorithms anymore.
Maybe try constructing a scenario where k-means does not yield the opimal solution, and then check if EM yields better results.