rperformancemachine-learningdata-miningexpectation-maximization

A framework for comparing the time performance of Expectation Maximization


I have my own implementation of the Expectation Maximization (EM) algorithm based on this paper, and I would like to compare this with the performance of another implementation. For the tests, I am using k centroids with 1 Gb of txt data and I am just measuring the time it takes to compute the new centroids in 1 iteration. I tried it with an EM implementation in R, but I couldn't, since the result is plotted in a graph and gets stuck when there's a large number of txt data. I was following the examples in here.

Does anybody know of an implementation of EM that can measure its performance or know how to do it with R?


Solution

  • Fair benchmarking of EM is hard. Very hard.

    1. the initialization will usually involve random, and can be very different. For all I know, the R implementation by default uses hierarchical clustering to find the initial clusters. Which comes at O(n^2) memory and most likely at O(n^3) runtime cost. In my benchmarks, R would run out of memory due to this. I assume there is a way to specify initial cluster centers/models. A random-objects initialization will of course be much faster. Probably k-means++ is a good way to choose initial centers in practise.

    2. EM theoretically never terminates. It just at some point does not change much anymore, and thus you can set a threshold to stop. However, the exact definition of the stopping threshold varies.

    3. There exist all kinds of model variations. A method only using fuzzy assignments such as Fuzzy-c-means will of course be much faster than an implementation using multivariate Gaussian Mixture Models with a covaraince matrix. In particular with higher dimensionality. Covariance matrixes also need O(k * d^2) memory, and the inversion will take O(k * d^3) time, and thus is clearly not appropriate for text data.

    4. Data may or may not be appropriate. If you run EM on a data set that actually has Gaussian clusters, it will usually work much better than on a data set that doesn't provide a good fit at all. When there is no good fit, you will see a high variance in runtime even with the same implementation.

    For a starter, try running your own algorithm several times with different initialization, and check your runtime for variance. How large is the variance compared to the total runtime?

    You can try benchmarking against the EM implementation in ELKI. But I doubt the implementation will work with sparse data such as text - that data just is not Gaussian, it is not proper to benchmark. Most likely it will not be able to process the data at all because of this. This is expected, and can be explained from theory. Try to find data sets that are dense and that can be expected to have multiple gaussian clusters (sorry, I can't give you many recommendations here. The classic Iris and Old Faithful data sets are too small to be useful for benchmarking.