hadoopmachine-learningcluster-analysisk-meansopenimaj

K-Means clustering in OpenIMAJ library


I'm not very experienced in machine learning and cluster analysis, but I have following problem:

I have ~100kk-1000kk pieces of data which I cannot load into memory all at once and I need to divide it to a number of classes (like 1-10k or even 100k classes) for further analisys. To do that I've choosed K-Means algorithm implemented in OpenIMAJ library (FloatKMeans class). I understand that K-Means algorithm can be divided into 2 phases:

  1. Learning phase - where I pass in all the data I have to create/fill the classes
  2. Assignemnt phase - where I can ask the cluster to which class the given piece of data belongs to

I'm planning to build the cluster model using Hadoop reduce phase where I'll receive the data pieces one by one (that's why i cannot pass the data all at once to the algorithm)

My questions are:

Thanks for help


Solution

  • K-Means clustering is an iterative algorithm that makes multiple passes over the data. In each pass, points are assigned to cluster centroids and then after all points have been assigned, the cluster centroids are recomputed to be the mean of the assigned points. You can't "stream" data to the algorithm in the traditional sense as you'll need to come back to it during the subsequent iterations.

    Regarding the OpenIMAJ FloatKMeans implementation: yes this can handle "big data" in the sense that it doesn't mind where it gets the data from - the DataSource instance that it takes as input can read data from disk if necessary. The only requirement is that you can hold all the centroids in memory during the runtime of the algorithm. The implementation is multi-threaded, so all cpu cores can be used during computation. There is example code here: https://github.com/openimaj/openimaj/blob/master/demos/examples/src/main/java/org/openimaj/examples/ml/clustering/kmeans/BigDataClusterExample.java. The OpenIMAJ IOUtils.writeBinary(...) methods can be used to save the resultant cluster centroids in the FloatCentroidsResult object.

    One of the biggest costs in K-Means is the computation of distances between each data point and each cluster centroid in order to find the closest. The cost of this is related to the dimensionality of the data and the number of centroids. If you've got a large number of centroids and high dimensional data, then using an approximate K-Means implementation can have big speed benefits at the cost of a slight loss in accuracy (see FloatKMeans.createKDTreeEnsemble() for example -- this uses an ensemble of KD-Trees to speed neighbour computations).

    Regarding integration with Hadoop, it is possible to implement K-Means as a series of Map-Reduce tasks (each pair corresponds to an iteration of the algorithm). See this paper for a discussion: http://eprints.soton.ac.uk/344243/1/paper.pdf . If you want to go down this route, OpenIMAJ has a very rough implementation here, which you could build off: https://github.com/openimaj/openimaj/tree/master/hadoop/tools/HadoopFastKMeans. As mentioned in the linked paper, Apache Mahout also contains an implementation: https://mahout.apache.org. One problem with both of these implementations is that they required quite a lot of data to be transferred between the mappers and reducer (each mapper emits the current data point and its assigned cluster id). The extent of this could mean that it could be faster to use a non-Hadoop implementation of the algorithm, but this would depend on what processing resources you have available and the nature of the dataset. The problem of data-transfer between the map and reduce could probably also be reduced with a clever Hadoop Combiner and computes weighted centroids from subsets of the data and then passes these to the (modified) reducer to compute the actual centroids.