I am trying to use the KMeans clustering from faiss on a human pose dataset of body joints. I have 16 body parts so a dimension of 32. The joints are scaled in a range between 0 and 1. My dataset consists of ~ 900.000 instances. As mentioned by faiss (faiss_FAQ):
As a rule of thumb there is no consistent improvement of the k-means quantizer beyond 20 iterations and 1000 * k training points
Applying this to my problem I randomly select 50000 instances for training. As I want to check for a number of clusters k between 1 and 30.
Now to my "problem":
The inertia is increasing directly as the number of cluster increases (n_cluster on the x-axis):
I tried varying the number of iterations, the number of redos, verbose and spherical, but the results stay the same or get worse. I do not think that it is a problem of my implementation; I tested it on a small example with 2D data and very clear clusters and it worked.
Is it that the data is just bad clustered or is there another problem/mistake I have missed? Maybe the scaling of the values between 0 and 1? Should I try another approach?
I found my mistake. I had to increase the parameter max_points_per_centroid. As I have so many data points it sampled a sub-batch for the fit. For a larger number of clusters this sub-batch is larger. See FAQ of faiss:
max_points_per_centroid * k: there are too many points, making k-means unnecessarily slow. Then the training set is sampled
The larger subbatch of course has a larger inertia as there are more points in total.