I have an application where I need to implement K-means myself. I was trying my implementation in a small program to see how things are going; however, I find it really confusing that I get the exact same cluster centers every time I run the program. I am surprised because my cluster centers are randomly initialized, and I am using FLANN to update the clusters (both should make the output nondeterministic, right?). I'm probably missing something, and would appreciate some help.
cv::Mat kMeans(cv::Mat data, int k, int max_itr){
cv::Mat centroids = cv::Mat(cv::Size(data.cols, k), data.type());
std::vector<int> cluster_ids(data.rows, 0);
//randomly initialize centroids
for(int i = 0; i < k; i++)
{
int idx = rand() % data.rows;
data.row(idx).copyTo(centroids.row(i));
}
int itr = 0;
cv::flann::Index kdtree(centroids, cv::flann::KDTreeIndexParams(4));
cv::Mat indices, dists;
//TODO: add tolerance stopping condition
while (itr < max_itr)
{
//assign data points to centroids
for(int i = 0; i < data.rows; i++)
{
kdtree.knnSearch(data.row(i), indices, dists, 1, cv::flann::SearchParams(32));
cluster_ids[i] = indices.data[0];
}
//update centroids
for(int i = 0; i < centroids.rows; i++)
{
cv::Mat sum = cv::Mat::zeros(cv::Size(data.cols, 1), data.type());
int count = 0;
for(int j = 0; j < data.rows; j++)
{
if(cluster_ids[j] != i) {continue;}
sum += data.row(j);
count++;
}
centroids.row(i) = sum / float(count);
}
itr += 1;
}
return centroids;
}
int main()
{
cv::Mat_<float> data;
cv::Mat_<float> centroids;
cv::Mat dst;
//initialize with dummy data
for(int i = 0; i < 1000; i++)
{
cv::Mat row = (cv::Mat_<float>(1, 3) << float(i), float(i), float(i));
data.push_back(row);
}
centroids = kMeans(data, 20, 20);
std::cout << centroids << std::endl;
return 0;
}
You need this line:
std::srand(std::time(nullptr)); // use current time as seed for random generator
at the start of your program to get different sequences from the rand
function every time you run the program.
You can use some other seed if you want, but you still need to call srand
with some source of randomness at the start of the program.