imagegroupingdistancesimilaritypairwise

Structure and an algorithm for grouping a large set of pairwise image with similarity distances (c++)


I want to find similar images in a very large dataset (at least 50K+ images, potentially much more). I already successfully implemented several "distance" functions (hashes compared with L2 or Hamming distance for example, image features with % of similarity, etc) - the result is always a "double" number.

What I want now is "grouping" (clustering?) images by similarity. I already achieved some pretty good results, but the groups are not perfect: some images that could be grouped with others are left aside, so my method is not that good.

I've been looking for a solution these last 3 days, but things are not so clear in my head, and maybe I overlooked a possible method?

I already have image pairs with distance : [image A (index, int), image B (index, int), distance (double)], and a list of duplicates (image X similar to images Y, Z, T, image Y similar to X, T, G, F --- etc).

My problem:

I'm coding with C++, using Qt6 for the interface and OpenCV 4.6 for some image functions, some hashing methods, etc.

Any idea/library/structure to propose? Thanks in advance.

EDIT - to better explain what I want to achieve

Example

Images are the yellow circles. Image 1 is similar to image 4 with a score=3 and to 5 with a score=2 etc

The problem is that image 4 is also similar to image 5, so image 4 is more similar to 1 than 5. The example I put here is very simple because there are no more than 2 similar images for each image. With a bigger sample, image 4 could be similar to n images... And what about equal scores?

So is there an algorithm to create groups of images, so that no image is listed twice?


Solution

  • The answers to my own question:

    Many thanks to @Similar_Pictures for taking the time to answer me, and opening my eyes upon the fact that the better the similarity algorithm(s), the less is the need to use complicated clustering techniques...

    I am actually testing how to combine several similarity techniques: each one has its defaults, but together some work best, using refined thresholds.