scalacluster-analysisapache-flinkminhash

How to cluster sets (users/documents) with distributed MinHash using the banding technique?


I have a big doubt about the way I should cluster sets using MinHash together with the banding technique.

I assume everyone reading has a good knowledge of MinHash so I won't define most of the terms I'm using.

My goal is to use MinHash to cluster users according to the similarity of their signatures. In a local, non-banded settings this would be trivial: if their signature hash is the same, they go in the same cluster.

If we split signatures in bands and process them indipendently, I can treat a band as I said before and generate a group of clusters for every band. My question is: how should I aggregate these clusters? Just merge them if they have at least an element in common? Or should I do something different?

Thanks


Solution

  • MinHash is not really meant as standalone clustering algorithm. It is meant as a candidate filter for near-duplicate detection.

    When looking for similar documents, you compute the minhashes to retrieve candidates. You then still need to check these candidates - they could be false positives! The more signatures agree, the more likely they really match.

    So if you consider the near-duplicate scenario again: if a is a near duplicate of b and b is a near duplicate of c, then a should also be a near duplicate of c. If this holds, you can throw all these matches (after verification) together. If it doesn't consider a hierarchical clustering like strategy to merge (or not merge) candidates.