performancecluster-analysisapproximationset-intersectionset-union

How Fast Can We Approximate Set Jaccard Scores?


I'm trying to compute about 1 trillion different pairwise jaccard scores on a laptop. I'm also not very patient.

Each set tends to contain about 200-800 unique, pseudorandom 32-bit hashes. I've already minimized the size of each set.

Is there an approach that would allow me to trade precision for faster performance? I can probably tolerate an error rate of +/- 0.15.

To be clear, I'm not looking to get a marginal performance boost from converting the same equation to something like numba. I'm looking for an alternative approach (perhaps matrix math?) that will boost performance by at least an order of magnitude, preferably more.

Here's an example that computes an error-free jaccard score:

def jaccard(hash_set_1, hash_set_2) -> float:
    if 0 == min(len(hash_set_1), len(hash_set_2)):
        return 0.0
    intersect = len(hash_set_1 & hash_set_2)
    union = len(hash_set_1) + len(hash_set_2) - intersect
    return intersect / union

For some additional context: I have roughly 100 million known sets, and some input sets of interest. For each of the sets of interest, I'm scoring against the 100 million known sets to generate clusters. As matches (scores above a threshold) are found, they are added to the input set. After fully exploring the space, I expect to find roughly 10K matches.

This operation gets repeated periodically, and needs to be as fast as possible. When the operation is repeated, some percentage (10%-25%) of both the known sets, and the input sets of interest, will have aged out and be replaced with new sets of data.


Solution

  • Two algorithms are optimized to solve this problem: MinHash and SimHash. MinHash has lower average variance, and beats SimHash when looking for similarity scores below 0.5. SimHash has lower variance than MinHash when looking for scores above 0.5. Both are useful in different contexts.