pythonclistdictionaryminhash

Storing the result of Minhash


The result is a fixed number of arrays, let's say lists (all of the same length) in .

One could see it as a matrix too, so in I would use an array, where every cell would point to another array. How to do it in Python?

A list where every item is a list or something else?

I thought of a dictionary, but the keys are trivial, 1, 2, ..., M, so I am not sure if that is the pythonic way to go here.

I am not interested in the implementation, I am interested in which approach I should follow, in which choice I should made!


Solution

  • Whatever container you choose, it should contain hash-itemID pairs, and should be indexed or sorted by the hash. Unsorted arrays will not be remotely efficient.

    Assuming you're using a decent sized hash and your various hash algorithms are well-implemented, you should be able to just as effectively store all minhashes in a single container, since the chance of collision between a minhash from one algorithm and a minhash from another is negligible, and if any such collision occurs it won't substantially alter the similarity measure.

    Using a single container as opposed to multiple reduces the memory overhead for indexing, though it also slightly increases the amount of processing required. As memory is usually the limiting factor for minhash, a single container may be preferable.