[SOLVED] Choosing between SimHash and MinHash for a production system

Choosing between SimHash and MinHash for a production system

I'm familiar with the LSH (Locality Sensitive Hashing) techniques of SimHash and MinHash. SimHash uses cosine similarity over real-valued data. MinHash calculates resemblance similarity over binary vectors. But I can't decide which one would be better to use.

I am creating a backend system for a website to find near duplicates of semi-structured text data. For example, each record will have a title, location, and a brief text description (<500 words).

Specific language implementation aside, which algorithm would be best for a greenfield production system?

Solution

Simhash is faster (very fast) and typically requires less storage, but imposes a strict limitation on how dissimilar two documents can be and still be detected as duplicates. If you are using a 64-bit simhash (a common choice), and depending on how many permuted tables you are capable of storing, you might be limited to hamming distances of as low as 3 or possibly as high as 6 or 7. Those are small hamming distances! You'll be limited to detecting documents that are mostly identical, and even then you may need to do some careful tuning of what features you choose to go into the simhash and what weightings you give to them.

The generation of simhashes is patented by google, though in practice they seem to allow at least non-commercial use.

Minhash uses more memory, since you'd be typically storing 50-400 hashes per document, and it's not as CPU-efficient as simhash, but it allows you to find quite distant similarities, e.g. as low as 5% estimated similarity, if you want. It's also a bit easier to understand than simhash, particularly in terms of how the tables work. It's quite straightforward to implement, typically using shingling, and doesn't need a lot of tuning to get good results. It's not (to my knowledge) patented.

If you're dealing with big data, the most CPU-intensive part of the minhash approach will likely be after you've generated the minhashes for your document, when you're hunting through your table to find other documents that share some of its hashes. There may be tens or hundreds of thousands of documents that share at least one hash with it, and you've got to weed through all of these to find those few that share e.g. a minimum of half its hashes. Simhash is a lot quicker here.

As Otmar points out in his comment below, there are optimizations of minhash that allow you to achieve the same precision on your similarity estimates with fewer hashes per document. This can substantially reduce the amount of weeding you have to do.

Edit:

I have now tried superminhash. It's fairly fast, though my implementation of minhash using a single hash function plus bit-transformations to produce all the other hashes was faster for my purposes. It offers more accurate jaccard estimates, about 15% better under some situations I tested (though almost no difference under others). This should mean you need about a third fewer hashes to achieve the same accuracy. Storing fewer hashes in your table means less "weeding" is needed to identify near duplicates, which delivers a significant speed-up. I'm not aware of any patent on superminhash. Thanks Otmar!