stringhashsimilaritysimhash

simhash like algorithm to compare two text documents


The problem is: I have a collection of text documents, i want to pick up the most similar one to the input one. The input text document could be exactly match or modified partly. The algorithm must be very fast.

Currently, I found simhash to take a fingerprint from collection documents. Is there any other algorithm to do the same thing?


Solution

  • have you tried LSH(locality sensitive Hashing) techniques