The problem is: I have a collection of text documents, i want to pick up the most similar one to the input one. The input text document could be exactly match or modified partly. The algorithm must be very fast.
Currently, I found simhash to take a fingerprint from collection documents. Is there any other algorithm to do the same thing?
have you tried LSH(locality sensitive Hashing) techniques