I'm currently creating a program that can compute near-dupliate score within a corpus of text documents (+5000 docs). I'm using Simhash to generate a uniq footprint of a document (thanks to this github repo)
my datas are :
data = {
1: u'Im testing simhash algorithm.',
2: u'test of simhash algorithm',
3: u'This is simhash test.',
}
and this gives me 3 hash like this :
00100110101110100011111000100010010101011001000001110000111001011100110101001101111010100010001011001011000110000100110101100110
00001001110010000000011000001000110010001010000101010000001100000100100011100100110010100000010000000110001001010110000010000100
10001110101100000100101010000010010001011010001000000000101000101100001100100000110011000000011001000000000110000000100110000000
And now, how to compare those 3 hashes ? I know that I have to split them into blocks but dont have the exact method ?
What i want to do is the output all the duplicated documents (>70%) with their ID and the IDs of duplicates docs.
Can someone help ?
Before I answer your question, it is important to keep in mind:
Now, I have replied to your question on the Github issue that you raised here.
For reference though, here is some sample code you can use to print the final near duplicate documents after hashing them.
# assuming that you have a dictionary with document id as the key and the document as the value:
# documents = { doc_id: doc } you can do:
from simhash import simhash
def split_hash(str, num):
return [ str[start:start+num] for start in range(0, len(str), num) ]
hashes = {}
for doc_id, doc in documents.items():
hash = simhash(doc)
# you can either use the whole hash for higher precision or split into chunks for higher recall
hash_chunks = split_hash(hash, 4)
for chunk in hash_chunks:
if chunk not in hashes:
hashes[chunk] = []
hashes[chunk].append(doc_id)
# now you can print the duplicate documents:
for hash, doc_list in hashes:
if len(doc_list) > 1: # if the length of doc is greater than 1
print("Duplicates documents: ", doc_list)
Please let me know if something is not clear.