duplicatessimhash

How to compare the similarity of documents with Simhash algorithm?


I'm currently creating a program that can compute near-dupliate score within a corpus of text documents (+5000 docs). I'm using Simhash to generate a uniq footprint of a document (thanks to this github repo)

my datas are :

data = {
    1: u'Im testing simhash algorithm.',
    2: u'test of simhash algorithm',
    3: u'This is simhash test.',
}

and this gives me 3 hash like this :

00100110101110100011111000100010010101011001000001110000111001011100110101001101111010100010001011001011000110000100110101100110

00001001110010000000011000001000110010001010000101010000001100000100100011100100110010100000010000000110001001010110000010000100

10001110101100000100101010000010010001011010001000000000101000101100001100100000110011000000011001000000000110000000100110000000

And now, how to compare those 3 hashes ? I know that I have to split them into blocks but dont have the exact method ?

What i want to do is the output all the duplicated documents (>70%) with their ID and the IDs of duplicates docs.

Can someone help ?


Solution

  • Before I answer your question, it is important to keep in mind:

    1. Simhash is useful as it detects near duplicates. This means that near duplicates will end up with the same hash.
    2. For exact duplicates you can simply use any one way, consistent hashing mechanism (ex. md5)
    3. The examples that you pasted here are too small and given their size, their differences are significant. The algorithm is tailored to work with large Web Documents and not small sentences.

    Now, I have replied to your question on the Github issue that you raised here.

    For reference though, here is some sample code you can use to print the final near duplicate documents after hashing them.

    # assuming that you have a dictionary with document id as the key and the document as the value: 
    # documents = { doc_id: doc } you can do:
    
    from simhash import simhash
    
    def split_hash(str, num):
        return [ str[start:start+num] for start in range(0, len(str), num) ]
    
    hashes = {}
    for doc_id, doc in documents.items():
        hash = simhash(doc)
        
        # you can either use the whole hash for higher precision or split into chunks for higher recall
        hash_chunks = split_hash(hash, 4)
        
        for chunk in hash_chunks:
            if chunk not in hashes:
                hashes[chunk] = []
            hashes[chunk].append(doc_id)
    
    # now you can print the duplicate documents:
    for hash, doc_list in hashes:
        if len(doc_list) > 1:  # if the length of doc is greater than 1
            print("Duplicates documents: ", doc_list)
    

    Please let me know if something is not clear.