python-dedupe

Is there a performance difference between `dedupe.match(generator=True)` and `dedupe.matchBlocks()` for large datasets?


I'm preparing to run dedupe on a fairly large dataset (400,000 rows) with Python. In the documentation for the DedupeMatching class, there are both the match and matchBlocks functions. For match the docs suggest to only use on small to moderately sized datasets. From looking through the code, I can't gather how matchBlocks in tandem with block_data performs better than just match on larger datasets when the generator=True in match.

I've tried running both methods on a small-ish dataset (10,000 entities) and didn't notice a difference.

data_d = {'id1': {'name': 'George Bush', 'address': '123 main st.'}
         {'id2': {'name': 'Bill Clinton', 'address': '1600 pennsylvania ave.'}... 
         {id10000...}}

then either method A:

blocks = deduper._blockData(data_d)
clustered_dupes = deduper.matchBlocks(blocks, threshold=threshold)

or method B

clustered_dupes = deduper.match(blocks, threshold=threshold, generator=True)

(Then the computationally intensive part is running a for-loop on the clustered_dupes object.

cluster_membership = {}
for (cluster_id, cluster) in enumerate(clustered_dupes):
    # Do something with each cluster_id like below
    cluster_membership[cluster_id] = cluster

I expect/wonder if there is a performance difference. If so, could you point me to the code that shows that and explain why?


Solution

  • there is no difference between calling _blockData and then matchBlocks versus just match. Indeed if you look at the code, you'll see that match calls those two methods.

    The reason why matchBlocks is exposed is that _blockData can take a lot of memory, and you may want to generate the blocks another way, such as taking advantage of a relational database.