I'm preparing to run dedupe on a fairly large dataset (400,000 rows) with Python. In the documentation for the DedupeMatching
class, there are both the match
and matchBlocks
functions. For match
the docs suggest to only use on small to moderately sized datasets. From looking through the code, I can't gather how matchBlocks
in tandem with block_data
performs better than just match
on larger datasets when the generator=True
in match.
I've tried running both methods on a small-ish dataset (10,000 entities) and didn't notice a difference.
data_d = {'id1': {'name': 'George Bush', 'address': '123 main st.'}
{'id2': {'name': 'Bill Clinton', 'address': '1600 pennsylvania ave.'}...
{id10000...}}
then either method A:
blocks = deduper._blockData(data_d)
clustered_dupes = deduper.matchBlocks(blocks, threshold=threshold)
or method B
clustered_dupes = deduper.match(blocks, threshold=threshold, generator=True)
(Then the computationally intensive part is running a for-loop
on the clustered_dupes
object.
cluster_membership = {}
for (cluster_id, cluster) in enumerate(clustered_dupes):
# Do something with each cluster_id like below
cluster_membership[cluster_id] = cluster
I expect/wonder if there is a performance difference. If so, could you point me to the code that shows that and explain why?
there is no difference between calling _blockData
and then matchBlocks
versus just match
. Indeed if you look at the code, you'll see that match
calls those two methods.
The reason why matchBlocks
is exposed is that _blockData
can take a lot of memory, and you may want to generate the blocks another way, such as taking advantage of a relational database.