pythonpysparkrecord-linkagepython-dedupe

Low resources usage when using dedupe python


I need to find duplicates in a large dataset, so I'm testing dedupe python library.

I know it is recommended for small datasets, so I thought using a good machine could improve the performance. I have a machine with 56 GB RAM and I'm running a test similar to "csv_example" for a dataset with 200000 rows. It works but the memory usage is very low and so the processing(CPU).

It seems to take too long in the blocking stage:

INFO:dedupe.blocking:10000, 110.6458142 seconds
INFO:dedupe.blocking:20000, 300.6112282 seconds
INFO:dedupe.blocking:30000, 557.1010122 seconds
INFO:dedupe.blocking:40000, 915.3087222 seconds

Could anyone help me to improve the usage or tell me if there is any library/setting that makes the program use more available resources?


Solution

  • What version of dedupe are your running? As of 1.6.8, it should handle a record set of this size pretty easily.

    However, the general guidance is that when your run into memory problems, switch to do blocking with a database like in the postgres example.

    (I'm a main author of dedupe).