elasticsearchduplicatesapache-nifirecord-linkagepython-dedupe

Apache Nifi - Federated Search


My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.

I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.

So roughly speaking something like this:-

enter image description here

For examples sake the following data then exists in the result database from the first flow :-

enter image description here

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

enter image description here

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.

Couple questions:-

I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).

Thanks!


Solution

  • For de-duplicating...

    You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.

    For triggering the second flow...

    Do you need that intermediate DB table for something else?

    If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.

    If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.