machine-learningcluster-analysissupervised-learning

Which ML algorithm can find pairs in two datasets?


I have two datasets for which I need to find matching pairs.

In dataset 1 I have reported data from payment providers, which will result in bank payouts. Each record contains a monetary amount, potentially a reference, a date, etc.

The actual incoming bank transactions are in dataset 2. Each transaction also contains a monetary amount, complementary bank info, a date, an account number, etc.

In the case where the "reference" from dataset 1 is part of the "complementary bank info" from dataset 2, finding pairs is trivial. Also, finding the correct pairs based on monetary amount and dates works fine.

However, there are cases where the monetary amounts are off by a few cents or the "reference" has whitespace in the complementary bank info. Those cases are currently handled manually, e.g. a person will hand pick and match the pairs.

I'd like to try and train a machine-learning algorithm, to find these pairs but I am struggling a bit with the correct approach here. I assume this is a case for supervised learning - but what algorithm would find me those pairs from the two datasets?


Solution

  • Try some "nearest neighbors" algorithm, with the number of neighbors set to 1.

    Specifically, you want to perform a "search", which is a bit broader concept than "classification" or "regression". See the sklearn.neighbors.NearestNeighbors class for more ideas.

    there are cases where the monetary amounts are off by a few cents or the "reference" has whitespace in the complementary bank info

    The "nearest neighbors" algorithm has a "threshold" parameter, which represents similarity cutoff. Find a "threshold" parameter value that captures "a few cents difference" but doesn't capture "a dollar or more difference".