I have an input and a master file in which I first tried doing an exact match using the fpd.fuzzy_merge
function, which seems to be working perfectly. Afterwards, I want to start applying the levenshtein method with different thresholds on the remaining rows from the input file, excluding the ones that actually had an exact match in the previously executed command.
Sample input:
Raw desc Clean desc Brand
CAULIFLOWER TRI COLOR 6 CT SAL cauliflower tri color sal SYSCO
!SYRUP BLACKBERRY 2LB syrup blackberry TYSON
PUREE BRKFST SAUSAGE LINK 24CT puree brkfst sausage link TYSON
Master file:
Master raw desc Master clean desc SKU
SYRUP & BLACKBERRY 50Z syrup blackberry 2356123
VEGETABLES MXD 6-10 GCHC vegetables mxd gchc 4412620
3 LB PAPER FOOD BOAT 500/CS paper food boat 4551210
After I am using the following code for an exact match, the output is the following which is the desired one. The challenge is to run the remaining of the sample input that didn't have an exact match through the Levenshtein method with different thresholds.
results1 = fpd.fuzzy_merge(
sample_df, master_df, left_on="Clean desc", right_on="Master clean desc"
)
Output 1):
Raw desc Clean desc Brand Master raw desc Master clean desc SKU
!SYRUP BLACKBERRY 2LB syrup blackberry TYSON SYRUP & BLACKBERRY 5OZ syrup blackberry 2356123
Afterwards, I use the following code using the Levenshtein method with a specific threshold, but no matter what parameter I use it only runs on the two columns I specified. I want to be able to run and return all the columns from both dataframes.
results = fpd.fuzzy_merge(
sample_df,
master_df,
left_on="Clean desc",
right_on="Master clean desc",
method="levenshtein",
threshold=0.85,
join="left-outer",
keep="match",
)
Any recs would be much appreciated!
With an extended version of your dataframes (to provide more than one match):
import pandas as pd
sample_df = pd.DataFrame(
{
"Raw desc": [
"CAULIFLOWER TRI COLOR 6 CT SAL",
"!SYRUP BLACKBERRY 2LB",
"PUREE BRKFST SAUSAGE LINK 24CT",
],
"Clean desc": [
"cauliflower tri color sal",
"syrup blackberry",
"puree brkfst sausage link",
],
"Brand": ["SYSCO", "TYSON", "TYSON"],
}
)
master_df = pd.DataFrame(
{
"Master raw desc": [
"SYRUP & BLACKBERRY 50Z",
"VEGETABLES MXD 6-10 GCHC",
"3 LB PAPER FOOD BOAT 500/CS",
"CAUFLOWER 3 COLOR SIX SAL",
],
"Master clean desc": [
"syrup blackberry",
"vegetables mxd gchc",
"paper food boat",
"flwrs multi shape vet",
],
"SKU": [2356123, 4412620, 4551210, 999999],
}
)
Here is one way to it with Pandas concat and a list comprehension:
import fuzzy_pandas as fpd
results = pd.concat(
[
fpd.fuzzy_merge(
sample_df,
master_df,
left_on=left_on,
right_on=right_on,
method="levenshtein",
ignore_case=True,
)
for left_on, right_on in (
("Raw desc", "Master raw desc"),
("Clean desc", "Master clean desc"),
)
]
).drop_duplicates(ignore_index=True)
Then:
print(results)
# Output
Raw desc Clean desc Brand Master raw desc Master clean desc SKU
0 CAULIF... caulif... SYSCO CAUFLO... flwrs ... 999999
1 !SYRUP... syrup ... TYSON SYRUP ... syrup ... 2356123