I have two large datasets. df1 is about 1m lines, and df2 is about 10m lines. I need to find matches for lines in df1 from df2.
I have posted an original version of this question separately. See here. Well answered by @laurent but I have some added specificities now. I would now like to:
Get the fuzz ratios for each of fname and lname in a column in my final matched dataframe
Write the code such that fuzz ratio for fname is set to >60, while fuzz ratio for lname is set to >75. In other words, a true match occurs if fuzz_ratio for fname>60 and fuzz ratio for lname>75; otherwise not a true match. A match would not be true if fuzz ratio for fname==80 while fuzz ratio for lname==60. While I understand that this can be done from (1) as a post-hoc filtering, it would make sense to do this at the stage of coding for a different matching.
I post here an example of my data. The solution by @laurent for the original problem can be found in the above link.
import pandas as pd
df1 = pd.DataFrame(
{
"ein": {0: 1001, 1: 1500, 2: 3000},
"ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
}
)
df2 = pd.DataFrame(
{
"lname": {0: "Cupper", 1: "Cruise", 2: "Cruz", 3: "Couper"},
"fname": {0: "Bradley", 1: "Tom", 2: "Thomas", 3: "M Brad"},
"score": {0: 3, 1: 3.5, 2: 4, 3: 2.5},
}
)
Expected output is:
df3 = pd.DataFrame(
{
"df1_ein": {0: 1001, 1: 1500, 2: 3000},
"df1_ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"df1_lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"df1_fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
"fuzz_ratio_lname": {0: 83, 1: 100, 2: NA},
"fuzz_ratio_fname": {0: 62, 1: 67, 2: NA},
"df2_lname": {0: "Couper", 1: "Cruise", 2: "NA"},
"df2_fname": {0: "M Brad", 1: "Tom", 2: "NA"},
"df2_score": {0: 2.5, 1: 3.5, 2: NA},
}
)
Note from the above expected output: Bradley Cupper is a bad match for Bradley Cooper based on the fuzz ratios that I assigned. The better match for Bradley Cooper is M Brad Couper. Similarly, Thomas Cruise matches with Tom Cruise rather than with Thomas Cruz.
I am a user of Stata primarily (haha) and the reclink2 ado file can do the above in theory, i.e. if Stata can handle the size of the data. However, with the size of data I have, nothing even starts after hours.
Here is one way to do it:
import pandas as pd
from fuzzywuzzy import fuzz
# Setup
df1.columns = [f"df1_{col}" for col in df1.columns]
# Add new columns
df1["fuzz_ratio_lname"] = (
df1["df1_lname"]
.apply(
lambda x: max(
[(value, fuzz.ratio(x, value)) for value in df2["lname"]],
key=lambda x: x[1],
)
)
.apply(lambda x: x if x[1] > 75 else pd.NA)
)
df1[["df2_lname", "fuzz_ratio_lname"]] = pd.DataFrame(
df1["fuzz_ratio_lname"].tolist(), index=df1.index
)
df1 = (
pd.merge(left=df1, right=df2, how="left", left_on="df2_lname", right_on="lname")
.drop(columns="lname")
.rename(columns={"fname": "df2_fname"})
)
df1["df2_fname"] = df1["df2_fname"].fillna(value="")
for i, (x, value) in enumerate(zip(df1["df1_fname"], df1["df2_fname"])):
ratio = fuzz.ratio(x, value)
df1.loc[i, "fuzz_ratio_fname"] = ratio if ratio > 60 else pd.NA
# Cleanup
df1["df2_fname"] = df1["df2_fname"].replace("", pd.NA)
df1 = df1[
[
"df1_ein",
"df1_ein_name",
"df1_lname",
"df1_fname",
"fuzz_ratio_lname",
"fuzz_ratio_fname",
"df2_lname",
"df2_fname",
"score",
]
]
print(df1)
# Output
df1_ein df1_ein_name df1_lname df1_fname fuzz_ratio_lname \
0 1001 H for Humanity Cooper Bradley 83.0
1 1500 Labor Union Cruise Thomas 100.0
2 3000 Something something Pitt Brad NaN
fuzz_ratio_fname df2_lname df2_fname score
0 62.0 Couper M Brad 2.5
1 67.0 Cruise Tom 3.5
2 <NA> <NA> <NA> NaN