I'm encountering the following issue during a small project of mine. I'm having a large dataset where some string values are accidentally not written properly. My goal is to write a function that ensures that all names that look fairly similar (.75) will be looked for in a loop, and will get the same name. In the example below I described a subset of the data where "Bob Fisherman", "Bob Felony" & "Bob Haris" are the correct names. I would like to have the misspelled names changes to the above if they match
Here is a subset of the dataframe:
columns = ["Name", "Type","Amount", "Year"]
data = [("Bob fisherman", "Income", 150, 2022), ("Bob fisherman","Income", 100, 2021), ("Bob Felony", "Income", 100, 2021), ("Bob Felany", "Expense", 50, 2022), ("Bob Haris", "Expense", 100, 2022), ("Bob Disherman", "Expense", 100, 2021)]
data = spark.createDataFrame(data).toDF(*columns)
So eventually I would like to have something like this:
Name | Type | Amount | Year |
---|---|---|---|
Bob Fisherman | Income | 150 | 2022 |
Bob Fisherman | Income | 100 | 2021 |
Bob Felony | Income | 100 | 2021 |
Bob Felany | Income | 50 | 2022 |
Bob Haris | Income | 100 | 2022 |
Bob Felony | Income | 100 | 2021 |
Bob Fisherman | Income | 100 | 2022 |
In the example it only goes about Bob. But in the total sample, I have much more names so the use of pre-specified list is not going to cut it unfortunately.
I tried to get some inspiration from the following question but I didn't seem to make it work: Replace similar strings in a column with the same string
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
[
("Bob Fisherman", "Income", 150, 2022),
("Bob Heris", "Income", 100, 2021),
("Bob Felony", "Income", 100, 2021),
("Bob Felany", "Expense", 50, 2022),
("Bob Haris", "Expense", 100, 2022),
("Bob Disherman", "Expense", 100, 2021),
],
columns=["Name", "Type", "Amount", "Year"],
)
print(df)
# Output
Name Type Amount Year
0 Bob Fisherman Income 150 2022
1 Bob Heris Income 100 2021
2 Bob Felony Income 100 2021
3 Bob Felany Expense 50 2022
4 Bob Haris Expense 100 2022
5 Bob Disherman Expense 100 2021
Here is one way to do it with the help of SequenceMatcher class from Python standard library difflib
module:
from difflib import SequenceMatcher
def similar(a, b):
"""Get similarity ratio between a and b.
Args:
a: value.
b: other value.
Returns:
Similatity ratio.
"""
return SequenceMatcher(None, a, b).ratio()
def replace_similar_values(df, col, value):
"""Main helper function.
Args:
df: target dataframe.
col: target column.
value: value to use for replacement.
Returns:
Modified dataframe.
"""
df = df.assign(
Match=df[col].map(
lambda x: max(
[similar(x, value)],
key=lambda x: x if x != 1 else 0,
)
)
)
df["Name"] = df.apply(lambda x: value if x["Match"] >= 0.8 else x["Name"], axis=1)
df = df.drop(columns="Match")
return df
And then:
correct_names = ["Bob Fisherman", "Bob Felony", "Bob Haris"]
for correct_name in correct_names:
df = replace_similar_values(df, "Name", correct_name)
print(df)
# Output
Name Type Amount Year
0 Bob Fisherman Income 150 2022
1 Bob Haris Income 100 2021
2 Bob Felony Income 100 2021
3 Bob Felony Expense 50 2022
4 Bob Haris Expense 100 2022
5 Bob Fisherman Expense 100 2021