I have a similar problem to the links provided in the following references with minor differences but want the same results:
I have on dataframe and want to get the partial ratio and token between 2 columns within the dataframe. Column 1 is just one word per row, but column 2 is a list of words with each row varying in size(I changed it to a tuple to make the functions in the references work).
The main issue I get is that in the compare it goes through column 1 and compares each element to every element in column 2 thus creating a massive dataframe when I just want it 1 to 1. How can I fix this?
df = pd.DataFrame(
{
"id": [1, 2, 3, 4, 5, 6],
"fruits": ["apple", "apples", "orange", "apple tree", "oranges", "mango"],
"choices": [
("app", "apull", "apple"),
("app", "apull", "apple", "appple"),
("orange", "org"),
("apple"),
("oranges", "orang"),
("mango"),
],
}
)
id fruits choices
0 1 apple ('app', 'apull', 'apple')
1 2 apples ('app', 'apull', 'apple', 'appple')
2 3 orange ('orange', 'org')
3 4 apple tree ('apple')
4 5 oranges ('oranges', 'orang')
5 6 mango ('mango')
What compare gives me in the variable explorer:
compare = pd.MultiIndex.from_product([df['fruits'], df['choices']]).to_series()
fruits choices
0 apple ('app', 'apull', 'apple')
1 apple ('app', 'apull', 'apple', 'appple')
2 apple ('orange', 'org')
3 apple ('apple')
4 apple ('oranges', 'orang')
5 apple ('mango')
6 apples ('app', 'apull', 'apple')
7 apples ('app', 'apull', 'apple', 'appple')
8 apples ('orange', 'org')
...
Is it possible to get the desired output like the first output in reference 1 but the multi-indexed elements as the choices?
Expected output like in reference #1, but I want the choices multi-indexed:
Here is one way to do it with Pandas explode and groupby:
from fuzzywuzzy import fuzz
new_df = (
df.explode("choices")
.drop(columns="id")
.pipe(
lambda df_: df_.assign(
ratio=df_.apply(lambda x: fuzz.ratio(x["fruits"], x["choices"]), axis=1),
token=df_.apply(
lambda x: fuzz.token_sort_ratio(x["fruits"], x["choices"]), axis=1
),
)
)
.groupby(["fruits", "choices"])
.agg(list)
.applymap(lambda x: x[0])
)
Then:
print(new_df)
# Output
ratio token
fruits choices
apple app 75 75
apple 100 100
apull 60 60
apple tree apple 67 67
apples app 67 67
apple 91 91
appple 83 83
apull 55 55
mango mango 100 100
orange orange 100 100
org 67 67
oranges orang 83 83
oranges 100 100