Consider this simple dataframe:
df = pd.DataFrame({'category' :[['Restaurants', 'Pizza'], ['Pizza', 'Restaurants'], ['Restaurants', 'Mexican']]})
df:
The issue is that the category
in the first two rows are essentially identical, just in different order. My goal is to collapse the two into one (does not matter which one). So, the resulting df should look like:
or:
I thought about getting the indices of the rows with essentially the same categories (rows indexed 0 and 1 in this example) and then find a way to replace all with one. But I am not sure if my code is correct and also the whole dataset is huge so this is inefficient:
identical_idx = []
df_length = len(df)
for i in range(df_length):
for j in range(df_length):
if i!=j:
if set(df.category.iloc[i]) == set(df.category.iloc[j]): identical_idx.append([i, j])
What is the most efficient way to achieve this?
As easy option would be to sort the lists:
df['category'] = df['category'].map(sorted)
Output:
category
0 [Pizza, Restaurants]
1 [Pizza, Restaurants]
2 [Mexican, Restaurants]
I would probably convert them to set
/frozenset
, which will be more efficient for grouping operations:
df['category'] = df['category'].map(frozenset)
Output:
category
0 (Pizza, Restaurants)
1 (Pizza, Restaurants)
2 (Mexican, Restaurants)