pythonpandasdataframe

merging lists with identical elements but in different order in pandas series into one unique lists


Consider this simple dataframe:

df = pd.DataFrame({'category' :[['Restaurants', 'Pizza'],  ['Pizza', 'Restaurants'],  ['Restaurants', 'Mexican']]})

df:

enter image description here

The issue is that the category in the first two rows are essentially identical, just in different order. My goal is to collapse the two into one (does not matter which one). So, the resulting df should look like:

enter image description here

or:

enter image description here

I thought about getting the indices of the rows with essentially the same categories (rows indexed 0 and 1 in this example) and then find a way to replace all with one. But I am not sure if my code is correct and also the whole dataset is huge so this is inefficient:

identical_idx = []
df_length = len(df)
for i in range(df_length):
    for j in range(df_length):
        if i!=j:
            if set(df.category.iloc[i]) == set(df.category.iloc[j]): identical_idx.append([i, j])

What is the most efficient way to achieve this?


Solution

  • As easy option would be to sort the lists:

    df['category'] = df['category'].map(sorted)
    

    Output:

                     category
    0    [Pizza, Restaurants]
    1    [Pizza, Restaurants]
    2  [Mexican, Restaurants]
    

    I would probably convert them to set/frozenset, which will be more efficient for grouping operations:

    df['category'] = df['category'].map(frozenset)
    

    Output:

                     category
    0    (Pizza, Restaurants)
    1    (Pizza, Restaurants)
    2  (Mexican, Restaurants)