[SOLVED] Replacing Duplicates in a column of list type in Pandas

Replacing Duplicates in a column of list type in Pandas

Background Info: I have a dataframe df which has various columns out of which the focus is the column called 'genres'

Goal:

The problem can be seen in this image, there are entries where duplicates are found for example '[Drama, Romance]' and '[Romance, Drama]' are the same thing

Now the objective is to remove/replace duplicates programmatically such that the variation is replaced with its equivalent form such as this.

Example:

'[Drama, Romance]' and '[Romance, Drama]'

Now [Romance, Drama] is replaced with [Drama, Romance] or vice versa instead of complete removal we just replace the list's content

Output - Before Replacing Duplicates '[Drama, Romance]' and '[Romance, Drama]'

Expected Output - After Replacing Duplicates '[Drama, Romance]'

Filter df column 'genres' to only include genres with list entries not exceeding 3 genres for example remove any rows with more than 3 genres. Example of acceptable result in 'genres' column:

[Romance, Drama, Comedy]
[Romance, Drama]
[Drama]

I have tried the following:

#to delist the 'genres' column
df['genres'] = df.genres.apply(', '.join)

# code sample of manually replaced duplicated content in genres column
df['genres'] = df['genres'].str.replace("Romance, Drama","Drama, Romance")
df['genres'] = df['genres'].str.replace("Drama, Comedy","Comedy, Drama")

The above code works but it is done manually for individual duplicates so I want to find a way to code this for all of the duplicates found in 'genres' column of df

Solution

Assuming list data type for each row within the column:

You can first sort the list per row with sorted

Then filter the rows of the dataframe with loc and get the value_counts()

df['genres'] = df['genres'].apply(lambda x: sorted(x))
df.loc[df['genres'].apply(lambda x: len(x) <= 3), 'genres'].value_counts()