Background Info: I have a dataframe df
which has various columns out of which the focus is the column called 'genres'
Goal:
Example:
'[Drama, Romance]' and '[Romance, Drama]'
Now [Romance, Drama] is replaced with [Drama, Romance] or vice versa instead of complete removal we just replace the list's content
Output - Before Replacing Duplicates '[Drama, Romance]' and '[Romance, Drama]'
Expected Output - After Replacing Duplicates '[Drama, Romance]'
df
column 'genres' to only include genres with list entries not exceeding 3 genres for example remove any rows with more than 3 genres. Example of acceptable result in 'genres' column:I have tried the following:
#to delist the 'genres' column
df['genres'] = df.genres.apply(', '.join)
# code sample of manually replaced duplicated content in genres column
df['genres'] = df['genres'].str.replace("Romance, Drama","Drama, Romance")
df['genres'] = df['genres'].str.replace("Drama, Comedy","Comedy, Drama")
The above code works but it is done manually for individual duplicates so I want to find a way to code this for all of the duplicates found in 'genres' column of df
Assuming list
data type for each row within the column:
You can first sort the list per row with sorted
Then filter the rows of the dataframe with loc
and get the value_counts()
df['genres'] = df['genres'].apply(lambda x: sorted(x))
df.loc[df['genres'].apply(lambda x: len(x) <= 3), 'genres'].value_counts()