pythonlistduplicatesmultilabel-classificationdata-preprocessing

Create a df new column which includes a list


I'm working on a multi-label image classifaction task. I have a dataframe with two columns (id and labels). I want to create a new column, which checks the ids for duplicates and if there is a duplicate (which is the case) the additional label should be assigned to the new column. The result should be a new column including all labels. Im struggling to write the labels in a new column as a list. Does anyone can support me here?

My df has the following structures:

| id       | labels         |
| -------- | -------------- |
| x.jpg    | label_1        |
| x.jpg    | label_2        |

New dataframe

| id       | labels         | all_labels       |
| -------- | -------------- |-------------------
| x.jpg    | label_1        | [label_1, label_2, and other if existent]
| x.jpg    | label_2        |

Solution

  • I think this does what you want although the format is a bit different:

    newdf = df.groupby('id')['labels'].agg(list).reset_index(name='labels')
    

    produces

          id              labels
    0  x.jpg  [label_1, label_2]
    1  y.jpg           [label_3]