pythonpandasdataframeconditional-statementslookup

Get column names where its value contains something in pandas


I have a dataframe with various columns, three of which are columns with lists (each cell has a list). These three columns have mutually exclusive values.

I have another column which has the label (A, B, C, D or E)

   vot_in_favour vot_against vot_abstention  label
0   [A, B, C]      []          [D, E]          A
1   [A, D, E]      [C]         [B]             C
2   [B, C]         [A]         [D, E]          D

I want a column vote which has the name of the column according to the label, like the following:

       vote
0   vot_in_favour
1   vot_against
2   vot_abstention

I tried something like df1['vote'] = df.drop("label", axis=1).isin(df["label"]).any(1), but do not know how to make this to match any value of the lists. I have visited similar issues, but the list columns are posing a challenge.


Solution

  • You will somehow need to get the values out of the lists, compare with the corresponding labels and map that with the column labels.

    The simplest way is to explode the columns sequentially, compare the values with 'label' column values (using eq), get the column labels (mul) and get back into original shape (stack + groupby.first).

    exp_df  = df.explode('vot_in_favour').explode('vot_against').explode('vot_abstention')
    df['vote'] = (
        exp_df
        .filter(like='vot')
        .eq(exp_df['label'], axis=0)
        .pipe(lambda x: x.mul(x.columns))
        .stack()
        .replace('', pd.NA)
        .groupby(level=0).first()
    )
    

    Yet another method is to explode the columns, compare with 'label' and filter out duplicate rows. # explode all three columns exp_df = df.explode('vot_in_favour').explode('vot_against').explode('vot_abstention') # compare labels with votes to find matches and assign index to be used later to remove duplicates exp_df = exp_df.eq(exp_df.pop('label'), axis=0).assign(index=lambda x: x.index) # remove duplicates and all False rows and get the matches in each row df['vote'] = exp_df[exp_df.any(axis=1) & ~exp_df.duplicated() & ~exp_df.pop('index').duplicated()].idxmax(axis=1)

    Both version produce the desired output as follows:

      vot_in_favour vot_against vot_abstention label            vote
    0     [A, B, C]          []         [D, E]     A   vot_in_favour
    1     [A, D, E]         [C]            [B]     C     vot_against
    2        [B, C]         [A]         [D, E]     D  vot_abstention