pandasdataframe

Get the value of a column by max value in others


I have the df:

import pandas as pd
data = {"id_1":["a","b","c"],
        "id_2":["q","w","e"],
        "val_1":[1,2,3],
        "val_2":[2,0,0]}
df = pd.DataFrame(data)

#   id_1 id_2  val_1  val_2
# 0    a    q      1      2
# 1    b    w      2      0
# 2    c    e      3      0

I want to compare the values of val_1 and val_2 columns. If val_1 is more than val_2 (they are never the same) I want to populate a new column with the value of id_1, else the value of id_2. To create:

  id_1 id_2  val_1  val_2 max_id
0    a    q      1      2      q
1    b    w      2      0      b
2    c    e      3      0      c

max_id of the first row is q because val_2>val_1...

I can get the max value with: df[["val_1", "val_2"]].max(axis=1)

And the column name from which the fetch the ids with:

index_map = {0:"id_1", 1:"id_2"}
df.apply(lambda x: [x.val_1, x.val_2].index(max([x.val_1, x.val_2])), axis=1).map(index_map)
# 0    id_2
# 1    id_1
# 2    id_1

But then my ideas run out.


Solution

  • Since you have two columns, a simple approach would be to use where:

    df['max_id'] = df['id_1'].where(df['val_1'].gt(df['val_2']), df['id_2'])
    

    If you want to generalize to an arbitrary number of id/values pairs, you could combine idxmax and indexing lookup:

    idx, cols = pd.factorize(df.filter(like='val_').idxmax(axis=1)
                               .str.replace('val_', 'id_'))
    df['max_id'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
    

    Or with wide_to_long:

    tmp = pd.wide_to_long(df.reset_index(), stubnames=['id', 'val'],
                          i='index', j='col', sep='_')
    df['max_id'] = (tmp.loc[tmp.groupby(level='index')['val'].idxmax(), 'id']
                       .droplevel('col')
                   )
    

    Or, if the columns are already sorted in the correct order (1,2,3... ; 1,2,3...), using :

    ids = df.filter(like='id_').to_numpy()
    vals = df.filter(like='val_').to_numpy()
    
    df['max_id'] = ids[np.arange(ids.shape[0]), np.argmax(vals, axis=1)]
    

    Output:

      id_1 id_2  val_1  val_2 max_id
    0    a    q      1      2      q
    1    b    w      2      0      b
    2    c    e      3      0      c
    

    Intermediates:

    # idxmax + factorize approach:
    
      id_1 id_2  val_1  val_2 idxmax str.replace
    0    a    q      1      2  val_2        id_2
    1    b    w      2      0  val_1        id_1
    2    c    e      3      0  val_1        id_1
    
    # wide_to_long output:
    
              id  val
    index col        
    0     1    a    1
    1     1    b    2    # max from index = 1
    2     1    c    3    # max from index = 2
    0     2    q    2    # max from index = 0
    1     2    w    0
    2     2    e    0