rpandasdataframecalculated-columnsrecode

How to create a new column using complex fomulas and variable outputs (i.e., recoding variables) in a Pandas DataFrame similarly to R's `case_when`?


Main Problem

Suppose I have the following Pandas DataFrame (code shared below):

input table

Suppose further that I want to create a new column called NewVal according to the following rule (process often called "recoding"):

How would I do this recoding in Pandas?

Side-note: The conditions above are just a simple illustrative example - the real world case I'm working with has considerably more conditions and they don't cover the possibility space so nicely, which is why I also need that "in all other cases/default" condition.

Solution in R using case_when

In R, I know I can use tidyverse's case_when function to create a new column using these complex formulas:

# Loading the tidyverse library
library(tidyverse)

# Creating the dataframe
my_df = as.data.frame(
  list('OldVal1'=c(1,1,1,1,2,2,2,2,3,3,3,3),
       'OldVal2'=c(1,2,3,4,5,6,7,8,9,10,11,12)))

# Using case_when to create the new column
my_df = my_df %>%
  mutate(NewVal = case_when((OldVal1 * OldVal2) %% 3 == 0 ~ OldVal2 * 100,
                            (OldVal1 - OldVal2) %% 4 == 1 ~ OldVal2 + 100,
                            .default = OldVal2))

My workaround in Python using apply

I found a workaround to do this in Python by creating an auxiliary function and then applying it to the dataframe. But this is a pretty clunky and roundabout solution since it requires us to define a whole new function which only gets used once. And on top of that, it requires us to use apply, which can be super slow on bigger DataFrames because it doesn't perform any type of vectorized operation - it just uses Python's slow for-loop to iterate over the DataFrame's rows.

# Importing the pandas library
import pandas as pd

# Creating the dataframe
my_df = pd.DataFrame({'OldVal1':(1,1,1,1,2,2,2,2,3,3,3,3),
                      'OldVal2':(1,2,3,4,5,6,7,8,9,10,11,12)})

# Auxiliary function
def my_colmaker(in_row):
    if (in_row['OldVal1'] * in_row['OldVal2']) % 3 == 0:
        out_cel = in_row['OldVal2'] * 100
    elif (in_row['OldVal1'] - in_row['OldVal2']) % 4 == 1:
        out_cel = in_row['OldVal2'] + 100
    else:
        out_cel = in_row['OldVal2']
    return(pd.Series({'NewVal':out_cel}))

# Applying the function to the dataframe
my_df['NewVal'] = my_df.apply(lambda in_row: my_colmaker(in_row), axis=1)

This is what the output dataframe looks like:

expected results

Is there a better (quicker execution and code-friendly) way to create this new column?

Edit: not a duplicate of older questions

I know of two SE threads that deal with a similar problem:

However, I posit that this specific question is not a duplicate of those other ones. In those two cases linked above, the values for the new column are static values, such as 'Easy', 'Medium' and 'Hard'. For those cases, we can easily use functions such as np.select or np.where.

But in the case I illustrated above, the values of the new column are not static. They depend on values of the other columns within that same row. Specifically, they can be OldVal2 * 100, OldVal2 + 100 or just OldVal2 (depending on the case).

Functions like np.select or np.where do not allow for this type of variable output.

This is why I believe my question is not a duplicate of those linked above.


Solution

  • np.select suffices - note that there is no hardcoding here - it uses the values in the columns:

    # (my_df.OldVal1 * my_df.OldVal2)%3==0
    cond1=my_df.prod(axis=1).mod(3).eq(0)
    # (my_df.OldVal1 - my_df.OldVal2)%4==1
    cond2 = my_df.agg(np.subtract.reduce,axis=1).mod(4).eq(1)
    condlist = [cond1, cond2]
    choicelist =  [my_df.OldVal2.mul(100), my_df.OldVal2.add(100)]
    result = np.select(condlist, choicelist, default = my_df.OldVal2)
    my_df.assign(newval=result)
        OldVal1  OldVal2  newval
    0         1        1       1
    1         1        2       2
    2         1        3     300
    3         1        4     104
    4         2        5     105
    5         2        6     600
    6         2        7       7
    7         2        8       8
    8         3        9     900
    9         3       10    1000
    10        3       11    1100
    11        3       12    1200
    

    You can also use pd.case_when, where you pass a list of tuples(condition, result):

    default = my_df.OldVal2
    outcome = default.case_when([(cond1, my_df.OldVal2.mul(100)), 
                                 (cond2, my_df.OldVal2.add(100))])
    my_df.assign(newval=outcome)
        OldVal1  OldVal2  newval
    0         1        1       1
    1         1        2       2
    2         1        3     300
    3         1        4     104
    4         2        5     105
    5         2        6     600
    6         2        7       7
    7         2        8       8
    8         3        9     900
    9         3       10    1000
    10        3       11    1100
    11        3       12    1200