Suppose I have the following Pandas DataFrame (code shared below):
Suppose further that I want to create a new column called NewVal
according to the following rule (process often called "recoding"):
How would I do this recoding in Pandas?
Side-note: The conditions above are just a simple illustrative example - the real world case I'm working with has considerably more conditions and they don't cover the possibility space so nicely, which is why I also need that "in all other cases/default" condition.
case_when
In R, I know I can use tidyverse
's case_when
function to create a new column using these complex formulas:
# Loading the tidyverse library
library(tidyverse)
# Creating the dataframe
my_df = as.data.frame(
list('OldVal1'=c(1,1,1,1,2,2,2,2,3,3,3,3),
'OldVal2'=c(1,2,3,4,5,6,7,8,9,10,11,12)))
# Using case_when to create the new column
my_df = my_df %>%
mutate(NewVal = case_when((OldVal1 * OldVal2) %% 3 == 0 ~ OldVal2 * 100,
(OldVal1 - OldVal2) %% 4 == 1 ~ OldVal2 + 100,
.default = OldVal2))
apply
I found a workaround to do this in Python by creating an auxiliary function and then apply
ing it to the dataframe. But this is a pretty clunky and roundabout solution since it requires us to define a whole new function which only gets used once. And on top of that, it requires us to use apply
, which can be super slow on bigger DataFrames because it doesn't perform any type of vectorized operation - it just uses Python's slow for-loop to iterate over the DataFrame's rows.
# Importing the pandas library
import pandas as pd
# Creating the dataframe
my_df = pd.DataFrame({'OldVal1':(1,1,1,1,2,2,2,2,3,3,3,3),
'OldVal2':(1,2,3,4,5,6,7,8,9,10,11,12)})
# Auxiliary function
def my_colmaker(in_row):
if (in_row['OldVal1'] * in_row['OldVal2']) % 3 == 0:
out_cel = in_row['OldVal2'] * 100
elif (in_row['OldVal1'] - in_row['OldVal2']) % 4 == 1:
out_cel = in_row['OldVal2'] + 100
else:
out_cel = in_row['OldVal2']
return(pd.Series({'NewVal':out_cel}))
# Applying the function to the dataframe
my_df['NewVal'] = my_df.apply(lambda in_row: my_colmaker(in_row), axis=1)
This is what the output dataframe looks like:
Is there a better (quicker execution and code-friendly) way to create this new column?
I know of two SE threads that deal with a similar problem:
However, I posit that this specific question is not a duplicate of those other ones. In those two cases linked above, the values for the new column are static values, such as 'Easy'
, 'Medium'
and 'Hard'
. For those cases, we can easily use functions such as np.select
or np.where
.
But in the case I illustrated above, the values of the new column are not static. They depend on values of the other columns within that same row. Specifically, they can be OldVal2 * 100
, OldVal2 + 100
or just OldVal2
(depending on the case).
Functions like np.select
or np.where
do not allow for this type of variable output.
This is why I believe my question is not a duplicate of those linked above.
np.select
suffices - note that there is no hardcoding here - it uses the values in the columns:
# (my_df.OldVal1 * my_df.OldVal2)%3==0
cond1=my_df.prod(axis=1).mod(3).eq(0)
# (my_df.OldVal1 - my_df.OldVal2)%4==1
cond2 = my_df.agg(np.subtract.reduce,axis=1).mod(4).eq(1)
condlist = [cond1, cond2]
choicelist = [my_df.OldVal2.mul(100), my_df.OldVal2.add(100)]
result = np.select(condlist, choicelist, default = my_df.OldVal2)
my_df.assign(newval=result)
OldVal1 OldVal2 newval
0 1 1 1
1 1 2 2
2 1 3 300
3 1 4 104
4 2 5 105
5 2 6 600
6 2 7 7
7 2 8 8
8 3 9 900
9 3 10 1000
10 3 11 1100
11 3 12 1200
You can also use pd.case_when
, where you pass a list of tuples(condition, result):
default = my_df.OldVal2
outcome = default.case_when([(cond1, my_df.OldVal2.mul(100)),
(cond2, my_df.OldVal2.add(100))])
my_df.assign(newval=outcome)
OldVal1 OldVal2 newval
0 1 1 1
1 1 2 2
2 1 3 300
3 1 4 104
4 2 5 105
5 2 6 600
6 2 7 7
7 2 8 8
8 3 9 900
9 3 10 1000
10 3 11 1100
11 3 12 1200