I have a Dataframe where I want to derive a Boolean column based on some other columns, including a check on whether a particular column is NaN:
def secret_sauce(my_row):
if (pd.isna(my_row["col1"]) and
my_row["col2"] != "BAD_VAL"):
return False
return True
my_df = my_df.assign(
my_boolean_col=lambda df: df.apply(
secret_sauce, axis=1
)
)
col1
can be either a list or NaN. I keep running into problems with various ways to check if it is NaN, because for example pd.isna()
does not like being called on a list.
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
However, any()
or all()
does not seem to be correct for calling on NaN. So I seem to be stuck. What would be the best way to do this?
The following setup can reproduce the issue:
import pandas as pd
import numpy as np
d = {
'mypk': ['ABC', 'DEF', 'GHI'],
'col1': [["HELLO"], np.nan, ["GOODBYE", "ADIOS"]],
'col2': ['GOOD_VAL', 'GOOD_VAL', 'GOOD_VAL']}
my_df = pd.DataFrame(data=d)
mypk col1 col2
0 ABC [HELLO] GOOD_VAL
1 DEF NaN GOOD_VAL
2 GHI [GOODBYE, ADIOS] GOOD_VAL
What I want is a new column with values True, False, True, respectively for the 3 rows. It processes the first 2 rows fine, but fails when it gets to the 3rd row with multiple elements in col1
. It can process the list with a single element fine.
So what I'm wondering is, what's the best way to have a condition that is True if it's NaN, and False otherwise, in this context? The different ways I've found so far don't seem to properly handle the possible different input values.
For me, this is an XY problem because you don't need to use .apply()
in the first place; instead you should use isna
and !=
on the whole columns instead of single elements, and in that case, the problem disappears.
my_df.assign(
my_boolean_col=lambda df: ~(df['col1'].isna() & df['col2'].ne('BAD_VAL'))
)
mypk col1 col2 my_boolean_col
0 ABC [HELLO] GOOD_VAL True
1 DEF NaN GOOD_VAL False
2 GHI [GOODBYE, ADIOS] GOOD_VAL True
Check out cs95's answer on "How can I iterate over rows in a Pandas DataFrame?", which says, first prefer vectorized methods (as I'm using here), and then .apply()
is lower down the list.