I am encountering some odd behavior in pandas, and I am hoping someone could shed some light on specifics from the df.assign(...)
function in a pandas dataframe. I am getting a ValueError
when trying to assign to column, despite the function being valid.
def is_toc_row(row):
m_sig = m_df.loc[m_df.signature == row.signature]
pct = (~pd.isnull(m_sig.line_type)).sum() / m_sig.shape[0]
return (not pd.isnull(row.line_type)) or (pct < .5)
m_df = m_df.assign(is_toc_row=is_toc_row)
Gives:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
But this works totally fine:
for ind, row in m_df.iterrows():
m_df.at[ind, 'is_toc_row'] = is_toc_row(row)
Is there some issue with referencing the rest of the DataFrame in the function? All I see in the docs is that the subject df cannot change, which it does not.
Of course I am capable of building a workaround, I just want to understand why this does not work for future use.
Not totally sure why so many down votes but adding a few rows of data here anyways per requests
index | signature | line_type |
---|---|---|
0 | WYcxXTjq27YAP4uJOcLeRLelyUixNJaOwFwf2qqfpM4 | NaN |
1 | WYcxXTjq27YAP4uJOcLeRLelyUixNJaOwFwf2qqfpM4 | NaN |
2 | WYcxXTjq27YAP4uJOcLeRLelyUixNJaOwFwf2qqfpM4 | 1 |
3 | WYcxXTjq27YAP4uJOcLeRLelyUixNJaOwFwf2qqfpM4 | 2 |
4 | WYcxXTjq27YAP4uJOcLeRLelyUixNJaOwFwf2qqfpM4 | 2.4 |
Actually when assign is used with a custom function, the function doesn't receive the datafame row by row (like apply) but receives once the full dataframe. Let's take a toy example:
m_df = pd.DataFrame({'temp_b': [7.0, 5.0], 'temp_c': [17.0, 25.0]},
index=['Portland', 'Berkeley'])
def myfunc(x):
print(x, "*end*")
return x.temp_c + x.temp_b
m_df = m_df.assign(is_toc_row=myfunc)
display(m_df)