When working with pandas dataframes, I like to use method chains, because it makes the workflow similar to the tidyverse approach in R, where you use a string of pipes.
Consider the example in this answer:
N = 10
df = (
pd.DataFrame({"x": np.random.random(N)})
.assign(y=lambda d: d['x']*0.5)
.assign(z=lambda d: d.y * 2)
.assign(w=lambda d: d.z*0.5)
)
I think I've heard that manipulating dataframes using lambda is inefficient, because it is not a vectorized operation, but some looping goes on under the hood.
Is this an issue with examples like the one above? Are there alternatives to using lambda in a method chain that retain the tidyverse-like approach?
Your operations are vectorized, the lambda is not operating as the level of the values but rather for the column names. The running time of the function will be negligible for large enough datasets.
However, each assign
call is generating a new DataFrame.
You could use a single assign
call, this would avoid generating an intermediate for each step:
df = (pd.DataFrame({'x': np.random.random(N)})
.assign(y=lambda d: d['x'] * 0.5,
z=lambda d: d.y * 2,
w=lambda d: d.z * 0.5,
)
)
There is a significant gain in performance:
NB. I'm only timing .assign(x,y,z)
vs .assign(x).assign(y).assign(z)
, the DataFrame is pre-generated.