[SOLVED] performance issues of using lambda for assigning variables in pandas in a method chain

performance issues of using lambda for assigning variables in pandas in a method chain

When working with pandas dataframes, I like to use method chains, because it makes the workflow similar to the tidyverse approach in R, where you use a string of pipes.

Consider the example in this answer:

N = 10
df = (
    pd.DataFrame({"x": np.random.random(N)})
    .assign(y=lambda d: d['x']*0.5)
    .assign(z=lambda d: d.y * 2)
    .assign(w=lambda d: d.z*0.5)
)

I think I've heard that manipulating dataframes using lambda is inefficient, because it is not a vectorized operation, but some looping goes on under the hood.

Is this an issue with examples like the one above? Are there alternatives to using lambda in a method chain that retain the tidyverse-like approach?

Solution

Your operations are vectorized, the lambda is not operating as the level of the values but rather for the column names. The running time of the function will be negligible for large enough datasets.

However, each assign call is generating a new DataFrame.

You could use a single assign call, this would avoid generating an intermediate for each step:

df = (pd.DataFrame({'x': np.random.random(N)})
        .assign(y=lambda d: d['x'] * 0.5,
                z=lambda d: d.y * 2,
                w=lambda d: d.z * 0.5,
               )
     )

There is a significant gain in performance:

NB. I'm only timing .assign(x,y,z) vs .assign(x).assign(y).assign(z), the DataFrame is pre-generated.