In a Polars DataFrame, I have 3 columns (A, B, D)
storing value of type double.
For n
rows, and a starting value of some double number, the calculation is performed as shown below:
value of column A = value of column A * initial value
value of column B = value of column B * initial value
value of column D = value of column A + value of column B
initial value = value of column D
Unfortunately, complex custom functions of this nature can only be satisfactorily performed via numba gufuncs as of Polars (v0.20.26).
My Issue is, I am only able to return 1 element from the gufunc and still have it be compatible with Polars (Polars does not support multiple returns as of yet). So in the end, even though I get my final results for column D, the values for column A & B are not retained.
One way to circumvent this would be to simply perform the same computations two more times solving for the desired column's values - but this is rather an inefficient way to go about this. I also tried to hoodwink polars and numba by making the value of column D an array of [column A, column B, column D]
but numba does not allow this with Polars columns as a message is returned informing of the data being elided (which I've noticed, is usually a signal of numba's incompatibility with data). I've also tried turning the column information into a comma-combined string to no avail - we just can't pass complex data like that to numba - not as far as I know.
I wrote a custom C function, but for some very odd reason, Python is so abysmally slow in calling a C function form it's native API that it takes about a 100 times longer than the polars complex function as described above. I can only guess this is due to taking time wrapping everything into PyObjects - even though I got the same slow result using numpy arrays... There is simply no way for me to take it to a lower level with the provided Python C API as the calls are abnormally slow for some reason - regardless of how many parameters I pass or how complex my data seems to be.
I am at my wits end at the moment. The only options I can think of at the moment are:
I wanted to post here first to see if there was an easier way out of this pickle (no pun intended).
In general, polars can handle multiple returns. They just need to be packed into a polars series (i.e. use a Struct
series). From the documentation:
The output of this custom function must be a Series (or a NumPy array, in which case it will be automatically converted into a Series).
As a simple example, consider the following.
import polars as pl
df = pl.DataFrame({
"A": [1, 2, 3],
"B": [1, 1, 1],
"C": [0, 1, 0],
})
df_new = df.with_columns(
pl.col("A").map_batches(
lambda x: pl.Series({"D": x[i], "E": 2 * x[i]} for i in range(len(x)))
).alias("result")
)
shape: (3, 4)
┌─────┬─────┬─────┬───────────┐
│ A ┆ B ┆ C ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ struct[2] │
╞═════╪═════╪═════╪═══════════╡
│ 1 ┆ 1 ┆ 0 ┆ {1,2} │
│ 2 ┆ 1 ┆ 1 ┆ {2,4} │
│ 3 ┆ 1 ┆ 0 ┆ {3,6} │
└─────┴─────┴─────┴───────────┘
The "result"
column can be unnested into the struct fields "D"
and "E"
using pl.DataFrame.unnest
.
df_new.unnest("result")
shape: (3, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ A ┆ B ┆ C ┆ D ┆ E │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1 ┆ 1 ┆ 0 ┆ 1 ┆ 2 │
│ 2 ┆ 1 ┆ 1 ┆ 2 ┆ 4 │
│ 3 ┆ 1 ┆ 0 ┆ 3 ┆ 6 │
└─────┴─────┴─────┴─────┴─────┘
A common pattern here is to have a pure numpy function that is numba jit-compatible and a simpler wrapper handling the conversion (polars to numpy and numpy to polars). In your concrete case this could look as follows.
import numpy as np
import numba
@numba.jit(nopython=True)
def f(a, b, initial_value=1.0):
a_new = np.empty_like(a)
b_new = np.empty_like(b)
d_new = np.empty_like(a)
carry = initial_value
for i in range(len(a)):
a_new[i] = a[i] * carry
b_new[i] = b[i] * carry
d_new[i] = carry = a_new[i] * b_new[i]
return a_new, b_new, d_new
def unpack(a_new, b_new, d_new):
df = pl.DataFrame({
"A_new": a_new,
"B_new": b_new,
"D_new": d_new,
})
return pl.Series(df.to_dicts())
(
df
.with_columns(
pl.struct("A", "B").map_batches(
lambda x: unpack(
*f(
x.struct.field("A").to_numpy(),
x.struct.field("B").to_numpy(),
)
)
).alias("result")
)
)
shape: (3, 4)
┌─────┬─────┬─────┬───────────┐
│ A ┆ B ┆ C ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ struct[3] │
╞═════╪═════╪═════╪═══════════╡
│ 1 ┆ 1 ┆ 0 ┆ {1,1,1} │
│ 2 ┆ 1 ┆ 1 ┆ {2,1,2} │
│ 3 ┆ 1 ┆ 0 ┆ {6,2,12} │
└─────┴─────┴─────┴───────────┘
Note. With some more thought, the conversion of the numpy arrays to the polars Struct
series (in unpack
) can likely be made a bit more elegantly.