pythonpython-polars

Passing a polars struct to a user-defined function using map_batches


I need to pass a variable number of columns to a user-defined function. The docs mention to first create a pl.struct and subsequently let the function extract it. Here's the example given on the website:

# Add two arrays together:
@guvectorize([(int64[:], int64[:], float64[:])], "(n),(n)->(n)")
def add(arr, arr2, result):
    for i in range(len(arr)):
        result[i] = arr[i] + arr2[i]


df3 = pl.DataFrame({"values1": [1, 2, 3], "values2": [10, 20, 30]})

out = df3.select(
    # Create a struct that has two columns in it:
    pl.struct(["values1", "values2"])
    # Pass the struct to a lambda that then passes the individual columns to
    # the add() function:
    .map_batches(
        lambda combined: add(
            combined.struct.field("values1"), combined.struct.field("values2")
        )
    )
    .alias("add_columns")
)
print(out)

Now, in my case, I don't know upfront how many columns will enter the pl.struct. Think of using a selector like pl.struct(cs.float()). In my user-defined function, I need to operate on a np.array. That is, the user-defined function will have one input argument that takes the whole array. How can I then extract it within the user-defined function?

EDIT: The output of my user-defined function will be an array that has the exact same shape as the input array. This array needs to be appended to the existing dataframe on axis 1 (new columns).

EDIT: Using pl.concat_arr might be one way to attack my concrete issue. My use case would be along the following lines:

def multiply_by_two(arr):
    # In reality, there are some complex array operations.
    return arr * 2


df = pl.DataFrame({"values1": [1, 2, 3], "values2": [10, 20, 30]})

out = df.select(
    # Create an array consisting of two columns:
    pl.concat_arr(["values1", "values2"])
    .map_batches(lambda arr: multiply_by_two(arr))
    .alias("result")
)

The new computed column result holds an array that has the same shape as the input array. I need to unnest the array (something like pl.struct.unnest()). The headings should be the original headings suffixed by "result" (values1_result and values2_result).

Also, I would like to make use of @guvectorize to speed things up.


Solution

  • A few things, if you use .to_numpy on either an array or a struct, it seems to return the same np.array so the difference in which to choose comes down to memory efficiency and features. The elements of an Array aren't named and you want the output names to correspond to the input columns so that means you probably want a struct. I'm not sure what the memory implications are between the two. I know that going from columns to a struct is cheaper than going from columns to Array but intuitively it seems that columns->struct->np.array ought to be about the same as columns->array->np.array.

    Anyway, with that said, here's how to do it:

    def multiply_by_two(arr: pl.Series)->pl.Series:
        # capture names of input
        names = arr.struct.fields
        arrnp=arr.to_numpy()
        res = arrnp * 2
        return pl.Series(res).arr.to_struct(fields=[f"{name}_result" for name in names])
    
    df.with_columns(
        # Create an array consisting of two columns:
        pl.struct(["values1", "values2"])
        .map_batches(lambda arr: multiply_by_two(arr))
        .alias("result")
    ).unnest("result")
    
    shape: (3, 4)
    ┌─────────┬─────────┬────────────────┬────────────────┐
    │ values1 ┆ values2 ┆ values1_result ┆ values2_result │
    │ ---     ┆ ---     ┆ ---            ┆ ---            │
    │ i64     ┆ i64     ┆ i64            ┆ i64            │
    ╞═════════╪═════════╪════════════════╪════════════════╡
    │ 1       ┆ 10      ┆ 2              ┆ 20             │
    │ 2       ┆ 20      ┆ 4              ┆ 40             │
    │ 3       ┆ 30      ┆ 6              ┆ 60             │
    └─────────┴─────────┴────────────────┴────────────────┘
    

    You can't unnest from within the .with_columns you have to do it at the DataFrame level.

    As for combining the above with numba, it should be relatively the same. Just search for polars and numba to find other questions/answers where the two are used together. If you can make a more specific question specifically about their interaction then ask away.