pythonpython-polarspolars

Softmax with polars Lazy Dataframe


I'm relatively new to using polars and it seems to be very verbose compared to pandas for what I would consider even relatively basic manipulations.

Case in point, the shortest way I could figure out doing a softmax over a lazy dataframe is the following:

import polars as pl

data = pl.DataFrame({'a': [1,2,3,4,5,6,7,8,9,10], 'b':[5,5,5,5,5,5,5,5,5,5], 'c': [10,9,8,7,6,5,4,3,2,1]}).lazy()
cols = ['a','b','c']

data = data.with_columns([ pl.col(c).exp().alias(c) for c in cols]) # Exp all columns
data = data.with_columns(pl.sum_horizontal(cols).alias('sum')) # Get row sum of exps
data = data.with_columns([ (pl.col(c)/pl.col('sum')).alias(c) for c in cols ]).drop('sum')

data.collect()

Am I missing something and is there a shorter, more readable way of achieving this?


Solution

  • You would use a multi-col selection e.g. pl.all() instead of list comprehensions.

    (Or pl.col(cols) for a named "subset" of columns)

    df.with_columns(
        pl.all().exp() / pl.sum_horizontal(pl.all().exp())
    )
    
    shape: (10, 3)
    ┌──────────┬──────────┬──────────┐
    │ a        ┆ b        ┆ c        │
    │ ---      ┆ ---      ┆ ---      │
    │ f64      ┆ f64      ┆ f64      │
    ╞══════════╪══════════╪══════════╡
    │ 0.000123 ┆ 0.006692 ┆ 0.993185 │
    │ 0.000895 ┆ 0.01797  ┆ 0.981135 │
    │ 0.006377 ┆ 0.047123 ┆ 0.946499 │
    │ 0.04201  ┆ 0.114195 ┆ 0.843795 │
    │ 0.211942 ┆ 0.211942 ┆ 0.576117 │
    │ 0.576117 ┆ 0.211942 ┆ 0.211942 │
    │ 0.843795 ┆ 0.114195 ┆ 0.04201  │
    │ 0.946499 ┆ 0.047123 ┆ 0.006377 │
    │ 0.981135 ┆ 0.01797  ┆ 0.000895 │
    │ 0.993185 ┆ 0.006692 ┆ 0.000123 │
    └──────────┴──────────┴──────────┘
    

    With LazyFrames we can use .explain() to inspect the query plan.

    plan = df.lazy().with_columns(pl.all().exp() / pl.sum_horizontal(pl.all().exp())).explain()
    print(plan)
    
    # simple π 3/7 ["a", "b", "c"]
    #    WITH_COLUMNS:
    #    [[(col("__POLARS_CSER_0x9b1b3182d015f390")) / (col("__POLARS_CSER_0x762bfea120ea9e6"))].alias("a"), [(col("__POLARS_CSER_0xb82f49f764da7a09")) / (col("__POLARS_CSER_0x762bfea120ea9e6"))].alias("b"), [(col("__POLARS_CSER_0x1a200912e2bcc700")) / (col("__POLARS_CSER_0x762bfea120ea9e6"))].alias("c")]
    #      WITH_COLUMNS:
    #      [col("a").exp().alias("__POLARS_CSER_0x9b1b3182d015f390"), col("b").exp().alias("__POLARS_CSER_0xb82f49f764da7a09"), col("c").exp().alias("__POLARS_CSER_0x1a200912e2bcc700"), col("a").exp().sum_horizontal([col("b").exp(), col("c").exp()]).alias("__POLARS_CSER_0x762bfea120ea9e6")]
    #       DF ["a", "b", "c"]; PROJECT */3 COLUMNS
    

    Polars caches the duplicate pl.all().exp() expression into a temp __POLARS_CSER* column for you.

    See also: