I'm relatively new to using polars and it seems to be very verbose compared to pandas for what I would consider even relatively basic manipulations.
Case in point, the shortest way I could figure out doing a softmax over a lazy dataframe is the following:
import polars as pl
data = pl.DataFrame({'a': [1,2,3,4,5,6,7,8,9,10], 'b':[5,5,5,5,5,5,5,5,5,5], 'c': [10,9,8,7,6,5,4,3,2,1]}).lazy()
cols = ['a','b','c']
data = data.with_columns([ pl.col(c).exp().alias(c) for c in cols]) # Exp all columns
data = data.with_columns(pl.sum_horizontal(cols).alias('sum')) # Get row sum of exps
data = data.with_columns([ (pl.col(c)/pl.col('sum')).alias(c) for c in cols ]).drop('sum')
data.collect()
Am I missing something and is there a shorter, more readable way of achieving this?
You would use a multi-col selection e.g. pl.all()
instead of list comprehensions.
(Or pl.col(cols)
for a named "subset" of columns)
df.with_columns(
pl.all().exp() / pl.sum_horizontal(pl.all().exp())
)
shape: (10, 3)
┌──────────┬──────────┬──────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞══════════╪══════════╪══════════╡
│ 0.000123 ┆ 0.006692 ┆ 0.993185 │
│ 0.000895 ┆ 0.01797 ┆ 0.981135 │
│ 0.006377 ┆ 0.047123 ┆ 0.946499 │
│ 0.04201 ┆ 0.114195 ┆ 0.843795 │
│ 0.211942 ┆ 0.211942 ┆ 0.576117 │
│ 0.576117 ┆ 0.211942 ┆ 0.211942 │
│ 0.843795 ┆ 0.114195 ┆ 0.04201 │
│ 0.946499 ┆ 0.047123 ┆ 0.006377 │
│ 0.981135 ┆ 0.01797 ┆ 0.000895 │
│ 0.993185 ┆ 0.006692 ┆ 0.000123 │
└──────────┴──────────┴──────────┘
With LazyFrames we can use .explain()
to inspect the query plan.
plan = df.lazy().with_columns(pl.all().exp() / pl.sum_horizontal(pl.all().exp())).explain()
print(plan)
# simple π 3/7 ["a", "b", "c"]
# WITH_COLUMNS:
# [[(col("__POLARS_CSER_0x9b1b3182d015f390")) / (col("__POLARS_CSER_0x762bfea120ea9e6"))].alias("a"), [(col("__POLARS_CSER_0xb82f49f764da7a09")) / (col("__POLARS_CSER_0x762bfea120ea9e6"))].alias("b"), [(col("__POLARS_CSER_0x1a200912e2bcc700")) / (col("__POLARS_CSER_0x762bfea120ea9e6"))].alias("c")]
# WITH_COLUMNS:
# [col("a").exp().alias("__POLARS_CSER_0x9b1b3182d015f390"), col("b").exp().alias("__POLARS_CSER_0xb82f49f764da7a09"), col("c").exp().alias("__POLARS_CSER_0x1a200912e2bcc700"), col("a").exp().sum_horizontal([col("b").exp(), col("c").exp()]).alias("__POLARS_CSER_0x762bfea120ea9e6")]
# DF ["a", "b", "c"]; PROJECT */3 COLUMNS
Polars caches the duplicate pl.all().exp()
expression into a temp __POLARS_CSER*
column for you.
See also: