I need some help with polars:
I have a dataframe with a categorical values column
┌───────────────────┬──────────────┬────────┐
│ session_id ┆ elapsed_time ┆ fqid │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ cat │
╞═══════════════════╪══════════════╪════════╡
│ 20090312431273200 ┆ 0 ┆ intro │
│ 20090312431273200 ┆ 1323 ┆ gramps │
│ 20090312431273200 ┆ 831 ┆ gramps │
│ 20090312431273200 ┆ 1147 ┆ gramps │
│ … ┆ … ┆ … │
│ 20090312431273200 ┆ 5197 ┆ teddy │
│ 20090312431273200 ┆ 6180 ┆ teddy │
│ 20090312431273200 ┆ 7014 ┆ teddy │
│ 20090312431273200 ┆ 7946 ┆ teddy │
└───────────────────┴──────────────┴────────┘
And I want to transform the fqid-column to look like this:
┌───────────────────┬─────────────┬────────────┬────────────┐
│ session_id ┆ fqid_gramps ┆ fqid_intro ┆ fqid_teddy │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ i32 ┆ i32 │
╞═══════════════════╪═════════════╪════════════╪════════════╡
│ 20090312431273200 ┆ 1 ┆ 1 ┆ 4 │
└───────────────────┴─────────────┴────────────┴────────────┘
That is, I would like to:
session_id
,fqid
,category
,Technically, I could achieve this without groupby by using something like
column_values = train['fqid'].value_counts().with_columns(pl.concat_str(pl.lit('fqid' + '_').alias('fqid'), pl.col('fqid').cast(pl.String))).transpose()
column_values = column_values.rename(column_values.head(1).to_dicts().pop()).slice(1)
But when I am trying to make an aggregating function from this replacing train['fqid']
with pl.col('fqid')
and making a group_by('session_id').aggregate(func('fqid'))
it gives me nothing but errors like AttributeError: 'Expr' object has no attribute 'with_columns'
.
Could you kindly suggest a proper way of making this operation?
Starting from
train=pl.from_repr(
"""┌───────────────────┬──────────────┬────────┐
│ session_id ┆ elapsed_time ┆ fqid │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ cat │
╞═══════════════════╪══════════════╪════════╡
│ 20090312431273200 ┆ 0 ┆ intro │
│ 20090312431273200 ┆ 1323 ┆ gramps │
│ 20090312431273200 ┆ 831 ┆ gramps │
│ 20090312431273200 ┆ 1147 ┆ gramps │
│ 20090312431273200 ┆ 5197 ┆ teddy │
│ 20090312431273200 ┆ 6180 ┆ teddy │
│ 20090312431273200 ┆ 7014 ┆ teddy │
│ 20090312431273200 ┆ 7946 ┆ teddy │
└───────────────────┴──────────────┴────────┘""")
we can do
(
train
.group_by(
(piv_idx:='session_id'),
(len_id:='fqid'),
maintain_order=True)
.len()
.pivot(on=len_id,
index=piv_idx,
values='len',
aggregate_function='first')
.select(
piv_idx,
pl.exclude(piv_idx).name.prefix(f"{len_id}_")
)
)
shape: (1, 4)
┌───────────────────┬────────────┬─────────────┬────────────┐
│ session_id ┆ fqid_intro ┆ fqid_gramps ┆ fqid_teddy │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ u32 ┆ u32 │
╞═══════════════════╪════════════╪═════════════╪════════════╡
│ 20090312431273200 ┆ 1 ┆ 3 ┆ 4 │
└───────────────────┴────────────┴─────────────┴────────────┘
Since you want the count (now len) of the fqids, you need to include that in the group_by
. Next, we do a pivot
to make the results wide. The output of pivot doesn't keep the original column name so we have to add that back manually. We do that in a select
by first taking the session_id
and then adding to that every column except session_id
with the prefix 'fqid_' to get the final desired result.
Incidentally, I'm not using value_counts
because it returns a list of structs so we can't do, for example, train.select(pl.col('fqid').value_counts().over('session_id'))
I used the walrus operator to assign the column names to variables in the group_by
so that you only need to change the columns in one place without repeating yourself. in the pivot and select.