pythongroup-bypython-polars

Polars group_by + value_counts


I need some help with polars:

I have a dataframe with a categorical values column

┌───────────────────┬──────────────┬────────┐
│ session_id        ┆ elapsed_time ┆ fqid   │
│ ---               ┆ ---          ┆ ---    │
│ i64               ┆ i32          ┆ cat    │
╞═══════════════════╪══════════════╪════════╡
│ 20090312431273200 ┆ 0            ┆ intro  │
│ 20090312431273200 ┆ 1323         ┆ gramps │
│ 20090312431273200 ┆ 831          ┆ gramps │
│ 20090312431273200 ┆ 1147         ┆ gramps │
│ …                 ┆ …            ┆ …      │
│ 20090312431273200 ┆ 5197         ┆ teddy  │
│ 20090312431273200 ┆ 6180         ┆ teddy  │
│ 20090312431273200 ┆ 7014         ┆ teddy  │
│ 20090312431273200 ┆ 7946         ┆ teddy  │
└───────────────────┴──────────────┴────────┘

And I want to transform the fqid-column to look like this:

┌───────────────────┬─────────────┬────────────┬────────────┐
│ session_id        ┆ fqid_gramps ┆ fqid_intro ┆ fqid_teddy │
│ ---               ┆ ---         ┆ ---        ┆ ---        │
│ i64               ┆ i32         ┆ i32        ┆ i32        │
╞═══════════════════╪═════════════╪════════════╪════════════╡
│ 20090312431273200 ┆ 1           ┆ 1          ┆ 4          │
└───────────────────┴─────────────┴────────────┴────────────┘

That is, I would like to:

Technically, I could achieve this without groupby by using something like

column_values = train['fqid'].value_counts().with_columns(pl.concat_str(pl.lit('fqid' + '_').alias('fqid'), pl.col('fqid').cast(pl.String))).transpose()
column_values = column_values.rename(column_values.head(1).to_dicts().pop()).slice(1)

But when I am trying to make an aggregating function from this replacing train['fqid'] with pl.col('fqid') and making a group_by('session_id').aggregate(func('fqid')) it gives me nothing but errors like AttributeError: 'Expr' object has no attribute 'with_columns'.

Could you kindly suggest a proper way of making this operation?


Solution

  • Starting from

    train=pl.from_repr(
         """┌───────────────────┬──────────────┬────────┐
            │ session_id        ┆ elapsed_time ┆ fqid   │
            │ ---               ┆ ---          ┆ ---    │
            │ i64               ┆ i32          ┆ cat    │
            ╞═══════════════════╪══════════════╪════════╡
            │ 20090312431273200 ┆ 0            ┆ intro  │
            │ 20090312431273200 ┆ 1323         ┆ gramps │
            │ 20090312431273200 ┆ 831          ┆ gramps │
            │ 20090312431273200 ┆ 1147         ┆ gramps │
            │ 20090312431273200 ┆ 5197         ┆ teddy  │
            │ 20090312431273200 ┆ 6180         ┆ teddy  │
            │ 20090312431273200 ┆ 7014         ┆ teddy  │
            │ 20090312431273200 ┆ 7946         ┆ teddy  │
            └───────────────────┴──────────────┴────────┘""")
    

    we can do

    (
        train
            .group_by(
                (piv_idx:='session_id'),
                (len_id:='fqid'),
                maintain_order=True) 
            .len()
            .pivot(on=len_id,
                   index=piv_idx,
                   values='len',
                   aggregate_function='first')
            .select(
                piv_idx,
                pl.exclude(piv_idx).name.prefix(f"{len_id}_")
                )
    )
    
    shape: (1, 4)
    ┌───────────────────┬────────────┬─────────────┬────────────┐
    │ session_id        ┆ fqid_intro ┆ fqid_gramps ┆ fqid_teddy │
    │ ---               ┆ ---        ┆ ---         ┆ ---        │
    │ i64               ┆ u32        ┆ u32         ┆ u32        │
    ╞═══════════════════╪════════════╪═════════════╪════════════╡
    │ 20090312431273200 ┆ 1          ┆ 3           ┆ 4          │
    └───────────────────┴────────────┴─────────────┴────────────┘
    

    Since you want the count (now len) of the fqids, you need to include that in the group_by. Next, we do a pivot to make the results wide. The output of pivot doesn't keep the original column name so we have to add that back manually. We do that in a select by first taking the session_id and then adding to that every column except session_id with the prefix 'fqid_' to get the final desired result.

    Incidentally, I'm not using value_counts because it returns a list of structs so we can't do, for example, train.select(pl.col('fqid').value_counts().over('session_id'))

    I used the walrus operator to assign the column names to variables in the group_by so that you only need to change the columns in one place without repeating yourself. in the pivot and select.