applypython-polars

How to apply value_counts() to multiple columns in polars python?


I am trying to apply value_counts() to multiple columns, but getting an error.

df = pl.from_repr("""
┌──────────────┬─────────────┐
│ sub-category ┆ category    │
│ ---          ┆ ---         │
│ str          ┆ str         │
╞══════════════╪═════════════╡
│ tv           ┆ electronics │
│ mobile       ┆ mobile      │
│ tv           ┆ electronics │
│ wm           ┆ electronics │
│ micro        ┆ kitchen     │
│ wm           ┆ electronics │
└──────────────┴─────────────┘
""")

If I convert it to Pandas, I can use apply:

pl.from_pandas(
    df.to_pandas().apply(lambda x: x.value_counts()).reset_index()
)
shape: (6, 3)
┌─────────────┬──────────────┬──────────┐
│ index       ┆ sub-category ┆ category │
│ ---         ┆ ---          ┆ ---      │
│ str         ┆ f64          ┆ f64      │
╞═════════════╪══════════════╪══════════╡
│ electronics ┆ null         ┆ 4.0      │
│ kitchen     ┆ null         ┆ 1.0      │
│ micro       ┆ 1.0          ┆ null     │
│ mobile      ┆ 1.0          ┆ 1.0      │
│ tv          ┆ 2.0          ┆ null     │
│ wm          ┆ 2.0          ┆ null     │
└─────────────┴──────────────┴──────────┘

How do I get the same result in Polars?


Solution

  • .value_counts() is implemented as .group_by().len()

    Generally, it's easier to just group_by manually.

    If you first reshape with .unpivot()

    shape: (12, 2)
    ┌──────────────┬─────────────┐
    │ variable     ┆ value       │
    │ ---          ┆ ---         │
    │ str          ┆ str         │
    ╞══════════════╪═════════════╡
    │ sub-category ┆ tv          │
    │ sub-category ┆ mobile      │
    │ sub-category ┆ tv          │
    │ sub-category ┆ wm          │
    │ sub-category ┆ micro       │
    │ …            ┆ …           │
    │ category     ┆ mobile      │
    │ category     ┆ electronics │
    │ category     ┆ electronics │
    │ category     ┆ kitchen     │
    │ category     ┆ electronics │
    └──────────────┴─────────────┘
    

    Then len of each group is the count.

    df.unpivot().group_by(pl.all()).len()
    
    shape: (7, 3)
    ┌──────────────┬─────────────┬─────┐
    │ variable     ┆ value       ┆ len │
    │ ---          ┆ ---         ┆ --- │
    │ str          ┆ str         ┆ u32 │
    ╞══════════════╪═════════════╪═════╡
    │ category     ┆ kitchen     ┆ 1   │
    │ sub-category ┆ tv          ┆ 2   │
    │ sub-category ┆ mobile      ┆ 1   │
    │ category     ┆ mobile      ┆ 1   │
    │ sub-category ┆ wm          ┆ 2   │
    │ sub-category ┆ micro       ┆ 1   │
    │ category     ┆ electronics ┆ 4   │
    └──────────────┴─────────────┴─────┘
    

    .pivot() can be used if the "wide" shape is needed.

    (df.unpivot()
       .pivot(
          on = "variable",
          index = "value",
          values = "value",
          aggregate_function = pl.len()
       )
    )
    
    shape: (6, 3)
    ┌─────────────┬──────────────┬──────────┐
    │ value       ┆ sub-category ┆ category │
    │ ---         ┆ ---          ┆ ---      │
    │ str         ┆ u32          ┆ u32      │
    ╞═════════════╪══════════════╪══════════╡
    │ tv          ┆ 2            ┆ null     │
    │ mobile      ┆ 1            ┆ 1        │
    │ wm          ┆ 2            ┆ null     │
    │ micro       ┆ 1            ┆ null     │
    │ electronics ┆ null         ┆ 4        │
    │ kitchen     ┆ null         ┆ 1        │
    └─────────────┴──────────────┴──────────┘