Polars Convert Back From Dummies

In pandas I can use the from_dummies method to reverse one-hot encoding. There doesn't seem to be a built in method for this in polars. Here is a basic example:

pl.DataFrame({
  "col1_hi": [0,0,0,1,1],
  "col1_med": [0,0,1,0,0],
  "col1_lo": [1,1,0,0,0],
  "col2_yes": [1,1,0,1,0],
  "col2_no": [0,0,1,0,1],
})

┌─────────┬──────────┬─────────┬──────────┬─────────┐
│ col1_hi ┆ col1_med ┆ col1_lo ┆ col2_yes ┆ col2_no │
│ ---     ┆ ---      ┆ ---     ┆ ---      ┆ ---     │
│ i64     ┆ i64      ┆ i64     ┆ i64      ┆ i64     │
╞═════════╪══════════╪═════════╪══════════╪═════════╡
│ 0       ┆ 0        ┆ 1       ┆ 1        ┆ 0       │
│ 0       ┆ 0        ┆ 1       ┆ 1        ┆ 0       │
│ 0       ┆ 1        ┆ 0       ┆ 0        ┆ 1       │
│ 1       ┆ 0        ┆ 0       ┆ 1        ┆ 0       │
│ 1       ┆ 0        ┆ 0       ┆ 0        ┆ 1       │
└─────────┴──────────┴─────────┴──────────┴─────────┘

Reversing the to_dummies operation should result in something like this:

┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ lo   ┆ yes  │
│ lo   ┆ yes  │
│ med  ┆ no   │
│ hi   ┆ yes  │
│ hi   ┆ no   │
└──────┴──────┘

My first thought was to use a pivot. How could I go about implementing this functionality?

Solution

You could utilize pl.coalesce

(df
 .with_columns(
    pl.when(pl.col(col) == 1)
      .then(pl.lit(col).str.extract(r"([^_]+$)"))
      .alias(col) 
    for col in df.columns)
 .select(
    pl.coalesce(pl.col(f"^{prefix}_.+$")).alias(prefix) 
    for prefix in dict.fromkeys(
       col.rsplit("_", maxsplit=1)[0]
       for col in df.columns
    )
))

shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ lo   ┆ yes  │
│ lo   ┆ yes  │
│ med  ┆ no   │
│ hi   ┆ yes  │
│ hi   ┆ no   │
└──────┴──────┘

Update: @Rodalm's approach is much neater:

def from_dummies(df, separator="_"):
    col_exprs = {}
    
    for col in df.columns:
        name, value = col.rsplit(separator, maxsplit=1)
        expr = pl.when(pl.col(col) == 1).then(value) 
        col_exprs.setdefault(name, []).append(expr)

    return df.select(
        pl.coalesce(exprs) # keep the first non-null expression value by row
          .alias(name)
        for name, exprs in col_exprs.items()
    )