pythonpython-polars

Select column based on the value of another column Polars Python


I have a df with ten columns and another column with its values are partial name of the ten columns. Here is a similar sample:

import polars as pl
df = pl.DataFrame({
    "ID"     :["A"  ,"B"  ,"C"  ] ,
    "A Left" :["W1" ,"W2" ,"W3" ] , 
    "A Right":["P1" ,"P2" ,"P3" ] , 
    "B Left" :["G1" ,"G2" ,"G3" ] , 
    "B Right":["Y1" ,"Y2" ,"Y3" ] , 
    "C Left" :["M1" ,"M2" ,"M3" ] , 
    "C Right":["K1" ,"K2" ,"K3" ] , 
    })
df
shape: (3, 7)
┌─────┬────────┬─────────┬────────┬─────────┬────────┬─────────┐
│ ID  ┆ A Left ┆ A Right ┆ B Left ┆ B Right ┆ C Left ┆ C Right │
│ --- ┆ ---    ┆ ---     ┆ ---    ┆ ---     ┆ ---    ┆ ---     │
│ str ┆ str    ┆ str     ┆ str    ┆ str     ┆ str    ┆ str     │
╞═════╪════════╪═════════╪════════╪═════════╪════════╪═════════╡
│ A   ┆ W1     ┆ P1      ┆ G1     ┆ Y1      ┆ M1     ┆ K1      │
│ B   ┆ W2     ┆ P2      ┆ G2     ┆ Y2      ┆ M2     ┆ K2      │
│ C   ┆ W3     ┆ P3      ┆ G3     ┆ Y3      ┆ M3     ┆ K3      │
└─────┴────────┴─────────┴────────┴─────────┴────────┴─────────┘

I want to add a column with its value selected from the other columns based on ID column like below:

shape: (3, 8)
┌─────┬────────┬─────────┬────────┬─────────┬────────┬─────────┬───────┐
│ ID  ┆ A Left ┆ A Right ┆ B Left ┆ B Right ┆ C Left ┆ C Right ┆ value │
│ --- ┆ ---    ┆ ---     ┆ ---    ┆ ---     ┆ ---    ┆ ---     ┆ ---   │
│ str ┆ str    ┆ str     ┆ str    ┆ str     ┆ str    ┆ str     ┆ str   │
╞═════╪════════╪═════════╪════════╪═════════╪════════╪═════════╪═══════╡
│ A   ┆ W1     ┆ P1      ┆ G1     ┆ Y1      ┆ M1     ┆ K1      ┆ W1-P1 │
│ B   ┆ W2     ┆ P2      ┆ G2     ┆ Y2      ┆ M2     ┆ K2      ┆ G2-Y2 │
│ C   ┆ W3     ┆ P3      ┆ G3     ┆ Y3      ┆ M3     ┆ K3      ┆ M3-K3 │
└─────┴────────┴─────────┴────────┴─────────┴────────┴─────────┴───────┘

I got this result using unpivot:

df.join( df.unpivot(index='ID').with_columns(
            pl.when(pl.col("ID") == pl.col("variable").str.slice(0,1)).then(pl.col("value"))
        ).select("ID" , "value").drop_nulls().group_by("ID").agg(pl.col('value').str.join("-")) 
        ,on='ID').sort("ID")

However, I need to avoid unpivot because I have two groups of ten columns beside other 50 columns.

I have tried using pl.col() and polars.selectors, but I couldn't get the result.

import polars.selectors as cs
df.with_columns(
    cs.by_name(
        ( pl.concat_str([pl.col('ID') , " Left"] ) )
        ).alias("value")
)
TypeError: invalid name: <Expr ['col("ID").str.concat_horizonta…']

Any suggested solution?


Solution

  • It looks like you want to extract the "base" / "prefix" of the Left/Right columns?

    There are various ways you could do that:

    columns = pl.Series(df.select("^.+ (Left|Right)$").columns)
    columns = columns.str.extract(r"(.+) (Left|Right)$")
    
    shape: (3,)
    Series: '' [str]
    [
        "A"
        "B"
        "C"
    ]
    

    You could then use pl.coalesce() to create a single column of the chosen when/then values:

    df.with_columns(
       pl.coalesce(
          pl.when(pl.col("ID") == col).then(
             pl.format("{}-{}", pl.col(f"{col} Left"), pl.col(f"{col} Right"))
          ) for col in columns
       )
       .alias("value")
    )
    
    shape: (3, 8)
    ┌─────┬────────┬─────────┬────────┬─────────┬────────┬─────────┬───────┐
    │ ID  ┆ A Left ┆ A Right ┆ B Left ┆ B Right ┆ C Left ┆ C Right ┆ value │
    │ --- ┆ ---    ┆ ---     ┆ ---    ┆ ---     ┆ ---    ┆ ---     ┆ ---   │
    │ str ┆ str    ┆ str     ┆ str    ┆ str     ┆ str    ┆ str     ┆ str   │
    ╞═════╪════════╪═════════╪════════╪═════════╪════════╪═════════╪═══════╡
    │ A   ┆ W1     ┆ P1      ┆ G1     ┆ Y1      ┆ M1     ┆ K1      ┆ W1-P1 │
    │ B   ┆ W2     ┆ P2      ┆ G2     ┆ Y2      ┆ M2     ┆ K2      ┆ G2-Y2 │
    │ C   ┆ W3     ┆ P3      ┆ G3     ┆ Y3      ┆ M3     ┆ K3      ┆ M3-K3 │
    └─────┴────────┴─────────┴────────┴─────────┴────────┴─────────┴───────┘