dictionaryreplaceapplypython-polars

Apply Python dict to Polars column (using replace_strict)


I have a dict and a polars DataFrame and want to map one column to the values of the dict:

import polars as pl

df = pl.DataFrame({
    'label': ['AA', 'BB', 'AA', 'CC'],
    'type':  ['CASH', 'ITEM', 'CHECK', 'CHECK'],
})

mapping = {
    'CASH':  {'qty':  1, 'origin': 'E'},
    'ITEM':  {'qty': -9, 'origin': 'A'},
    'CHECK': {'qty': 46, 'origin': 'A'},
}
df.with_columns(pl.col('type').replace_strict(mapping).alias('mapped'))

This outputs

shape: (4, 3)
┌───────┬───────┬───────────┐
│ label ┆ type  ┆ mapped    │
│ ---   ┆ ---   ┆ ---       │
│ str   ┆ str   ┆ struct[2] │
╞═══════╪═══════╪═══════════╡
│ AA    ┆ CASH  ┆ {1,"E"}   │
│ BB    ┆ ITEM  ┆ {-9,"A"}  │
│ AA    ┆ CHECK ┆ {46,"A"}  │
│ CC    ┆ CHECK ┆ {46,"A"}  │
└───────┴───────┴───────────┘

The problem is, it only takes the values of the dict and entirely drops the keys.

So I tried using replace_strict(mapping, return_dtype=pl.Object), but this gives error

File site-packages\polars\lazyframe\frame.py:2026,
     in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown,
                          simplify_expression, slice_pushdown, comm_subplan_elim, 
                          comm_subexpr_elim, cluster_with_columns, no_optimization,
                          streaming, engine, background, _eager, **_kwargs)
   2024 # Only for testing purposes
   2025 callback = _kwargs.get("post_opt_callback", callback)
-> 2026 return wrap_df(ldf.collect(callback))

InvalidOperationError: casting from Int64 to Unknown not supported

Ultimately, the output I am after is the table below. How do I achieve this?

shape: (4, 3)
┌───────┬───────┬─────────────────────────┐
│ label ┆ type  ┆ mapped                  │
│ ---   ┆ ---   ┆ ---                     │
│ str   ┆ str   ┆ object                  │
╞═══════╪═══════╪═════════════════════════╡
│ AA    ┆ CASH  ┆ {"qty":1,"origin":"E"}  │
│ BB    ┆ ITEM  ┆ {"qty":-9,"origin":"A"} │
│ AA    ┆ CHECK ┆ {"qty":46,"origin":"A"} │
│ CC    ┆ CHECK ┆ {"qty":46,"origin":"A"} │
└───────┴───────┴─────────────────────────┘

I am using polars==1.3.0


Solution

  • Just to be clear, polars doesn't drop the keys, it just implicitly converts python dictionaries to Struct

    You can easily check it by using, for example, unnest():

    res = df.with_columns(pl.col('type').replace_strict(mapping).alias('mapped'))
    
    ┌───────┬───────┬───────────┐
    │ label ┆ type  ┆ mapped    │
    │ ---   ┆ ---   ┆ ---       │
    │ str   ┆ str   ┆ struct[2] │
    ╞═══════╪═══════╪═══════════╡
    │ AA    ┆ CASH  ┆ {1,"E"}   │
    │ BB    ┆ ITEM  ┆ {-9,"A"}  │
    │ AA    ┆ CHECK ┆ {46,"A"}  │
    │ CC    ┆ CHECK ┆ {46,"A"}  │
    └───────┴───────┴───────────┘
    
    res.unnest('mapped')
    
    ┌───────┬───────┬─────┬────────┐
    │ label ┆ type  ┆ qty ┆ origin │
    │ ---   ┆ ---   ┆ --- ┆ ---    │
    │ str   ┆ str   ┆ i64 ┆ str    │
    ╞═══════╪═══════╪═════╪════════╡
    │ AA    ┆ CASH  ┆ 1   ┆ E      │
    │ BB    ┆ ITEM  ┆ -9  ┆ A      │
    │ AA    ┆ CHECK ┆ 46  ┆ A      │
    │ CC    ┆ CHECK ┆ 46  ┆ A      │
    └───────┴───────┴─────┴────────┘
    

    All the key names are there, you just don't see it in the text representation. So question is - do you actually need data in the column to be python dictionaries? It'd be much easier to work with native polars struct.