pythonpython-polarspolars

Duplicate null columns created during pivot in polars


I have this example dataframe in polars:

df_example = pl.DataFrame(
    {
        "DATE": ["2024-11-11", "2024-11-11", "2024-11-12", "2024-11-12", "2024-11-13"],
        "A": [None, None, "option1", "option2", None],
        "B": [None, None, "YES", "YES", "NO"],
    }
)

Which looks like this:

DATE A B
0 2024-11-11
1 2024-11-11
2 2024-11-12 option1 YES
3 2024-11-12 option2 YES
4 2024-11-13 NO

As you can see this is a long format dataframe. I want to have it in a wide format, meaning that I want the DATE to be unique per row and for each other column several columns have to be created. What I want to achieve is:

DATE A_option1 A_option2 B_YES B_NO
2024-11-11 Null Null Null Null
2024-11-12 True True True Null
2024-11-13 Null Null Null True

I have tried doing the following:

df_example.pivot(
    index="DATE", on=["A", "B"], values=["A", "B"], aggregate_function="first"
)

However, I get this error:

DuplicateError: column with name 'null' has more than one occurrence

Which is logical, as it tries to create a column for the Null values in columns A, and a column for the Null values in column B.

I am looking for a clean solution to this problem. I know I can impute the nulls per column with something unique and then do the pivot. Or by pivoting per column and then dropping the Null columns. However, this will create unnecessary columns. I want something more elegant.


Solution

  • I ended up with:

    (
        df_example.pipe(
            lambda df: df.group_by("DATE").agg(
                [
                    pl.col(col).eq(val).any().alias(f"{col}_{val}")
                    for col in df.select(pl.exclude("DATE")).columns
                    for val in df.get_column(col).unique().drop_nulls()
                ]
            )
        ).sort("DATE")
    )