[SOLVED] Duplicate null columns created during pivot in polars

Duplicate null columns created during pivot in polars

I have this example dataframe in polars:

df_example = pl.DataFrame(
    {
        "DATE": ["2024-11-11", "2024-11-11", "2024-11-12", "2024-11-12", "2024-11-13"],
        "A": [None, None, "option1", "option2", None],
        "B": [None, None, "YES", "YES", "NO"],
    }
)

Which looks like this:

	DATE	A	B
0	2024-11-11
1	2024-11-11
2	2024-11-12	option1	YES
3	2024-11-12	option2	YES
4	2024-11-13		NO

As you can see this is a long format dataframe. I want to have it in a wide format, meaning that I want the DATE to be unique per row and for each other column several columns have to be created. What I want to achieve is:

DATE	A_option1	A_option2	B_YES	B_NO
2024-11-11	Null	Null	Null	Null
2024-11-12	True	True	True	Null
2024-11-13	Null	Null	Null	True

I have tried doing the following:

df_example.pivot(
    index="DATE", on=["A", "B"], values=["A", "B"], aggregate_function="first"
)

However, I get this error:

DuplicateError: column with name 'null' has more than one occurrence

Which is logical, as it tries to create a column for the Null values in columns A, and a column for the Null values in column B.

I am looking for a clean solution to this problem. I know I can impute the nulls per column with something unique and then do the pivot. Or by pivoting per column and then dropping the Null columns. However, this will create unnecessary columns. I want something more elegant.

Solution

I ended up with:

(
    df_example.pipe(
        lambda df: df.group_by("DATE").agg(
            [
                pl.col(col).eq(val).any().alias(f"{col}_{val}")
                for col in df.select(pl.exclude("DATE")).columns
                for val in df.get_column(col).unique().drop_nulls()
            ]
        )
    ).sort("DATE")
)