pythonpython-polarspolars

Create a uniform dataset in Polars with cross joins


I am working with Polars and need to ensure that my dataset contains all possible combinations of unique values in certain index columns. If a combination is missing in the original data, it should be filled with null.

Currently, I use the following approach with sequential cross joins:

def ensure_uniform(df: pl.DataFrame, index_cols: Sequence[str]) -> pl.DataFrame:
    # Quick exit
    if len(index_cols) == 1:
        return df

    # Get unique values of the first index column
    uniform_df = df.select(index_cols[0]).unique(maintain_order=True)

    # Cross join with other unique index columns
    for i in range(1, len(index_cols)):
        unique_index_values = df.select(index_cols[i]).unique(maintain_order=True)
        uniform_df = uniform_df.join(unique_index_values, how="cross")

    # Left join with the original DataFrame to preserve existing values
    return uniform_df.join(df, on=index_cols, how="left")

Here is an example:

df = pl.from_repr('''
┌─────┬─────┬─────┬───────┐
│ g1  ┆ g2  ┆ g3  ┆ value │
│ --- ┆ --- ┆ --- ┆ ---   │
│ str ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ A   ┆ 1   ┆ 1   ┆ 10    │
│ A   ┆ 1   ┆ 2   ┆ 20    │
│ B   ┆ 2   ┆ 1   ┆ 30    │
│ B   ┆ 2   ┆ 2   ┆ 40    │
└─────┴─────┴─────┴───────┘
''')

uniform_df = ensure_uniform(df, index_cols=["g1", "g2", "g3"])
print(uniform_df)
# ┌─────┬─────┬─────┬───────┐
# │ g1  ┆ g2  ┆ g3  ┆ value │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ str ┆ i64 ┆ i64 ┆ i64   │
# ╞═════╪═════╪═════╪═══════╡
# │ A   ┆ 1   ┆ 1   ┆ 10    │
# │ A   ┆ 1   ┆ 2   ┆ 20    │
# │ A   ┆ 2   ┆ 1   ┆ null  │
# │ A   ┆ 2   ┆ 2   ┆ null  │
# │ B   ┆ 1   ┆ 1   ┆ null  │
# │ B   ┆ 1   ┆ 2   ┆ null  │
# │ B   ┆ 2   ┆ 1   ┆ 30    │
# │ B   ┆ 2   ┆ 2   ┆ 40    │
# └─────┴─────┴─────┴───────┘

Any suggestions for making this more graceful and efficient?

Edit: @Dean MacGregor & @orlp Thank you for your answers. All approaches show comparable performance (+/-10%), with @Dean MacGregor's proposal consistently being slightly faster than the others. After testing on multiple setups, I found that the main bottleneck seems to be the final join process, combining the uniform multiindex with the original dataset, rather than building the Cartesian product beforehand. This suggests that speed and peak memory consumption remain similar regardless of how the Cartesian product is computed, especially as dataset sizes grow.

Both proposals enable lazy operations, which could be useful depending on the use case. Since there is no dedicated method for computing the Cartesian product, all options seem to be valid.


Solution

  • I think your approach is sensible, it can however be done lazily:

    from functools import reduce
    
    def ensure_uniform(df: pl.DataFrame, index_cols: Sequence[str]) -> pl.DataFrame:
        if len(index_cols) == 1:
            return df
    
        lf = df.lazy()
        uniques = [lf.select(col).unique(maintain_order=True) for col in index_cols]
        product = reduce(lambda a, b: a.join(b, how="cross"), uniques)
        out = product.join(lf, on=index_cols, how="left", maintain_order="left")
        return out.collect()