pythonpython-polars

Generating a dataframe of *combinations* (not permutations)?


Suppose I have a bag of items {a, b}. Then I can choose pairs out of it in a variety of ways. One way might be to pick all possible permutations: [a, a], [a, b], [b, a], [b, b]. But I might disallow repetition, in which case the possible permutations are: [a, b], [b, a]. I might go further and declare that [a, b] is the same as [b, a], i.e. I only care about the "combination" of choices, not their permutations.

For more about the distinction between combination vs. permutation, see: https://en.wikipedia.org/wiki/Combination

What are the best ways to produce a combination of choices (i.e. order of elements should not matter)? My current solutions looks like this:

import polars as pl

choices = pl.DataFrame(
    [
        pl.Series("flavor", ["x"] * 2 + ["y"] * 3),
        pl.Series("choice", ["a", "b"] + ["1", "2", "3"]),
    ]
)

# join to produce the choices
choices.join(choices, on=["flavor"]).with_columns(
    # generate a 2-element list representing the choice
    sorted_choice_pair=pl.concat_list("choice", "choice_right").list.sort()
).filter(pl.col.choice.eq(pl.col.sorted_choice_pair.list.first()))
shape: (9, 4)
┌────────┬────────┬──────────────┬────────────────────┐
│ flavor ┆ choice ┆ choice_right ┆ sorted_choice_pair │
│ ---    ┆ ---    ┆ ---          ┆ ---                │
│ str    ┆ str    ┆ str          ┆ list[str]          │
╞════════╪════════╪══════════════╪════════════════════╡
│ x      ┆ a      ┆ a            ┆ ["a", "a"]         │
│ x      ┆ a      ┆ b            ┆ ["a", "b"]         │
│ x      ┆ b      ┆ b            ┆ ["b", "b"]         │
│ y      ┆ 1      ┆ 1            ┆ ["1", "1"]         │
│ y      ┆ 1      ┆ 2            ┆ ["1", "2"]         │
│ y      ┆ 2      ┆ 2            ┆ ["2", "2"]         │
│ y      ┆ 1      ┆ 3            ┆ ["1", "3"]         │
│ y      ┆ 2      ┆ 3            ┆ ["2", "3"]         │
│ y      ┆ 3      ┆ 3            ┆ ["3", "3"]         │
└────────┴────────┴──────────────┴────────────────────┘

So I generate all permutations, and then filter out those that where the "left element" does not match the first element of the list.


Solution

  • You can use .join_where() with a row index predicate to prevent "duplicates".

    (choices
      .with_row_index()
      .join_where(choices.with_row_index(),
          pl.col.flavor == pl.col.flavor_right,
          pl.col.index <= pl.col.index_right
      )
    ) 
    
    shape: (9, 5)
    ┌───────┬────────┬────────┬─────────────┬──────────────┐
    │ index ┆ flavor ┆ choice ┆ index_right ┆ choice_right │
    │ ---   ┆ ---    ┆ ---    ┆ ---         ┆ ---          │
    │ u32   ┆ str    ┆ str    ┆ u32         ┆ str          │
    ╞═══════╪════════╪════════╪═════════════╪══════════════╡
    │ 0     ┆ x      ┆ a      ┆ 0           ┆ a            │
    │ 0     ┆ x      ┆ a      ┆ 1           ┆ b            │
    │ 1     ┆ x      ┆ b      ┆ 1           ┆ b            │
    │ 2     ┆ y      ┆ 1      ┆ 2           ┆ 1            │
    │ 2     ┆ y      ┆ 1      ┆ 3           ┆ 2            │
    │ 3     ┆ y      ┆ 2      ┆ 3           ┆ 2            │
    │ 2     ┆ y      ┆ 1      ┆ 4           ┆ 3            │
    │ 3     ┆ y      ┆ 2      ┆ 4           ┆ 3            │
    │ 4     ┆ y      ┆ 3      ┆ 4           ┆ 3            │
    └───────┴────────┴────────┴─────────────┴──────────────┘